Stream: dev

Topic: Croissant


view this post on Zulip Philip Durbin πŸš€ (Mar 14 2024 at 14:09):

#10341 is the issue we're using to track Croissant support.

There's also good discussion at https://github.com/mlcommons/croissant/issues/530

@Slava Tykhonov @Jan Range and others, I'm getting variable-level metadata from the "datasetFileDetails" JSON. See https://github.com/gdcc/dataverse-exporters/commit/dbc3fa000ebe51ef9e9e4f7ef31d955afc77ed2a

view this post on Zulip Philip Durbin πŸš€ (Mar 14 2024 at 15:52):

I'm far from finished but I build a jar from https://github.com/gdcc/dataverse-exporters/compare/main...croissant and put it on https://dev3.dataverse.org if anyone wants to play with it:

Screenshot-2024-03-14-at-11.51.42-AM.png

view this post on Zulip Slava Tykhonov (Mar 14 2024 at 17:12):

Looks good! How about adding URL to external microservice in payara parameters to produce Croissant etc?

view this post on Zulip Philip Durbin πŸš€ (Mar 14 2024 at 18:00):

Sure, but I think that's a different topic. I just kicked it off: #dev > exporters as external services

view this post on Zulip Jan Range (Mar 15 2024 at 09:06):

@Slava Tykhonov I have added DataFrame support to EasyDataverse (see example colab and PR). Since EasyDataverse will eventually be merged, it will also be part of pyDataverse.

I was thinking that adding the croissant export to this class, which currently handles tabular data, would make sense. What are your thoughts?

view this post on Zulip Slava Tykhonov (Mar 15 2024 at 11:40):

Jan, it looks great! Can you also think about adding methods to get from this class 1) dataframe statistics (mean, median, etc) 2) column names and their types 3) dataframe json export?

This feature should be able also read file types (spreadsheet, tabular) etc and get files ingested in dataframe. So we can also connect it with Croissant metadata and prepare for ML coming.

view this post on Zulip Jan Range (Mar 15 2024 at 12:13):

Thanks @Slava Tykhonov! Of course, I am happy to add these to the class as well as an importer. Are the .describe() statistics sufficient?

image.png

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 12:16):

I occurs to me that perhaps I should extend an invitation to the Croissant meeting on Wednesday to a wider audience. Basically, @Slava Tykhonov and I will be showing the Croissant team what we have so far. I can (privately) send the Google Meet invite or people can sign up for the mailing list: https://github.com/mlcommons/croissant#getting-involved

view this post on Zulip Jan Range (Mar 15 2024 at 12:19):

@Philip Durbin happy to join and listen if feasible by time. At what time is the meeting on Thursday?

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 12:22):

Wednesday. "Weekly on Wednesday from 9:05am-10:00am Pacific." -- https://mlcommons.org/working-groups/data/croissant/

view this post on Zulip Jan Range (Mar 15 2024 at 12:24):

My goodness, I am ready for the weekend. Wednesday it is :joy:

view this post on Zulip Jan Range (Mar 15 2024 at 12:25):

9AM works well for me

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 12:48):

Now the question is if I should email the Dataverse Google Group. :thinking: How much do I want to embarrass myself? :crazy:

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 15:14):

I guess I want to embarrass myself. I just posted this: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/DqEIkiwlAgAJ :grinning:

view this post on Zulip Jan Range (Mar 15 2024 at 16:16):

@Slava Tykhonov I have added the stats and JSON export - https://github.com/gdcc/easyDataverse/pull/14

Following up with the importers on Monday. Off to the switzerland now :flag_switzerland:

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 16:20):

The land of the Switzer. Enjoy!

@Jan Range when you're back, I had a question about that PR at #python > tabular data addition

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 17:01):

I have lots of questions about Croissant. :grinning: I started a doc: https://docs.google.com/document/d/1C33FAR6s421WV9U50dzlBkVZRTWTlWguc-RoxakOly0/edit?usp=sharing

view this post on Zulip Jan Range (Mar 15 2024 at 17:07):

Ah nice, will add my questions once home :smile:

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 17:08):

Cool. Please leave comments for now. And I'll move these last few messages to #dev > Croissant

view this post on Zulip Slava Tykhonov (Mar 15 2024 at 18:39):

Screenshot-2024-03-15-at-19.38.57.png
Just in the case: I've got variables from DDI in Croissant export.

view this post on Zulip Philip Durbin πŸš€ (Mar 15 2024 at 18:48):

recordSet! Looking good! :dataverse_man:

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 15:47):

Extended with all Croissant fields which I can recognise and map: https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/croissant_sample.json

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 16:18):

I'll move all semantic mappings to the separate file and will publish Croissant mapper on GitHub.

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:02):

That looks great @Slava Tykhonov I noticed that you used the 2.0 version and it doesn't show a validation error as the one that @Philip Durbin reported on #609 I was playing a little bit with your sample and if I post the @context before the version (your sample has it at the end) the error will display
Version doesn't follow MAJOR.MINOR.PATCH: 2.0.

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:03):

Are we doing something wrong here by posting the context first or this may be indeed a bug on the validator :thinking:

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 19:09):

pip3 install --upgrade git+https://github.com/Dans-labs/pyDataverse@croissant#egg=pyDataverse --break-system-packages

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 19:09):

from pyDataverse.Croissant import Croissant
import json

host = "https://dataverse.nl"
DOI = "doi:10.34894/KMRAYH"
croissant = Croissant(host, DOI)
c = croissant.get_record()
print(json.dumps(c, indent=4, default=str))

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:31):

Actually I probably was just looking at the wrong place because I am getting the same validation error :sweat_smile:

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 19:34):

Can you make screenshot, @Juan Pablo Tosca Villanueva ?

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:35):

  -  [Metadata(Quality_of_care__UP_TSU)] Version doesn't follow MAJOR.MINOR.PATCH: 2.0. For more information refer to: https://semver.org/spec/v2.0.0.html

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:35):

Sure

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:36):

image.png

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:37):

the validate.sh just runs the validator on that sample

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:37):

image.png

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 19:42):

I see, thanks! Will fix it here then https://colab.research.google.com/drive/1H-dfY_TBh6eXLkD7tUlqsxUDEsDCQPiD?usp=sharing

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 22:13):

mlcroissant validate --jsonld /tmp/croissant1
I0318 23:13:19.350207 139745363906624 validate.py:53] Done.

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 23:27):

Oh! DId you fixit by switching to 1.0.0?

view this post on Zulip Slava Tykhonov (Mar 18 2024 at 23:29):

Added ".0" in the end of version :)

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 23:31):

I think that works to pass the validation but Datasets versions come on X.X format, I think that is why Phil opened that issue but he should be around tomorrow to enlighten us with more about this :rolling_on_the_floor_laughing:

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 01:47):

Ha. Thanks for the :thumbs_up: on https://github.com/mlcommons/croissant/issues/609 :heart:

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 16:40):

I just pushed a commit to export variable-level metadata: https://github.com/gdcc/dataverse-exporters/pull/4/commits/4f4361260d294280614e1112b291d632982a9dbd

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:43):

Amazing!

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 17:08):

Phil, a bit naive question - how it will be maintained if Croissant will get updates?

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 18:04):

I've put in slides those bullet points currently missing in Croissant:
Sensitive vs Restricted files
Embargo
Provenance, data ownership transfer
Primary and secondary (derivative) datasets

Do you see more?

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 18:35):

Well, on a related note, how are you handling original vs. archival versions of files? foo.dta (Stata) vs foo.tab (archival). For now I'm only presenting the original.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 18:56):

I'm reading original and comparing with .tab versions, and linking them in the graph.

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 20:10):

Cool. Can you show an example?

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 20:44):

I just added "creator": https://github.com/gdcc/dataverse-exporters/pull/4/commits/7a0d8183cc7d4ef3f1864af56374fb08726732af

@Slava Tykhonov you seem to be doing just key/value here.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:06):

In first version yes but extended yesterday with affilitation and person name https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/odissei-croissant.json

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:07):

ah, great!

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:07):

Screenshot-2024-03-19-at-22.07.01.png
I moved all semantic transformations outside to have FAIR semantic mappings in the separate file(s)

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:08):

Nice. Should be a fun call tomorrow. :grinning:

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:08):

The idea is to load any "custom" mappings from GitHub and let community to maintain it without touching source code.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:09):

I'm not sure if someone is actually doing that to get mappings in the knowledge graph. :)

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:12):

For the @id of a file you're using the database id:

    "distribution": [
        {
            "@type": "cr:FileObject",
            "@id": "f3056770",
            "name": "DoD_R1.DTA",

Right now I'm showing the filename, like the spec shows. It's more readable. But your way is more precise.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:15):

It's more clear when you have in DDI variables in a few files.

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:17):

Have you experimented with path/to/data.dta? Having a file hierarchy?

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:22):

No, can you give me example with DOI?

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:26):

Sure for my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP for example, I have a tabular file in a directory called "data".

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:44):

Ok, I'll take a look. Updated version of slides for tomorrow meeting. https://docs.google.com/presentation/d/1hEqIFE9yS3aePLhRDgnuw9PdL-Az7ORj6utTAenNtbw/edit?usp=sharing

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:44):

Cool. No rush.

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:45):

I'm also wondering about citeAs.

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:45):

It seems like it's ideally for a paper about a dataset:

"citeAs": "@Article{asano21pass, author = \"Yuki M. Asano and Christian Rupprecht and Andrew Zisserman and Andrea Vedaldi\", title = \"PASS: An ImageNet replacement for self-supervised pretraining without humans\", journal = \"NeurIPS Track on Datasets and Benchmarks\", year = \"2021\" }",

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:45):

citeAs has to be build from author names and date. I'm not sure if they're doing it right to be honest.

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:46):

But where do I put the DOI of the dataset itself? Kaggle is putting it in "identifier" (which isn't in the spec) and "citeAs" (which I'm not sure is right).

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:47):

Ok, well citeAs is on my list to ask about tomorrow: https://docs.google.com/document/d/1C33FAR6s421WV9U50dzlBkVZRTWTlWguc-RoxakOly0/edit?usp=sharing :grinning:

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:47):

I think we need to use citeAs as it's implemented in Dataverse right now

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:48):

Can you make screenshot of Dataverse with Croissant button and some example in json-ld? For slides? (coming in production in next version)

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:48):

Huh, citeAs is in our Signposting:

$ ack citeAs
src/main/java/edu/harvard/iq/dataverse/util/SignpostingResources.java
64:            String citeAs = "<" + ds.getPersistentURL() + ">;rel=\"cite-as\"";

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:49):

Yes, sure.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:49):

Cool! So I'll make my part of story - Phil is building "production ready" Croissant export and I'm moving all crosswolks outside of the implementation to invite community to maintain it.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 21:50):

I have a slide on Signposting, btw :)

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:52):

Here's an example (sort of a work in progress, honestly): https://dev3.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5072/FK2/DZRHUP

view this post on Zulip Philip Durbin πŸš€ (Mar 19 2024 at 21:53):

And you are welcome to grab a screenshot of the Croissant button from https://dev3.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/DZRHUP

view this post on Zulip Jan Range (Mar 20 2024 at 09:11):

Due to a shorthand collision, I won't be able to participate in the Croissant meeting tonight :anguish: @Slava Tykhonov Would you like to join the next PyDataverse meeting and talk about the Croissant extension? It is happening next Wednesday at 4 PM CET

view this post on Zulip Slava Tykhonov (Mar 20 2024 at 09:34):

Hi Jan, I'll join pyDataverse next week then. Croissant extension is kind of working :) https://colab.research.google.com/drive/1H-dfY_TBh6eXLkD7tUlqsxUDEsDCQPiD?usp=sharing#scrollTo=WDDs2hdcJnED

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2024 at 11:46):

@Jan Range no worries

view this post on Zulip Jan Range (Mar 20 2024 at 13:26):

@Slava Tykhonov looks great! Thanks for sharing :tada:

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2024 at 14:49):

@Juan Pablo Tosca Villanueva I added more error checking to the Croissant exporter. You should be able to upload any file now: https://github.com/gdcc/dataverse-exporters/pull/4/commits/9316860b1869108d5cf64499b04252bde86575f4

view this post on Zulip Slava Tykhonov (Mar 20 2024 at 15:02):

Phil, you should mention this during the meeting today. It looks like Editor isn't very stable, should be improved.

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2024 at 19:01):

There wasn't enough time!

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2024 at 19:02):

@Slava Tykhonov thanks for posting https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/FuajmDKEAQAJ ... an update with links to slides, notes, etc.

view this post on Zulip Philip Durbin πŸš€ (Apr 12 2024 at 17:51):

@Slava Tykhonov following up on the Croissant call this week... I see what you mean about the lack of backward compatibility within 1.0. I just upgraded from mlcroissant 1.0.3 to 1.0.5 and now I see this new error:

WARNING: The JSON-LD @context is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'examples', 'isLiveDataset', 'rai'}

view this post on Zulip Philip Durbin πŸš€ (Apr 12 2024 at 18:04):

@Slava Tykhonov so is that what you do to get the latest, correct @context? Go to https://github.com/mlcommons/croissant/tree/main/datasets/1.0 and pick one of the examples (I picked "titantic") and copy it from there?

Or do you go to https://mlcommons.github.io/croissant/docs/croissant-spec.html which has a different @context?

Help! :sweat_smile:

view this post on Zulip Philip Durbin πŸš€ (Apr 12 2024 at 18:29):

I'm going with titanic and explained to look out for breaking changes in the README: https://github.com/gdcc/dataverse-exporters/pull/4/commits/03dfeddbd8b136aeee9c2642d8bc1852e73b948b

view this post on Zulip Slava Tykhonov (Apr 12 2024 at 19:10):

Yes, this is exactly why I was working on semantic mappings.

view this post on Zulip Philip Durbin πŸš€ (Apr 12 2024 at 19:27):

In other Croissant news, I'm appending ".0" to dataset versions but as I say here, I'm pretty grumpy about it: https://github.com/mlcommons/croissant/issues/609#issuecomment-2052403311

view this post on Zulip Philip Durbin πŸš€ (Apr 23 2024 at 12:25):

I switched citeAs to bibtex format: https://github.com/gdcc/dataverse-exporters/pull/4/commits/151efb7898164d2f8290f31392d70fb28bfec299

view this post on Zulip Philip Durbin πŸš€ (Apr 26 2024 at 21:41):

I'd love some feedback on this new pull request:

add docs for Croissant, tweak exporter docs #10533

view this post on Zulip Philip Durbin πŸš€ (Apr 29 2024 at 18:33):

@Slava Tykhonov do you think I should open an issue at https://github.com/mlcommons/croissant/issues about where to put summary statistics? I see that you and Rajat volunteered to think about this and I don't want to step on your toes!

view this post on Zulip Slava Tykhonov (Apr 29 2024 at 19:21):

Just open

view this post on Zulip Philip Durbin πŸš€ (Apr 29 2024 at 19:23):

will do!

view this post on Zulip Philip Durbin πŸš€ (Apr 29 2024 at 19:49):

ok, done: https://github.com/mlcommons/croissant/issues/640

view this post on Zulip Philip Durbin πŸš€ (May 03 2024 at 17:57):

These are the little notes I leave to myself while working on the Croissant exporter:

ls -1 ../max | grep -v croissant | while read i; do FILE=$i; FMT=`echo $FILE | cut -d . -f1`; echo $FMT; cat 27626.debug | jq ".$FMT" -r > $FILE; done

:croissant:

view this post on Zulip Philip Durbin πŸš€ (May 07 2024 at 18:18):

I'm comparing my output with @Slava Tykhonov's and realizing I forgot "description"! :doh:

view this post on Zulip Philip Durbin πŸš€ (May 17 2024 at 15:02):

I just wrote a long passage about "version" ("1.0.0" vs "1.0" vs 1.0, etc.): https://github.com/mlcommons/croissant/issues/609#issuecomment-2117798279

What do you think? Am I making sense?

view this post on Zulip Philip Durbin πŸš€ (May 28 2024 at 14:03):

First Croissant jar is up on Maven Central: https://repo1.maven.org/maven2/io/gdcc/export/croissant/0.1.1/

view this post on Zulip Philip Durbin πŸš€ (May 28 2024 at 15:20):

Here's the fancy landing page: https://central.sonatype.com/artifact/io.gdcc.export/croissant

view this post on Zulip Philip Durbin πŸš€ (May 30 2024 at 21:08):

I'm talking to Kaggle and comparing https://www.kaggle.com/datasets/yasserh/wine-quality-dataset to https://beta.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/YKJQY8

Croissant for both:

view this post on Zulip Philip Durbin πŸš€ (May 30 2024 at 21:09):

I also invited the Dataverse community to play around with the Croissant exporter: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/ARnYS5kpCgAJ

view this post on Zulip Philip Durbin πŸš€ (Jun 03 2024 at 14:48):

I just added some new feedback from Geoff at Kaggle: https://github.com/gdcc/exporter-croissant#differences-from-kaggle

view this post on Zulip Philip Durbin πŸš€ (Jun 03 2024 at 18:18):

I created a pull request for the point about "field" being repeated over and over: https://github.com/gdcc/exporter-croissant/pull/2

view this post on Zulip Philip Durbin πŸš€ (Jun 03 2024 at 18:19):

merged

view this post on Zulip Philip Durbin πŸš€ (Jun 03 2024 at 19:24):

Another issue: https://github.com/gdcc/exporter-croissant/issues/3 - we are using sc:Integer for all numeric types. I'd like to use sc:Number instead but I get this error from the validator:

-Β  [Metadata(Cars) > RecordSet() > Field(weight)] The field does not specify a valid http://mlcommons.org/croissant/dataType, neither does any of its predecessor. Got: [rdflib.term.URIRef('https://schema.org/Number')]

view this post on Zulip Philip Durbin πŸš€ (Jun 03 2024 at 20:43):

I just merged a fix: https://github.com/gdcc/exporter-croissant/pull/4

view this post on Zulip Philip Durbin πŸš€ (Jun 04 2024 at 02:10):

I put out a new release: https://repo1.maven.org/maven2/io/gdcc/export/croissant/0.1.2/

view this post on Zulip Philip Durbin πŸš€ (Jun 10 2024 at 11:16):

There's a decent chance I'll be presenting at a Croissant Task Force meeting, even as soon as Wednesday. I'll keep you posted.

view this post on Zulip Philip Durbin πŸš€ (Jun 12 2024 at 11:08):

Yep, I'll be presenting in about 5 hours - 12:05 pm Boston time: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/ONqgdyKJAAAJ

view this post on Zulip Philip Durbin πŸš€ (Jun 12 2024 at 16:54):

That went pretty well, I think.

view this post on Zulip Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:14):

I think it was great!

view this post on Zulip Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:16):

I am still concerned about 1) generating this croissant / JSON-LD on each request and 2)including it on each page

view this post on Zulip Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:17):

I wonder if there could be an ideal case where we could cache the croissant/JSON-LD and also add a parameter to include it or not on the request and include that URL with the param on the robots.txt

view this post on Zulip Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:17):

but at least for normal browsing of the application it could be ignored (no croissant - JSON-LD) in the headers

view this post on Zulip Philip Durbin πŸš€ (Jun 12 2024 at 17:26):

All exports are cached (written to disk).

view this post on Zulip Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:36):

I see... logger.fine("Returning cached schema.org JSON-LD."); :whoops:

view this post on Zulip Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:36):

As a wise man said once "now we know" :laughing:

view this post on Zulip Oliver Bertuch (Jul 02 2024 at 15:49):

@Philip Durbin do you want / need help with making use of the new exporter Parent POM for the crossaint exporter?

view this post on Zulip Oliver Bertuch (Jul 02 2024 at 15:51):

Not sure if you have seen what I did with https://github.com/gdcc/dataverse-exporters

view this post on Zulip Philip Durbin πŸš€ (Jul 02 2024 at 15:53):

Was it within the last two weeks? I was out.

view this post on Zulip Oliver Bertuch (Jul 02 2024 at 15:56):

Yes it was :smile_cat:

view this post on Zulip Philip Durbin πŸš€ (Jul 02 2024 at 18:46):

Ok, I'm still catching up hundreds of GitHub emails. I'll see it eventually. :crazy:

view this post on Zulip Philip Durbin πŸš€ (Jul 03 2024 at 12:55):

During today's Croissant Task Force meeting at noon-ish Boston time (12:05) they will be discussing future plans for Croissant.

Please feel free to DM me for the links to the meeting (on Google Meet) or the doc they will be discussing.

view this post on Zulip Slava Tykhonov (Jul 03 2024 at 13:18):

We're also testing Croissant implementation(s) in our multimodal repository (video, audio, text, haptics) https://database.sharemusic.se/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5072/FK2/T55YDC

view this post on Zulip Philip Durbin πŸš€ (Jul 03 2024 at 18:26):

Oh! Does that mean you installed the Croissant jar I made, @Slava Tykhonov :grinning:

view this post on Zulip Philip Durbin πŸš€ (Aug 22 2024 at 20:16):

We installed Croissant on demo and Harvard Dataverse. Please see https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/dLmV7HTcAgAJ

view this post on Zulip Philip Durbin πŸš€ (Aug 22 2024 at 20:18):

I also posted an update here: https://github.com/mlcommons/croissant/issues/530#issuecomment-2305479611

view this post on Zulip Philip Durbin πŸš€ (Aug 28 2024 at 19:52):

@Julian Gautier I got that same Invalid object type for field "distribution" email and just opened an issue about it: https://github.com/mlcommons/croissant/issues/725

view this post on Zulip Julian Gautier (Aug 28 2024 at 20:34):

Ah thanks. Was https://validator.schema.org not returning this error earlier? It is now, too, but I checked only today

view this post on Zulip Philip Durbin πŸš€ (Aug 28 2024 at 20:35):

Come to think of it, I think it's been returning that error all along.

view this post on Zulip Philip Durbin πŸš€ (Aug 28 2024 at 20:35):

I suspect the Search Console folks are not talking to the Croissant folks. Not sure.

view this post on Zulip Philip Durbin πŸš€ (Sep 03 2024 at 20:08):

"Indeed the Search console doesn't know about Croissant yet. It only validates mark-up based on the schema.org vocabulary, which expects distribution to be of type sc:DataDownload. I will get in touch with them to figure out how to best address this issue." --Omar

view this post on Zulip Philip Durbin πŸš€ (Sep 04 2024 at 13:34):

@Leo Andreev I believe you get these emails too. Please see the GitHub issue above for more info.

view this post on Zulip Philip Durbin πŸš€ (Oct 22 2024 at 21:05):

A little bird told me that @Slava Tykhonov is giving a talk at the Croissant Working Group meeting tomorrow:

"Slava Tykhonov (DANS-KNAW) will talk about supportingΒ external controlled vocabularies in Dataverse, and we can brainstorm on how we would like to support them in the next version of Croissant."

To join the call see How to Join and Access Croissant Working Group Resources at https://mlcommons.org/working-groups/data/croissant/

view this post on Zulip Philip Durbin πŸš€ (Oct 23 2024 at 16:48):

Slava's slides: https://docs.google.com/presentation/d/1PepV5qOITW2heil_iDts6xoB9CHM7Pjj/edit?usp=sharing&ouid=117275479921759507378&rtpof=true&sd=true

view this post on Zulip Philip Durbin πŸš€ (Oct 31 2024 at 20:17):

Dataverse has been added to the main Croissant image :tada:

croissant-summary-v11.png

view this post on Zulip Philip Durbin πŸš€ (Oct 31 2024 at 20:18):

Stefano used the updated image in his tweet today: https://x.com/iacus/status/1852061999854948814

view this post on Zulip Slava Tykhonov (Nov 01 2024 at 07:41):

That's fantastic!

view this post on Zulip Philip Durbin πŸš€ (Jan 13 2025 at 16:22):

In https://github.com/gdcc/exporter-croissant/pull/7 I'm proposing we update the media type from application/json to application/ld+json; profile="http://mlcommons.org/croissant/1.0" to be more specific.

view this post on Zulip Philip Durbin πŸš€ (Jan 13 2025 at 16:23):

@Slava Tykhonov and others, does this change make sense to you? Please see also the issue the PR closes: https://github.com/gdcc/exporter-croissant/issues/6

view this post on Zulip Slava Tykhonov (Jan 13 2025 at 16:38):

Makes sense as discussed last week on Croissant call with Signposting.

view this post on Zulip Philip Durbin πŸš€ (Jan 13 2025 at 16:42):

Great, thanks. I also created this issue upstream to add that media type to the Croissant spec: https://github.com/mlcommons/croissant/issues/792

view this post on Zulip Paul TO (Feb 10 2025 at 15:36):

Hi all, sorry for my question out of the blue, I hope this is the right place to ask. I work for Open Targets, an organisation that provides open-access data of biomedicine data. Our data is in Parquet format and we are looking for a standardised way for our dataset discovery and schema description and Croissant seems to be a perfect match, even though our use is not for ML directly. The problem is, we are not quite sure how do we express some data structure or data types. For example, how do we express key value pairs or list under a list? It seems the data types defined by Schema.org is minimal, will Croissant expands this? Thanks in advance.

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:38):

@Paul TO hi! It sounds like you should join the Croissant community! https://github.com/mlcommons/croissant#getting-involved

There's a meeting every Wednesday and a mailing list where you can ask questions like this.

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:39):

@Slava Tykhonov might have some ideas for you. He's on some of the papers.

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:39):

@Paul TO key value pairs of what?

view this post on Zulip Paul TO (Feb 10 2025 at 15:40):

Philip Durbin β˜ƒοΈ said:

Paul TO key value pairs of what?

We have a column of map<string, list<string>>.

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:40):

Sure, but I'm curious what kind of data.

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:40):

biomedicine, obviously

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:41):

broadly

view this post on Zulip Paul TO (Feb 10 2025 at 15:44):

Philip Durbin β˜ƒοΈ said:

biomedicine, obviously

It's a column in our drug dataset named crossReferences, here is one of the records:
{DailyMed=[oxybutynin, oxybutynin%20chloride], PubChem=[174006905, 50105262, 90341037], Wikipedia=[Oxybutynin], drugbank=[DB01062], chEBI=[7856]}

Btw thank you so much for your prompt response!

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:47):

Sure. As you know, Croissant is built on top of Schema.org. That means you can use whatever Schema.org fields you like.

view this post on Zulip Paul TO (Feb 10 2025 at 15:54):

Philip Durbin β˜ƒοΈ said:

Sure. As you know, Croissant is built on top of Schema.org. That means you can use whatever Schema.org fields you like.

Schema.org also doesn't provide dict or list and we try to avoid extending Schema.org ourselves as we want to follow a standardised specification. Anyway thanks for your help, I will look somewhere else for more information, maybe joining the mailing list :big_smile:

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:55):

What about additionalProperty at https://schema.org/Drug ?

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 15:56):

"A property-value pair representing an additional characteristic of the entity, e.g. a product feature or another characteristic for which there is no matching property in schema.org."

view this post on Zulip Paul TO (Feb 10 2025 at 16:26):

Was having a meeting with my direct supervisor, coincidentally he also made connections with major contributors of Croissant in an AI seminar and we may contribute to the specification soon. I think we will join the meeting.

view this post on Zulip Philip Durbin πŸš€ (Feb 10 2025 at 16:37):

Cool. I dip in and out but @Slava Tykhonov attends more consistently. It's a nice meeting.

view this post on Zulip Philip Durbin πŸš€ (Mar 04 2025 at 17:45):

@Slava Tykhonov @Jan Range did we ever merge the Croissant branch into pyDataverse? Also, can that branch be used to create a Croissant file from a draft dataset? Or does the dataset need to be published? I'm asking because a conference is considering hosting datasets on Harvard Dataverse but they want the dataset to be in draft AND have a Croissant file (which isn't possible in Dataverse itself, since metadata export formats like Croissant are only available AFTER publication).

view this post on Zulip Jan Range (Mar 04 2025 at 20:27):

It is not merged, but we could do that if its ready @Slava Tykhonov :smile:

view this post on Zulip Philip Durbin πŸš€ (Mar 05 2025 at 16:33):

@Jan Range I requested a review from you on https://github.com/gdcc/dataverse-recipes/pull/6

(Slava and I are talking on the side about support for drafts, etc.)

view this post on Zulip Jan Range (Mar 06 2025 at 21:22):

Sorry for the delay, had a presentation today and things ended up last minute :grinning:

Reviewed the PR and looks good! Also tested the requirements thing and opened a PR

https://github.com/gdcc/dataverse-recipes/pull/7

view this post on Zulip Philip Durbin πŸš€ (Mar 06 2025 at 21:30):

Thanks! I left a comment: https://github.com/gdcc/dataverse-recipes/pull/7/files#r1984080290

view this post on Zulip Philip Durbin πŸš€ (Jun 06 2025 at 15:07):

We see "@type": "WebApplication" in the JSON-LD in the <head> of https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker . Regular JSON-LD. It's software.

We see "@type": "sc:Dataset" (and a bunch of Croissant fields) in the <head> of https://huggingface.co/datasets/siacus/flourishing . True Croissant. :croissant: It's a dataset.

(This is all under <script type="application/ld+json">, of course.)

I bring this up because @Oliver Bertuch and I were talking about dataset types (#dev > datasetType (software, workflow, etc.) ). When datasetType=software, what do we want in the <head>? Not Croissant, I suppose! We'd follow Hugging Face's lead, I'd think, right @Slava Tykhonov?

view this post on Zulip Philip Durbin πŸš€ (Jun 06 2025 at 15:22):

@Oliver Bertuch also, I'm suggesting they link to the spec from the README in https://github.com/mlcommons/croissant/pull/887

view this post on Zulip Philip Durbin πŸš€ (Jul 30 2025 at 13:50):

As I just mentioned on the mailing list, the "summary statistics (mean, max, min, etc.)" issue I opened a while back at https://github.com/mlcommons/croissant/issues/640 got a comment.

@Slava Tykhonov I know you're interested in this.

Also the DDI folks I can think of: @Amber Leahey @Janet McDougall @Victoria Lubitch @Leo Andreev

view this post on Zulip Philip Durbin πŸš€ (Aug 26 2025 at 19:09):

I was just chatting with @Slava Tykhonov and he made me realize that when you create a preview URL for a dataset, you can use the token to export a draft export like Croissant. Here's an example: https://demo.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi:10.70122/FK2/QH4PDC&version=:draft&key=469367f2-357d-4df6-8f15-1bcb0e9a426b

view this post on Zulip Philip Durbin πŸš€ (Aug 26 2025 at 19:10):

So, while the script I added in https://github.com/gdcc/dataverse-recipes/pull/19 to download a Croissant file using one's own API token is still useful, this is a way to share a link without using your API token. Instead, you're using the token from the preview URL.

view this post on Zulip Slava Tykhonov (Aug 26 2025 at 19:13):

This should go in the tutorial asap.

view this post on Zulip Philip Durbin πŸš€ (Aug 26 2025 at 19:15):

Yeah. Hmm, maybe I can put it in the README at https://github.com/gdcc/dataverse-recipes/tree/main/python/download_draft_croissant at least. :sweat_smile:


Last updated: Nov 01 2025 at 14:11 UTC