Croissant · dev · Zulip Chat Archive

Stream: dev

Topic: Croissant

Philip Durbin 🚀 (Mar 14 2024 at 14:09):

#10341 is the issue we're using to track Croissant support.

There's also good discussion at https://github.com/mlcommons/croissant/issues/530

@Slava Tykhonov @Jan Range and others, I'm getting variable-level metadata from the "datasetFileDetails" JSON. See https://github.com/gdcc/dataverse-exporters/commit/dbc3fa000ebe51ef9e9e4f7ef31d955afc77ed2a

Philip Durbin 🚀 (Mar 14 2024 at 15:52):

I'm far from finished but I build a jar from https://github.com/gdcc/dataverse-exporters/compare/main...croissant and put it on https://dev3.dataverse.org if anyone wants to play with it:

Screenshot-2024-03-14-at-11.51.42-AM.png

Slava Tykhonov (Mar 14 2024 at 17:12):

Looks good! How about adding URL to external microservice in payara parameters to produce Croissant etc?

Philip Durbin 🚀 (Mar 14 2024 at 18:00):

Sure, but I think that's a different topic. I just kicked it off: #dev > exporters as external services

Jan Range (Mar 15 2024 at 09:06):

@Slava Tykhonov I have added DataFrame support to EasyDataverse (see example colab and PR). Since EasyDataverse will eventually be merged, it will also be part of pyDataverse.

I was thinking that adding the croissant export to this class, which currently handles tabular data, would make sense. What are your thoughts?

Slava Tykhonov (Mar 15 2024 at 11:40):

Jan, it looks great! Can you also think about adding methods to get from this class 1) dataframe statistics (mean, median, etc) 2) column names and their types 3) dataframe json export?

This feature should be able also read file types (spreadsheet, tabular) etc and get files ingested in dataframe. So we can also connect it with Croissant metadata and prepare for ML coming.

Jan Range (Mar 15 2024 at 12:13):

Thanks @Slava Tykhonov! Of course, I am happy to add these to the class as well as an importer. Are the .describe() statistics sufficient?

image.png

Philip Durbin 🚀 (Mar 15 2024 at 12:16):

I occurs to me that perhaps I should extend an invitation to the Croissant meeting on Wednesday to a wider audience. Basically, @Slava Tykhonov and I will be showing the Croissant team what we have so far. I can (privately) send the Google Meet invite or people can sign up for the mailing list: https://github.com/mlcommons/croissant#getting-involved

Jan Range (Mar 15 2024 at 12:19):

@Philip Durbin happy to join and listen if feasible by time. At what time is the meeting on Thursday?

Philip Durbin 🚀 (Mar 15 2024 at 12:22):

Wednesday. "Weekly on Wednesday from 9:05am-10:00am Pacific." -- https://mlcommons.org/working-groups/data/croissant/

Jan Range (Mar 15 2024 at 12:24):

My goodness, I am ready for the weekend. Wednesday it is :joy:

Jan Range (Mar 15 2024 at 12:25):

9AM works well for me

Philip Durbin 🚀 (Mar 15 2024 at 12:48):

Now the question is if I should email the Dataverse Google Group. :thinking: How much do I want to embarrass myself? :crazy:

Philip Durbin 🚀 (Mar 15 2024 at 15:14):

I guess I want to embarrass myself. I just posted this: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/DqEIkiwlAgAJ :grinning:

Jan Range (Mar 15 2024 at 16:16):

@Slava Tykhonov I have added the stats and JSON export - https://github.com/gdcc/easyDataverse/pull/14

Following up with the importers on Monday. Off to the switzerland now :flag_switzerland:

Philip Durbin 🚀 (Mar 15 2024 at 16:20):

The land of the Switzer. Enjoy!

@Jan Range when you're back, I had a question about that PR at #python > tabular data addition

Philip Durbin 🚀 (Mar 15 2024 at 17:01):

I have lots of questions about Croissant. :grinning: I started a doc: https://docs.google.com/document/d/1C33FAR6s421WV9U50dzlBkVZRTWTlWguc-RoxakOly0/edit?usp=sharing

Jan Range (Mar 15 2024 at 17:07):

Ah nice, will add my questions once home :smile:

Philip Durbin 🚀 (Mar 15 2024 at 17:08):

Cool. Please leave comments for now. And I'll move these last few messages to #dev > Croissant

Slava Tykhonov (Mar 15 2024 at 18:39):

Screenshot-2024-03-15-at-19.38.57.png
Just in the case: I've got variables from DDI in Croissant export.

Philip Durbin 🚀 (Mar 15 2024 at 18:48):

recordSet! Looking good!

Slava Tykhonov (Mar 18 2024 at 15:47):

Extended with all Croissant fields which I can recognise and map: https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/croissant_sample.json

Slava Tykhonov (Mar 18 2024 at 16:18):

I'll move all semantic mappings to the separate file and will publish Croissant mapper on GitHub.

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:02):

That looks great @Slava Tykhonov I noticed that you used the 2.0 version and it doesn't show a validation error as the one that @Philip Durbin reported on #609 I was playing a little bit with your sample and if I post the @context before the version (your sample has it at the end) the error will display
Version doesn't follow MAJOR.MINOR.PATCH: 2.0.

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:03):

Are we doing something wrong here by posting the context first or this may be indeed a bug on the validator :thinking:

Slava Tykhonov (Mar 18 2024 at 19:09):

pip3 install --upgrade git+https://github.com/Dans-labs/pyDataverse@croissant#egg=pyDataverse --break-system-packages

Slava Tykhonov (Mar 18 2024 at 19:09):

from pyDataverse.Croissant import Croissant
import json

host = "https://dataverse.nl"
DOI = "doi:10.34894/KMRAYH"
croissant = Croissant(host, DOI)
c = croissant.get_record()
print(json.dumps(c, indent=4, default=str))

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:31):

Actually I probably was just looking at the wrong place because I am getting the same validation error :sweat_smile:

Slava Tykhonov (Mar 18 2024 at 19:34):

Can you make screenshot, @Juan Pablo Tosca Villanueva ?

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:35):

  -  [Metadata(Quality_of_care__UP_TSU)] Version doesn't follow MAJOR.MINOR.PATCH: 2.0. For more information refer to: https://semver.org/spec/v2.0.0.html

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:35):

Sure

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:36):

image.png

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:37):

the validate.sh just runs the validator on that sample

Juan Pablo Tosca Villanueva (Mar 18 2024 at 19:37):

image.png

Slava Tykhonov (Mar 18 2024 at 19:42):

I see, thanks! Will fix it here then https://colab.research.google.com/drive/1H-dfY_TBh6eXLkD7tUlqsxUDEsDCQPiD?usp=sharing

Slava Tykhonov (Mar 18 2024 at 22:13):

mlcroissant validate --jsonld /tmp/croissant1
I0318 23:13:19.350207 139745363906624 validate.py:53] Done.

Juan Pablo Tosca Villanueva (Mar 18 2024 at 23:27):

Oh! DId you fixit by switching to 1.0.0?

Slava Tykhonov (Mar 18 2024 at 23:29):

Added ".0" in the end of version :)

Juan Pablo Tosca Villanueva (Mar 18 2024 at 23:31):

I think that works to pass the validation but Datasets versions come on X.X format, I think that is why Phil opened that issue but he should be around tomorrow to enlighten us with more about this :rolling_on_the_floor_laughing:

Philip Durbin 🚀 (Mar 19 2024 at 01:47):

Ha. Thanks for the :thumbs_up: on https://github.com/mlcommons/croissant/issues/609 :heart:

Philip Durbin 🚀 (Mar 19 2024 at 16:40):

I just pushed a commit to export variable-level metadata: https://github.com/gdcc/dataverse-exporters/pull/4/commits/4f4361260d294280614e1112b291d632982a9dbd

Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:43):

Amazing!

Slava Tykhonov (Mar 19 2024 at 17:08):

Phil, a bit naive question - how it will be maintained if Croissant will get updates?

Slava Tykhonov (Mar 19 2024 at 18:04):

I've put in slides those bullet points currently missing in Croissant:
Sensitive vs Restricted files
Embargo
Provenance, data ownership transfer
Primary and secondary (derivative) datasets

Do you see more?

Philip Durbin 🚀 (Mar 19 2024 at 18:35):

Well, on a related note, how are you handling original vs. archival versions of files? foo.dta (Stata) vs foo.tab (archival). For now I'm only presenting the original.

Slava Tykhonov (Mar 19 2024 at 18:56):

I'm reading original and comparing with .tab versions, and linking them in the graph.

Philip Durbin 🚀 (Mar 19 2024 at 20:10):

Cool. Can you show an example?

Philip Durbin 🚀 (Mar 19 2024 at 20:44):

I just added "creator": https://github.com/gdcc/dataverse-exporters/pull/4/commits/7a0d8183cc7d4ef3f1864af56374fb08726732af

@Slava Tykhonov you seem to be doing just key/value here.

Slava Tykhonov (Mar 19 2024 at 21:06):

In first version yes but extended yesterday with affilitation and person name https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/odissei-croissant.json

Philip Durbin 🚀 (Mar 19 2024 at 21:07):

ah, great!

Slava Tykhonov (Mar 19 2024 at 21:07):

Screenshot-2024-03-19-at-22.07.01.png
I moved all semantic transformations outside to have FAIR semantic mappings in the separate file(s)

Philip Durbin 🚀 (Mar 19 2024 at 21:08):

Nice. Should be a fun call tomorrow. :grinning:

Slava Tykhonov (Mar 19 2024 at 21:08):

The idea is to load any "custom" mappings from GitHub and let community to maintain it without touching source code.

Slava Tykhonov (Mar 19 2024 at 21:09):

I'm not sure if someone is actually doing that to get mappings in the knowledge graph. :)

Philip Durbin 🚀 (Mar 19 2024 at 21:12):

For the @id of a file you're using the database id:

    "distribution": [
        {
            "@type": "cr:FileObject",
            "@id": "f3056770",
            "name": "DoD_R1.DTA",

Right now I'm showing the filename, like the spec shows. It's more readable. But your way is more precise.

Slava Tykhonov (Mar 19 2024 at 21:15):

It's more clear when you have in DDI variables in a few files.

Philip Durbin 🚀 (Mar 19 2024 at 21:17):

Have you experimented with path/to/data.dta? Having a file hierarchy?

Slava Tykhonov (Mar 19 2024 at 21:22):

No, can you give me example with DOI?

Philip Durbin 🚀 (Mar 19 2024 at 21:26):

Sure for my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP for example, I have a tabular file in a directory called "data".

Slava Tykhonov (Mar 19 2024 at 21:44):

Ok, I'll take a look. Updated version of slides for tomorrow meeting. https://docs.google.com/presentation/d/1hEqIFE9yS3aePLhRDgnuw9PdL-Az7ORj6utTAenNtbw/edit?usp=sharing

Philip Durbin 🚀 (Mar 19 2024 at 21:44):

Cool. No rush.

Philip Durbin 🚀 (Mar 19 2024 at 21:45):

I'm also wondering about citeAs.

Philip Durbin 🚀 (Mar 19 2024 at 21:45):

It seems like it's ideally for a paper about a dataset:

"citeAs": "@Article{asano21pass, author = \"Yuki M. Asano and Christian Rupprecht and Andrew Zisserman and Andrea Vedaldi\", title = \"PASS: An ImageNet replacement for self-supervised pretraining without humans\", journal = \"NeurIPS Track on Datasets and Benchmarks\", year = \"2021\" }",

Slava Tykhonov (Mar 19 2024 at 21:45):

citeAs has to be build from author names and date. I'm not sure if they're doing it right to be honest.

Philip Durbin 🚀 (Mar 19 2024 at 21:46):

But where do I put the DOI of the dataset itself? Kaggle is putting it in "identifier" (which isn't in the spec) and "citeAs" (which I'm not sure is right).

Philip Durbin 🚀 (Mar 19 2024 at 21:47):

Ok, well citeAs is on my list to ask about tomorrow: https://docs.google.com/document/d/1C33FAR6s421WV9U50dzlBkVZRTWTlWguc-RoxakOly0/edit?usp=sharing :grinning:

Slava Tykhonov (Mar 19 2024 at 21:47):

I think we need to use citeAs as it's implemented in Dataverse right now

Slava Tykhonov (Mar 19 2024 at 21:48):

Can you make screenshot of Dataverse with Croissant button and some example in json-ld? For slides? (coming in production in next version)

Philip Durbin 🚀 (Mar 19 2024 at 21:48):

Huh, citeAs is in our Signposting:

$ ack citeAs
src/main/java/edu/harvard/iq/dataverse/util/SignpostingResources.java
64:            String citeAs = "<" + ds.getPersistentURL() + ">;rel=\"cite-as\"";

Philip Durbin 🚀 (Mar 19 2024 at 21:49):

Yes, sure.

Slava Tykhonov (Mar 19 2024 at 21:49):

Cool! So I'll make my part of story - Phil is building "production ready" Croissant export and I'm moving all crosswolks outside of the implementation to invite community to maintain it.

Slava Tykhonov (Mar 19 2024 at 21:50):

I have a slide on Signposting, btw :)

Philip Durbin 🚀 (Mar 19 2024 at 21:52):

Here's an example (sort of a work in progress, honestly): https://dev3.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5072/FK2/DZRHUP

Philip Durbin 🚀 (Mar 19 2024 at 21:53):

And you are welcome to grab a screenshot of the Croissant button from https://dev3.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/DZRHUP

Jan Range (Mar 20 2024 at 09:11):

Due to a shorthand collision, I won't be able to participate in the Croissant meeting tonight :anguish: @Slava Tykhonov Would you like to join the next PyDataverse meeting and talk about the Croissant extension? It is happening next Wednesday at 4 PM CET

Slava Tykhonov (Mar 20 2024 at 09:34):

Hi Jan, I'll join pyDataverse next week then. Croissant extension is kind of working :) https://colab.research.google.com/drive/1H-dfY_TBh6eXLkD7tUlqsxUDEsDCQPiD?usp=sharing#scrollTo=WDDs2hdcJnED

Philip Durbin 🚀 (Mar 20 2024 at 11:46):

@Jan Range no worries

Jan Range (Mar 20 2024 at 13:26):

@Slava Tykhonov looks great! Thanks for sharing :tada:

Philip Durbin 🚀 (Mar 20 2024 at 14:49):

@Juan Pablo Tosca Villanueva I added more error checking to the Croissant exporter. You should be able to upload any file now: https://github.com/gdcc/dataverse-exporters/pull/4/commits/9316860b1869108d5cf64499b04252bde86575f4

Slava Tykhonov (Mar 20 2024 at 15:02):

Phil, you should mention this during the meeting today. It looks like Editor isn't very stable, should be improved.

Philip Durbin 🚀 (Mar 20 2024 at 19:01):

There wasn't enough time!

Philip Durbin 🚀 (Mar 20 2024 at 19:02):

@Slava Tykhonov thanks for posting https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/FuajmDKEAQAJ ... an update with links to slides, notes, etc.

Philip Durbin 🚀 (Apr 12 2024 at 17:51):

@Slava Tykhonov following up on the Croissant call this week... I see what you mean about the lack of backward compatibility within 1.0. I just upgraded from mlcroissant 1.0.3 to 1.0.5 and now I see this new error:

WARNING: The JSON-LD @context is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'examples', 'isLiveDataset', 'rai'}

Philip Durbin 🚀 (Apr 12 2024 at 18:04):

@Slava Tykhonov so is that what you do to get the latest, correct @context? Go to https://github.com/mlcommons/croissant/tree/main/datasets/1.0 and pick one of the examples (I picked "titantic") and copy it from there?

Or do you go to https://mlcommons.github.io/croissant/docs/croissant-spec.html which has a different @context?

Help! :sweat_smile:

Philip Durbin 🚀 (Apr 12 2024 at 18:29):

I'm going with titanic and explained to look out for breaking changes in the README: https://github.com/gdcc/dataverse-exporters/pull/4/commits/03dfeddbd8b136aeee9c2642d8bc1852e73b948b

Slava Tykhonov (Apr 12 2024 at 19:10):

Yes, this is exactly why I was working on semantic mappings.

Philip Durbin 🚀 (Apr 12 2024 at 19:27):

In other Croissant news, I'm appending ".0" to dataset versions but as I say here, I'm pretty grumpy about it: https://github.com/mlcommons/croissant/issues/609#issuecomment-2052403311

Philip Durbin 🚀 (Apr 23 2024 at 12:25):

I switched citeAs to bibtex format: https://github.com/gdcc/dataverse-exporters/pull/4/commits/151efb7898164d2f8290f31392d70fb28bfec299

Philip Durbin 🚀 (Apr 26 2024 at 21:41):

I'd love some feedback on this new pull request:

add docs for Croissant, tweak exporter docs #10533

Philip Durbin 🚀 (Apr 29 2024 at 18:33):

@Slava Tykhonov do you think I should open an issue at https://github.com/mlcommons/croissant/issues about where to put summary statistics? I see that you and Rajat volunteered to think about this and I don't want to step on your toes!

Slava Tykhonov (Apr 29 2024 at 19:21):

Just open

Philip Durbin 🚀 (Apr 29 2024 at 19:23):

will do!

Philip Durbin 🚀 (Apr 29 2024 at 19:49):

ok, done: https://github.com/mlcommons/croissant/issues/640

Philip Durbin 🚀 (May 03 2024 at 17:57):

These are the little notes I leave to myself while working on the Croissant exporter:

ls -1 ../max | grep -v croissant | while read i; do FILE=$i; FMT=`echo $FILE | cut -d . -f1`; echo $FMT; cat 27626.debug | jq ".$FMT" -r > $FILE; done

:croissant:

Philip Durbin 🚀 (May 07 2024 at 18:18):

I'm comparing my output with @Slava Tykhonov's and realizing I forgot "description"!

Philip Durbin 🚀 (May 17 2024 at 15:02):

I just wrote a long passage about "version" ("1.0.0" vs "1.0" vs 1.0, etc.): https://github.com/mlcommons/croissant/issues/609#issuecomment-2117798279

What do you think? Am I making sense?

Croissant for both:

(click the Croissant button for Kaggle)
https://beta.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5072/FK2/YKJQY8

Philip Durbin 🚀 (May 30 2024 at 21:09):

I also invited the Dataverse community to play around with the Croissant exporter: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/ARnYS5kpCgAJ

Philip Durbin 🚀 (Jun 03 2024 at 14:48):

I just added some new feedback from Geoff at Kaggle: https://github.com/gdcc/exporter-croissant#differences-from-kaggle

Philip Durbin 🚀 (Jun 03 2024 at 18:18):

I created a pull request for the point about "field" being repeated over and over: https://github.com/gdcc/exporter-croissant/pull/2

Philip Durbin 🚀 (Jun 03 2024 at 18:19):

merged

Philip Durbin 🚀 (Jun 03 2024 at 19:24):

Another issue: https://github.com/gdcc/exporter-croissant/issues/3 - we are using sc:Integer for all numeric types. I'd like to use sc:Number instead but I get this error from the validator:

- [Metadata(Cars) > RecordSet() > Field(weight)] The field does not specify a valid http://mlcommons.org/croissant/dataType, neither does any of its predecessor. Got: [rdflib.term.URIRef('https://schema.org/Number')]

Philip Durbin 🚀 (Jun 03 2024 at 20:43):

I just merged a fix: https://github.com/gdcc/exporter-croissant/pull/4

Philip Durbin 🚀 (Jun 04 2024 at 02:10):

I put out a new release: https://repo1.maven.org/maven2/io/gdcc/export/croissant/0.1.2/

Philip Durbin 🚀 (Jun 10 2024 at 11:16):

There's a decent chance I'll be presenting at a Croissant Task Force meeting, even as soon as Wednesday. I'll keep you posted.

Philip Durbin 🚀 (Jun 12 2024 at 11:08):

Yep, I'll be presenting in about 5 hours - 12:05 pm Boston time: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/ONqgdyKJAAAJ

Philip Durbin 🚀 (Jun 12 2024 at 16:54):

That went pretty well, I think.

Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:14):

I think it was great!

Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:16):

I am still concerned about 1) generating this croissant / JSON-LD on each request and 2)including it on each page

Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:17):

I wonder if there could be an ideal case where we could cache the croissant/JSON-LD and also add a parameter to include it or not on the request and include that URL with the param on the robots.txt

Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:17):

but at least for normal browsing of the application it could be ignored (no croissant - JSON-LD) in the headers

Philip Durbin 🚀 (Jun 12 2024 at 17:26):

All exports are cached (written to disk).

Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:36):

I see... logger.fine("Returning cached schema.org JSON-LD."); :whoops:

Juan Pablo Tosca Villanueva (Jun 12 2024 at 17:36):

As a wise man said once "now we know" :laughing:

Oliver Bertuch (Jul 02 2024 at 15:49):

@Philip Durbin do you want / need help with making use of the new exporter Parent POM for the crossaint exporter?

Oliver Bertuch (Jul 02 2024 at 15:51):

Not sure if you have seen what I did with https://github.com/gdcc/dataverse-exporters

Philip Durbin 🚀 (Jul 02 2024 at 15:53):

Was it within the last two weeks? I was out.

Oliver Bertuch (Jul 02 2024 at 15:56):

Yes it was :smile_cat:

Philip Durbin 🚀 (Jul 02 2024 at 18:46):

Ok, I'm still catching up hundreds of GitHub emails. I'll see it eventually. :crazy:

Philip Durbin 🚀 (Jul 03 2024 at 12:55):

During today's Croissant Task Force meeting at noon-ish Boston time (12:05) they will be discussing future plans for Croissant.

Please feel free to DM me for the links to the meeting (on Google Meet) or the doc they will be discussing.

Slava Tykhonov (Jul 03 2024 at 13:18):

We're also testing Croissant implementation(s) in our multimodal repository (video, audio, text, haptics) https://database.sharemusic.se/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5072/FK2/T55YDC

Philip Durbin 🚀 (Jul 03 2024 at 18:26):

Oh! Does that mean you installed the Croissant jar I made, @Slava Tykhonov :grinning:

Philip Durbin 🚀 (Aug 22 2024 at 20:16):

We installed Croissant on demo and Harvard Dataverse. Please see https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/dLmV7HTcAgAJ

Philip Durbin 🚀 (Aug 22 2024 at 20:18):

I also posted an update here: https://github.com/mlcommons/croissant/issues/530#issuecomment-2305479611

Philip Durbin 🚀 (Aug 28 2024 at 19:52):

@Julian Gautier I got that same Invalid object type for field "distribution" email and just opened an issue about it: https://github.com/mlcommons/croissant/issues/725

Julian Gautier (Aug 28 2024 at 20:34):

Ah thanks. Was https://validator.schema.org not returning this error earlier? It is now, too, but I checked only today

Philip Durbin 🚀 (Aug 28 2024 at 20:35):

Come to think of it, I think it's been returning that error all along.

Philip Durbin 🚀 (Aug 28 2024 at 20:35):

I suspect the Search Console folks are not talking to the Croissant folks. Not sure.

Philip Durbin 🚀 (Sep 03 2024 at 20:08):

"Indeed the Search console doesn't know about Croissant yet. It only validates mark-up based on the schema.org vocabulary, which expects distribution to be of type sc:DataDownload. I will get in touch with them to figure out how to best address this issue." --Omar

Philip Durbin 🚀 (Sep 04 2024 at 13:34):

@Leo Andreev I believe you get these emails too. Please see the GitHub issue above for more info.

Philip Durbin 🚀 (Oct 22 2024 at 21:05):

A little bird told me that @Slava Tykhonov is giving a talk at the Croissant Working Group meeting tomorrow:

"Slava Tykhonov (DANS-KNAW) will talk about supporting external controlled vocabularies in Dataverse, and we can brainstorm on how we would like to support them in the next version of Croissant."

To join the call see How to Join and Access Croissant Working Group Resources at https://mlcommons.org/working-groups/data/croissant/

Philip Durbin 🚀 (Oct 23 2024 at 16:48):

Slava's slides: https://docs.google.com/presentation/d/1PepV5qOITW2heil_iDts6xoB9CHM7Pjj/edit?usp=sharing&ouid=117275479921759507378&rtpof=true&sd=true

Philip Durbin 🚀 (Oct 31 2024 at 20:17):

Dataverse has been added to the main Croissant image :tada:

croissant-summary-v11.png

Philip Durbin 🚀 (Oct 31 2024 at 20:18):

Stefano used the updated image in his tweet today: https://x.com/iacus/status/1852061999854948814

Slava Tykhonov (Nov 01 2024 at 07:41):

That's fantastic!

Philip Durbin 🚀 (Jan 13 2025 at 16:22):

In https://github.com/gdcc/exporter-croissant/pull/7 I'm proposing we update the media type from application/json to application/ld+json; profile="http://mlcommons.org/croissant/1.0" to be more specific.

Philip Durbin 🚀 (Jan 13 2025 at 16:23):

@Slava Tykhonov and others, does this change make sense to you? Please see also the issue the PR closes: https://github.com/gdcc/exporter-croissant/issues/6

Slava Tykhonov (Jan 13 2025 at 16:38):

Makes sense as discussed last week on Croissant call with Signposting.

Philip Durbin 🚀 (Jan 13 2025 at 16:42):

Great, thanks. I also created this issue upstream to add that media type to the Croissant spec: https://github.com/mlcommons/croissant/issues/792

Paul TO (Feb 10 2025 at 15:36):

Hi all, sorry for my question out of the blue, I hope this is the right place to ask. I work for Open Targets, an organisation that provides open-access data of biomedicine data. Our data is in Parquet format and we are looking for a standardised way for our dataset discovery and schema description and Croissant seems to be a perfect match, even though our use is not for ML directly. The problem is, we are not quite sure how do we express some data structure or data types. For example, how do we express key value pairs or list under a list? It seems the data types defined by Schema.org is minimal, will Croissant expands this? Thanks in advance.

Philip Durbin 🚀 (Feb 10 2025 at 15:38):

@Paul TO hi! It sounds like you should join the Croissant community! https://github.com/mlcommons/croissant#getting-involved

There's a meeting every Wednesday and a mailing list where you can ask questions like this.

Philip Durbin 🚀 (Feb 10 2025 at 15:39):

@Slava Tykhonov might have some ideas for you. He's on some of the papers.

Philip Durbin 🚀 (Feb 10 2025 at 15:39):

@Paul TO key value pairs of what?

Paul TO (Feb 10 2025 at 15:40):

Philip Durbin ☃️ said:

Paul TO key value pairs of what?

We have a column of map<string, list<string>>.

Philip Durbin 🚀 (Feb 10 2025 at 15:40):

Sure, but I'm curious what kind of data.

Philip Durbin 🚀 (Feb 10 2025 at 15:40):

biomedicine, obviously

Philip Durbin 🚀 (Feb 10 2025 at 15:41):

broadly

Paul TO (Feb 10 2025 at 15:44):

Philip Durbin ☃️ said:

biomedicine, obviously

It's a column in our drug dataset named crossReferences, here is one of the records:
{DailyMed=[oxybutynin, oxybutynin%20chloride], PubChem=[174006905, 50105262, 90341037], Wikipedia=[Oxybutynin], drugbank=[DB01062], chEBI=[7856]}

Btw thank you so much for your prompt response!

Philip Durbin 🚀 (Feb 10 2025 at 15:47):

Sure. As you know, Croissant is built on top of Schema.org. That means you can use whatever Schema.org fields you like.

Paul TO (Feb 10 2025 at 15:54):

Philip Durbin ☃️ said:

Sure. As you know, Croissant is built on top of Schema.org. That means you can use whatever Schema.org fields you like.

Schema.org also doesn't provide dict or list and we try to avoid extending Schema.org ourselves as we want to follow a standardised specification. Anyway thanks for your help, I will look somewhere else for more information, maybe joining the mailing list :big_smile:

Philip Durbin 🚀 (Feb 10 2025 at 15:55):

What about additionalProperty at https://schema.org/Drug ?

Philip Durbin 🚀 (Feb 10 2025 at 15:56):

"A property-value pair representing an additional characteristic of the entity, e.g. a product feature or another characteristic for which there is no matching property in schema.org."

Paul TO (Feb 10 2025 at 16:26):

Was having a meeting with my direct supervisor, coincidentally he also made connections with major contributors of Croissant in an AI seminar and we may contribute to the specification soon. I think we will join the meeting.

Philip Durbin 🚀 (Feb 10 2025 at 16:37):

Cool. I dip in and out but @Slava Tykhonov attends more consistently. It's a nice meeting.

Philip Durbin 🚀 (Mar 04 2025 at 17:45):

@Slava Tykhonov @Jan Range did we ever merge the Croissant branch into pyDataverse? Also, can that branch be used to create a Croissant file from a draft dataset? Or does the dataset need to be published? I'm asking because a conference is considering hosting datasets on Harvard Dataverse but they want the dataset to be in draft AND have a Croissant file (which isn't possible in Dataverse itself, since metadata export formats like Croissant are only available AFTER publication).

Jan Range (Mar 04 2025 at 20:27):

It is not merged, but we could do that if its ready @Slava Tykhonov :smile:

Philip Durbin 🚀 (Mar 05 2025 at 16:33):

@Jan Range I requested a review from you on https://github.com/gdcc/dataverse-recipes/pull/6

(Slava and I are talking on the side about support for drafts, etc.)

Jan Range (Mar 06 2025 at 21:22):

Sorry for the delay, had a presentation today and things ended up last minute :grinning:

Reviewed the PR and looks good! Also tested the requirements thing and opened a PR

https://github.com/gdcc/dataverse-recipes/pull/7

Philip Durbin 🚀 (Mar 06 2025 at 21:30):

Thanks! I left a comment: https://github.com/gdcc/dataverse-recipes/pull/7/files#r1984080290

Philip Durbin 🚀 (Jun 06 2025 at 15:07):

We see "@type": "WebApplication" in the JSON-LD in the <head> of https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker . Regular JSON-LD. It's software.

We see "@type": "sc:Dataset" (and a bunch of Croissant fields) in the <head> of https://huggingface.co/datasets/siacus/flourishing . True Croissant. :croissant: It's a dataset.

(This is all under <script type="application/ld+json">, of course.)

I bring this up because @Oliver Bertuch and I were talking about dataset types (#dev > datasetType (software, workflow, etc.) ). When datasetType=software, what do we want in the <head>? Not Croissant, I suppose! We'd follow Hugging Face's lead, I'd think, right @Slava Tykhonov?

Philip Durbin 🚀 (Jun 06 2025 at 15:22):

@Oliver Bertuch also, I'm suggesting they link to the spec from the README in https://github.com/mlcommons/croissant/pull/887

Philip Durbin 🚀 (Jul 30 2025 at 13:50):

As I just mentioned on the mailing list, the "summary statistics (mean, max, min, etc.)" issue I opened a while back at https://github.com/mlcommons/croissant/issues/640 got a comment.

@Slava Tykhonov I know you're interested in this.

Also the DDI folks I can think of: @Amber Leahey @Janet McDougall @Victoria Lubitch @Leo Andreev

Philip Durbin 🚀 (Aug 26 2025 at 19:09):

I was just chatting with @Slava Tykhonov and he made me realize that when you create a preview URL for a dataset, you can use the token to export a draft export like Croissant. Here's an example: https://demo.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi:10.70122/FK2/QH4PDC&version=:draft&key=469367f2-357d-4df6-8f15-1bcb0e9a426b

Philip Durbin 🚀 (Aug 26 2025 at 19:10):

So, while the script I added in https://github.com/gdcc/dataverse-recipes/pull/19 to download a Croissant file using one's own API token is still useful, this is a way to share a link without using your API token. Instead, you're using the token from the preview URL.

Slava Tykhonov (Aug 26 2025 at 19:13):

This should go in the tutorial asap.

Philip Durbin 🚀 (Aug 26 2025 at 19:15):

Yeah. Hmm, maybe I can put it in the README at https://github.com/gdcc/dataverse-recipes/tree/main/python/download_draft_croissant at least. :sweat_smile:

Philip Durbin 🚀 (Dec 03 2025 at 16:28):

I haven't been to a Croissant Task Force meeting in a while but I'm planning on joining (in about half an hour) since they're talking about finalizing the Croissant 1.1 spec. Here's the issue on our side: #12014.

Philip Durbin 🚀 (Dec 10 2025 at 16:47):

Today @Slava Tykhonov is give a talk at the task force meeting.

Philip Durbin 🚀 (Dec 18 2025 at 16:29):

Slides from the Croissant task force meeting yesterday:

Descriptive statistics and frequencies for ML Commons Croissant - https://docs.google.com/presentation/d/1Y6k9qpukHDnNqlQYC8fvth_LZfV0WT2FBHxyCrDYKFM/edit?usp=sharing

Last updated: Jan 09 2026 at 14:18 UTC