#10341 is the issue we're using to track Croissant support.
There's also good discussion at https://github.com/mlcommons/croissant/issues/530
@Slava Tykhonov @Jan Range and others, I'm getting variable-level metadata from the "datasetFileDetails" JSON. See https://github.com/gdcc/dataverse-exporters/commit/dbc3fa000ebe51ef9e9e4f7ef31d955afc77ed2a
I'm far from finished but I build a jar from https://github.com/gdcc/dataverse-exporters/compare/main...croissant and put it on https://dev3.dataverse.org if anyone wants to play with it:
Screenshot-2024-03-14-at-11.51.42-AM.png
Looks good! How about adding URL to external microservice in payara parameters to produce Croissant etc?
Sure, but I think that's a different topic. I just kicked it off: #dev > exporters as external services
@Slava Tykhonov I have added DataFrame support to EasyDataverse (see example colab and PR). Since EasyDataverse will eventually be merged, it will also be part of pyDataverse.
I was thinking that adding the croissant export to this class, which currently handles tabular data, would make sense. What are your thoughts?
Jan, it looks great! Can you also think about adding methods to get from this class 1) dataframe statistics (mean, median, etc) 2) column names and their types 3) dataframe json export?
This feature should be able also read file types (spreadsheet, tabular) etc and get files ingested in dataframe. So we can also connect it with Croissant metadata and prepare for ML coming.
Thanks @Slava Tykhonov! Of course, I am happy to add these to the class as well as an importer. Are the .describe() statistics sufficient?
I occurs to me that perhaps I should extend an invitation to the Croissant meeting on Wednesday to a wider audience. Basically, @Slava Tykhonov and I will be showing the Croissant team what we have so far. I can (privately) send the Google Meet invite or people can sign up for the mailing list: https://github.com/mlcommons/croissant#getting-involved
@Philip Durbin happy to join and listen if feasible by time. At what time is the meeting on Thursday?
Wednesday. "Weekly on Wednesday from 9:05am-10:00am Pacific." -- https://mlcommons.org/working-groups/data/croissant/
My goodness, I am ready for the weekend. Wednesday it is :joy:
9AM works well for me
Now the question is if I should email the Dataverse Google Group. :thinking: How much do I want to embarrass myself? :crazy:
I guess I want to embarrass myself. I just posted this: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/DqEIkiwlAgAJ :grinning:
@Slava Tykhonov I have added the stats and JSON export - https://github.com/gdcc/easyDataverse/pull/14
Following up with the importers on Monday. Off to the switzerland now :flag_switzerland:
The land of the Switzer. Enjoy!
@Jan Range when you're back, I had a question about that PR at #python > tabular data addition
I have lots of questions about Croissant. :grinning: I started a doc: https://docs.google.com/document/d/1C33FAR6s421WV9U50dzlBkVZRTWTlWguc-RoxakOly0/edit?usp=sharing
Ah nice, will add my questions once home :smile:
Cool. Please leave comments for now. And I'll move these last few messages to #dev > Croissant
Screenshot-2024-03-15-at-19.38.57.png
Just in the case: I've got variables from DDI in Croissant export.
recordSet! Looking good! ![]()
Extended with all Croissant fields which I can recognise and map: https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/croissant_sample.json
I'll move all semantic mappings to the separate file and will publish Croissant mapper on GitHub.
That looks great @Slava Tykhonov I noticed that you used the 2.0 version and it doesn't show a validation error as the one that @Philip Durbin reported on #609 I was playing a little bit with your sample and if I post the @context before the version (your sample has it at the end) the error will display
Version doesn't follow MAJOR.MINOR.PATCH: 2.0.
Are we doing something wrong here by posting the context first or this may be indeed a bug on the validator :thinking:
pip3 install --upgrade git+https://github.com/Dans-labs/pyDataverse@croissant#egg=pyDataverse --break-system-packages
from pyDataverse.Croissant import Croissant
import json
host = "https://dataverse.nl"
DOI = "doi:10.34894/KMRAYH"
croissant = Croissant(host, DOI)
c = croissant.get_record()
print(json.dumps(c, indent=4, default=str))
Actually I probably was just looking at the wrong place because I am getting the same validation error :sweat_smile:
Can you make screenshot, @Juan Pablo Tosca Villanueva ?
- [Metadata(Quality_of_care__UP_TSU)] Version doesn't follow MAJOR.MINOR.PATCH: 2.0. For more information refer to: https://semver.org/spec/v2.0.0.html
Sure
the validate.sh just runs the validator on that sample
I see, thanks! Will fix it here then https://colab.research.google.com/drive/1H-dfY_TBh6eXLkD7tUlqsxUDEsDCQPiD?usp=sharing
mlcroissant validate --jsonld /tmp/croissant1
I0318 23:13:19.350207 139745363906624 validate.py:53] Done.
Oh! DId you fixit by switching to 1.0.0?
Added ".0" in the end of version :)
I think that works to pass the validation but Datasets versions come on X.X format, I think that is why Phil opened that issue but he should be around tomorrow to enlighten us with more about this :rolling_on_the_floor_laughing:
Ha. Thanks for the :thumbs_up: on https://github.com/mlcommons/croissant/issues/609 :heart:
I just pushed a commit to export variable-level metadata: https://github.com/gdcc/dataverse-exporters/pull/4/commits/4f4361260d294280614e1112b291d632982a9dbd
Amazing!
Phil, a bit naive question - how it will be maintained if Croissant will get updates?
I've put in slides those bullet points currently missing in Croissant:
Sensitive vs Restricted files
Embargo
Provenance, data ownership transfer
Primary and secondary (derivative) datasets
Do you see more?
Well, on a related note, how are you handling original vs. archival versions of files? foo.dta (Stata) vs foo.tab (archival). For now I'm only presenting the original.
I'm reading original and comparing with .tab versions, and linking them in the graph.
Cool. Can you show an example?
I just added "creator": https://github.com/gdcc/dataverse-exporters/pull/4/commits/7a0d8183cc7d4ef3f1864af56374fb08726732af
@Slava Tykhonov you seem to be doing just key/value here.
In first version yes but extended yesterday with affilitation and person name https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/odissei-croissant.json
ah, great!
Screenshot-2024-03-19-at-22.07.01.png
I moved all semantic transformations outside to have FAIR semantic mappings in the separate file(s)
Nice. Should be a fun call tomorrow. :grinning:
The idea is to load any "custom" mappings from GitHub and let community to maintain it without touching source code.
I'm not sure if someone is actually doing that to get mappings in the knowledge graph. :)
For the @id of a file you're using the database id:
"distribution": [
{
"@type": "cr:FileObject",
"@id": "f3056770",
"name": "DoD_R1.DTA",
Right now I'm showing the filename, like the spec shows. It's more readable. But your way is more precise.
It's more clear when you have in DDI variables in a few files.
Have you experimented with path/to/data.dta? Having a file hierarchy?
No, can you give me example with DOI?
Sure for my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP for example, I have a tabular file in a directory called "data".
Ok, I'll take a look. Updated version of slides for tomorrow meeting. https://docs.google.com/presentation/d/1hEqIFE9yS3aePLhRDgnuw9PdL-Az7ORj6utTAenNtbw/edit?usp=sharing
Cool. No rush.
I'm also wondering about citeAs.
It seems like it's ideally for a paper about a dataset:
"citeAs": "@Article{asano21pass, author = \"Yuki M. Asano and Christian Rupprecht and Andrew Zisserman and Andrea Vedaldi\", title = \"PASS: An ImageNet replacement for self-supervised pretraining without humans\", journal = \"NeurIPS Track on Datasets and Benchmarks\", year = \"2021\" }",
citeAs has to be build from author names and date. I'm not sure if they're doing it right to be honest.
But where do I put the DOI of the dataset itself? Kaggle is putting it in "identifier" (which isn't in the spec) and "citeAs" (which I'm not sure is right).
Ok, well citeAs is on my list to ask about tomorrow: https://docs.google.com/document/d/1C33FAR6s421WV9U50dzlBkVZRTWTlWguc-RoxakOly0/edit?usp=sharing :grinning:
I think we need to use citeAs as it's implemented in Dataverse right now
Can you make screenshot of Dataverse with Croissant button and some example in json-ld? For slides? (coming in production in next version)
Huh, citeAs is in our Signposting:
$ ack citeAs
src/main/java/edu/harvard/iq/dataverse/util/SignpostingResources.java
64: String citeAs = "<" + ds.getPersistentURL() + ">;rel=\"cite-as\"";
Yes, sure.
Cool! So I'll make my part of story - Phil is building "production ready" Croissant export and I'm moving all crosswolks outside of the implementation to invite community to maintain it.
I have a slide on Signposting, btw :)
Here's an example (sort of a work in progress, honestly): https://dev3.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5072/FK2/DZRHUP
And you are welcome to grab a screenshot of the Croissant button from https://dev3.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/DZRHUP
Due to a shorthand collision, I won't be able to participate in the Croissant meeting tonight :anguish: @Slava Tykhonov Would you like to join the next PyDataverse meeting and talk about the Croissant extension? It is happening next Wednesday at 4 PM CET
Hi Jan, I'll join pyDataverse next week then. Croissant extension is kind of working :) https://colab.research.google.com/drive/1H-dfY_TBh6eXLkD7tUlqsxUDEsDCQPiD?usp=sharing#scrollTo=WDDs2hdcJnED
@Jan Range no worries
@Slava Tykhonov looks great! Thanks for sharing :tada:
@Juan Pablo Tosca Villanueva I added more error checking to the Croissant exporter. You should be able to upload any file now: https://github.com/gdcc/dataverse-exporters/pull/4/commits/9316860b1869108d5cf64499b04252bde86575f4
Phil, you should mention this during the meeting today. It looks like Editor isn't very stable, should be improved.
There wasn't enough time!
@Slava Tykhonov thanks for posting https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/FuajmDKEAQAJ ... an update with links to slides, notes, etc.
@Slava Tykhonov following up on the Croissant call this week... I see what you mean about the lack of backward compatibility within 1.0. I just upgraded from mlcroissant 1.0.3 to 1.0.5 and now I see this new error:
WARNING: The JSON-LD @context is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'examples', 'isLiveDataset', 'rai'}
@Slava Tykhonov so is that what you do to get the latest, correct @context? Go to https://github.com/mlcommons/croissant/tree/main/datasets/1.0 and pick one of the examples (I picked "titantic") and copy it from there?
Or do you go to https://mlcommons.github.io/croissant/docs/croissant-spec.html which has a different @context?
Help! :sweat_smile:
I'm going with titanic and explained to look out for breaking changes in the README: https://github.com/gdcc/dataverse-exporters/pull/4/commits/03dfeddbd8b136aeee9c2642d8bc1852e73b948b
Yes, this is exactly why I was working on semantic mappings.
In other Croissant news, I'm appending ".0" to dataset versions but as I say here, I'm pretty grumpy about it: https://github.com/mlcommons/croissant/issues/609#issuecomment-2052403311
I switched citeAs to bibtex format: https://github.com/gdcc/dataverse-exporters/pull/4/commits/151efb7898164d2f8290f31392d70fb28bfec299
I'd love some feedback on this new pull request:
add docs for Croissant, tweak exporter docs #10533
@Slava Tykhonov do you think I should open an issue at https://github.com/mlcommons/croissant/issues about where to put summary statistics? I see that you and Rajat volunteered to think about this and I don't want to step on your toes!
Just open
will do!
ok, done: https://github.com/mlcommons/croissant/issues/640
These are the little notes I leave to myself while working on the Croissant exporter:
ls -1 ../max | grep -v croissant | while read i; do FILE=$i; FMT=`echo $FILE | cut -d . -f1`; echo $FMT; cat 27626.debug | jq ".$FMT" -r > $FILE; done
:croissant:
I'm comparing my output with @Slava Tykhonov's and realizing I forgot "description"! ![]()
I just wrote a long passage about "version" ("1.0.0" vs "1.0" vs 1.0, etc.): https://github.com/mlcommons/croissant/issues/609#issuecomment-2117798279
What do you think? Am I making sense?
First Croissant jar is up on Maven Central: https://repo1.maven.org/maven2/io/gdcc/export/croissant/0.1.1/
Here's the fancy landing page: https://central.sonatype.com/artifact/io.gdcc.export/croissant
I'm talking to Kaggle and comparing https://www.kaggle.com/datasets/yasserh/wine-quality-dataset to https://beta.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/YKJQY8
Croissant for both:
I also invited the Dataverse community to play around with the Croissant exporter: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/ARnYS5kpCgAJ
I just added some new feedback from Geoff at Kaggle: https://github.com/gdcc/exporter-croissant#differences-from-kaggle
I created a pull request for the point about "field" being repeated over and over: https://github.com/gdcc/exporter-croissant/pull/2
merged
Another issue: https://github.com/gdcc/exporter-croissant/issues/3 - we are using sc:Integer for all numeric types. I'd like to use sc:Number instead but I get this error from the validator:
-Β [Metadata(Cars) > RecordSet() > Field(weight)] The field does not specify a valid http://mlcommons.org/croissant/dataType, neither does any of its predecessor. Got: [rdflib.term.URIRef('https://schema.org/Number')]
I just merged a fix: https://github.com/gdcc/exporter-croissant/pull/4
I put out a new release: https://repo1.maven.org/maven2/io/gdcc/export/croissant/0.1.2/
There's a decent chance I'll be presenting at a Croissant Task Force meeting, even as soon as Wednesday. I'll keep you posted.
Yep, I'll be presenting in about 5 hours - 12:05 pm Boston time: https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/ONqgdyKJAAAJ
That went pretty well, I think.
I think it was great!
I am still concerned about 1) generating this croissant / JSON-LD on each request and 2)including it on each page
I wonder if there could be an ideal case where we could cache the croissant/JSON-LD and also add a parameter to include it or not on the request and include that URL with the param on the robots.txt
but at least for normal browsing of the application it could be ignored (no croissant - JSON-LD) in the headers
All exports are cached (written to disk).
I see... logger.fine("Returning cached schema.org JSON-LD."); :whoops:
As a wise man said once "now we know" :laughing:
@Philip Durbin do you want / need help with making use of the new exporter Parent POM for the crossaint exporter?
Not sure if you have seen what I did with https://github.com/gdcc/dataverse-exporters
Was it within the last two weeks? I was out.
Yes it was :smile_cat:
Ok, I'm still catching up hundreds of GitHub emails. I'll see it eventually. :crazy:
During today's Croissant Task Force meeting at noon-ish Boston time (12:05) they will be discussing future plans for Croissant.
Please feel free to DM me for the links to the meeting (on Google Meet) or the doc they will be discussing.
We're also testing Croissant implementation(s) in our multimodal repository (video, audio, text, haptics) https://database.sharemusic.se/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5072/FK2/T55YDC
Oh! Does that mean you installed the Croissant jar I made, @Slava Tykhonov :grinning:
We installed Croissant on demo and Harvard Dataverse. Please see https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/dLmV7HTcAgAJ
I also posted an update here: https://github.com/mlcommons/croissant/issues/530#issuecomment-2305479611
@Julian Gautier I got that same Invalid object type for field "distribution" email and just opened an issue about it: https://github.com/mlcommons/croissant/issues/725
Ah thanks. Was https://validator.schema.org not returning this error earlier? It is now, too, but I checked only today
Come to think of it, I think it's been returning that error all along.
I suspect the Search Console folks are not talking to the Croissant folks. Not sure.
"Indeed the Search console doesn't know about Croissant yet. It only validates mark-up based on the schema.org vocabulary, which expects distribution to be of type sc:DataDownload. I will get in touch with them to figure out how to best address this issue." --Omar
@Leo Andreev I believe you get these emails too. Please see the GitHub issue above for more info.
A little bird told me that @Slava Tykhonov is giving a talk at the Croissant Working Group meeting tomorrow:
"Slava Tykhonov (DANS-KNAW) will talk about supportingΒ external controlled vocabularies in Dataverse, and we can brainstorm on how we would like to support them in the next version of Croissant."
To join the call see How to Join and Access Croissant Working Group Resources at https://mlcommons.org/working-groups/data/croissant/
Slava's slides: https://docs.google.com/presentation/d/1PepV5qOITW2heil_iDts6xoB9CHM7Pjj/edit?usp=sharing&ouid=117275479921759507378&rtpof=true&sd=true
Dataverse has been added to the main Croissant image :tada:
Stefano used the updated image in his tweet today: https://x.com/iacus/status/1852061999854948814
That's fantastic!
In https://github.com/gdcc/exporter-croissant/pull/7 I'm proposing we update the media type from application/json to application/ld+json; profile="http://mlcommons.org/croissant/1.0" to be more specific.
@Slava Tykhonov and others, does this change make sense to you? Please see also the issue the PR closes: https://github.com/gdcc/exporter-croissant/issues/6
Makes sense as discussed last week on Croissant call with Signposting.
Great, thanks. I also created this issue upstream to add that media type to the Croissant spec: https://github.com/mlcommons/croissant/issues/792
Hi all, sorry for my question out of the blue, I hope this is the right place to ask. I work for Open Targets, an organisation that provides open-access data of biomedicine data. Our data is in Parquet format and we are looking for a standardised way for our dataset discovery and schema description and Croissant seems to be a perfect match, even though our use is not for ML directly. The problem is, we are not quite sure how do we express some data structure or data types. For example, how do we express key value pairs or list under a list? It seems the data types defined by Schema.org is minimal, will Croissant expands this? Thanks in advance.
@Paul TO hi! It sounds like you should join the Croissant community! https://github.com/mlcommons/croissant#getting-involved
There's a meeting every Wednesday and a mailing list where you can ask questions like this.
@Slava Tykhonov might have some ideas for you. He's on some of the papers.
@Paul TO key value pairs of what?
Philip Durbin βοΈ said:
Paul TO key value pairs of what?
We have a column of map<string, list<string>>.
Sure, but I'm curious what kind of data.
biomedicine, obviously
broadly
Philip Durbin βοΈ said:
biomedicine, obviously
It's a column in our drug dataset named crossReferences, here is one of the records:
{DailyMed=[oxybutynin, oxybutynin%20chloride], PubChem=[174006905, 50105262, 90341037], Wikipedia=[Oxybutynin], drugbank=[DB01062], chEBI=[7856]}
Btw thank you so much for your prompt response!
Sure. As you know, Croissant is built on top of Schema.org. That means you can use whatever Schema.org fields you like.
Philip Durbin βοΈ said:
Sure. As you know, Croissant is built on top of Schema.org. That means you can use whatever Schema.org fields you like.
Schema.org also doesn't provide dict or list and we try to avoid extending Schema.org ourselves as we want to follow a standardised specification. Anyway thanks for your help, I will look somewhere else for more information, maybe joining the mailing list :big_smile:
What about additionalProperty at https://schema.org/Drug ?
"A property-value pair representing an additional characteristic of the entity, e.g. a product feature or another characteristic for which there is no matching property in schema.org."
Was having a meeting with my direct supervisor, coincidentally he also made connections with major contributors of Croissant in an AI seminar and we may contribute to the specification soon. I think we will join the meeting.
Cool. I dip in and out but @Slava Tykhonov attends more consistently. It's a nice meeting.
@Slava Tykhonov @Jan Range did we ever merge the Croissant branch into pyDataverse? Also, can that branch be used to create a Croissant file from a draft dataset? Or does the dataset need to be published? I'm asking because a conference is considering hosting datasets on Harvard Dataverse but they want the dataset to be in draft AND have a Croissant file (which isn't possible in Dataverse itself, since metadata export formats like Croissant are only available AFTER publication).
It is not merged, but we could do that if its ready @Slava Tykhonov :smile:
@Jan Range I requested a review from you on https://github.com/gdcc/dataverse-recipes/pull/6
(Slava and I are talking on the side about support for drafts, etc.)
Sorry for the delay, had a presentation today and things ended up last minute :grinning:
Reviewed the PR and looks good! Also tested the requirements thing and opened a PR
https://github.com/gdcc/dataverse-recipes/pull/7
Thanks! I left a comment: https://github.com/gdcc/dataverse-recipes/pull/7/files#r1984080290
We see "@type": "WebApplication" in the JSON-LD in the <head> of https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker . Regular JSON-LD. It's software.
We see "@type": "sc:Dataset" (and a bunch of Croissant fields) in the <head> of https://huggingface.co/datasets/siacus/flourishing . True Croissant. :croissant: It's a dataset.
(This is all under <script type="application/ld+json">, of course.)
I bring this up because @Oliver Bertuch and I were talking about dataset types (#dev > datasetType (software, workflow, etc.) ). When datasetType=software, what do we want in the <head>? Not Croissant, I suppose! We'd follow Hugging Face's lead, I'd think, right @Slava Tykhonov?
@Oliver Bertuch also, I'm suggesting they link to the spec from the README in https://github.com/mlcommons/croissant/pull/887
As I just mentioned on the mailing list, the "summary statistics (mean, max, min, etc.)" issue I opened a while back at https://github.com/mlcommons/croissant/issues/640 got a comment.
@Slava Tykhonov I know you're interested in this.
Also the DDI folks I can think of: @Amber Leahey @Janet McDougall @Victoria Lubitch @Leo Andreev
I was just chatting with @Slava Tykhonov and he made me realize that when you create a preview URL for a dataset, you can use the token to export a draft export like Croissant. Here's an example: https://demo.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi:10.70122/FK2/QH4PDC&version=:draft&key=469367f2-357d-4df6-8f15-1bcb0e9a426b
So, while the script I added in https://github.com/gdcc/dataverse-recipes/pull/19 to download a Croissant file using one's own API token is still useful, this is a way to share a link without using your API token. Instead, you're using the token from the preview URL.
This should go in the tutorial asap.
Yeah. Hmm, maybe I can put it in the README at https://github.com/gdcc/dataverse-recipes/tree/main/python/download_draft_croissant at least. :sweat_smile:
Last updated: Nov 01 2025 at 14:11 UTC