The Croissant Editor seems pretty cool. You can play around with it at https://huggingface.co/spaces/MLCommons/croissant-editor as explained at https://mlcommons.org/working-groups/data/croissant/
It reminds me of bit of the Data Curation Tool by @Victoria Lubitch
I had a little trouble running it from Docker and opened https://github.com/mlcommons/croissant/issues/607 (maybe it's just me? :shrug: )
It might be a nice external tool to integrate with Dataverse somehow, some day.
It worked out of the box for me :sweat_smile:
Which dataset did you try opening?
@Slava Tykhonov 's sample from https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/croissant_sample.json
It said it was invalid but it oppened it lol
This is what I get with the one generated with the exporter
Yeah, I think that's the error I got.
I'm not sure that Croissant Editor working well. There are error messages on Kaggle and HuggingFace croissants.
The version hosted at https://huggingface.co/spaces/MLCommons/croissant-editor was working fine for me.
did you tried examples from their github? https://github.com/mlcommons/croissant/tree/main/datasets/1.0
or real croissants from Kagle etc?
When I opened https://github.com/mlcommons/croissant/issues/607 I was using local examples like Titantic.
try this from HF
https://datasets-server.huggingface.co/croissant?dataset=mnist
I spun up the Docker image with this:
docker run -p 8501:8501 -v ~/.cache/croissant:/root/.cache/croissant -it mlcommons/croissant-editor
Then I loaded that mnist file but it says invalid:
Screenshot-2024-03-19-at-9.29.04-AM.png
(in the bottom right)
No one else sees this? Just me? :sweat_smile:
Try Kaggle export, it's the same :)
Same how?
It works for you? You're using Docker?
No, it doesn't work for real datasets.
Ok, so the issue I opened is valid. Good! Thanks! :grinning:
Philip Durbin said:
No one else sees this? Just me? :sweat_smile:
I see the same
Good, so the issue is valid.
So probably this is the same for everyone but the local validator was failing to me since I didn't have installed cypress and libmagic, once that I installed those it worked for me on docker
Probably these should be pre-requisites
brew install nvm
nvm use default # We recommend managing NPM using NVM
For NVM I was getting "nvm N/A: version "default" is not yet installed." which was fixed for me with $ nvm install 'lts/*'
npm install
npm run cypress:open # Opens the Cypress application
npm run cypress:run # Runs e2e tests in background
brew install libmagic
For NVM I was getting "nvm N/A: version "default" is not yet installed." which was fixed for me with $ nvm install 'lts/*'
If anyone else needs help to get it working let me know :smile:
Slava Tykhonov said:
try this from HF
https://datasets-server.huggingface.co/croissant?dataset=mnist
E0319 13:31:32.897765 7956631552 validate.py:55] Found the following 5 error(s) during the validation:
- "parquet-files-for-config-mnist" should have an attribute "@type": "http://mlcommons.org/croissant/FileObject" or "@type": "http://mlcommons.org/croissant/FileSet". Got https://schema.org/FileSet instead.
- "repo" should have an attribute "@type": "http://mlcommons.org/croissant/FileObject" or "@type": "http://mlcommons.org/croissant/FileSet". Got https://schema.org/FileObject instead.
- [Metadata(mnist) > RecordSet(record_set_mnist) > Field(image)] Malformed source data: parquet-files-for-config-mnist. It does not refer to any existing node. Have you used http://mlcommons.org/croissant/field or https://schema.org/distribution to indicate the source field or the source distribution? If you specified a field, it should contain all the names from the RecordSet separated by `/`, e.g.: "record_set_name/field_name"
- [Metadata(mnist) > RecordSet(record_set_mnist) > Field(label)] Malformed source data: parquet-files-for-config-mnist. It does not refer to any existing node. Have you used http://mlcommons.org/croissant/field or https://schema.org/distribution to indicate the source field or the source distribution? If you specified a field, it should contain all the names from the RecordSet separated by `/`, e.g.: "record_set_name/field_name"
- [Metadata(mnist) > RecordSet(record_set_mnist)] There is a reference to node with UUID "parquet-files-for-config-mnist" in node "record_set_mnist", but this node doesn't exist.
Found the following 3 warning(s) during the validation:
- [Metadata(mnist)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
- [Metadata(mnist)] Property "https://schema.org/datePublished" is recommended, but does not exist.
- [Metadata(mnist)] Property "https://schema.org/version" is recommended, but does not exist.
So I went trough that log and the sample sample from HuggingFace, on the distribution they have
"@type": "sc:FileObject" vs "@type": "cr:FileObject"and "@type": "sc:FileSet" vs "@type": "cr:FileSet" once these two are updated there are no more errors just the warnings that when added:
"citeAs": "TEST",
"version": "1.0.0",
"datePublished": "2024-03-19",
The file passed validations:
(venv) jptosca@HMDC-JPs-MacBook-Pro croissant % ./validate.sh
I0319 13:40:04.702379 7956631552 validate.py:53] Done.
So probably the implementation from HuggingFace is not fully complete
I am also testing with this from Kaggle, https://www.kaggle.com/datasets/bhavikjikadara/brand-laptops-dataset which initially shows 1 error and 13 warnings:
E0319 14:13:07.987779 7956631552 validate.py:55] Found the following 1 error(s) during the validation:
- [Metadata(Brand-Laptops-Dataset) > FileObject(archive.zip)] At least one of these properties should be defined: ['md5', 'sha256'].
Found the following 15 warning(s) during the validation:
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(OS)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(display_size)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(gpu_brand)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(gpu_type)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(index)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(is_touch_screen)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(primary_storage_capacity)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(primary_storage_type)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(resolution_height)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(resolution_width)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(secondary_storage_capacity)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(secondary_storage_type)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(year_of_warranty)] Property "https://schema.org/description" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset)] Property "https://schema.org/datePublished" is recommended, but does not exist.
- [Metadata(Brand-Laptops-Dataset)] Property "https://schema.org/version" is recommended, but does not exist.
(venv) jptosca@HMDC-JPs-MacBook-Pro croissant %
Most of the warnings are caused by the missing description of the fields and the last two are caused by the version and the datePublished. The only interesting thing that I noticed from this one is the error that says At least one of these properties should be defined: ['md5', 'sha256'] because this JSON declares an MD5, if I change this to sha256 the validation is successful :thinking:
It looks like in general the validator is working fine but probably the implementations from both sites are not complete 100% yet
This is probably just an error on the message no?
It seems to me that the validator only takes sha256
Also here I could only find sha256
https://schema.org/DataDownload
I didn't managed to find any dataset from external repositories to be validated by Croissant Editor :)
This was one of them after the changes, I have found cases that the CLI validation passes but it fails on the web client
Can you share this validated dataset?
Screenshot-2024-03-19-at-21.21.16.png
:eyes: Did you have already cypress and libmagic set up and running?
it's here https://huggingface.co/spaces/MLCommons/croissant-editor?project=20240319212107872516
Interesting...
On the local version I am running the same file gets validated and I can browse between the tabs so I am not sure if there is a difference on the version of the editor
:thinking:
Last updated: Nov 01 2025 at 14:11 UTC