Stream: dev

Topic: Croissant Editor


view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 15 2024 at 14:30):

The Croissant Editor seems pretty cool. You can play around with it at https://huggingface.co/spaces/MLCommons/croissant-editor as explained at https://mlcommons.org/working-groups/data/croissant/

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 15 2024 at 14:31):

It reminds me of bit of the Data Curation Tool by @Victoria Lubitch

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 15 2024 at 14:32):

I had a little trouble running it from Docker and opened https://github.com/mlcommons/croissant/issues/607 (maybe it's just me? :shrug: )

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 15 2024 at 14:32):

It might be a nice external tool to integrate with Dataverse somehow, some day.

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 18 2024 at 14:16):

It worked out of the box for me :sweat_smile:

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 01:46):

Which dataset did you try opening?

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 01:49):

@Slava Tykhonov 's sample from https://github.com/Dans-labs/pyDataverse/blob/croissant/samples/croissant_sample.json

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 01:56):

It said it was invalid but it oppened it lol

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 02:00):

image.png

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 02:01):

This is what I get with the one generated with the exporter

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 02:01):

image.png

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 11:23):

Yeah, I think that's the error I got.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 12:17):

I'm not sure that Croissant Editor working well. There are error messages on Kaggle and HuggingFace croissants.

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 12:19):

The version hosted at https://huggingface.co/spaces/MLCommons/croissant-editor was working fine for me.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 12:30):

did you tried examples from their github? https://github.com/mlcommons/croissant/tree/main/datasets/1.0
or real croissants from Kagle etc?

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 12:35):

When I opened https://github.com/mlcommons/croissant/issues/607 I was using local examples like Titantic.

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 13:13):

try this from HF
https://datasets-server.huggingface.co/croissant?dataset=mnist

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 13:29):

I spun up the Docker image with this:

docker run -p 8501:8501 -v ~/.cache/croissant:/root/.cache/croissant -it mlcommons/croissant-editor

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 13:29):

Then I loaded that mnist file but it says invalid:

Screenshot-2024-03-19-at-9.29.04-AM.png

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 13:30):

(in the bottom right)

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 13:30):

No one else sees this? Just me? :sweat_smile:

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 13:44):

Try Kaggle export, it's the same :)

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 13:44):

Same how?

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 13:44):

It works for you? You're using Docker?

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 13:54):

No, it doesn't work for real datasets.

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 13:56):

Ok, so the issue I opened is valid. Good! Thanks! :grinning:

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 15:58):

Philip Durbin said:

No one else sees this? Just me? :sweat_smile:

I see the same

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2024 at 15:58):

Good, so the issue is valid.

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:28):

So probably this is the same for everyone but the local validator was failing to me since I didn't have installed cypress and libmagic, once that I installed those it worked for me on docker

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:29):

Probably these should be pre-requisites

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:29):

image.png

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:30):

brew install nvm

nvm use default # We recommend managing NPM using NVM

For NVM I was getting "nvm N/A: version "default" is not yet installed." which was fixed for me with $ nvm install 'lts/*'

npm install
npm run cypress:open  # Opens the Cypress application
npm run cypress:run  # Runs e2e tests in background

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:31):

brew install libmagic

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:32):

For NVM I was getting "nvm N/A: version "default" is not yet installed." which was fixed for me with $ nvm install 'lts/*'

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 16:42):

If anyone else needs help to get it working let me know :smile:

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 17:33):

Slava Tykhonov said:

try this from HF
https://datasets-server.huggingface.co/croissant?dataset=mnist

E0319 13:31:32.897765 7956631552 validate.py:55] Found the following 5 error(s) during the validation:
  -  "parquet-files-for-config-mnist" should have an attribute "@type": "http://mlcommons.org/croissant/FileObject" or "@type": "http://mlcommons.org/croissant/FileSet". Got https://schema.org/FileSet instead.
  -  "repo" should have an attribute "@type": "http://mlcommons.org/croissant/FileObject" or "@type": "http://mlcommons.org/croissant/FileSet". Got https://schema.org/FileObject instead.
  -  [Metadata(mnist) > RecordSet(record_set_mnist) > Field(image)] Malformed source data: parquet-files-for-config-mnist. It does not refer to any existing node. Have you used http://mlcommons.org/croissant/field or https://schema.org/distribution to indicate the source field or the source distribution? If you specified a field, it should contain all the names from the RecordSet separated by `/`, e.g.: "record_set_name/field_name"
  -  [Metadata(mnist) > RecordSet(record_set_mnist) > Field(label)] Malformed source data: parquet-files-for-config-mnist. It does not refer to any existing node. Have you used http://mlcommons.org/croissant/field or https://schema.org/distribution to indicate the source field or the source distribution? If you specified a field, it should contain all the names from the RecordSet separated by `/`, e.g.: "record_set_name/field_name"
  -  [Metadata(mnist) > RecordSet(record_set_mnist)] There is a reference to node with UUID "parquet-files-for-config-mnist" in node "record_set_mnist", but this node doesn't exist.
Found the following 3 warning(s) during the validation:
  -  [Metadata(mnist)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(mnist)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(mnist)] Property "https://schema.org/version" is recommended, but does not exist.

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 17:45):

So I went trough that log and the sample sample from HuggingFace, on the distribution they have
"@type": "sc:FileObject" vs "@type": "cr:FileObject"and "@type": "sc:FileSet" vs "@type": "cr:FileSet" once these two are updated there are no more errors just the warnings that when added:

"citeAs": "TEST",
"version": "1.0.0",
"datePublished": "2024-03-19",

The file passed validations:

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 17:45):

(venv) jptosca@HMDC-JPs-MacBook-Pro croissant % ./validate.sh
I0319 13:40:04.702379 7956631552 validate.py:53] Done.

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 17:46):

So probably the implementation from HuggingFace is not fully complete

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:15):

I am also testing with this from Kaggle, https://www.kaggle.com/datasets/bhavikjikadara/brand-laptops-dataset which initially shows 1 error and 13 warnings:

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:16):

E0319 14:13:07.987779 7956631552 validate.py:55] Found the following 1 error(s) during the validation:
  -  [Metadata(Brand-Laptops-Dataset) > FileObject(archive.zip)] At least one of these properties should be defined: ['md5', 'sha256'].
Found the following 15 warning(s) during the validation:
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(OS)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(display_size)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(gpu_brand)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(gpu_type)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(index)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(is_touch_screen)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(primary_storage_capacity)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(primary_storage_type)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(resolution_height)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(resolution_width)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(secondary_storage_capacity)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(secondary_storage_type)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset) > RecordSet(laptops.csv_records) > Field(year_of_warranty)] Property "https://schema.org/description" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(Brand-Laptops-Dataset)] Property "https://schema.org/version" is recommended, but does not exist.
(venv) jptosca@HMDC-JPs-MacBook-Pro croissant %

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:33):

Most of the warnings are caused by the missing description of the fields and the last two are caused by the version and the datePublished. The only interesting thing that I noticed from this one is the error that says At least one of these properties should be defined: ['md5', 'sha256'] because this JSON declares an MD5, if I change this to sha256 the validation is successful :thinking:

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:34):

It looks like in general the validator is working fine but probably the implementations from both sites are not complete 100% yet

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:55):

This is probably just an error on the message no?

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:56):

It seems to me that the validator only takes sha256

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:56):

Also here I could only find sha256

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 18:56):

https://schema.org/DataDownload

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 19:04):

I didn't managed to find any dataset from external repositories to be validated by Croissant Editor :)

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 19:49):

image.png

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 19:49):

This was one of them after the changes, I have found cases that the CLI validation passes but it fails on the web client

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 20:07):

Can you share this validated dataset?

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 20:11):

laptops.json

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 20:22):

Screenshot-2024-03-19-at-21.21.16.png

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 20:26):

:eyes: Did you have already cypress and libmagic set up and running?

view this post on Zulip Slava Tykhonov (Mar 19 2024 at 20:27):

it's here https://huggingface.co/spaces/MLCommons/croissant-editor?project=20240319212107872516

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 20:28):

Interesting...

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 20:37):

On the local version I am running the same file gets validated and I can browse between the tabs so I am not sure if there is a difference on the version of the editor

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 20:37):

:thinking:

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 20:37):

image.png

view this post on Zulip Juan Pablo Tosca Villanueva (Mar 19 2024 at 20:37):

image.png


Last updated: Nov 01 2025 at 14:11 UTC