@Slava Tykhonov or anyone interested in Python or Croissant, are you able to follow the README at https://github.com/mlcommons/croissant/tree/main/health ?
When I get to the scrapydweb step, I get crazy errors:
I should point out I did uncomment the early return because I don't want to actually crawl all of Hugging Face:
diff --git a/health/crawler/spiders/huggingface.py b/health/crawler/spiders/huggingface.py
index bbfc91b..e06fed3 100644
--- a/health/crawler/spiders/huggingface.py
+++ b/health/crawler/spiders/huggingface.py
@@ -17,13 +17,13 @@ class HuggingfaceSpider(BaseSpider):
def list_datasets(self):
"""See base class."""
# Uncomment this early return for debugging purposes:
- # return [
- # "lkarjun/Malayalam-Artiicles",
- # "lkndsjkndgskjngkjsndkj/jsjdjsdvkjvszlhdskb",
- # "foo", # does not exist
- # "Recag/Rp_CommonC_636_2", # 500
- # "common_voice", # timeout from Hugging Face
- # ]
+ return [
+ "lkarjun/Malayalam-Artiicles",
+ "lkndsjkndgskjngkjsndkj/jsjdjsdvkjvszlhdskb",
+ "foo", # does not exist
+ "Recag/Rp_CommonC_636_2", # 500
+ "common_voice", # timeout from Hugging Face
+ ]
return [dataset.id for dataset in huggingface_hub.list_datasets()]
def get_url(self, dataset_id: str):
Anyway, for more context, we'd like to add a crawler for Dataverse some day, one Croissant is in place for a few installations: https://github.com/mlcommons/croissant/issues/530
I asked for help here: https://github.com/mlcommons/croissant/issues/530#issuecomment-2096806017
Shall we put it on the PyDataverse WG agenda?
If you want. I'm just trying to reach Python people. :grinning:
Sure, let's try to debug it :nerd:
It seems like the error originates from scrapydweb, which tries to talk to a database where a certain table is missing. Maybe there are some prequesites necessary to set up scrapyd and it is missing here.
Trying to reproduce it on my machine now
Getting the same error on Python 3.11
Seems to be know and is fixed in 1.4.1
https://github.com/my8100/scrapydweb/issues/205
Current Scrapydweb is 1.5.0 on PyPI but coming from the scrapydweb on PyPI is 1.4.0 thoughrequirements.txt it installs 1.4.0. I guess due to other dependencies using it already.
Works when you install the current dev version via:
python -m pip install git+https://github.com/my8100/scrapydweb.git
Nice! Thanks! Gotta take my dog for a walk, but I'll try it. Thanks! ![]()
Hey, it works! Thanks, @Jan Range !
@Jan Range I opened an issue and gave you a shout out: https://github.com/mlcommons/croissant/issues/647
Thanks again! ![]()
Glad to hear it works!
Last updated: Nov 01 2025 at 14:11 UTC