Stream: python

Topic: Croissant health


view this post on Zulip Philip Durbin ๐Ÿš€ (May 06 2024 at 19:23):

@Slava Tykhonov or anyone interested in Python or Croissant, are you able to follow the README at https://github.com/mlcommons/croissant/tree/main/health ?

When I get to the scrapydweb step, I get crazy errors:

err.txt

view this post on Zulip Philip Durbin ๐Ÿš€ (May 06 2024 at 19:24):

I should point out I did uncomment the early return because I don't want to actually crawl all of Hugging Face:

diff --git a/health/crawler/spiders/huggingface.py b/health/crawler/spiders/huggingface.py
index bbfc91b..e06fed3 100644
--- a/health/crawler/spiders/huggingface.py
+++ b/health/crawler/spiders/huggingface.py
@@ -17,13 +17,13 @@ class HuggingfaceSpider(BaseSpider):
     def list_datasets(self):
         """See base class."""
         # Uncomment this early return for debugging purposes:
-        # return [
-        #     "lkarjun/Malayalam-Artiicles",
-        #     "lkndsjkndgskjngkjsndkj/jsjdjsdvkjvszlhdskb",
-        #     "foo",  # does not exist
-        #     "Recag/Rp_CommonC_636_2",  # 500
-        #     "common_voice",  # timeout from Hugging Face
-        # ]
+        return [
+            "lkarjun/Malayalam-Artiicles",
+            "lkndsjkndgskjngkjsndkj/jsjdjsdvkjvszlhdskb",
+            "foo",  # does not exist
+            "Recag/Rp_CommonC_636_2",  # 500
+            "common_voice",  # timeout from Hugging Face
+        ]
         return [dataset.id for dataset in huggingface_hub.list_datasets()]

     def get_url(self, dataset_id: str):

view this post on Zulip Philip Durbin ๐Ÿš€ (May 06 2024 at 19:25):

Anyway, for more context, we'd like to add a crawler for Dataverse some day, one Croissant is in place for a few installations: https://github.com/mlcommons/croissant/issues/530

view this post on Zulip Philip Durbin ๐Ÿš€ (May 06 2024 at 20:00):

I asked for help here: https://github.com/mlcommons/croissant/issues/530#issuecomment-2096806017

view this post on Zulip Jan Range (May 06 2024 at 20:11):

Shall we put it on the PyDataverse WG agenda?

view this post on Zulip Philip Durbin ๐Ÿš€ (May 06 2024 at 20:47):

If you want. I'm just trying to reach Python people. :grinning:

view this post on Zulip Jan Range (May 06 2024 at 21:07):

Sure, let's try to debug it :nerd:

view this post on Zulip Jan Range (May 06 2024 at 21:15):

It seems like the error originates from scrapydweb, which tries to talk to a database where a certain table is missing. Maybe there are some prequesites necessary to set up scrapyd and it is missing here.

view this post on Zulip Jan Range (May 06 2024 at 21:16):

Trying to reproduce it on my machine now

view this post on Zulip Jan Range (May 06 2024 at 21:21):

Getting the same error on Python 3.11

view this post on Zulip Jan Range (May 06 2024 at 21:42):

Seems to be know and is fixed in 1.4.1

https://github.com/my8100/scrapydweb/issues/205

view this post on Zulip Jan Range (May 06 2024 at 21:42):

Current scrapydweb on PyPI is 1.4.0 though Scrapydweb is 1.5.0 on PyPI but coming from the requirements.txt it installs 1.4.0. I guess due to other dependencies using it already.

view this post on Zulip Jan Range (May 06 2024 at 21:44):

image.png

view this post on Zulip Jan Range (May 06 2024 at 21:44):

Works when you install the current dev version via:

python -m pip install git+https://github.com/my8100/scrapydweb.git

view this post on Zulip Philip Durbin ๐Ÿš€ (May 06 2024 at 21:46):

Nice! Thanks! Gotta take my dog for a walk, but I'll try it. Thanks! :midnight:

view this post on Zulip Philip Durbin ๐Ÿš€ (May 07 2024 at 14:07):

Hey, it works! Thanks, @Jan Range !

scrapydweb.png

view this post on Zulip Philip Durbin ๐Ÿš€ (May 07 2024 at 14:18):

@Jan Range I opened an issue and gave you a shout out: https://github.com/mlcommons/croissant/issues/647

Thanks again! :dataverse_man:

view this post on Zulip Jan Range (May 07 2024 at 17:13):

Glad to hear it works!


Last updated: Nov 01 2025 at 14:11 UTC