Croissant health · python · Zulip Chat Archive

Stream: python

Topic: Croissant health

Philip Durbin 🚀 (May 06 2024 at 19:23):

@Slava Tykhonov or anyone interested in Python or Croissant, are you able to follow the README at https://github.com/mlcommons/croissant/tree/main/health ?

When I get to the scrapydweb step, I get crazy errors:

err.txt

Philip Durbin 🚀 (May 06 2024 at 19:24):

I should point out I did uncomment the early return because I don't want to actually crawl all of Hugging Face:

diff --git a/health/crawler/spiders/huggingface.py b/health/crawler/spiders/huggingface.py
index bbfc91b..e06fed3 100644
--- a/health/crawler/spiders/huggingface.py
+++ b/health/crawler/spiders/huggingface.py
@@ -17,13 +17,13 @@ class HuggingfaceSpider(BaseSpider):
     def list_datasets(self):
         """See base class."""
         # Uncomment this early return for debugging purposes:
-        # return [
-        #     "lkarjun/Malayalam-Artiicles",
-        #     "lkndsjkndgskjngkjsndkj/jsjdjsdvkjvszlhdskb",
-        #     "foo",  # does not exist
-        #     "Recag/Rp_CommonC_636_2",  # 500
-        #     "common_voice",  # timeout from Hugging Face
-        # ]
+        return [
+            "lkarjun/Malayalam-Artiicles",
+            "lkndsjkndgskjngkjsndkj/jsjdjsdvkjvszlhdskb",
+            "foo",  # does not exist
+            "Recag/Rp_CommonC_636_2",  # 500
+            "common_voice",  # timeout from Hugging Face
+        ]
         return [dataset.id for dataset in huggingface_hub.list_datasets()]

     def get_url(self, dataset_id: str):

Philip Durbin 🚀 (May 06 2024 at 19:25):

Anyway, for more context, we'd like to add a crawler for Dataverse some day, one Croissant is in place for a few installations: https://github.com/mlcommons/croissant/issues/530

Philip Durbin 🚀 (May 06 2024 at 20:00):

I asked for help here: https://github.com/mlcommons/croissant/issues/530#issuecomment-2096806017

Jan Range (May 06 2024 at 20:11):

Shall we put it on the PyDataverse WG agenda?

Philip Durbin 🚀 (May 06 2024 at 20:47):

If you want. I'm just trying to reach Python people. :grinning:

Jan Range (May 06 2024 at 21:07):

Sure, let's try to debug it :nerd:

Jan Range (May 06 2024 at 21:15):

It seems like the error originates from scrapydweb, which tries to talk to a database where a certain table is missing. Maybe there are some prequesites necessary to set up scrapyd and it is missing here.

Jan Range (May 06 2024 at 21:16):

Trying to reproduce it on my machine now

Jan Range (May 06 2024 at 21:21):

Getting the same error on Python 3.11

Jan Range (May 06 2024 at 21:42):

Seems to be know and is fixed in 1.4.1

https://github.com/my8100/scrapydweb/issues/205

Jan Range (May 06 2024 at 21:42):

~~Current scrapydweb on PyPI is 1.4.0 though~~ Scrapydweb is 1.5.0 on PyPI but coming from the requirements.txt it installs 1.4.0. I guess due to other dependencies using it already.