Hey @Philip Durbin --it's been a long time, but I see you're still deep into everything Dataverse. I wanted to reopen the topic of indexing Dataverse content into DataONE (https://datone.org) -- my colleague Ian has been working on getting that harvest enabled, but running into blocks from robots.txt and other policy issues -- I just wanted to check in and make sure we're heading in the right direction. Let me know if you have a minute to chat.
@Matt Jones hi! Yes, I'm still in deep. :grinning:
I see you replied to @Leo Andreev at https://help.hmdc.harvard.edu/Ticket/Display.html?id=337611 recently. Let me check in with him. He's been working on robots.txt stuff over at https://github.com/IQSS/dataverse.harvard.edu/issues/228 but I don't remember the exact status.
Anyway, sure! Let's do a call. 3am local time for you so let's wait a bit. :big_smile:
We're talking about it but probably won't get to it today.
ok, thanks @Philip Durbin -- when we first tried to harvest schema.org, it seems like we didn;'t throttle and so it was a problem for your server,. So, Ian worked on toning it down and limiting our request rate, but then on the very day that he tried it again, he noticed that your robots.txt was modified to include a Disallow: / from all user agents except google. So we were worried that we did something wrong again. Ian sent Leo a note on RT, but we'd love to work out how to handle this in a way that works for your systems.
Heh, well, we appreciate you being good Internet citizens. :grinning:
Hi @Philip Durbin, I appreciate your and Leo's time. I'm happy to jump on a call if you need to hash out technical stuff.
Much appreciated. It's a busy day here. Sorry.
Leonid is still quite busy. Are there questions I can answer? Or is the main question just, "When will you stop blocking us?" :sweat_smile:
That is the main question for sure, with a side question of why blocking was needed for our new process -- we're hoping our throttled harvest is well below whatever limits you need to impose. If not, we'd like to be sure they are before we restart.
Probably you're collateral damage. I'm sure there are worse offenders.
so if we see the robots.txt file flip back to unblocked, is that a sign its ok to start again?
Let us know if there's anything we need to change to be good netizens
Can you or @Ian Nesbitt please provide any specifics about what he wrote in the ticket?
In addition, I also cut the concurrent requests down and added some throttling. Before the requests began to fail, the harvest was running much faster than before due to the smaller response sizes. Please let me know whether to cut it back further and I can do so.
How many requests per second are we talking about?
@Philip Durbin The lowest rate I saw was 20 pages/min and the most I saw was 36 pages/min
so between 0.3 and 0.6-ish/second
More specifics: the job started with a permissive robots.txt and the spider was requesting application/ld+json successfully, but at 2023-10-13 23:06:16 UTC those requests started returning a text/html response we couldn't parse, and then at 2023-10-13 23:09:14 they began returning 404
Thanks. Out of curiosity, do you plan to index metadata from any of the other Dataverse installations? https://dataverse.org/installations ?
Yes actually, we have been in contact with Borealis recently and just began a harvest of their Dataverse metadata as well
Oh, nice. Our Canadian friends. :flag_canada: How has the indexing experience been so far?
Excellent! It's one of the endpoints in our triage that has the most complete schema.org data of anywhere I've seen
Wow, that's nice to hear.
I see a lot of repositories that are missing identifier or have incorrectly configured @id or other fields so it's nice to see ones that follow the spec
Gotcha. Well we did use Google's validator.
And we put in some schema.org fixes recently. Not sure if Borealis has upgraded yet.
Looks like some correspondence is happening in https://help.hmdc.harvard.edu/Ticket/Display.html?id=337611 . Great!
@Philip Durbin We've finally got most of the Harvard Dataverse corpus scraped, harvested, and indexed in DataONE: https://search.dataone.org/portals/HD :tada:
I will have Angie Garcia, our outreach coordinator, get in contact with you as soon as we're all back from AGU, or sooner if you're there to find us in person. Cheers and Happy Holidays!
@Ian Nesbitt nice, thanks for letting us know!
@Ian Nesbitt I was just thinking, what you've built is morally equivalent to SHARE, which we list an an integration: https://guides.dataverse.org/en/6.1/admin/integrations.html#share
If you'd like to create a PR to add DataONE, the file edit it here: https://github.com/IQSS/dataverse/blob/develop/doc/sphinx-guides/source/admin/integrations.rst
Definitely! I've added PR #10192 with the suggested changes. Thank you for the suggestion.
@Matt Jones @Ian Nesbitt can you think of any reason why https://search.dataone.org/view/sha256%3A2291cc19ed4e348a344f58f656cf5b354bfd2b8a0a05d59b2799d9333ce795f4 (Ci Technology DataSet) has my ORCID on it? My ORCID ( http://orcid.org/0000-0002-9528-9470 ) is listed as both submitter (!) and rights holder (!!). Someone just emailed me about access to that dataset but I have nothing to do with it. :confused:
Hi Philip, my apologies for the delay as I've been on vacationβto answer
your question, since the records are all just placeholders that trace back
to the "real" HD records, we assign the repository manager as both
submitter and rightsholder. Since the records update automatically and the
node instance needs a user account to do so, we tell the system that they
are being maintained by "you". It also allows you to, for example, modify
the records yourself using our API, should you need to do so. We really
should take the mention of the submitter and rightsholder ORCID out of
prominence or view entirely for schema.org records, but that is technically
how it works.
As for the root cause of why they may be asking you this question: I
suppose they must need the record changed in some way. If so, we would be
happy to help facilitate. The records should be updated automatically based
on HD JSON-LDs on a regular basis, but sometimes in rare cases our indexer
drops the records and they must be reindexed manually. Before I left for
vacation I noticed HD had a backlog of changes that need to be indexed, so
I can take a look and see if this is one of those cases. We have yet to
identify the bug, but we're working on it.
@Philip Durbin π making sure you see this :up:
@Philip Durbin π Now that I'm back from vacation, it looks like the Harvard Dataverse sitemap doesn't have any lastmod dates after 2024-07-03. Perhaps the sitemap has stopped updating somehow?
I've set our sitemap spider to pick up all changes after 2024-07-03 so when the sitemap does get updated, we can get to work indexing the backlog.
@Ian Nesbitt thanks for getting back to me. Now I'm back from couple days off. It sounds like my ORCID is on thousands of datasets. Can you please remove it? :grinning:
Sure. Since I manage the metadata flows, we can replace it with mine if that's an acceptable solution.
From my perspective, that's a step in the right direction. :grinning:
I don't particularly want to field questions about rights and access, etc.
Are there any other options, longer term? Maybe even just "see source dataset"?
That's fair. Perhaps a better solution would be to hide those fields from view entirely, since they don't really mean anything to the end user.
Are there any other options, longer term? Maybe even just "see source dataset"?
Yes. I will bring this up at our DataONE team meeting on Thursday, because ideally we don't want end users asking data managers about these automatically managed records at all.
Well, I think the fields are meaningful. Who submitted this data? Who is the rights holder? But sure, hidden fields are better than inaccurate fields, I'd say.
Of course. They would still be visible in the system metadata, just not on the dataset landing pages.
Any idea why the HD sitemap seems to be stale since early July?
Oh, we probably switched to >50K mode. One sec.
Here, please try this one: https://dataverse.harvard.edu/sitemap_index.xml
If it's helpful, here are our docs on it: https://guides.dataverse.org/en/6.4/installation/config.html#multiple-sitemap-files-sitemap-index-file
Ah, perfect. Our spider can handle indexed sitemaps. Thank you!
Sure thing, I wonder if we should do something with the old, aging, single-file sitemap. :thinking:
Yeah, good question. Maybe a moved permanently that redirects to the base of the index?
Good idea. I'm asking internally.
Weird: the new sitemap doesn't seem to have records newer than 2024-07-03 either.
![]()
@Ian Nesbitt ok! The sitemap should be fixed now. Please try again. And thanks again for letting us know!
Thank you @Philip Durbin π ! Yes, I'm scraping 4200 new records now.
Great news. And where are we with "rights holder" and "submitter"? Some day it would be nice to fill these in with the proper values. Maybe stuff like "CC0" and whoever is in the Depositor field? Of course, hiding these values for now, if they don't have accurate information, sounds good to me.
Because the records are maintained automatically, the rights have to be the same across the board. I can change the rightsHolder field to my ORCiD. The submitter field is immutable, unfortunately, so I think the best route to take would be to just hide them from view of the end user so it doesn't get misinterpreted as to why it's not the authors themselves. Because in essence, those rights are held by the authors themselves, they just have to edit the HD record, since the DataONE record is automatically drawn from HD.
I will raise this issue at our meeting on Thursday and let you know the outcome.
Thanks, I'm curious what people think about this.
From my perspective "submitter" maps to "depositor" in Dataverse.
And rights are always a hot mess. :crazy:
But we do have lots of fields in Dataverse for rights if you want them! :sweat_smile:
From my perspective "submitter" maps to "depositor" in Dataverse.
Does Dataverse require ORCiDs for submission? If so that would be a fairly 1:1 translation...
No, ORCIDs are optional
Our submitter field corresponds to the user identifier of the party that initially deposited the dataset (and is immutable via the API from the time of first deposit). It is closely aligned to rightsHolder, which is the user identifier of the party that has full access rights over the dataset, which can change through time (in addition to any other access rules that are provided for other users). The values in these fields are typically ORCID values now, but could also be any identifier from an identity provider that you use (e.g., from CILogon, Globus Auth, OpenID Connect, etc).
Ideally we like to have the info as it applies to each dataset individually, but in the case of schema.org harvests, this info is usually not in the record, and so we have reverted to setting a global value for the whole collection. Which is where we went awry from your perspective, I think. Maybe we need to standardize/clarify how rights and access fields are populated in schema.org Dataset entries to promote interoperability?
We've been gaining members of DataONE that use the Dataverse platform (e.g., DataverseNO most recently), and so it would be good to iron out these details for all groups that might wish to join.
Maybe. These days we're using Croissant as an extension of Schema.org.
Here's an example: https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi%3A10.7910/DVN/HOLVXA
We don't seem to put Dataverse's "depositor" field in there. If there's a good place for it, we certainly could.
We do populate "creator" but this can be different than "depositor". A depositor can upload data on a creator/author's behalf in Dataverse.
But yes, yes, yes, we should align! :grinning:
Yeah, creator and rightsHolder in DataONE are certainly different. We interpret creator following SOSO to be the list of parties that should be cited/attributed for the Dataset. Whereas rightsHolder is about access control, and orthogonal to attribution. Various other parties can act on behalf of creators when editing and depositing datasets. So I think a separate set of roles around rightsHolder, submitter/depositor, and access control lists could be useful. But its only really useful to people that are trying to use interoperable editing APIs (and not just public read access).
(Hi Phil! Glad to be chatting again, it has been quite a while!)
SOSO?
Also, can you please remind me... is DataONE getting Dataverse metadata from the <head> of dataset pages as Schema.org JSON-LD?
SOSO - science-on-schema.org
ooo, fun!
is DataONE getting Dataverse metadata from the
<head>of dataset pages as Schema.org JSON-LD?
IIRC we're doing content negotiation and grabbing it from an AWS instance
I wonder if the Croissant folks know about this :thinking:
well, sure, we're on AWS
Is the code open source?
Yesβit's located at https://github.com/DataONEorg/mnlite
After downloading and parsing the sitemaps, we query each page listed and ask for JSON-LD, which I think causes a redirect to a request like this:
https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/BUOUNW/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20240925%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240925T082255Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=ce9e514a3d9e1c521733311b597a7df31e4e8387b0b6b353891f1a2c30635b28
On a related note, every once in a while we seem to be running into an issue where the server only returns half of a json document...
Carl Boettiger has been pursuing the SOSO-Croissant mapping and how they relate to one another, and it has been discussed a few times in ESIP cluster meetings, but we haven't dove into it in detail yet.
Thanks, make sense. Looks like you're using export_schema.org.cached but if you want, you could switch to export_croissant.cached. Google Dataset Search is encouraging sites to switch to Croissant. Please see also this summary I wrote.
I don't see Carl in the Croissant meeting minutes but I do see mention of science-on-schema.org. Thanks for putting this on my radar.
Ah, it looks like @Julian Gautier attended the DataONE "Science on Schema.org Guidelines and Experiences" call back in 2021. Good.
Good morning @Philip Durbin and team. We've received some requests for DataONE to index location information for Dataverse datasets, but I don't think we scrape any from the SOSO docs. Do you store location information (bounding boxes, points, etc?) Is there a way for me to request that this info gets serialized into schema.org documents in future versions of the Dataverse software?
Hi! We're about to start our annual conference (#community > #Dataverse2025) but quickly, yes, please check the geospatial metadata block for a bounding box.
Ah, exciting! Have a great conference!
Thanks. Which of these export formats are you importing? https://dataverse.harvard.edu/api/info/exportFormats
application/ld+json (schema.org)
Some datasets have place names, but I don't think we get any quantitative locations
Don't feel the need to respond now, I can wait until after the conference to talk about this
@Ian Nesbitt We're back! And I think I have some good news for you. I hope! :smile:
I started with https://dataverse.harvard.edu/api/search?q=*&geo_point=42.3,-71.1&geo_radius=1.5 which is the example of a geospatial search at https://guides.dataverse.org/en/6.6/api/search.html
This lead me to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/E8Z5Q3
If you export as Schema.org JSON-LD like this: https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/E8Z5Q3
You'll see this:
"spatialCoverage": [
"North America",
"Global"
]
It's not bounding boxes but you can get them from Dataverse's native JSON format if you like: https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/E8Z5Q3
Or the OAI_ORE format: https://dataverse.harvard.edu/api/datasets/export?exporter=OAI_ORE&persistentId=doi%3A10.7910/DVN/E8Z5Q3
Ah, I was just typing a message to you. Hope your conference went well!
Yep! Good times. Too short. :smile:
I do see the semantic place names in spatialCoverage but when we found the geospatial metadata block field definitions last week it confirmed my suspicion that the quantitative location information doesn't get serialized to schema.org
Right. It doesn't seem to. Should it? Is there a good place in schema.org (and Croissant, if you're familiar) for bounding boxes?
Yes they do! It looks like this:
"spatialCoverage": {
"@type": "Place",
"geo": {
"@type": "GeoShape",
"box": "{SOUTH} {WEST} {NORTH} {EAST}"
}
}
}
For SO
Interesting. Would you be able to make a feature request? https://github.com/IQSS/dataverse/issues
You can read about it in the science-on-schema.org Dataset guide: https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md#spatial-coverage
Philip Durbin said:
Interesting. Would you be able to make a feature request? https://github.com/IQSS/dataverse/issues
Definitely
Awesome. Thanks. Also, out of curiosity, have you heard of Croissant? Any interest in it? It's also based on schema.org.
I ask because from the Dataverse perspective, we implemented the original JSON-LD Schema.org format to support Google Dataset Search. But now they've deprecated it in favor of Croissant.
Yes, and we've discussed formally adopting it as well, but haven't made any official moves towards that yet
Ok. I'm just reviewing https://dataverse.harvard.edu/api/info/exportFormats again and if I'm not wrong Dataverse supports three formats based on schema.org:
Would you want that "geo box" info in all three formats?
I don't know exactly how that field translates from standard SO, but those other formats definitely support bounding boxes so I'll try to include it in the issue
I think the answer is "yes"
Great, thanks. If you want you can just say "all formats based on schema.org".
Ok. I need to finish some other stuff but I can probably post the issue later this evening
no rush, we don't have time to work on it anyway :crazy:
Same story here at DataONE as always :)
I figured :rofl:
Submitted: https://github.com/IQSS/dataverse/issues/11582
Looks great! Thanks! I made a couple tiny tweaks.
Thanks! You are quick!
Hi @Philip Durbin π , we have been having an issue that I've missed since July...it seems our scraper is getting empty status code 202 responses from the HD server when it tries to get the base sitemap. I think the reason I missed it is because it doesn't register as an error...
Here's what's returned when I wget from the scraper server:
$ wget https://dataverse.harvard.edu/sitemap_index.xml
--2025-10-27 18:35:21-- https://dataverse.harvard.edu/sitemap_index.xml
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 54.86.163.49, 3.211.175.147, 3.215.43.147
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|54.86.163.49|:443... connected.
HTTP request sent, awaiting response... 202 Accepted
Length: 0 [text/html]
Saving to: βsitemap_index.xmlβ
sitemap_index.xml [ <=> ] 0 --.-KB/s in 0s
2025-10-27 18:35:22 (0.00 B/s) - βsitemap_index.xmlβ saved [0/0]
Length 0. Interesting.
It loads fine in a browser which is odd
Hey all. I used to scrape the HTML of certain types of pages on Harvard Dataverse and had to stop back in April 2025. Leonid told me back then that the 202 status code I was seeing was because the IT folks who help manage security related things for Harvard Dataverse (HUIT) implemented some "silent challenge" that makes pages accessible from browsers only (or by using a Harvard VPN, although I couldn't get this to work back then, and eventually I stopped needing to scrape).
Ah. Well, that would explain it. Thank you @Julian Gautier
I imagine you're getting crawled by all sorts of LLM scrapers so I understand the necessity, but it would be nice if DataONE's metadata scraper could exempted from that restriction, because people do expect HD records to be aggregated in DataONE and we do send legitimate traffic to HD...
And it's still necessary to scrape the page instead of using the Dataverse API, right? I was able to stop scraping when the info I needed was made available with a new API endpoint, and I was able to use that instead. Sorry if I'm asking a question you've already talked about. I haven't read everything in this thread yet :sweat_smile:
It's ok. We parse the sitemaps to get landing page URLs for datasets, then use the lastmod date to filter for only the most recent ones, and download JSON-LD metadata from the endpoint using content negotiation. It's similar to what the Google Dataset Search scraper is doing
@Ian Nesbitt - DataONE could you use Signposting to get the links to the JSON-LD files?
Please see this PR: expose links to all export formats via Signposting #11045
And https://guides.dataverse.org/en/6.8/api/native-api.html#retrieve-signposting-information
I'm sure I can find a way to make HEAD requests to the landing pages. Currently we're doing GET requests but using content negotiation to ask for JSON-LD, so the response redirects us to a URL like
https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/BUOUNW/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20240925%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240925T082255Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=ce9e514a3d9e1c521733311b597a7df31e4e8387b0b6b353891f1a2c30635b28
Would the HEAD request be any more efficient than the content negotiation we currently use?
Definitely. Right now you're getting the whole payload of the page with a GET, right?
I don't think we end up having to download any XHTML, because we get redirected to that export_schema.org.cached function when the server sees content negotiation in the request, which I assumed was the most efficient way of doing things
I see. So you're already skipping the step of doing a GET of the dataset landing page, you're saying. You go directly to the cached export by constructing the URL you need based on the DOI. Is that right?
Correct. I can recreate the requests in a curl -v command and show you the outputs but I assume I'd run into the aforementioned command line restriction
Can you set the user agent to look like a browser?
I can. In the scraper or the curl command?
Maybe try in curl and if it works, try in the scraper?
Sadly it still knows I'm on the command line:
$ curl -v -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0" https://dataverse.harvard.edu/sitemap_index.xml
* Trying 3.211.175.147:443...
* TCP_NODELAY set
* Connected to dataverse.harvard.edu (3.211.175.147) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
* subject: C=US; ST=Massachusetts; O=President and Fellows of Harvard College; CN=dataverse.harvard.edu
* start date: Apr 30 00:00:00 2025 GMT
* expire date: May 31 23:59:59 2026 GMT
* subjectAltName: host "dataverse.harvard.edu" matched cert's "dataverse.harvard.edu"
* issuer: C=US; O=Internet2; CN=InCommon RSA Server CA 2
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55c388e8e0d0)
> GET /sitemap_index.xml HTTP/2
> Host: dataverse.harvard.edu
> user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 202
< server: awselb/2.0
< date: Mon, 27 Oct 2025 20:53:52 GMT
< content-length: 0
< x-amzn-waf-action: challenge
< cache-control: no-store, max-age=0
< content-type: text/html; charset=UTF-8
< access-control-allow-origin: *
< access-control-max-age: 86400
< access-control-allow-methods: OPTIONS,GET,POST
< access-control-expose-headers: x-amzn-waf-action
<
* Connection #0 to host dataverse.harvard.edu left intact
![]()
I pinged Leo earlier. Maybe he'll save us.
Hi Ian,
< HTTP/2 202
< access-control-expose-headers: x-amzn-waf-action
Yes, this is AWS WAF Silent Challenge that HUIT are enforcing on our UI pages now, to weed out non-browser calls. (HUIT is the Harvard group that runs the load balancer our servers sit behind).
Please send me your crawler's ip address(es)/subnets so that I could ask them to be exempted from this WAF rule.
I checked and I still have the rewrite rules in place for your crawler to serve fast redirects to exported metadata records on S3.
(for the record, our /api is exempt from this blocking; but I'm not suggesting going through that as a solution, since I remember you had reasons to prefer to follow the standard sitemap route. Plus the custom redirects worked really well in the end).
And yes, virtually all Harvard sites that serve any data that can be fed to LLMs have been getting crawled to death. So they've been resorting to increasingly harsh measures to protect the perimeter from the bots.
We'll work it out.
All the best,
-Leo
P.S. I have 128.111.85.17 for your spider in my records - but that was a while ago.
Hi Leo, makes senseβI'm actually not sure if our IPs have changed but we have a production scraper at 128.111.85.168 (sonode.dataone.org) and a test scraper at 128.111.85.172 (so.test.dataone.org).
Yes, we do have to download and parse the whole sitemap unfortunately, but the redirects have been working quite well!
I got a confirmation that the 2 ips above have been added to the exemptions list.
Could you please remind me if your crawler can be throttled as not to exceed a certain call rate?
That's another thing HUIT are enforcing. Unlike the silent challenges, and for reasons I don't fully understand, they have been unable to grant us exceptions for specific url patterns etc. with that.
At the moment the rate is defined as 300 calls/5 min., after which they put the ip on their crap list (code 403) for the next 5 min.
(I am working with them on relaxing these rules/making them more flexible etc., as this is causing us real problems; but that's what we have to work around at the moment)
We can delay each call as much as needed. Currently we enforce a 2-second delay between each.
Great, that should be more slower than enough.
I think if I set it to 1/sec it would be fine, because the delay time does not include processing time, but I also want to be kind to your servers and there's really no rush in getting the scrape done.
Last updated: Oct 30 2025 at 06:21 UTC