DataONE search · troubleshooting

Hey @Philip Durbin --it's been a long time, but I see you're still deep into everything Dataverse. I wanted to reopen the topic of indexing Dataverse content into DataONE (https://datone.org) -- my colleague Ian has been working on getting that harvest enabled, but running into blocks from robots.txt and other policy issues -- I just wanted to check in and make sure we're heading in the right direction. Let me know if you have a minute to chat.

Philip Durbin 🚀 (Oct 17 2023 at 11:04):

Anyway, sure! Let's do a call. 3am local time for you so let's wait a bit. :big_smile:

Philip Durbin 🚀 (Oct 17 2023 at 18:13):

Matt Jones (Oct 17 2023 at 18:49):

ok, thanks @Philip Durbin -- when we first tried to harvest schema.org, it seems like we didn;'t throttle and so it was a problem for your server,. So, Ian worked on toning it down and limiting our request rate, but then on the very day that he tried it again, he noticed that your robots.txt was modified to include a Disallow: / from all user agents except google. So we were worried that we did something wrong again. Ian sent Leo a note on RT, but we'd love to work out how to handle this in a way that works for your systems.

Philip Durbin 🚀 (Oct 17 2023 at 18:53):

Ian Nesbitt (Oct 17 2023 at 19:02):

Hi @Philip Durbin, I appreciate your and Leo's time. I'm happy to jump on a call if you need to hash out technical stuff.

Philip Durbin 🚀 (Oct 17 2023 at 20:44):

Philip Durbin 🚀 (Oct 19 2023 at 16:37):

Leonid is still quite busy. Are there questions I can answer? Or is the main question just, "When will you stop blocking us?" :sweat_smile:

Matt Jones (Oct 19 2023 at 22:50):

That is the main question for sure, with a side question of why blocking was needed for our new process -- we're hoping our throttled harvest is well below whatever limits you need to impose. If not, we'd like to be sure they are before we restart.

Philip Durbin 🚀 (Oct 19 2023 at 22:50):

Matt Jones (Oct 19 2023 at 22:52):

so if we see the robots.txt file flip back to unblocked, is that a sign its ok to start again?

Matt Jones (Oct 19 2023 at 22:54):

Philip Durbin 🚀 (Oct 19 2023 at 23:47):

Can you or @Ian Nesbitt please provide any specifics about what he wrote in the ticket?

Ian Nesbitt (Oct 20 2023 at 02:07):

@Philip Durbin The lowest rate I saw was 20 pages/min and the most I saw was 36 pages/min

Ian Nesbitt (Oct 20 2023 at 02:07):

Ian Nesbitt (Oct 20 2023 at 14:56):

More specifics: the job started with a permissive robots.txt and the spider was requesting application/ld+json successfully, but at 2023-10-13 23:06:16 UTC those requests started returning a text/html response we couldn't parse, and then at 2023-10-13 23:09:14 they began returning 404

Philip Durbin 🚀 (Oct 20 2023 at 14:59):

Ian Nesbitt (Oct 20 2023 at 15:02):

Yes actually, we have been in contact with Borealis recently and just began a harvest of their Dataverse metadata as well

Philip Durbin 🚀 (Oct 20 2023 at 15:03):

Oh, nice. Our Canadian friends. :flag_canada: How has the indexing experience been so far?

Ian Nesbitt (Oct 20 2023 at 15:10):

Excellent! It's one of the endpoints in our triage that has the most complete schema.org data of anywhere I've seen

Philip Durbin 🚀 (Oct 20 2023 at 15:12):

Ian Nesbitt (Oct 20 2023 at 15:16):

I see a lot of repositories that are missing identifier or have incorrectly configured @id or other fields so it's nice to see ones that follow the spec

Philip Durbin 🚀 (Oct 20 2023 at 15:21):

Philip Durbin 🚀 (Oct 20 2023 at 15:22):

And we put in some schema.org fixes recently. Not sure if Borealis has upgraded yet.

Philip Durbin 🚀 (Oct 23 2023 at 15:49):

Ian Nesbitt (Dec 11 2023 at 02:13):

@Philip Durbin We've finally got most of the Harvard Dataverse corpus scraped, harvested, and indexed in DataONE: https://search.dataone.org/portals/HD :tada:

Ian Nesbitt (Dec 11 2023 at 02:15):

I will have Angie Garcia, our outreach coordinator, get in contact with you as soon as we're all back from AGU, or sooner if you're there to find us in person. Cheers and Happy Holidays!

Philip Durbin 🚀 (Dec 11 2023 at 02:22):

Philip Durbin 🚀 (Dec 15 2023 at 20:13):

Philip Durbin 🚀 (Dec 15 2023 at 20:14):

Ian Nesbitt (Dec 15 2023 at 22:26):

Definitely! I've added PR #10192 with the suggested changes. Thank you for the suggestion.

Philip Durbin 🚀 (Oct 07 2024 at 14:00):

Ian Nesbitt (Oct 11 2024 at 20:42):

Hi Philip, my apologies for the delay as I've been on vacation—to answer
your question, since the records are all just placeholders that trace back
to the "real" HD records, we assign the repository manager as both
submitter and rightsholder. Since the records update automatically and the
node instance needs a user account to do so, we tell the system that they
are being maintained by "you". It also allows you to, for example, modify
the records yourself using our API, should you need to do so. We really
should take the mention of the submitter and rightsholder ORCID out of
prominence or view entirely for schema.org records, but that is technically
how it works.

As for the root cause of why they may be asking you this question: I
suppose they must need the record changed in some way. If so, we would be
happy to help facilitate. The records should be updated automatically based
on HD JSON-LDs on a regular basis, but sometimes in rare cases our indexer
drops the records and they must be reindexed manually. Before I left for
vacation I noticed HD had a backlog of changes that need to be indexed, so
I can take a look and see if this is one of those cases. We have yet to
identify the bug, but we're working on it.

Ian Nesbitt (Oct 11 2024 at 21:04):

Ian Nesbitt (Oct 15 2024 at 14:07):

@Philip Durbin 🐉 Now that I'm back from vacation, it looks like the Harvard Dataverse sitemap doesn't have any lastmod dates after 2024-07-03. Perhaps the sitemap has stopped updating somehow?

Ian Nesbitt (Oct 15 2024 at 14:14):

I've set our sitemap spider to pick up all changes after 2024-07-03 so when the sitemap does get updated, we can get to work indexing the backlog.

Philip Durbin 🚀 (Oct 15 2024 at 15:40):

@Ian Nesbitt thanks for getting back to me. Now I'm back from couple days off. It sounds like my ORCID is on thousands of datasets. Can you please remove it? :grinning:

Ian Nesbitt (Oct 15 2024 at 15:43):

Sure. Since I manage the metadata flows, we can replace it with mine if that's an acceptable solution.

Philip Durbin 🚀 (Oct 15 2024 at 15:48):

Philip Durbin 🚀 (Oct 15 2024 at 15:49):

Ian Nesbitt (Oct 15 2024 at 15:49):

That's fair. Perhaps a better solution would be to hide those fields from view entirely, since they don't really mean anything to the end user.

Ian Nesbitt (Oct 15 2024 at 15:51):

Yes. I will bring this up at our DataONE team meeting on Thursday, because ideally we don't want end users asking data managers about these automatically managed records at all.

Philip Durbin 🚀 (Oct 15 2024 at 15:55):

Well, I think the fields are meaningful. Who submitted this data? Who is the rights holder? But sure, hidden fields are better than inaccurate fields, I'd say.

Ian Nesbitt (Oct 15 2024 at 15:56):

Of course. They would still be visible in the system metadata, just not on the dataset landing pages.

Ian Nesbitt (Oct 15 2024 at 15:59):

Philip Durbin 🚀 (Oct 15 2024 at 16:00):

Philip Durbin 🚀 (Oct 15 2024 at 16:01):

Philip Durbin 🚀 (Oct 15 2024 at 16:02):

Ian Nesbitt (Oct 15 2024 at 16:06):

Philip Durbin 🚀 (Oct 15 2024 at 16:09):

Sure thing, I wonder if we should do something with the old, aging, single-file sitemap. :thinking:

Ian Nesbitt (Oct 15 2024 at 16:10):

Yeah, good question. Maybe a moved permanently that redirects to the base of the index?

Philip Durbin 🚀 (Oct 15 2024 at 16:14):

Ian Nesbitt (Oct 15 2024 at 16:15):

Weird: the new sitemap doesn't seem to have records newer than 2024-07-03 either.

Philip Durbin 🚀 (Oct 15 2024 at 16:15):

Philip Durbin 🚀 (Oct 15 2024 at 17:50):

@Ian Nesbitt ok! The sitemap should be fixed now. Please try again. And thanks again for letting us know!

Ian Nesbitt (Oct 15 2024 at 18:01):

Philip Durbin 🚀 (Oct 15 2024 at 18:10):

Great news. And where are we with "rights holder" and "submitter"? Some day it would be nice to fill these in with the proper values. Maybe stuff like "CC0" and whoever is in the Depositor field? Of course, hiding these values for now, if they don't have accurate information, sounds good to me.

Ian Nesbitt (Oct 15 2024 at 19:16):

Because the records are maintained automatically, the rights have to be the same across the board. I can change the rightsHolder field to my ORCiD. The submitter field is immutable, unfortunately, so I think the best route to take would be to just hide them from view of the end user so it doesn't get misinterpreted as to why it's not the authors themselves. Because in essence, those rights are held by the authors themselves, they just have to edit the HD record, since the DataONE record is automatically drawn from HD.

Ian Nesbitt (Oct 15 2024 at 19:16):

I will raise this issue at our meeting on Thursday and let you know the outcome.

Philip Durbin 🚀 (Oct 15 2024 at 20:10):

Philip Durbin 🚀 (Oct 15 2024 at 20:11):

But we do have lots of fields in Dataverse for rights if you want them! :sweat_smile:

Ian Nesbitt (Oct 15 2024 at 20:30):

Does Dataverse require ORCiDs for submission? If so that would be a fairly 1:1 translation...

Philip Durbin 🚀 (Oct 15 2024 at 20:34):

Matt Jones (Oct 15 2024 at 21:01):

Our submitter field corresponds to the user identifier of the party that initially deposited the dataset (and is immutable via the API from the time of first deposit). It is closely aligned to rightsHolder, which is the user identifier of the party that has full access rights over the dataset, which can change through time (in addition to any other access rules that are provided for other users). The values in these fields are typically ORCID values now, but could also be any identifier from an identity provider that you use (e.g., from CILogon, Globus Auth, OpenID Connect, etc).

Matt Jones (Oct 15 2024 at 21:03):

Ideally we like to have the info as it applies to each dataset individually, but in the case of schema.org harvests, this info is usually not in the record, and so we have reverted to setting a global value for the whole collection. Which is where we went awry from your perspective, I think. Maybe we need to standardize/clarify how rights and access fields are populated in schema.org Dataset entries to promote interoperability?

Matt Jones (Oct 15 2024 at 21:06):

We've been gaining members of DataONE that use the Dataverse platform (e.g., DataverseNO most recently), and so it would be good to iron out these details for all groups that might wish to join.

Philip Durbin 🚀 (Oct 15 2024 at 21:06):

Philip Durbin 🚀 (Oct 15 2024 at 21:07):

We don't seem to put Dataverse's "depositor" field in there. If there's a good place for it, we certainly could.

Philip Durbin 🚀 (Oct 15 2024 at 21:08):

We do populate "creator" but this can be different than "depositor". A depositor can upload data on a creator/author's behalf in Dataverse.

Philip Durbin 🚀 (Oct 15 2024 at 21:08):

Matt Jones (Oct 15 2024 at 21:13):

Yeah, creator and rightsHolder in DataONE are certainly different. We interpret creator following SOSO to be the list of parties that should be cited/attributed for the Dataset. Whereas rightsHolder is about access control, and orthogonal to attribution. Various other parties can act on behalf of creators when editing and depositing datasets. So I think a separate set of roles around rightsHolder, submitter/depositor, and access control lists could be useful. But its only really useful to people that are trying to use interoperable editing APIs (and not just public read access).

Matt Jones (Oct 15 2024 at 21:14):

Philip Durbin 🚀 (Oct 15 2024 at 21:23):

Philip Durbin 🚀 (Oct 15 2024 at 21:24):

Also, can you please remind me... is DataONE getting Dataverse metadata from the <head> of dataset pages as Schema.org JSON-LD?

Ian Nesbitt (Oct 15 2024 at 21:24):

Philip Durbin 🚀 (Oct 15 2024 at 21:25):

Ian Nesbitt (Oct 15 2024 at 21:25):

Philip Durbin 🚀 (Oct 15 2024 at 21:25):

Philip Durbin 🚀 (Oct 15 2024 at 21:26):

Ian Nesbitt (Oct 15 2024 at 21:27):

Ian Nesbitt (Oct 15 2024 at 21:30):

After downloading and parsing the sitemaps, we query each page listed and ask for JSON-LD, which I think causes a redirect to a request like this:

https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/BUOUNW/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20240925%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240925T082255Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=ce9e514a3d9e1c521733311b597a7df31e4e8387b0b6b353891f1a2c30635b28

Ian Nesbitt (Oct 15 2024 at 21:36):

On a related note, every once in a while we seem to be running into an issue where the server only returns half of a json document...

Matt Jones (Oct 15 2024 at 21:38):

Carl Boettiger has been pursuing the SOSO-Croissant mapping and how they relate to one another, and it has been discussed a few times in ESIP cluster meetings, but we haven't dove into it in detail yet.

Philip Durbin 🚀 (Oct 15 2024 at 22:44):

Thanks, make sense. Looks like you're using export_schema.org.cached but if you want, you could switch to export_croissant.cached. Google Dataset Search is encouraging sites to switch to Croissant. Please see also this summary I wrote.

Philip Durbin 🚀 (Oct 15 2024 at 22:46):

I don't see Carl in the Croissant meeting minutes but I do see mention of science-on-schema.org. Thanks for putting this on my radar.

Philip Durbin 🚀 (Oct 16 2024 at 13:43):

Ah, it looks like @Julian Gautier attended the DataONE "Science on Schema.org Guidelines and Experiences" call back in 2021. Good.

Ian Nesbitt (Jun 10 2025 at 12:45):

Good morning @Philip Durbin and team. We've received some requests for DataONE to index location information for Dataverse datasets, but I don't think we scrape any from the SOSO docs. Do you store location information (bounding boxes, points, etc?) Is there a way for me to request that this info gets serialized into schema.org documents in future versions of the Dataverse software?

Philip Durbin 🚀 (Jun 10 2025 at 12:46):

Hi! We're about to start our annual conference (#community > #Dataverse2025) but quickly, yes, please check the geospatial metadata block for a bounding box.

Ian Nesbitt (Jun 10 2025 at 12:48):

Philip Durbin 🚀 (Jun 10 2025 at 12:50):

Ian Nesbitt (Jun 10 2025 at 12:51):

Ian Nesbitt (Jun 10 2025 at 12:53):

Some datasets have place names, but I don't think we get any quantitative locations

Ian Nesbitt (Jun 10 2025 at 12:56):

Don't feel the need to respond now, I can wait until after the conference to talk about this

Philip Durbin 🚀 (Jun 16 2025 at 19:32):

@Ian Nesbitt We're back! And I think I have some good news for you. I hope! :smile:

  "spatialCoverage": [
    "North America",
    "Global"
  ]

Philip Durbin 🚀 (Jun 16 2025 at 19:33):

Philip Durbin 🚀 (Jun 16 2025 at 19:34):

Ian Nesbitt (Jun 16 2025 at 19:37):

Philip Durbin 🚀 (Jun 16 2025 at 19:38):

Ian Nesbitt (Jun 16 2025 at 19:41):

I do see the semantic place names in spatialCoverage but when we found the geospatial metadata block field definitions last week it confirmed my suspicion that the quantitative location information doesn't get serialized to schema.org

Philip Durbin 🚀 (Jun 16 2025 at 19:46):

Right. It doesn't seem to. Should it? Is there a good place in schema.org (and Croissant, if you're familiar) for bounding boxes?

Ian Nesbitt (Jun 16 2025 at 19:48):

  "spatialCoverage": {
    "@type": "Place",
    "geo": {
      "@type": "GeoShape",
      "box": "{SOUTH} {WEST} {NORTH} {EAST}"
    }
  }
}

Ian Nesbitt (Jun 16 2025 at 19:49):

Philip Durbin 🚀 (Jun 16 2025 at 19:49):

Ian Nesbitt (Jun 16 2025 at 19:50):

Philip Durbin 🚀 (Jun 16 2025 at 19:52):

Awesome. Thanks. Also, out of curiosity, have you heard of Croissant? Any interest in it? It's also based on schema.org.

I ask because from the Dataverse perspective, we implemented the original JSON-LD Schema.org format to support Google Dataset Search. But now they've deprecated it in favor of Croissant.

Ian Nesbitt (Jun 16 2025 at 19:53):

Yes, and we've discussed formally adopting it as well, but haven't made any official moves towards that yet

Philip Durbin 🚀 (Jun 16 2025 at 19:56):

Ian Nesbitt (Jun 16 2025 at 19:58):

I don't know exactly how that field translates from standard SO, but those other formats definitely support bounding boxes so I'll try to include it in the issue

Ian Nesbitt (Jun 16 2025 at 19:59):

Philip Durbin 🚀 (Jun 16 2025 at 19:59):

Ian Nesbitt (Jun 16 2025 at 20:00):

Ok. I need to finish some other stuff but I can probably post the issue later this evening

Philip Durbin 🚀 (Jun 16 2025 at 20:00):

Ian Nesbitt (Jun 16 2025 at 20:01):

Philip Durbin 🚀 (Jun 16 2025 at 20:01):

Ian Nesbitt (Jun 17 2025 at 14:36):

Philip Durbin 🚀 (Jun 17 2025 at 14:38):

Ian Nesbitt (Jun 17 2025 at 14:40):

Ian Nesbitt (Oct 27 2025 at 18:29):

Hi @Philip Durbin 🚀 , we have been having an issue that I've missed since July...it seems our scraper is getting empty status code 202 responses from the HD server when it tries to get the base sitemap. I think the reason I missed it is because it doesn't register as an error...

Ian Nesbitt (Oct 27 2025 at 18:37):

$ wget https://dataverse.harvard.edu/sitemap_index.xml
--2025-10-27 18:35:21--  https://dataverse.harvard.edu/sitemap_index.xml
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 54.86.163.49, 3.211.175.147, 3.215.43.147
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|54.86.163.49|:443... connected.
HTTP request sent, awaiting response... 202 Accepted
Length: 0 [text/html]
Saving to: ‘sitemap_index.xml’

sitemap_index.xml                                [ <=>                                                                                         ]       0  --.-KB/s    in 0s

2025-10-27 18:35:22 (0.00 B/s) - ‘sitemap_index.xml’ saved [0/0]

Philip Durbin 🚀 (Oct 27 2025 at 18:47):

Ian Nesbitt (Oct 27 2025 at 19:26):

Julian Gautier (Oct 27 2025 at 19:50):

Hey all. I used to scrape the HTML of certain types of pages on Harvard Dataverse and had to stop back in April 2025. Leonid told me back then that the 202 status code I was seeing was because the IT folks who help manage security related things for Harvard Dataverse (HUIT) implemented some "silent challenge" that makes pages accessible from browsers only (or by using a Harvard VPN, although I couldn't get this to work back then, and eventually I stopped needing to scrape).

Ian Nesbitt (Oct 27 2025 at 20:04):

Ian Nesbitt (Oct 27 2025 at 20:12):

I imagine you're getting crawled by all sorts of LLM scrapers so I understand the necessity, but it would be nice if DataONE's metadata scraper could exempted from that restriction, because people do expect HD records to be aggregated in DataONE and we do send legitimate traffic to HD...

Julian Gautier (Oct 27 2025 at 20:16):

And it's still necessary to scrape the page instead of using the Dataverse API, right? I was able to stop scraping when the info I needed was made available with a new API endpoint, and I was able to use that instead. Sorry if I'm asking a question you've already talked about. I haven't read everything in this thread yet :sweat_smile:

Ian Nesbitt - DataONE (Oct 27 2025 at 20:22):

It's ok. We parse the sitemaps to get landing page URLs for datasets, then use the lastmod date to filter for only the most recent ones, and download JSON-LD metadata from the endpoint using content negotiation. It's similar to what the Google Dataset Search scraper is doing

Philip Durbin 🚀 (Oct 27 2025 at 20:26):

@Ian Nesbitt - DataONE could you use Signposting to get the links to the JSON-LD files?

Philip Durbin 🚀 (Oct 27 2025 at 20:26):

Ian Nesbitt - DataONE (Oct 27 2025 at 20:34):

I'm sure I can find a way to make HEAD requests to the landing pages. Currently we're doing GET requests but using content negotiation to ask for JSON-LD, so the response redirects us to a URL like


https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/BUOUNW/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20240925%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240925T082255Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=ce9e514a3d9e1c521733311b597a7df31e4e8387b0b6b353891f1a2c30635b28

Would the HEAD request be any more efficient than the content negotiation we currently use?

Philip Durbin 🚀 (Oct 27 2025 at 20:35):

Definitely. Right now you're getting the whole payload of the page with a GET, right?

Ian Nesbitt - DataONE (Oct 27 2025 at 20:39):

I don't think we end up having to download any XHTML, because we get redirected to that export_schema.org.cached function when the server sees content negotiation in the request, which I assumed was the most efficient way of doing things

Philip Durbin 🚀 (Oct 27 2025 at 20:41):

I see. So you're already skipping the step of doing a GET of the dataset landing page, you're saying. You go directly to the cached export by constructing the URL you need based on the DOI. Is that right?

Ian Nesbitt - DataONE (Oct 27 2025 at 20:43):

Correct. I can recreate the requests in a curl -v command and show you the outputs but I assume I'd run into the aforementioned command line restriction

Philip Durbin 🚀 (Oct 27 2025 at 20:44):

Ian Nesbitt - DataONE (Oct 27 2025 at 20:45):

Philip Durbin 🚀 (Oct 27 2025 at 20:45):

Ian Nesbitt - DataONE (Oct 27 2025 at 20:55):

$ curl -v -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0" https://dataverse.harvard.edu/sitemap_index.xml
*   Trying 3.211.175.147:443...
* TCP_NODELAY set
* Connected to dataverse.harvard.edu (3.211.175.147) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=Massachusetts; O=President and Fellows of Harvard College; CN=dataverse.harvard.edu
*  start date: Apr 30 00:00:00 2025 GMT
*  expire date: May 31 23:59:59 2026 GMT
*  subjectAltName: host "dataverse.harvard.edu" matched cert's "dataverse.harvard.edu"
*  issuer: C=US; O=Internet2; CN=InCommon RSA Server CA 2
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55c388e8e0d0)
> GET /sitemap_index.xml HTTP/2
> Host: dataverse.harvard.edu
> user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 202
< server: awselb/2.0
< date: Mon, 27 Oct 2025 20:53:52 GMT
< content-length: 0
< x-amzn-waf-action: challenge
< cache-control: no-store, max-age=0
< content-type: text/html; charset=UTF-8
< access-control-allow-origin: *
< access-control-max-age: 86400
< access-control-allow-methods: OPTIONS,GET,POST
< access-control-expose-headers: x-amzn-waf-action
<
* Connection #0 to host dataverse.harvard.edu left intact

Philip Durbin 🚀 (Oct 27 2025 at 20:56):

Philip Durbin 🚀 (Oct 27 2025 at 20:57):

Leo Andreev (Oct 27 2025 at 21:47):

Yes, this is AWS WAF Silent Challenge that HUIT are enforcing on our UI pages now, to weed out non-browser calls. (HUIT is the Harvard group that runs the load balancer our servers sit behind).
Please send me your crawler's ip address(es)/subnets so that I could ask them to be exempted from this WAF rule.
I checked and I still have the rewrite rules in place for your crawler to serve fast redirects to exported metadata records on S3.
(for the record, our /api is exempt from this blocking; but I'm not suggesting going through that as a solution, since I remember you had reasons to prefer to follow the standard sitemap route. Plus the custom redirects worked really well in the end).

And yes, virtually all Harvard sites that serve any data that can be fed to LLMs have been getting crawled to death. So they've been resorting to increasingly harsh measures to protect the perimeter from the bots.

P.S. I have 128.111.85.17 for your spider in my records - but that was a while ago.

Ian Nesbitt - DataONE (Oct 27 2025 at 22:09):

Hi Leo, makes sense—I'm actually not sure if our IPs have changed but we have a production scraper at 128.111.85.168 (sonode.dataone.org) and a test scraper at 128.111.85.172 (so.test.dataone.org).

Yes, we do have to download and parse the whole sitemap unfortunately, but the redirects have been working quite well!

Leo Andreev (Oct 28 2025 at 15:27):

I got a confirmation that the 2 ips above have been added to the exemptions list.

Leo Andreev (Oct 28 2025 at 15:32):

Could you please remind me if your crawler can be throttled as not to exceed a certain call rate?
That's another thing HUIT are enforcing. Unlike the silent challenges, and for reasons I don't fully understand, they have been unable to grant us exceptions for specific url patterns etc. with that.
At the moment the rate is defined as 300 calls/5 min., after which they put the ip on their crap list (code 403) for the next 5 min.

Leo Andreev (Oct 28 2025 at 15:34):

(I am working with them on relaxing these rules/making them more flexible etc., as this is causing us real problems; but that's what we have to work around at the moment)

Ian Nesbitt - DataONE (Oct 28 2025 at 15:40):

We can delay each call as much as needed. Currently we enforce a 2-second delay between each.

Leo Andreev (Oct 28 2025 at 15:43):

Ian Nesbitt - DataONE (Oct 28 2025 at 15:51):

I think if I set it to 1/sec it would be fine, because the delay time does not include processing time, but I also want to be kind to your servers and there's really no rush in getting the scrape done.

Stream: troubleshooting

Topic: DataONE search

Matt Jones (Oct 17 2023 at 00:46):

Philip Durbin 🚀 (Oct 17 2023 at 11:04):

Philip Durbin 🚀 (Oct 17 2023 at 18:13):

Matt Jones (Oct 17 2023 at 18:49):

Philip Durbin 🚀 (Oct 17 2023 at 18:53):

Ian Nesbitt (Oct 17 2023 at 19:02):

Philip Durbin 🚀 (Oct 17 2023 at 20:44):

Philip Durbin 🚀 (Oct 19 2023 at 16:37):

Matt Jones (Oct 19 2023 at 22:50):

Philip Durbin 🚀 (Oct 19 2023 at 22:50):

Matt Jones (Oct 19 2023 at 22:52):

Matt Jones (Oct 19 2023 at 22:54):

Philip Durbin 🚀 (Oct 19 2023 at 23:47):

Ian Nesbitt (Oct 20 2023 at 02:07):

Ian Nesbitt (Oct 20 2023 at 02:07):

Ian Nesbitt (Oct 20 2023 at 14:56):

Philip Durbin 🚀 (Oct 20 2023 at 14:59):

Ian Nesbitt (Oct 20 2023 at 15:02):

Philip Durbin 🚀 (Oct 20 2023 at 15:03):

Ian Nesbitt (Oct 20 2023 at 15:10):

Philip Durbin 🚀 (Oct 20 2023 at 15:12):

Ian Nesbitt (Oct 20 2023 at 15:16):

Philip Durbin 🚀 (Oct 20 2023 at 15:21):

Philip Durbin 🚀 (Oct 20 2023 at 15:22):

Philip Durbin 🚀 (Oct 23 2023 at 15:49):

Ian Nesbitt (Dec 11 2023 at 02:13):

Ian Nesbitt (Dec 11 2023 at 02:15):

Philip Durbin 🚀 (Dec 11 2023 at 02:22):

Philip Durbin 🚀 (Dec 15 2023 at 20:13):

Philip Durbin 🚀 (Dec 15 2023 at 20:14):

Ian Nesbitt (Dec 15 2023 at 22:26):

Philip Durbin 🚀 (Oct 07 2024 at 14:00):

Ian Nesbitt (Oct 11 2024 at 20:42):

Ian Nesbitt (Oct 11 2024 at 21:04):

Ian Nesbitt (Oct 15 2024 at 14:07):

Ian Nesbitt (Oct 15 2024 at 14:14):

Philip Durbin 🚀 (Oct 15 2024 at 15:40):

Ian Nesbitt (Oct 15 2024 at 15:43):

Philip Durbin 🚀 (Oct 15 2024 at 15:48):

Philip Durbin 🚀 (Oct 15 2024 at 15:48):

Philip Durbin 🚀 (Oct 15 2024 at 15:49):

Ian Nesbitt (Oct 15 2024 at 15:49):

Ian Nesbitt (Oct 15 2024 at 15:51):

Philip Durbin 🚀 (Oct 15 2024 at 15:55):

Ian Nesbitt (Oct 15 2024 at 15:56):

Ian Nesbitt (Oct 15 2024 at 15:59):

Philip Durbin 🚀 (Oct 15 2024 at 16:00):

Philip Durbin 🚀 (Oct 15 2024 at 16:01):

Philip Durbin 🚀 (Oct 15 2024 at 16:02):

Ian Nesbitt (Oct 15 2024 at 16:06):

Philip Durbin 🚀 (Oct 15 2024 at 16:09):

Ian Nesbitt (Oct 15 2024 at 16:10):

Philip Durbin 🚀 (Oct 15 2024 at 16:14):

Ian Nesbitt (Oct 15 2024 at 16:15):

Philip Durbin 🚀 (Oct 15 2024 at 16:15):

Philip Durbin 🚀 (Oct 15 2024 at 17:50):

Ian Nesbitt (Oct 15 2024 at 18:01):

Philip Durbin 🚀 (Oct 15 2024 at 18:10):

Ian Nesbitt (Oct 15 2024 at 19:16):

Ian Nesbitt (Oct 15 2024 at 19:16):

Philip Durbin 🚀 (Oct 15 2024 at 20:10):

Philip Durbin 🚀 (Oct 15 2024 at 20:10):

Philip Durbin 🚀 (Oct 15 2024 at 20:11):

Philip Durbin 🚀 (Oct 15 2024 at 20:11):

Ian Nesbitt (Oct 15 2024 at 20:30):

Philip Durbin 🚀 (Oct 15 2024 at 20:34):

Matt Jones (Oct 15 2024 at 21:01):

Matt Jones (Oct 15 2024 at 21:03):

Matt Jones (Oct 15 2024 at 21:06):

Philip Durbin 🚀 (Oct 15 2024 at 21:06):

Philip Durbin 🚀 (Oct 15 2024 at 21:07):

Philip Durbin 🚀 (Oct 15 2024 at 21:08):

Philip Durbin 🚀 (Oct 15 2024 at 21:08):

Matt Jones (Oct 15 2024 at 21:13):

Matt Jones (Oct 15 2024 at 21:14):

Philip Durbin 🚀 (Oct 15 2024 at 21:23):

Philip Durbin 🚀 (Oct 15 2024 at 21:24):

Ian Nesbitt (Oct 15 2024 at 21:24):