Stream: troubleshooting

Topic: DataONE search


view this post on Zulip Matt Jones (Oct 17 2023 at 00:46):

Hey @Philip Durbin --it's been a long time, but I see you're still deep into everything Dataverse. I wanted to reopen the topic of indexing Dataverse content into DataONE (https://datone.org) -- my colleague Ian has been working on getting that harvest enabled, but running into blocks from robots.txt and other policy issues -- I just wanted to check in and make sure we're heading in the right direction. Let me know if you have a minute to chat.

view this post on Zulip Philip Durbin πŸš€ (Oct 17 2023 at 11:04):

@Matt Jones hi! Yes, I'm still in deep. :grinning:

I see you replied to @Leo Andreev at https://help.hmdc.harvard.edu/Ticket/Display.html?id=337611 recently. Let me check in with him. He's been working on robots.txt stuff over at https://github.com/IQSS/dataverse.harvard.edu/issues/228 but I don't remember the exact status.

Anyway, sure! Let's do a call. 3am local time for you so let's wait a bit. :big_smile:

view this post on Zulip Philip Durbin πŸš€ (Oct 17 2023 at 18:13):

We're talking about it but probably won't get to it today.

view this post on Zulip Matt Jones (Oct 17 2023 at 18:49):

ok, thanks @Philip Durbin -- when we first tried to harvest schema.org, it seems like we didn;'t throttle and so it was a problem for your server,. So, Ian worked on toning it down and limiting our request rate, but then on the very day that he tried it again, he noticed that your robots.txt was modified to include a Disallow: / from all user agents except google. So we were worried that we did something wrong again. Ian sent Leo a note on RT, but we'd love to work out how to handle this in a way that works for your systems.

view this post on Zulip Philip Durbin πŸš€ (Oct 17 2023 at 18:53):

Heh, well, we appreciate you being good Internet citizens. :grinning:

view this post on Zulip Ian Nesbitt (Oct 17 2023 at 19:02):

Hi @Philip Durbin, I appreciate your and Leo's time. I'm happy to jump on a call if you need to hash out technical stuff.

view this post on Zulip Philip Durbin πŸš€ (Oct 17 2023 at 20:44):

Much appreciated. It's a busy day here. Sorry.

view this post on Zulip Philip Durbin πŸš€ (Oct 19 2023 at 16:37):

Leonid is still quite busy. Are there questions I can answer? Or is the main question just, "When will you stop blocking us?" :sweat_smile:

view this post on Zulip Matt Jones (Oct 19 2023 at 22:50):

That is the main question for sure, with a side question of why blocking was needed for our new process -- we're hoping our throttled harvest is well below whatever limits you need to impose. If not, we'd like to be sure they are before we restart.

view this post on Zulip Philip Durbin πŸš€ (Oct 19 2023 at 22:50):

Probably you're collateral damage. I'm sure there are worse offenders.

view this post on Zulip Matt Jones (Oct 19 2023 at 22:52):

so if we see the robots.txt file flip back to unblocked, is that a sign its ok to start again?

view this post on Zulip Matt Jones (Oct 19 2023 at 22:54):

Let us know if there's anything we need to change to be good netizens

view this post on Zulip Philip Durbin πŸš€ (Oct 19 2023 at 23:47):

Can you or @Ian Nesbitt please provide any specifics about what he wrote in the ticket?

In addition, I also cut the concurrent requests down and added some throttling. Before the requests began to fail, the harvest was running much faster than before due to the smaller response sizes. Please let me know whether to cut it back further and I can do so.

How many requests per second are we talking about?

view this post on Zulip Ian Nesbitt (Oct 20 2023 at 02:07):

@Philip Durbin The lowest rate I saw was 20 pages/min and the most I saw was 36 pages/min

view this post on Zulip Ian Nesbitt (Oct 20 2023 at 02:07):

so between 0.3 and 0.6-ish/second

view this post on Zulip Ian Nesbitt (Oct 20 2023 at 14:56):

More specifics: the job started with a permissive robots.txt and the spider was requesting application/ld+json successfully, but at 2023-10-13 23:06:16 UTC those requests started returning a text/html response we couldn't parse, and then at 2023-10-13 23:09:14 they began returning 404

view this post on Zulip Philip Durbin πŸš€ (Oct 20 2023 at 14:59):

Thanks. Out of curiosity, do you plan to index metadata from any of the other Dataverse installations? https://dataverse.org/installations ?

view this post on Zulip Ian Nesbitt (Oct 20 2023 at 15:02):

Yes actually, we have been in contact with Borealis recently and just began a harvest of their Dataverse metadata as well

view this post on Zulip Philip Durbin πŸš€ (Oct 20 2023 at 15:03):

Oh, nice. Our Canadian friends. :flag_canada: How has the indexing experience been so far?

view this post on Zulip Ian Nesbitt (Oct 20 2023 at 15:10):

Excellent! It's one of the endpoints in our triage that has the most complete schema.org data of anywhere I've seen

view this post on Zulip Philip Durbin πŸš€ (Oct 20 2023 at 15:12):

Wow, that's nice to hear.

view this post on Zulip Ian Nesbitt (Oct 20 2023 at 15:16):

I see a lot of repositories that are missing identifier or have incorrectly configured @id or other fields so it's nice to see ones that follow the spec

view this post on Zulip Philip Durbin πŸš€ (Oct 20 2023 at 15:21):

Gotcha. Well we did use Google's validator.

view this post on Zulip Philip Durbin πŸš€ (Oct 20 2023 at 15:22):

And we put in some schema.org fixes recently. Not sure if Borealis has upgraded yet.

view this post on Zulip Philip Durbin πŸš€ (Oct 23 2023 at 15:49):

Looks like some correspondence is happening in https://help.hmdc.harvard.edu/Ticket/Display.html?id=337611 . Great!

view this post on Zulip Ian Nesbitt (Dec 11 2023 at 02:13):

@Philip Durbin We've finally got most of the Harvard Dataverse corpus scraped, harvested, and indexed in DataONE: https://search.dataone.org/portals/HD :tada:

view this post on Zulip Ian Nesbitt (Dec 11 2023 at 02:15):

I will have Angie Garcia, our outreach coordinator, get in contact with you as soon as we're all back from AGU, or sooner if you're there to find us in person. Cheers and Happy Holidays!

view this post on Zulip Philip Durbin πŸš€ (Dec 11 2023 at 02:22):

@Ian Nesbitt nice, thanks for letting us know!

view this post on Zulip Philip Durbin πŸš€ (Dec 15 2023 at 20:13):

@Ian Nesbitt I was just thinking, what you've built is morally equivalent to SHARE, which we list an an integration: https://guides.dataverse.org/en/6.1/admin/integrations.html#share

view this post on Zulip Philip Durbin πŸš€ (Dec 15 2023 at 20:14):

If you'd like to create a PR to add DataONE, the file edit it here: https://github.com/IQSS/dataverse/blob/develop/doc/sphinx-guides/source/admin/integrations.rst

view this post on Zulip Ian Nesbitt (Dec 15 2023 at 22:26):

Definitely! I've added PR #10192 with the suggested changes. Thank you for the suggestion.

view this post on Zulip Philip Durbin πŸš€ (Oct 07 2024 at 14:00):

@Matt Jones @Ian Nesbitt can you think of any reason why https://search.dataone.org/view/sha256%3A2291cc19ed4e348a344f58f656cf5b354bfd2b8a0a05d59b2799d9333ce795f4 (Ci Technology DataSet) has my ORCID on it? My ORCID ( http://orcid.org/0000-0002-9528-9470 ) is listed as both submitter (!) and rights holder (!!). Someone just emailed me about access to that dataset but I have nothing to do with it. :confused:

view this post on Zulip Ian Nesbitt (Oct 11 2024 at 20:42):

Hi Philip, my apologies for the delay as I've been on vacationβ€”to answer
your question, since the records are all just placeholders that trace back
to the "real" HD records, we assign the repository manager as both
submitter and rightsholder. Since the records update automatically and the
node instance needs a user account to do so, we tell the system that they
are being maintained by "you". It also allows you to, for example, modify
the records yourself using our API, should you need to do so. We really
should take the mention of the submitter and rightsholder ORCID out of
prominence or view entirely for schema.org records, but that is technically
how it works.

As for the root cause of why they may be asking you this question: I
suppose they must need the record changed in some way. If so, we would be
happy to help facilitate. The records should be updated automatically based
on HD JSON-LDs on a regular basis, but sometimes in rare cases our indexer
drops the records and they must be reindexed manually. Before I left for
vacation I noticed HD had a backlog of changes that need to be indexed, so
I can take a look and see if this is one of those cases. We have yet to
identify the bug, but we're working on it.

view this post on Zulip Ian Nesbitt (Oct 11 2024 at 21:04):

@Philip Durbin πŸ‰ making sure you see this :up:

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 14:07):

@Philip Durbin πŸ‰ Now that I'm back from vacation, it looks like the Harvard Dataverse sitemap doesn't have any lastmod dates after 2024-07-03. Perhaps the sitemap has stopped updating somehow?

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 14:14):

I've set our sitemap spider to pick up all changes after 2024-07-03 so when the sitemap does get updated, we can get to work indexing the backlog.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 15:40):

@Ian Nesbitt thanks for getting back to me. Now I'm back from couple days off. It sounds like my ORCID is on thousands of datasets. Can you please remove it? :grinning:

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 15:43):

Sure. Since I manage the metadata flows, we can replace it with mine if that's an acceptable solution.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 15:48):

From my perspective, that's a step in the right direction. :grinning:

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 15:48):

I don't particularly want to field questions about rights and access, etc.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 15:49):

Are there any other options, longer term? Maybe even just "see source dataset"?

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 15:49):

That's fair. Perhaps a better solution would be to hide those fields from view entirely, since they don't really mean anything to the end user.

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 15:51):

Are there any other options, longer term? Maybe even just "see source dataset"?

Yes. I will bring this up at our DataONE team meeting on Thursday, because ideally we don't want end users asking data managers about these automatically managed records at all.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 15:55):

Well, I think the fields are meaningful. Who submitted this data? Who is the rights holder? But sure, hidden fields are better than inaccurate fields, I'd say.

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 15:56):

Of course. They would still be visible in the system metadata, just not on the dataset landing pages.

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 15:59):

Any idea why the HD sitemap seems to be stale since early July?

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 16:00):

Oh, we probably switched to >50K mode. One sec.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 16:01):

Here, please try this one: https://dataverse.harvard.edu/sitemap_index.xml

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 16:02):

If it's helpful, here are our docs on it: https://guides.dataverse.org/en/6.4/installation/config.html#multiple-sitemap-files-sitemap-index-file

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 16:06):

Ah, perfect. Our spider can handle indexed sitemaps. Thank you!

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 16:09):

Sure thing, I wonder if we should do something with the old, aging, single-file sitemap. :thinking:

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 16:10):

Yeah, good question. Maybe a moved permanently that redirects to the base of the index?

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 16:14):

Good idea. I'm asking internally.

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 16:15):

Weird: the new sitemap doesn't seem to have records newer than 2024-07-03 either.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 16:15):

:doh:

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 17:50):

@Ian Nesbitt ok! The sitemap should be fixed now. Please try again. And thanks again for letting us know!

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 18:01):

Thank you @Philip Durbin πŸ‰ ! Yes, I'm scraping 4200 new records now.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 18:10):

Great news. And where are we with "rights holder" and "submitter"? Some day it would be nice to fill these in with the proper values. Maybe stuff like "CC0" and whoever is in the Depositor field? Of course, hiding these values for now, if they don't have accurate information, sounds good to me.

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 19:16):

Because the records are maintained automatically, the rights have to be the same across the board. I can change the rightsHolder field to my ORCiD. The submitter field is immutable, unfortunately, so I think the best route to take would be to just hide them from view of the end user so it doesn't get misinterpreted as to why it's not the authors themselves. Because in essence, those rights are held by the authors themselves, they just have to edit the HD record, since the DataONE record is automatically drawn from HD.

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 19:16):

I will raise this issue at our meeting on Thursday and let you know the outcome.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 20:10):

Thanks, I'm curious what people think about this.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 20:10):

From my perspective "submitter" maps to "depositor" in Dataverse.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 20:11):

And rights are always a hot mess. :crazy:

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 20:11):

But we do have lots of fields in Dataverse for rights if you want them! :sweat_smile:

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 20:30):

From my perspective "submitter" maps to "depositor" in Dataverse.

Does Dataverse require ORCiDs for submission? If so that would be a fairly 1:1 translation...

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 20:34):

No, ORCIDs are optional

view this post on Zulip Matt Jones (Oct 15 2024 at 21:01):

Our submitter field corresponds to the user identifier of the party that initially deposited the dataset (and is immutable via the API from the time of first deposit). It is closely aligned to rightsHolder, which is the user identifier of the party that has full access rights over the dataset, which can change through time (in addition to any other access rules that are provided for other users). The values in these fields are typically ORCID values now, but could also be any identifier from an identity provider that you use (e.g., from CILogon, Globus Auth, OpenID Connect, etc).

view this post on Zulip Matt Jones (Oct 15 2024 at 21:03):

Ideally we like to have the info as it applies to each dataset individually, but in the case of schema.org harvests, this info is usually not in the record, and so we have reverted to setting a global value for the whole collection. Which is where we went awry from your perspective, I think. Maybe we need to standardize/clarify how rights and access fields are populated in schema.org Dataset entries to promote interoperability?

view this post on Zulip Matt Jones (Oct 15 2024 at 21:06):

We've been gaining members of DataONE that use the Dataverse platform (e.g., DataverseNO most recently), and so it would be good to iron out these details for all groups that might wish to join.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:06):

Maybe. These days we're using Croissant as an extension of Schema.org.

Here's an example: https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi%3A10.7910/DVN/HOLVXA

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:07):

We don't seem to put Dataverse's "depositor" field in there. If there's a good place for it, we certainly could.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:08):

We do populate "creator" but this can be different than "depositor". A depositor can upload data on a creator/author's behalf in Dataverse.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:08):

But yes, yes, yes, we should align! :grinning:

view this post on Zulip Matt Jones (Oct 15 2024 at 21:13):

Yeah, creator and rightsHolder in DataONE are certainly different. We interpret creator following SOSO to be the list of parties that should be cited/attributed for the Dataset. Whereas rightsHolder is about access control, and orthogonal to attribution. Various other parties can act on behalf of creators when editing and depositing datasets. So I think a separate set of roles around rightsHolder, submitter/depositor, and access control lists could be useful. But its only really useful to people that are trying to use interoperable editing APIs (and not just public read access).

view this post on Zulip Matt Jones (Oct 15 2024 at 21:14):

(Hi Phil! Glad to be chatting again, it has been quite a while!)

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:23):

SOSO?

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:24):

Also, can you please remind me... is DataONE getting Dataverse metadata from the <head> of dataset pages as Schema.org JSON-LD?

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 21:24):

SOSO - science-on-schema.org

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:25):

ooo, fun!

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 21:25):

is DataONE getting Dataverse metadata from the <head> of dataset pages as Schema.org JSON-LD?

IIRC we're doing content negotiation and grabbing it from an AWS instance

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:25):

I wonder if the Croissant folks know about this :thinking:

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:26):

well, sure, we're on AWS

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 21:26):

Is the code open source?

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 21:27):

Yesβ€”it's located at https://github.com/DataONEorg/mnlite

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 21:30):

After downloading and parsing the sitemaps, we query each page listed and ask for JSON-LD, which I think causes a redirect to a request like this:

https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/BUOUNW/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20240925%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240925T082255Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=ce9e514a3d9e1c521733311b597a7df31e4e8387b0b6b353891f1a2c30635b28

view this post on Zulip Ian Nesbitt (Oct 15 2024 at 21:36):

On a related note, every once in a while we seem to be running into an issue where the server only returns half of a json document...

view this post on Zulip Matt Jones (Oct 15 2024 at 21:38):

Carl Boettiger has been pursuing the SOSO-Croissant mapping and how they relate to one another, and it has been discussed a few times in ESIP cluster meetings, but we haven't dove into it in detail yet.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 22:44):

Thanks, make sense. Looks like you're using export_schema.org.cached but if you want, you could switch to export_croissant.cached. Google Dataset Search is encouraging sites to switch to Croissant. Please see also this summary I wrote.

view this post on Zulip Philip Durbin πŸš€ (Oct 15 2024 at 22:46):

I don't see Carl in the Croissant meeting minutes but I do see mention of science-on-schema.org. Thanks for putting this on my radar.

view this post on Zulip Philip Durbin πŸš€ (Oct 16 2024 at 13:43):

Ah, it looks like @Julian Gautier attended the DataONE "Science on Schema.org Guidelines and Experiences" call back in 2021. Good.

view this post on Zulip Ian Nesbitt (Jun 10 2025 at 12:45):

Good morning @Philip Durbin and team. We've received some requests for DataONE to index location information for Dataverse datasets, but I don't think we scrape any from the SOSO docs. Do you store location information (bounding boxes, points, etc?) Is there a way for me to request that this info gets serialized into schema.org documents in future versions of the Dataverse software?

view this post on Zulip Philip Durbin πŸš€ (Jun 10 2025 at 12:46):

Hi! We're about to start our annual conference (#community > #Dataverse2025) but quickly, yes, please check the geospatial metadata block for a bounding box.

view this post on Zulip Ian Nesbitt (Jun 10 2025 at 12:48):

Ah, exciting! Have a great conference!

view this post on Zulip Philip Durbin πŸš€ (Jun 10 2025 at 12:50):

Thanks. Which of these export formats are you importing? https://dataverse.harvard.edu/api/info/exportFormats

view this post on Zulip Ian Nesbitt (Jun 10 2025 at 12:51):

application/ld+json (schema.org)

view this post on Zulip Ian Nesbitt (Jun 10 2025 at 12:53):

Some datasets have place names, but I don't think we get any quantitative locations

view this post on Zulip Ian Nesbitt (Jun 10 2025 at 12:56):

Don't feel the need to respond now, I can wait until after the conference to talk about this

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:32):

@Ian Nesbitt We're back! And I think I have some good news for you. I hope! :smile:

I started with https://dataverse.harvard.edu/api/search?q=*&geo_point=42.3,-71.1&geo_radius=1.5 which is the example of a geospatial search at https://guides.dataverse.org/en/6.6/api/search.html

This lead me to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/E8Z5Q3

If you export as Schema.org JSON-LD like this: https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/E8Z5Q3

You'll see this:

  "spatialCoverage": [
    "North America",
    "Global"
  ]

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:33):

It's not bounding boxes but you can get them from Dataverse's native JSON format if you like: https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/E8Z5Q3

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:34):

Or the OAI_ORE format: https://dataverse.harvard.edu/api/datasets/export?exporter=OAI_ORE&persistentId=doi%3A10.7910/DVN/E8Z5Q3

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:37):

Ah, I was just typing a message to you. Hope your conference went well!

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:38):

Yep! Good times. Too short. :smile:

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:41):

I do see the semantic place names in spatialCoverage but when we found the geospatial metadata block field definitions last week it confirmed my suspicion that the quantitative location information doesn't get serialized to schema.org

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:46):

Right. It doesn't seem to. Should it? Is there a good place in schema.org (and Croissant, if you're familiar) for bounding boxes?

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:48):

Yes they do! It looks like this:

  "spatialCoverage": {
    "@type": "Place",
    "geo": {
      "@type": "GeoShape",
      "box": "{SOUTH} {WEST} {NORTH} {EAST}"
    }
  }
}

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:49):

For SO

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:49):

Interesting. Would you be able to make a feature request? https://github.com/IQSS/dataverse/issues

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:50):

You can read about it in the science-on-schema.org Dataset guide: https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md#spatial-coverage

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:50):

Philip Durbin said:

Interesting. Would you be able to make a feature request? https://github.com/IQSS/dataverse/issues

Definitely

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:52):

Awesome. Thanks. Also, out of curiosity, have you heard of Croissant? Any interest in it? It's also based on schema.org.

I ask because from the Dataverse perspective, we implemented the original JSON-LD Schema.org format to support Google Dataset Search. But now they've deprecated it in favor of Croissant.

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:53):

Yes, and we've discussed formally adopting it as well, but haven't made any official moves towards that yet

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:56):

Ok. I'm just reviewing https://dataverse.harvard.edu/api/info/exportFormats again and if I'm not wrong Dataverse supports three formats based on schema.org:

Would you want that "geo box" info in all three formats?

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:58):

I don't know exactly how that field translates from standard SO, but those other formats definitely support bounding boxes so I'll try to include it in the issue

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 19:59):

I think the answer is "yes"

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 19:59):

Great, thanks. If you want you can just say "all formats based on schema.org".

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 20:00):

Ok. I need to finish some other stuff but I can probably post the issue later this evening

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 20:00):

no rush, we don't have time to work on it anyway :crazy:

view this post on Zulip Ian Nesbitt (Jun 16 2025 at 20:01):

Same story here at DataONE as always :)

view this post on Zulip Philip Durbin πŸš€ (Jun 16 2025 at 20:01):

I figured :rofl:

view this post on Zulip Ian Nesbitt (Jun 17 2025 at 14:36):

Submitted: https://github.com/IQSS/dataverse/issues/11582

view this post on Zulip Philip Durbin πŸš€ (Jun 17 2025 at 14:38):

Looks great! Thanks! I made a couple tiny tweaks.

view this post on Zulip Ian Nesbitt (Jun 17 2025 at 14:40):

Thanks! You are quick!

view this post on Zulip Ian Nesbitt (Oct 27 2025 at 18:29):

Hi @Philip Durbin πŸš€ , we have been having an issue that I've missed since July...it seems our scraper is getting empty status code 202 responses from the HD server when it tries to get the base sitemap. I think the reason I missed it is because it doesn't register as an error...

view this post on Zulip Ian Nesbitt (Oct 27 2025 at 18:37):

Here's what's returned when I wget from the scraper server:

$ wget https://dataverse.harvard.edu/sitemap_index.xml
--2025-10-27 18:35:21--  https://dataverse.harvard.edu/sitemap_index.xml
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 54.86.163.49, 3.211.175.147, 3.215.43.147
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|54.86.163.49|:443... connected.
HTTP request sent, awaiting response... 202 Accepted
Length: 0 [text/html]
Saving to: β€˜sitemap_index.xml’

sitemap_index.xml                                [ <=>                                                                                         ]       0  --.-KB/s    in 0s

2025-10-27 18:35:22 (0.00 B/s) - β€˜sitemap_index.xml’ saved [0/0]

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 18:47):

Length 0. Interesting.

view this post on Zulip Ian Nesbitt (Oct 27 2025 at 19:26):

It loads fine in a browser which is odd

view this post on Zulip Julian Gautier (Oct 27 2025 at 19:50):

Hey all. I used to scrape the HTML of certain types of pages on Harvard Dataverse and had to stop back in April 2025. Leonid told me back then that the 202 status code I was seeing was because the IT folks who help manage security related things for Harvard Dataverse (HUIT) implemented some "silent challenge" that makes pages accessible from browsers only (or by using a Harvard VPN, although I couldn't get this to work back then, and eventually I stopped needing to scrape).

view this post on Zulip Ian Nesbitt (Oct 27 2025 at 20:04):

Ah. Well, that would explain it. Thank you @Julian Gautier

view this post on Zulip Ian Nesbitt (Oct 27 2025 at 20:12):

I imagine you're getting crawled by all sorts of LLM scrapers so I understand the necessity, but it would be nice if DataONE's metadata scraper could exempted from that restriction, because people do expect HD records to be aggregated in DataONE and we do send legitimate traffic to HD...

view this post on Zulip Julian Gautier (Oct 27 2025 at 20:16):

And it's still necessary to scrape the page instead of using the Dataverse API, right? I was able to stop scraping when the info I needed was made available with a new API endpoint, and I was able to use that instead. Sorry if I'm asking a question you've already talked about. I haven't read everything in this thread yet :sweat_smile:

view this post on Zulip Ian Nesbitt - DataONE (Oct 27 2025 at 20:22):

It's ok. We parse the sitemaps to get landing page URLs for datasets, then use the lastmod date to filter for only the most recent ones, and download JSON-LD metadata from the endpoint using content negotiation. It's similar to what the Google Dataset Search scraper is doing

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:26):

@Ian Nesbitt - DataONE could you use Signposting to get the links to the JSON-LD files?

Please see this PR: expose links to all export formats via Signposting #11045

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:26):

And https://guides.dataverse.org/en/6.8/api/native-api.html#retrieve-signposting-information

view this post on Zulip Ian Nesbitt - DataONE (Oct 27 2025 at 20:34):

I'm sure I can find a way to make HEAD requests to the landing pages. Currently we're doing GET requests but using content negotiation to ask for JSON-LD, so the response redirects us to a URL like https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/BUOUNW/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20240925%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240925T082255Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=ce9e514a3d9e1c521733311b597a7df31e4e8387b0b6b353891f1a2c30635b28

Would the HEAD request be any more efficient than the content negotiation we currently use?

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:35):

Definitely. Right now you're getting the whole payload of the page with a GET, right?

view this post on Zulip Ian Nesbitt - DataONE (Oct 27 2025 at 20:39):

I don't think we end up having to download any XHTML, because we get redirected to that export_schema.org.cached function when the server sees content negotiation in the request, which I assumed was the most efficient way of doing things

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:41):

I see. So you're already skipping the step of doing a GET of the dataset landing page, you're saying. You go directly to the cached export by constructing the URL you need based on the DOI. Is that right?

view this post on Zulip Ian Nesbitt - DataONE (Oct 27 2025 at 20:43):

Correct. I can recreate the requests in a curl -v command and show you the outputs but I assume I'd run into the aforementioned command line restriction

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:44):

Can you set the user agent to look like a browser?

view this post on Zulip Ian Nesbitt - DataONE (Oct 27 2025 at 20:45):

I can. In the scraper or the curl command?

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:45):

Maybe try in curl and if it works, try in the scraper?

view this post on Zulip Ian Nesbitt - DataONE (Oct 27 2025 at 20:55):

Sadly it still knows I'm on the command line:

$ curl -v -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0" https://dataverse.harvard.edu/sitemap_index.xml
*   Trying 3.211.175.147:443...
* TCP_NODELAY set
* Connected to dataverse.harvard.edu (3.211.175.147) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=Massachusetts; O=President and Fellows of Harvard College; CN=dataverse.harvard.edu
*  start date: Apr 30 00:00:00 2025 GMT
*  expire date: May 31 23:59:59 2026 GMT
*  subjectAltName: host "dataverse.harvard.edu" matched cert's "dataverse.harvard.edu"
*  issuer: C=US; O=Internet2; CN=InCommon RSA Server CA 2
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55c388e8e0d0)
> GET /sitemap_index.xml HTTP/2
> Host: dataverse.harvard.edu
> user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 202
< server: awselb/2.0
< date: Mon, 27 Oct 2025 20:53:52 GMT
< content-length: 0
< x-amzn-waf-action: challenge
< cache-control: no-store, max-age=0
< content-type: text/html; charset=UTF-8
< access-control-allow-origin: *
< access-control-max-age: 86400
< access-control-allow-methods: OPTIONS,GET,POST
< access-control-expose-headers: x-amzn-waf-action
<
* Connection #0 to host dataverse.harvard.edu left intact

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:56):

:doh:

view this post on Zulip Philip Durbin πŸš€ (Oct 27 2025 at 20:57):

I pinged Leo earlier. Maybe he'll save us.

view this post on Zulip Leo Andreev (Oct 27 2025 at 21:47):

Hi Ian,

< HTTP/2 202
< access-control-expose-headers: x-amzn-waf-action

Yes, this is AWS WAF Silent Challenge that HUIT are enforcing on our UI pages now, to weed out non-browser calls. (HUIT is the Harvard group that runs the load balancer our servers sit behind).
Please send me your crawler's ip address(es)/subnets so that I could ask them to be exempted from this WAF rule.
I checked and I still have the rewrite rules in place for your crawler to serve fast redirects to exported metadata records on S3.
(for the record, our /api is exempt from this blocking; but I'm not suggesting going through that as a solution, since I remember you had reasons to prefer to follow the standard sitemap route. Plus the custom redirects worked really well in the end).

And yes, virtually all Harvard sites that serve any data that can be fed to LLMs have been getting crawled to death. So they've been resorting to increasingly harsh measures to protect the perimeter from the bots.

We'll work it out.
All the best,
-Leo

P.S. I have 128.111.85.17 for your spider in my records - but that was a while ago.

view this post on Zulip Ian Nesbitt - DataONE (Oct 27 2025 at 22:09):

Hi Leo, makes senseβ€”I'm actually not sure if our IPs have changed but we have a production scraper at 128.111.85.168 (sonode.dataone.org) and a test scraper at 128.111.85.172 (so.test.dataone.org).

Yes, we do have to download and parse the whole sitemap unfortunately, but the redirects have been working quite well!

view this post on Zulip Leo Andreev (Oct 28 2025 at 15:27):

I got a confirmation that the 2 ips above have been added to the exemptions list.

view this post on Zulip Leo Andreev (Oct 28 2025 at 15:32):

Could you please remind me if your crawler can be throttled as not to exceed a certain call rate?
That's another thing HUIT are enforcing. Unlike the silent challenges, and for reasons I don't fully understand, they have been unable to grant us exceptions for specific url patterns etc. with that.
At the moment the rate is defined as 300 calls/5 min., after which they put the ip on their crap list (code 403) for the next 5 min.

view this post on Zulip Leo Andreev (Oct 28 2025 at 15:34):

(I am working with them on relaxing these rules/making them more flexible etc., as this is causing us real problems; but that's what we have to work around at the moment)

view this post on Zulip Ian Nesbitt - DataONE (Oct 28 2025 at 15:40):

We can delay each call as much as needed. Currently we enforce a 2-second delay between each.

view this post on Zulip Leo Andreev (Oct 28 2025 at 15:43):

Great, that should be more slower than enough.

view this post on Zulip Ian Nesbitt - DataONE (Oct 28 2025 at 15:51):

I think if I set it to 1/sec it would be fine, because the delay time does not include processing time, but I also want to be kind to your servers and there's really no rush in getting the scrape done.


Last updated: Oct 30 2025 at 06:21 UTC