Stream: community

Topic: DDI-Codebook via OAI-PMH less rich than via web interface


view this post on Zulip Knut Wenzig (May 15 2025 at 08:39):

Hi, I met @Oliver Bertuch at HMC Conference in Cologne, and he suggested I raise my issue here.

I've noticed discrepancies between the DDI-Codebook metadata available via the web interface and through OAI-PMH. Specifically, the metadata accessed via OAI-PMH appears less rich.

For example, using the web interface, I can download rich DDI-Codebook metadata for this dataset:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BT3LXD
Direct download link:
https://dataverse.harvard.edu/api/datasets/export?exporter=ddi&persistentId=doi%3A10.7910/DVN/BT3LXD
This version includes detailed elements such as <fileDscr> and <dataDscr>.

However, when accessing the same dataset via OAI-PMH:
https://dataverse.harvard.edu/oai?verb=GetRecord&identifier=doi:10.7910/DVN/BT3LXD&metadataPrefix=oai_ddi
those elements are missing.

While I can imagine there may be reasons for this difference, I encourage you to look into it and consider providing the richer DDI-Codebook metadata via OAI-PMH—especially since it’s already available through the web interface.

Would it make sense to open a GitHub issue for this?

view this post on Zulip Philip Durbin 🚀 (May 15 2025 at 10:56):

Yes! Please do! @Oliver Bertuch brought this up in Slack yesterday (I know you can't see this) and linked to the abstract of your talk. @Leo Andreev called it a solvable problem so please do open an issue!

view this post on Zulip Knut Wenzig (May 15 2025 at 11:38):

Done: https://github.com/IQSS/dataverse/issues/11493

view this post on Zulip Amber Leahey (May 15 2025 at 13:43):

There are probably more differences, do you have a list of all the DDI fields and their mappings? Looks like in the metadata doc the export API and OAI-PMH are grouped together, so we will have to look into the code and update this perhaps - https://docs.google.com/spreadsheets/d/1VAhZ83hKURX_4T-bOkn7XCb9h62Gr2gcJI1fWe4Rl4c/edit?gid=1901625433#gid=1901625433

It would be great to look into the DDI mappings across the Dataverse integrations for better alignment!! :) @Victoria Lubitch

view this post on Zulip Leo Andreev (May 15 2025 at 14:14):

Hi Knut,
The decision to serve the "skinny" version of the DDI format ("oai_ddi") vs. the full DDI ("ddi") was made for purely practical reasons, based on the experience back when we used to serve the full version.
With the "<dataDscr>" section included, the size of the resulting XML goes into tens of megabytes for a dataset with a large number of tabular datafiles. It was expensive to serve these records, and it was expensive for the harvesters to parse them (we learned in the process that most OAI harvester clients were NOT designed with the expectation of ever having to deal with large records). At the same time, most harvesters have zero need for the actual datavariable-level metadata from the "<dataDscr>" section - in most practical use cases, an instance harvesting from us just wants to gather the descriptive, dataset-level metadata so that they can index it in their search engines. So, the decision was made to support 2 flavors of the format, and have the short version served via the OAI. Under the assumption that anyone with a practical need for the variable-level metadata will obtain it via the metadata export API.
(So, for anyone else reading this - an important clarification is that access to the full version is not limited to the web interface; the metadata API will happily serve both oai_ddi and ddi).

The above was just a long-ish explanation of the underlying history/legacy. If there is indeed a practical need for obtaining the full version of the format via OAI, the obvious solution is to just make both versions available and let harvesters decide which one they want (I would probably make it configurable on the individual Dataverse instance level, which formats they want to serve). There may already be a Dataverse fork out there where this has been done (Borealis, maybe?). Anyway, thank you for opening the issue, and yes, I do believe that this will be rather straightforward to implement.

best,
-Leo

view this post on Zulip Leo Andreev (May 15 2025 at 14:20):

@Amber Leahey I can confirm that the difference between the 2 formats is just as described in the opening comment: 1. oai_ddi does not include the "<dataDscr>" section. 2. In the oai_ddi, ALL the files in the dataset are listed as "<otherMat>" entries vs. in the full ddi ingested tabular datafiles are formatted under "<fileDscr>" sections that have some additional information, such as the numbers of variables and observations.
The dataset-level, descriptive metadata sections are identical between the 2 flavors.

view this post on Zulip Knut Wenzig (May 15 2025 at 14:48):

Thanks to all for the feedback and your time thinking about my issue.

From our metadata I constructed a single >500MB DDI-Codebook XML file (which corresponds to one DOI - https://www.doi.org/10.5684/soep.core.v39eu) and we had to learn that of the shelf OAI-PMH implementations did not expect this kind of XML files. So I am aware that fine-grained metadata will be more demanding.

Leo Andreev said:

If there is indeed a practical need for obtaining the full version of the format via OAI, the obvious solution is to just make both versions available and let harvesters decide which one they want (I would probably make it configurable on the individual Dataverse instance level, which formats they want to serve).

I would not underestimate what will be possible if this kind of rich metadata are in the world. In respect of discoverability fine-grained metadata will deliver better results than study-level metadata. @Slava Tykhonov can do magic things with LLMs and Croissant. Even if there are no LLMs which understand DDI-Codebook, it is true that DDI-Codebook can be much richer than Croissant because the categories are not available in Croissant. And especially if one strips those metadata from Stata oder SPSS files (or similar) then they are the cheapest metadata in your system - because no additional curation is needed.

So I am watching the Github issue. And to be honest: I am already excited. :-)

view this post on Zulip Julian Gautier (May 22 2025 at 20:48):

In the crosswalk that @Amber Leahey mentioned, earlier in this thread I'll note somehow that variable-level metadata is missing from DDI-Codebook metadata when it's retrieved over OAI-PMH.

A little tricky though since the crosswalk really only describes metadata at the dataset level, but I can at least leave a note


Last updated: Nov 01 2025 at 14:11 UTC