Stream: troubleshooting

Topic: OAI feed missing dataset identifiers


view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:31):

Hello .. me again with a probably stupid question .. I admit I set up our dataverse harvesting server when I first installed the site and really haven't touched it since. I only have the DEFAULT no name set that doesn't have a "setspec" set and should be all published datasets right? Anyway, I've noticed recently that the feed is missing some datasets .. specifically there should be 83 and there are only 79. I've recently upgraded to v6.5 and did a full index of the site at that time. Should I delete/recreate the set and/or create a set that is explicitly defined? Or, any ideas why else datasets would be missing or how I can fix?

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:34):

when i try to re-run the default export i see these messages in the log (and it just hangs forever and never completes):

[[setService, findAllNamedSets; query: select object(o) from OAISet as o where o.spec != '' order by o.spec]]
[[ 0 results found.]]

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:35):

What have I completely messed up?

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:35):

Hmm, yep, should be all. I'm not sure why a few are missing. I'm also not sure if Solr is involved or not. :thinking:

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:39):

Yeah, it does look like Solr is involved.

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:39):

Do you see something like this in your logs?

"set query expanded to " + datasetIds.size() + " datasets."

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:40):

I think harvesting might have its own log, apart from server.log I mean.

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:41):

Actually, I take it back. I don't think Solr is involved for the default set.

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:41):

if (!oaiSet.isDefaultSet()) {
    datasetIds = expandSetQuery(query);
    exportLogger.info("set query expanded to " + datasetIds.size() + " datasets.");
} else {
    // The default set includes all the local, published datasets.
    // findAllLocalDatasetIds() finds the ids of all the local datasets -
    // including the unpublished drafts and deaccessioned ones.
    // Those will be filtered out further down the line.
    datasetIds = datasetService.findAllLocalDatasetIds();
    databaseLookup = true;
}

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:42):

yes, i do see that .. and it looked like it was recreating the missing datasets, and maybe i just didn't wait long enough ..
.. i also created a new set and used the example for pulling the identifier .. and it looked like it was going to export 83 records as well ..
.. so wonder why the original set stopped updating?

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:43):

Not sure. Strange. Please feel free to open a issue if you think it's a bug.

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:45):

ohhh okay, so the new set i created says "83 datasets (79 records exported, 0 marked as deleted)" .. so it is not exporting 4 for some reason

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:46):

the actual OAI log just says "Calling OAI Record Service to re-export 93 datasets."

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:46):

Weird. I wonder if we'll be able to reproduce it, though. It is particular to your database? :thinking:

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:46):

93? that would include the unpublished ones i think

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:47):

i just don't know how to figure out why it isn't exporting those 4 .. they are from various time periods and don't seem to have weird formatting .. although there are some differences in all of them

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:49):

What if you make a set with one of the missing datasets? Does it work?

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:51):

trying now ..

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:57):

it says "1 dataset (0 records exported, 0 marked as deleted)"

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 19:58):

it finds the dataset but can't export those particular ones for some reason

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 19:59):

And if you create a set for a working dataset? Does it say 1 exported?

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:08):

yes it worked

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:12):

ok, so something is wrong with those few, hmm :thinking:

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:12):

anything in server.log?

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:15):

the only thing that i'm seeing in server.log are messages like this:

[2025-02-11T20:09:02.324+0000] [Payara 6.2024.7] [INFO] [] [edu.harvard.iq.dataverse.harvest.server.OAISetServiceBean] [tid: _ThreadID=93 _ThreadName=http-thread-pool::jk-connector(2)] [timeMillis: 1739304542324] [levelValue: 800] [[
setService, findAllNamedSets; query: select object(o) from OAISet as o where o.spec != '' order by o.spec]]

[2025-02-11T20:09:02.325+0000] [Payara 6.2024.7] [INFO] [] [edu.harvard.iq.dataverse.harvest.server.OAISetServiceBean] [tid: _ThreadID=93 _ThreadName=http-thread-pool::jk-connector(2)] [timeMillis: 1739304542325] [levelValue: 800] [[
3 results found.]]

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:16):

but that looks like the "all published" set and not sure why it just says "3 results found"??

view this post on Zulip Leo Andreev (Feb 11 2025 at 20:17):

Hi,
Yes, "83 datasets (79 records exported ..." would almost certainly indicate that the search query has found 83 published datasets, but only 79 of them have been successfully exported, so, the remaining 4 were not included in the OAI set advertised to the clients.
You already know the actual DOIs of the 4 missing/un-unexported datasets, correct? (ok, it looks like you know at least one - the one you've tried creating a set with...)
First step would be to identify which of the metadata formats is failing to export. (So, yes, this is a limitation of our export system - it's kind of binary/all-or-nothing; it's enough for just one format out of 10+ to fail, for the dataset to end up being "unexported". Which is a bit counter-productive, for the purposes of OAI especially - since the 3 formats needed for that are somewhat less likely to fail...)
Take a look at the storage folder for the dataset in question, on the filesystem or on S3, whichever is the case, and look for the files with the names like export_*.cached - for example, export_oai_dc.cached etc., and see which ones are missing, when compared to one of the successfully exported datasets.
Once you see which formats are missing, try exporting them individually via

curl "http://localhost:8080/api/datasets/export?exporter=xxx&persistentId=yyy"

while watching the server log; and hopefully there will be some errors/exceptions that will tell us what Dataverse doesn't like in the metadata. (during a bulk export of an OAI set error messages are suppressed, I believe).
This is pretty time-consuming, unfortunately. (but maybe someone can chime in with something easier in mind)

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:28):

for one of the datasets, in the storage location I see "export_Datacite.cache", "export_oai_dc.cache" and "export_OAI_ORE.cache" ..

I tried the curl command for all three exporters for a dataset id that did export and for one that did not, and they all seemed to work .. nothing appears in server.log

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:29):

these curl commands:
curl "http://localhost:8080/api/datasets/export?exporter=oai_dc&persistentId=doi:<ours>/C1CWX9"
curl "http://localhost:8080/api/datasets/export?exporter=OAI_ORE&persistentId=doi:<ours>/C1CWX9"
curl "http://localhost:8080/api/datasets/export?exporter=Datacite&persistentId=doi:<ours>/C1CWX9"

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:30):

all of those seemed to generate results for the non-exported dataset

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:37):

looking at one of the datasets that is working, it has more "export" files in the storage location .. ie: ddi, dcterms, dc, schema.org, etc

view this post on Zulip Leo Andreev (Feb 11 2025 at 20:38):

Yes, so, the next step should be to try the export API for the formats that are NOT there/not cached.
Note that if there are a few formats that are missing, it does NOT mean that all of them have failed to export; it may just mean that the exporter stopped once it encountered the first format it wasn't able to produce.

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:42):

so try each of these?:
https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:47):

the oai_datacite one failed (but the Datacite one worked)

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:49):

and this in the log:
IllegalStateException caught when exporting oai_datacite for dataset doi:10.48349/ASU/C1CWX9; may or may not be due to a mismatch between an exporter code and a metadata block update.

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:49):

how do i fix it? :sweat_smile:

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:50):

we dont' have any custom metadata blocks .. other than the computational workflow one that I accidentally installed

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:52):

"may or may not"

just tell me!

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:53):

haha yea!

view this post on Zulip Leo Andreev (Feb 11 2025 at 20:53):

Yes! Except you _may_ have more metadata formats configured, on top of the 9 listed in the guide. (for example, we also have "croissant" added). So, comparing to what's cached in one of the known exported datasets directories may be the safest...
OK, sounds like you have already found one that is failing. The error message in the log is not super helpful, unfortunately... if you haven't yet, could you please try all the remaining formats too, and _maybe_ we'll see something more interesting in the log?

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:53):

and tell me what it is and how to fix it :smile:

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:54):

The comment above the error is not very promising: https://github.com/IQSS/dataverse/blob/v6.5/src/main/java/edu/harvard/iq/dataverse/export/ExportService.java#L340

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:54):

i tried all of these: ddi, oai_ddi, dcterms, oai_dc, schema.org , OAI_ORE , Datacite, oai_datacite and dataverse_json

and the only one that failed was the oai_datacite one

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:57):

Which dataset is failing? Can you please give us the doi or landing page?

view this post on Zulip Leo Andreev (Feb 11 2025 at 20:57):

... Another (also very time-consuming) way of going about it is to open the full metadata edit form for this dataset, next to one for one of the known "good" datasets, and then stare at the two looking for any visible differences. Some obscure field of which you have 2 populated entries in the former, but only one in the latter, etc.

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:57):

(I'm wondering if this fix will help: Openaire fix for multiple productionPlaces #11194 )

view this post on Zulip Leo Andreev (Feb 11 2025 at 20:58):

Finally, since the formats that the OAI server _actually needs_ have been exported successfully, let me think of a way to cheat and mark the dataset as "exported", for the purposes of adding it to the OAI set.

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:59):

okay, i will start working on comparing the metadata .. i was doing that and didn't really see anything but i understand better now what i am looking for

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:59):

Multiple productionPlace:

          {
            "typeName": "productionPlace",
            "multiple": true,
            "typeClass": "primitive",
            "value": [
              "Phoenix, Arizona, USA",
              "Los Angeles, California, USA",
              "Santa Barbara, California, USA",
              "Austin, Texas, USA"
            ]
          },

At https://dataverse.asu.edu/dataset.xhtml?persistentId=doi:10.48349/ASU/C1CWX9

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 20:59):

thank you both!

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 11 2025 at 20:59):

So for that dataset, #11194 should help. (Thank you, @Florian Fritze !)

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 21:00):

yay! okay have to go to a meeting then will read/try it

view this post on Zulip Leo Andreev (Feb 11 2025 at 21:08):

The fix in the PR above will only be added in 6.6.
If you are willing to resort to hacks, you could try and set the lastexporttime to some time today in the dataset table for this dataset, and re-export the OAI set again. (I would only try that with datasets for which at least the oai_dc format exports successfully). May or may not work, no promises. :)

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 21:46):

oh got it .. i will try updating the table .. that sounds like a good solution for now if it works!

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 21:47):

looking at the rest of the datasets that won't export to see if it is the same thing

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 21:49):

they do all have multiple production locations

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 22:15):

the db hack worked at least for that one dataset .. trying the rest

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 22:15):

THANK YOU!! :tada:

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 22:18):

well, it worked for the new oai set that i created with the persistent id set, but not for the default dataset .. it still says 79 .. wonder why? we will probably need to change our primo feed to point to the new one i guess

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 22:25):

and it worked for 2 of the datasets but not the other 2 .. :confused:

view this post on Zulip Deirdre Kirmis (Feb 11 2025 at 22:37):

when i point to one of the "fixed" ones it says it has been deleted
https://dataverse.asu.edu/oai?verb=GetRecord&identifier=doi%3A10.48349%2FASU%2FC1CWX9&metadataPrefix=oai_dc

view this post on Zulip Leo Andreev (Feb 12 2025 at 14:05):

I'm sorry I sent you on this hacky path!
Let's try and erase and rebuild the default set from scratch; but if these datasets are still left out of it after that, I think the sensible thing to do will be to wait for 6.6 to fix it properly
So, first, please erase all the records in the default set:
DELETE FROM oairecord WHERE setname = '';
(please be super careful! deleting things from the database directly is inherently risky...)
After that, the control panel should be showing "no active records" for the default set. Then re-export the set, and see what happens.

view this post on Zulip Deirdre Kirmis (Feb 12 2025 at 15:01):

ha no worries I always learn things that I didn't know doing hacky things! :big_smile: I would like to clean up that default set anyway. Will do this morning.

view this post on Zulip Deirdre Kirmis (Feb 12 2025 at 15:13):

well, it's better, it is now showing 81 rows .. so it included 2 of the ones that were missing before but 2 are still missing .. I'll figure out which ones are still missing and look at the metadata and see if there is anything else that could be causing this .. otherwise we will wait for the fix! :smile:

view this post on Zulip Deirdre Kirmis (Feb 12 2025 at 15:13):

thanks so much for your help! :glowing_star:

view this post on Zulip Leo Andreev (Feb 12 2025 at 16:03):

Well, we are 2 datasets better off than we were yesterday, so I'll call it progress :)

view this post on Zulip Deirdre Kirmis (Feb 12 2025 at 16:29):

ha yes for sure!
The 2 datasets still not showing are:
https://dataverse.asu.edu/dataset.xhtml?persistentId=doi:10.48349/ASU/7DCWIK
https://dataverse.asu.edu/dataset.xhtml?persistentId=doi:10.48349/ASU/C1CWX9

They both have multiple production locations. I tried changing the lastexporttime for both to an earlier date (most of them show november '24) .. but it didn't make a difference, still didn't show in the feed after re-export.

I guess we could try deleting all but one of the production locations until the fix is released and see if that works, and then add them back after the fix.

view this post on Zulip Philip Durbin ๐Ÿš€ (Feb 12 2025 at 16:38):

Should work, unless that exporter is failing for more reasons than just Production Place.

view this post on Zulip Leo Andreev (Feb 12 2025 at 16:55):

And maybe double-check the lastexporttime in the dataset table for these 2, that it is actually something later than the releasetime on their latest datasetversions - ?

view this post on Zulip Deirdre Kirmis (Feb 12 2025 at 17:49):

ohhhhhh that was it! The release date on those 2 was blank again (maybe because I tried to set it earlier?) .. idk .. but i made sure the lastexportdate was later than the release date, and now everything is showing as exported! I swear, yesterday when i did this I set them all to yesterday's date and it didn't work .. but for some reason those had to be set again. THANK YOU! We are good for now but will wait for the fix!


Last updated: Oct 30 2025 at 06:21 UTC