Hi everyone,
last week I discovered an interesting "bug" in Dataverse related to generateIdentifierFromStoredProcedure.
I updated a Dataverse instance from v6.8 to v6.9, and noticed that the "create dataset" API was occasionally getting stuck and not returning a response. See this terminal log of some datasets being created in quick succession:
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":92,"persistentId":"perma:nfdi4health-64"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":96,"persistentId":"perma:nfdi4health-64"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":97,"persistentId":"perma:nfdi4health-65"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":98,"persistentId":"perma:nfdi4health-66"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":99,"persistentId":"perma:nfdi4health-67"}}
(The ^C lines are where it got stuck and I had to kill the command with CTRL+C)
(Also, notice the one identifier "perma:nfdi4health-64" being returned twice for two datasets...?)
Now, first I had no clue what could be the reason. I ran asadmin generate-jvm-report --type=thread while I had a curl request stuck in another terminal to see what the Dataverse threads were doing. Here's a shortened bit of the output:
payara@backend-testing-dv-v6-10-1-dataverse-0:~$ asadmin generate-jvm-report --type=thread
...
Thread Execution Information:
-----------------------
Thread "http-thread-pool::http-listener-1(8)" thread-id: 507 thread-state: RUNNABLE
at: org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2339)
at: org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:372)
...
at: edu.harvard.iq.dataverse.pidproviders.PidProviderFactoryBean.isGlobalIdLocallyUnique(PidProviderFactoryBean.java:224)
at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.isGlobalIdUnique(AbstractPidProvider.java:259)
at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.generateIdentifierFromStoredProcedureIndependent(AbstractPidProvider.java:455)
at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.generateDatasetIdentifier(AbstractPidProvider.java:242)
at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.generatePid(AbstractPidProvider.java:228)
at: edu.harvard.iq.dataverse.engine.command.impl.AbstractCreateDatasetCommand.execute(AbstractCreateDatasetCommand.java:90)
...
So I thought, OK, could be another bug introduced in v6.9 related to PID providers that use "storedProcGenerated", like the one I reported here: https://github.com/IQSS/dataverse/issues/12126
My SQL is based on the example for getting a numerical sequence and the function itself looks like this:
CREATE OR REPLACE FUNCTION generateIdentifierFromStoredProcedure()
RETURNS varchar AS $$
DECLARE
identifier varchar;
BEGIN
identifier := nextval('datasetidentifier_seq')::varchar;
RETURN identifier;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
In the end, I was able to fix the issue by marking this function VOLATILE instead of IMMUTABLE.
It seems that when the function was marked as IMMUTABLE (meaning it "is guaranteed to return the same results given the same arguments forever" [1]), the db was caching the function result and occasionally just returning the same cached identifier over and over, which caused the loop in generateIdentifierFromStoredProcedureIndependent to get stuck.
The function should be marked VOLATILE, since it contains a call to nextval, which means it modifies the db state and its result constantly change.
I'm thinking about opening a PR for the IMMUTABLE -> VOLATILE change, but first, I thought this bug was so interesting I'd open a Zulip thread about it :)
[1] https://www.postgresql.org/docs/current/xfunc-volatility.html
Yes, very interesting! Thanks!
Opened a PR here: https://github.com/IQSS/dataverse/pull/12391
Last updated: May 30 2026 at 06:18 UTC