Stream: dev

Topic: generateIdentifierFromStoredProcedure as a VOLATILE function


view this post on Zulip Vera Clemens (May 11 2026 at 08:53):

Hi everyone,

last week I discovered an interesting "bug" in Dataverse related to generateIdentifierFromStoredProcedure.

I updated a Dataverse instance from v6.8 to v6.9, and noticed that the "create dataset" API was occasionally getting stuck and not returning a response. See this terminal log of some datasets being created in quick succession:

clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":92,"persistentId":"perma:nfdi4health-64"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":96,"persistentId":"perma:nfdi4health-64"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C

clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C

clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":97,"persistentId":"perma:nfdi4health-65"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C

clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
^C

clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":98,"persistentId":"perma:nfdi4health-66"}}
clemens@clemens-virtual-machine:~$ curl -X POST localhost:8080/api/dataverses/nfdi4health/datasets -H "X-Dataverse-key: xxx" -H "Content-Type: application/json" --data @/home/clemens/testdata.json
{"status":"OK","data":{"id":99,"persistentId":"perma:nfdi4health-67"}}

(The ^C lines are where it got stuck and I had to kill the command with CTRL+C)

(Also, notice the one identifier "perma:nfdi4health-64" being returned twice for two datasets...?)

Now, first I had no clue what could be the reason. I ran asadmin generate-jvm-report --type=thread while I had a curl request stuck in another terminal to see what the Dataverse threads were doing. Here's a shortened bit of the output:

payara@backend-testing-dv-v6-10-1-dataverse-0:~$ asadmin generate-jvm-report --type=thread
...
Thread Execution Information:
-----------------------
Thread "http-thread-pool::http-listener-1(8)" thread-id: 507 thread-state: RUNNABLE
     at: org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2339)
     at: org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:372)
...
     at: edu.harvard.iq.dataverse.pidproviders.PidProviderFactoryBean.isGlobalIdLocallyUnique(PidProviderFactoryBean.java:224)
     at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.isGlobalIdUnique(AbstractPidProvider.java:259)
     at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.generateIdentifierFromStoredProcedureIndependent(AbstractPidProvider.java:455)
     at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.generateDatasetIdentifier(AbstractPidProvider.java:242)
     at: edu.harvard.iq.dataverse.pidproviders.AbstractPidProvider.generatePid(AbstractPidProvider.java:228)
     at: edu.harvard.iq.dataverse.engine.command.impl.AbstractCreateDatasetCommand.execute(AbstractCreateDatasetCommand.java:90)
...

So I thought, OK, could be another bug introduced in v6.9 related to PID providers that use "storedProcGenerated", like the one I reported here: https://github.com/IQSS/dataverse/issues/12126

My SQL is based on the example for getting a numerical sequence and the function itself looks like this:

CREATE OR REPLACE FUNCTION generateIdentifierFromStoredProcedure()
RETURNS varchar AS $$
DECLARE
identifier varchar;
BEGIN
    identifier := nextval('datasetidentifier_seq')::varchar;
RETURN identifier;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

In the end, I was able to fix the issue by marking this function VOLATILE instead of IMMUTABLE.

It seems that when the function was marked as IMMUTABLE (meaning it "is guaranteed to return the same results given the same arguments forever" [1]), the db was caching the function result and occasionally just returning the same cached identifier over and over, which caused the loop in generateIdentifierFromStoredProcedureIndependent to get stuck.

The function should be marked VOLATILE, since it contains a call to nextval, which means it modifies the db state and its result constantly change.

I'm thinking about opening a PR for the IMMUTABLE -> VOLATILE change, but first, I thought this bug was so interesting I'd open a Zulip thread about it :)

[1] https://www.postgresql.org/docs/current/xfunc-volatility.html

view this post on Zulip Philip Durbin 🚀 (May 11 2026 at 08:58):

Yes, very interesting! Thanks!

view this post on Zulip Vera Clemens (May 11 2026 at 11:37):

Opened a PR here: https://github.com/IQSS/dataverse/pull/12391


Last updated: May 30 2026 at 06:18 UTC