Hi, I have recently updated Dataverse to version 6.5 and I noticed that updating citation.tsv, something that could be done in a few seconds, now takes too long as you can check bellow. I assume that this behavior has started because of the new values added for the Language datasetfield.
time curl "http://localhost:8080/api/admin/datasetfield/load" -X POST --upload-file /tmp/dvinstall/data/metadatablocks/citation.tsv -H "Content-type: text/tab-separated-values"
real 27m33.218s
user 0m0.059s
sys 0m0.094s
Is there any alternative that can speed up this process for future updates or is this behavior solved in a newer version?
Yes, exactly, it's because of all the new languages in #10762, added in Dataverse 6.4.
Hmm starting with https://github.com/IQSS/dataverse/releases/tag/v6.6 we say this:
"Expect the loading of the citation block to take several seconds
because of its size (especially due to the number of languages)."
I just added this to https://github.com/IQSS/dataverse/releases/tag/v6.4 which is where #10762 was merged.
At https://github.com/IQSS/dataverse/releases/tag/v6.5 I didn't add anything because we don't ask you to reload citation.tsv.
27 minutes is crazy though! Is that right?!? For me, it takes maybe 5-10 seconds which was enough to annoy me and prompt me to add that note about several seconds.
Yes, it took me 27 minutes.
Can you please say a little more about your hardware?
I am monitoring the machine and I don't see any load increase. I am running the container on a machine with 64GB of RAM and 32 cores.
a pretty beefy machine
Please feel free to create an issue: https://github.com/IQSS/dataverse/issues
This is well known from my experiments with migrations in containers. Please make sure to recheck your resource allocations for the Postgres and Dataverse container. I found more RAM and a few CPU cores helped. Depending on your setup you may have a beefy machine but not necessarily use all of it.
And @Philip Durbin 🚀 I kinda disbelieve about the seconds stuff for this change. Even after adding a lot more hardware it was still well into minutes.
@César Ferreira I ran all of these migrations tests on a snapshot of prod, so less anxiousness involved :grinning_face_with_smiling_eyes:
Oliver Bertuch said:
And Philip Durbin 🚀 I kinda disbelieve about the seconds stuff for this change. Even after adding a lot more hardware it was still well into minutes.
On my laptop it takes 13 seconds:
% time curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file scripts/api/data/metadatablocks/citation.tsv
0.01s user 0.01s system 0% cpu 12.791 total
That's where I write the release notes. :sweat_smile:
@Omer M Fahim how long does it take on any of our test servers, would you say? :thinking:
your times align with mines
Whaaaaaat :flushed::flushed::flushed::flushed::flushed::flushed::flushed::flushed::flushed:
@Leo Andreev is saying it's much slower when you go from 200 langs to 8000.
Versus staying at 8000, I mean, which is what I just tested.
Is this a fresh DB or a snapshot of an existing instance?
For me a fresh db.
Hmm that may make difference with updating all the datasets and have 8000 new entries. Not sure. Might need some more testing again...
Oh wait - you mean it takes 13 secs now that you had it loaded before to reload the block again? That would make sense...
Yeah, exactly.
ehh gonna run a test right now
Then I probably misunderstood earlier. Sry
No worries.
If I drop my database and add this to scripts/api/setup-datasetfields.sh
+echo "BEGIN loading citation"
+date
curl "${DATAVERSE_URL}/api/admin/datasetfield/load" -X POST --data-binary @"$SCRIPT_PATH"/data/metadatablocks/citation.tsv -H "Content-type: text/tab-separated-values"
+date
+echo "END loading citation"
I get this:
dev_bootstrap> BEGIN loading citation
dev_bootstrap> Fri Oct 10 20:25:01 UTC 2025
dev_bootstrap> {"status":"OK","data":{"added":[{"name":"citation"...
dev_bootstrap> Fri Oct 10 20:25:14 UTC 2025
dev_bootstrap> END loading citation
So also 13 seconds for an initial load of citation.tsv. ![]()
Thank you for all the feedback, I will try to adjust container resources as @Oliver Bertuch suggested. I did the same test on another "fresh" instance, it only has one dataset, and POST time was 1m30s. Could this time be increased by the number of existing datasets?
After some tests I suspect that these POST requests are taking too long because of the filesystem. We have our Dataverse instances running on Openstack and we have found out previously, while stress testing our main instance, that PostgreSQL doesn't work well with the Openstack filesystem. Because of that our production instance has a DB running outside of Openstack and the POST request takes 1m25s without any resource tuning.
What filesystem are you using in your Openstack?
Also are we talking Cinder or Mantis? (I hope Cinder :alien:)
If you're looking into running some FS statistics, maybe this little tool I put together in a container helps. https://jugit.fz-juelich.de/fdm/k8s/k8s-storage-benchmark
We're on Openstack as well and use Cinder block devices backed by Ceph librbd mounts. I'd rather have krbd because it has even better performance, but it's not bad either. The storage link is a 10G per OpenStack Nova Host and they run a 2-replica config in Ceph.
I am not sure but I think it is Cinder. I can ask my colleague in charge of Openstack for more details. When we detected issues with Openstack we also ran some performance tests with Ansible. We generated test files with the command head -c {{ filesize }} /dev/zero | tr '\000' '\377' and we used dool for filesystem monitoring.
I deployed Dataverse on a different Openstack with newer hardware and I got better results. The other Openstack instance has older hardware and that could be one of the reasons for the bad performance. Now the POST only takes 1-2min. I get the same times either if it is a fresh install or if the database already exists. I even tried run it locally and I got the same results.
That's good. Please remind me, is the going from ~200 langs to ~8000? Or from 8k to 8k?
From 8k to 8k. My citations.tsv also has about 300 more subjects, but I also test with the one on GitHub and results were the same.
citation.tsv
Ok, thanks. I'm just thinking we should update our release notes to say minutes instead of seconds.
Last updated: Oct 30 2025 at 06:21 UTC