Stream: troubleshooting

Topic: POST citation.tsv takes too long


view this post on Zulip César Ferreira (Oct 10 2025 at 15:57):

Hi, I have recently updated Dataverse to version 6.5 and I noticed that updating citation.tsv, something that could be done in a few seconds, now takes too long as you can check bellow. I assume that this behavior has started because of the new values added for the Language datasetfield.

time curl "http://localhost:8080/api/admin/datasetfield/load" -X POST --upload-file /tmp/dvinstall/data/metadatablocks/citation.tsv -H "Content-type: text/tab-separated-values"
real    27m33.218s
user    0m0.059s
sys     0m0.094s

Is there any alternative that can speed up this process for future updates or is this behavior solved in a newer version?

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 15:59):

Yes, exactly, it's because of all the new languages in #10762, added in Dataverse 6.4.

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 16:01):

Hmm starting with https://github.com/IQSS/dataverse/releases/tag/v6.6 we say this:

"Expect the loading of the citation block to take several seconds
because of its size (especially due to the number of languages)."

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 16:02):

I just added this to https://github.com/IQSS/dataverse/releases/tag/v6.4 which is where #10762 was merged.

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 16:03):

At https://github.com/IQSS/dataverse/releases/tag/v6.5 I didn't add anything because we don't ask you to reload citation.tsv.

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 16:04):

27 minutes is crazy though! Is that right?!? For me, it takes maybe 5-10 seconds which was enough to annoy me and prompt me to add that note about several seconds.

view this post on Zulip César Ferreira (Oct 10 2025 at 16:05):

Yes, it took me 27 minutes.

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 16:08):

Can you please say a little more about your hardware?

view this post on Zulip César Ferreira (Oct 10 2025 at 16:10):

I am monitoring the machine and I don't see any load increase. I am running the container on a machine with 64GB of RAM and 32 cores.

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 16:11):

a pretty beefy machine

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 16:11):

Please feel free to create an issue: https://github.com/IQSS/dataverse/issues

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 18:43):

This is well known from my experiments with migrations in containers. Please make sure to recheck your resource allocations for the Postgres and Dataverse container. I found more RAM and a few CPU cores helped. Depending on your setup you may have a beefy machine but not necessarily use all of it.

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 18:45):

And @Philip Durbin 🚀 I kinda disbelieve about the seconds stuff for this change. Even after adding a lot more hardware it was still well into minutes.

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 18:47):

@César Ferreira I ran all of these migrations tests on a snapshot of prod, so less anxiousness involved :grinning_face_with_smiling_eyes:

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 18:53):

Oliver Bertuch said:

And Philip Durbin 🚀 I kinda disbelieve about the seconds stuff for this change. Even after adding a lot more hardware it was still well into minutes.

On my laptop it takes 13 seconds:

% time curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file scripts/api/data/metadatablocks/citation.tsv

0.01s user 0.01s system 0% cpu 12.791 total

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 18:54):

That's where I write the release notes. :sweat_smile:

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 18:55):

@Omer M Fahim how long does it take on any of our test servers, would you say? :thinking:

view this post on Zulip Omer M Fahim (Oct 10 2025 at 19:19):

your times align with mines

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 20:16):

Whaaaaaat :flushed::flushed::flushed::flushed::flushed::flushed::flushed::flushed::flushed:

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 20:17):

@Leo Andreev is saying it's much slower when you go from 200 langs to 8000.

Versus staying at 8000, I mean, which is what I just tested.

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 20:17):

Is this a fresh DB or a snapshot of an existing instance?

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 20:17):

For me a fresh db.

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 20:18):

Hmm that may make difference with updating all the datasets and have 8000 new entries. Not sure. Might need some more testing again...

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 20:20):

Oh wait - you mean it takes 13 secs now that you had it loaded before to reload the block again? That would make sense...

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 20:20):

Yeah, exactly.

view this post on Zulip Omer M Fahim (Oct 10 2025 at 20:20):

ehh gonna run a test right now

view this post on Zulip Oliver Bertuch (Oct 10 2025 at 20:20):

Then I probably misunderstood earlier. Sry

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 20:33):

No worries.

view this post on Zulip Philip Durbin 🚀 (Oct 10 2025 at 20:34):

If I drop my database and add this to scripts/api/setup-datasetfields.sh

+echo "BEGIN loading citation"
+date
 curl "${DATAVERSE_URL}/api/admin/datasetfield/load" -X POST --data-binary @"$SCRIPT_PATH"/data/metadatablocks/citation.tsv -H "Content-type: text/tab-separated-values"
+date
+echo "END loading citation"

I get this:

dev_bootstrap> BEGIN loading citation
dev_bootstrap> Fri Oct 10 20:25:01 UTC 2025
dev_bootstrap> {"status":"OK","data":{"added":[{"name":"citation"...
dev_bootstrap> Fri Oct 10 20:25:14 UTC 2025
dev_bootstrap> END loading citation

So also 13 seconds for an initial load of citation.tsv. :shrugdog:

view this post on Zulip César Ferreira (Oct 13 2025 at 09:10):

Thank you for all the feedback, I will try to adjust container resources as @Oliver Bertuch suggested. I did the same test on another "fresh" instance, it only has one dataset, and POST time was 1m30s. Could this time be increased by the number of existing datasets?

view this post on Zulip César Ferreira (Oct 13 2025 at 12:19):

After some tests I suspect that these POST requests are taking too long because of the filesystem. We have our Dataverse instances running on Openstack and we have found out previously, while stress testing our main instance, that PostgreSQL doesn't work well with the Openstack filesystem. Because of that our production instance has a DB running outside of Openstack and the POST request takes 1m25s without any resource tuning.

view this post on Zulip Oliver Bertuch (Oct 13 2025 at 13:40):

What filesystem are you using in your Openstack?

view this post on Zulip Oliver Bertuch (Oct 13 2025 at 13:40):

Also are we talking Cinder or Mantis? (I hope Cinder :alien:)

view this post on Zulip Oliver Bertuch (Oct 13 2025 at 13:42):

If you're looking into running some FS statistics, maybe this little tool I put together in a container helps. https://jugit.fz-juelich.de/fdm/k8s/k8s-storage-benchmark

view this post on Zulip Oliver Bertuch (Oct 13 2025 at 13:45):

We're on Openstack as well and use Cinder block devices backed by Ceph librbd mounts. I'd rather have krbd because it has even better performance, but it's not bad either. The storage link is a 10G per OpenStack Nova Host and they run a 2-replica config in Ceph.

view this post on Zulip César Ferreira (Oct 13 2025 at 13:53):

I am not sure but I think it is Cinder. I can ask my colleague in charge of Openstack for more details. When we detected issues with Openstack we also ran some performance tests with Ansible. We generated test files with the command head -c {{ filesize }} /dev/zero | tr '\000' '\377' and we used dool for filesystem monitoring.

view this post on Zulip César Ferreira (Oct 14 2025 at 14:41):

I deployed Dataverse on a different Openstack with newer hardware and I got better results. The other Openstack instance has older hardware and that could be one of the reasons for the bad performance. Now the POST only takes 1-2min. I get the same times either if it is a fresh install or if the database already exists. I even tried run it locally and I got the same results.

view this post on Zulip Philip Durbin 🚀 (Oct 14 2025 at 14:46):

That's good. Please remind me, is the going from ~200 langs to ~8000? Or from 8k to 8k?

view this post on Zulip César Ferreira (Oct 14 2025 at 14:50):

From 8k to 8k. My citations.tsv also has about 300 more subjects, but I also test with the one on GitHub and results were the same.
citation.tsv

view this post on Zulip Philip Durbin 🚀 (Oct 14 2025 at 14:51):

Ok, thanks. I'm just thinking we should update our release notes to say minutes instead of seconds.


Last updated: Oct 30 2025 at 06:21 UTC