Dataset with large number of files slowly kills dataverse · community

Stream: community

Topic: Dataset with large number of files slowly kills dataverse

Péter Pallinger (Oct 29 2024 at 11:20):

We have a dataset with a large number of files (about 450000). Loading this dataset takes about 30 seconds, and raises the (permanent) memory usage of payara by at least 1GB (based on the garbage collection logs).
Is there a know mitigation for this? Some switch to disable file listing or similar?
As it is now, about 20 loads of this dataset in the browser sends payara thrashing in GC hell. The only fix I know is restarting payara.

Philip Durbin 🚀 (Oct 29 2024 at 11:22):

Hmm, can you put those ~half a million files in a zip and use that instead as a new version of the dataset? We do have a nice zip previewer/downloader.

Péter Pallinger (Oct 29 2024 at 11:23):

Possibly, of course. I will have to talk with the uploader. Is this the only way?

Péter Pallinger (Oct 29 2024 at 11:24):

Of course, no single files could be download then. Or not even some smaller parts (sub-directories)...

Philip Durbin 🚀 (Oct 29 2024 at 11:43):

With the zip previewer/downloader, single files can be downloaded. You can try it on a zip file in my dataset if you like: https://dataverse.harvard.edu/file.xhtml?fileId=6867328&version=4.0

Péter Pallinger (Oct 29 2024 at 11:45):

Cool, I will look into this.

Philip Durbin 🚀 (Oct 29 2024 at 11:46):

https://dataverse.harvard.edu/dataverse/ashkelonexcavations has 28K files. One dataset per file. Perhaps an extreme example but another way of avoiding having too many files in a single dataset.

Péter Pallinger (Oct 29 2024 at 12:43):

Yeah, one file per dataset is not really a good solution IMHO.
Also, the zip previewer needs to get the file list from the zip on the server, and parse it. With 450k files, it may crash the browser doing that...

Péter Pallinger (Oct 29 2024 at 12:46):

Also, the "all files in a dataverse" approach makes it possible to search among them.

Philip Durbin 🚀 (Oct 29 2024 at 12:55):

Well, perhaps @Markus Haarländer, author of the zip previewer/downloader, can confirm, but I believe it only downloads the bytes it needs to get the list.

Péter Pallinger (Oct 29 2024 at 13:09):

Yes, it really only downloads the needed parts of the zip. However, representing a 450k long list in javascript and/or in the DOM is challenging for most browsers.
Thank you for your help. I will try to convince the dataset owner to use a smaller number (~1000) of zip files, that way some of the search functionality remains but listing would be fast and would not leak (that much) memory.

Philip Durbin 🚀 (Oct 29 2024 at 13:30):

If it helps, @Ceilyn Boyd gave a talk fairly recently called "Transforming a Digital Collection into a Data Collection". About 80K files were in play: https://groups.google.com/g/dataverse-community/c/Teb7_Pj2ajg/m/HO0E0vMnAQAJ

Péter Pallinger (Oct 29 2024 at 16:40):

If I upload a zip (even through the API), it will be decompressed. How can you upload a zip so that it is left as a zip?

Oliver Bertuch (Oct 29 2024 at 16:47):

Double-zip it :smiley: (The official workaround)

Philip Durbin 🚀 (Oct 29 2024 at 17:34):

Yes, a please consider voting and commenting on this issue: Support uploading of archives (ZIP, other). #8029

Last updated: Nov 01 2025 at 14:11 UTC