Stream: community

Topic: Dataset with large number of files slowly kills dataverse


view this post on Zulip Péter Pallinger (Oct 29 2024 at 11:20):

We have a dataset with a large number of files (about 450000). Loading this dataset takes about 30 seconds, and raises the (permanent) memory usage of payara by at least 1GB (based on the garbage collection logs).
Is there a know mitigation for this? Some switch to disable file listing or similar?
As it is now, about 20 loads of this dataset in the browser sends payara thrashing in GC hell. The only fix I know is restarting payara.

view this post on Zulip Philip Durbin 🚀 (Oct 29 2024 at 11:22):

Hmm, can you put those ~half a million files in a zip and use that instead as a new version of the dataset? We do have a nice zip previewer/downloader.

view this post on Zulip Péter Pallinger (Oct 29 2024 at 11:23):

Possibly, of course. I will have to talk with the uploader. Is this the only way?

view this post on Zulip Péter Pallinger (Oct 29 2024 at 11:24):

Of course, no single files could be download then. Or not even some smaller parts (sub-directories)...

view this post on Zulip Philip Durbin 🚀 (Oct 29 2024 at 11:43):

With the zip previewer/downloader, single files can be downloaded. You can try it on a zip file in my dataset if you like: https://dataverse.harvard.edu/file.xhtml?fileId=6867328&version=4.0

view this post on Zulip Péter Pallinger (Oct 29 2024 at 11:45):

Cool, I will look into this.

view this post on Zulip Philip Durbin 🚀 (Oct 29 2024 at 11:46):

https://dataverse.harvard.edu/dataverse/ashkelonexcavations has 28K files. One dataset per file. Perhaps an extreme example but another way of avoiding having too many files in a single dataset.

view this post on Zulip Péter Pallinger (Oct 29 2024 at 12:43):

Yeah, one file per dataset is not really a good solution IMHO.
Also, the zip previewer needs to get the file list from the zip on the server, and parse it. With 450k files, it may crash the browser doing that...

view this post on Zulip Péter Pallinger (Oct 29 2024 at 12:46):

Also, the "all files in a dataverse" approach makes it possible to search among them.

view this post on Zulip Philip Durbin 🚀 (Oct 29 2024 at 12:55):

Well, perhaps @Markus Haarländer, author of the zip previewer/downloader, can confirm, but I believe it only downloads the bytes it needs to get the list.

view this post on Zulip Péter Pallinger (Oct 29 2024 at 13:09):

Yes, it really only downloads the needed parts of the zip. However, representing a 450k long list in javascript and/or in the DOM is challenging for most browsers.
Thank you for your help. I will try to convince the dataset owner to use a smaller number (~1000) of zip files, that way some of the search functionality remains but listing would be fast and would not leak (that much) memory.

view this post on Zulip Philip Durbin 🚀 (Oct 29 2024 at 13:30):

If it helps, @Ceilyn Boyd gave a talk fairly recently called "Transforming a Digital Collection into a Data Collection". About 80K files were in play: https://groups.google.com/g/dataverse-community/c/Teb7_Pj2ajg/m/HO0E0vMnAQAJ

view this post on Zulip Péter Pallinger (Oct 29 2024 at 16:40):

If I upload a zip (even through the API), it will be decompressed. How can you upload a zip so that it is left as a zip?

view this post on Zulip Oliver Bertuch (Oct 29 2024 at 16:47):

Double-zip it :smiley: (The official workaround)

view this post on Zulip Philip Durbin 🚀 (Oct 29 2024 at 17:34):

Yes, a please consider voting and commenting on this issue: Support uploading of archives (ZIP, other). #8029


Last updated: Nov 01 2025 at 14:11 UTC