Stream: troubleshooting

Topic: Limit number of files per dataset


view this post on Zulip Markus Haarländer (Mar 27 2025 at 15:40):

Hi @all,
We recently had a user who uploaded more than 35.000 files into one dataset on our Dataverse (6.5) instance. Subsequently, not only the dataset, but the whole system became unstable and unusable for most of the time. Deleting the dataset via API or UI did not work anymore, we had to remove the files from the different database tables to get back a running system.
I know that Dataverse does not get along well with such a large number of files, so I was looking for a configuration setting that allows to limit the number of files per dataset. Couldn't find one, and I also could not find any related issues on GitHub. I know that @Eryk Kulikowski worked on performance improvements some time ago regarding large number of files, and I have in mind that it worked better for some time, but it seems the improvements are gone in v6.5?
So we would be interested how other instances handle such issues? Is there some kind of configuration option I just can't find? Wouldn't it make sense to have such an option, so a user cannot bring the whole system to a halt by just uploading too many files?

view this post on Zulip Philip Durbin 🚀 (Mar 27 2025 at 15:47):

The issue is less than a month old: Feature Request: (internal request) Add quota-like limit on the number of files in a dataset #11275

view this post on Zulip Philip Durbin 🚀 (Mar 27 2025 at 15:47):

But we've talked about it off and on for years. :sweat_smile:

view this post on Zulip Philip Durbin 🚀 (Mar 27 2025 at 15:48):

It completely makes sense that users should not be able to bring the system to a halt by uploading too many files!

view this post on Zulip Philip Durbin 🚀 (Mar 27 2025 at 15:48):

We do have a rate limiting feature: https://guides.dataverse.org/en/6.5/installation/config.html#rate-limiting

view this post on Zulip Philip Durbin 🚀 (Mar 27 2025 at 15:49):

But I'm wondering if there's a command you can target. :thinking:

view this post on Zulip Philip Durbin 🚀 (Mar 27 2025 at 15:49):

When lots of upload are happening, do you see a lot of the same command in the actionlogrecord table? https://guides.dataverse.org/en/6.5/admin/monitoring.html#actionlogrecord

view this post on Zulip Markus Haarländer (Mar 27 2025 at 16:02):

Thanks Phil.
There's the "CreateNewDataFilesCommand" in the actionlogrecord, but not sure if rate limiting is the right way to go here.
But glad to hear that the discussion about this is alive. We'll also have to discuss internally, if such a feature will be available, what numbers would make sense. An instance-wide config option would be sufficient for us. I'm currently quite busy with another project, but when I find the time I'll try to understand the current implementation of the collection quotas and maybe get some ideas out of it for such a file limit configuration.

view this post on Zulip Philip Durbin 🚀 (Mar 27 2025 at 16:05):

Sounds good!

view this post on Zulip Sherry Lake (Mar 27 2025 at 16:33):

@Markus Haarländer
Can you say more about what tables you edited to remove the files? We at University of Virginia have such a dataset that I can't delete (over 14,000 files). See my ask in the google group: https://groups.google.com/g/dataverse-community/c/WFf34d8R0Aw

You can either reply here or send email to shlake@virginia.edu

view this post on Zulip Markus Haarländer (Mar 27 2025 at 17:30):

Hi Sherry
Here's a (not very sophisticated) SQL script which I used. Not 100% sure if everything was removed, but it worked for us. The files didn't have any tags or restrictions. If they do, maybe other tables have to be cleaned, too.

UPDATE dataset SET thumbnailfile_id = NULL WHERE id = '<dataset-id>';

DELETE FROM filemetadata WHERE datafile_id IN (SELECT id FROM dvobject WHERE owner_id = '<dataset-id>');

DELETE FROM datafile WHERE id IN (SELECT id FROM dvobject WHERE owner_id = '<dataset-id>');

DELETE FROM dvobject WHERE owner_id = '<dataset-id>' AND dtype='DataFile';

Last updated: Oct 30 2025 at 06:21 UTC