Hi @all, I have encountered issues registering many files after directly uploading to an S3 storage.
For context, upon direct upload the python-dvuploader library first uploads all files to the storage using the ticket system and then registers each file asynchronously at the instance. This is where I am running into, I guess, rate-limiting problems. The library that I am using throws an exception that the connection is closed and I suspect that it's simply too many requests at once or too frequent.
Hence, my question is, how many concurrent requests are reasonable to handle by the backend and roughly how many per minute? I am trying to find a good default that can fit most instances equally well while being somewhat performant, compared to synchronous requests.
I've been testing this with demo Dataverse and locally using the docker compose variant with localstack. In both cases I have encountered this issue.
I'm reminded of this cartoon. :sweat_smile:
bridge.png
Well, then it is time to load up the trucks :joy:
I'd say so. We'll learn something!
So I did a little more testing and tried with a larger amount of files (1000 small ones) and looked into the Dataverse logs. Once more than one request is sent simultaneously, the dataset goes into a lock. Plus, I am not able to remove the lock via the UI. If done synchronously, there is no issue and lock.
Do you know what could be the cause of this? I have added the logs below:
2024-02-28 15:52:03 dev_dataverse | [#|2024-02-28T14:52:03.794+0000|SEVERE|Payara 6.2023.8|edu.harvard.iq.dataverse.datasetutility.AddReplaceFileHelper|_ThreadID=104;_ThreadName=http-thread-pool::http-listener-1(5);_TimeMillis=1709131923794;_LevelValue=1000;|
2024-02-28 15:52:03 dev_dataverse | Failed to add file to dataset.|#]
2024-02-28 15:52:03 dev_dataverse |
2024-02-28 15:52:03 dev_dataverse | [#|2024-02-28T14:52:03.795+0000|SEVERE|Payara 6.2023.8|edu.harvard.iq.dataverse.datasetutility.AddReplaceFileHelper|_ThreadID=104;_ThreadName=http-thread-pool::http-listener-1(5);_TimeMillis=1709131923795;_LevelValue=1000;|
2024-02-28 15:52:03 dev_dataverse | Dataset cannot be edited due to dataset lock.|#]
Sadly, I'm sort of not surprised that you're getting locks when sending files asynchronously.
What do you want to know the cause of? The locks? Or why you can't remove them? Or why you have to upload files synchronously? Or all of the above? :grinning:
Have you tried this? https://guides.dataverse.org/en/6.1/developers/s3-direct-upload-api.html#to-add-multiple-uploaded-files-to-the-dataset
Sometimes you can't see the forest for the trees :dizzy: That did the trick! Thanks Phil ![]()
Fantastic! :tada:
Jim wrote the code. I'm just the messenger. :grinning:
Should we improve the docs somehow? :thinking:
It's working flawlessly now, even with a bunch of files! Thanks again :smile:
Great!
Philip Durbin schrieb:
Should we improve the docs somehow? :thinking:
No, I think this was just my fault here. I was thinking that the default file add endpoint is used and assumed that this is the only way. You never stop learning :grinning:
Ok, well, this should probably go in a new thread but I've been thinking that perhaps we need more tutorials in the API Guide.
We have https://guides.dataverse.org/en/6.1/api/getting-started.html#uploading-files but it only references the default way.
I think that listing the direct upload feature would also be good. At the very least, it raises awareness of its existence. The sentence within the Native API docs would be sufficient to add to the "uploading files" guide:
when a Dataverse installation is configured to use S3 storage with direct upload enabled, there is API support to send a file directly to S3. This is more complex and is described in the Direct DataFile Upload/Replace API guide.
Sounds good. Do you want to make a PR?
Of course, just opened a PR for this
https://github.com/IQSS/dataverse/pull/10347
Thanks! :heart: I'm making a couple minor tweaks.
Merged! Thanks again!
Awesome! Thanks :heart:
Jan Range has marked this topic as resolved.
Last updated: Jan 09 2026 at 14:18 UTC