Stream: large-data

Topic: simultaneous upload


view this post on Zulip Simon Carroll (Mar 10 2025 at 11:53):

Good morning!
I am testing uploading simaltenously 1gb files. I am quite often getting a failed with 500 error but doens't seem to output anything in the server.log. Has anyone done similar tests ? Smaller file sizes seem to work fine. I wonder what is provoking the 500 error and how i Can track it.

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 10 2025 at 12:24):

@Simon Carroll this fix might help: https://github.com/gdcc/python-dvuploader/pull/24

view this post on Zulip Simon Carroll (Mar 10 2025 at 13:59):

Thanks! Let me see (I was actually using the native API before).

view this post on Zulip Simon Carroll (Mar 10 2025 at 14:55):

Simon Carroll said:

Thanks! Let me see (I was actually using the native API before).

OK now I remember. We are not using S3 so the direct upload fails. It is not expect that 3 concurrent uploads would cause this with the Native API I suppose. I can try to investigate more.

view this post on Zulip Jan Range (Mar 11 2025 at 15:12):

@Simon Carroll are you using the Python-DVUploader native upload or the Native API directly?

view this post on Zulip Simon Carroll (Mar 12 2025 at 10:12):

Jan Range said:

Simon Carroll are you using the Python-DVUploader native upload or the Native API directly?

Good morning! I was using the Native API. I was seeing 500 errors when launching 3 concurrent uploads of 1GB. Occasionally I was able to upload 2 concurrent but normally launching 3 causes all to fail. I just tried using the Python-DVUploader but since we dont have object storage (yet) it reverts to the using the Native API (I assume the results would be the same but I didn't test it yet). Is this somewhat expected or a suprising result ? We are imagining the use case of several users/jobs uploading data at the same time.

view this post on Zulip Jan Range (Mar 12 2025 at 11:00):

Good morning @Simon Carroll :smile:

I assume that the 500 error stems from a dataset lock due to ingestion. This is typically the case for tabular files, which subsequently induce the lock and no further uploads/edits to the dataset are possible. There are two ways to circumvent this:

I think the latter is the easiest way to get around this, but the zipping workflow really shines when you have a lot of small files.

If you are uploading tabular files, this could potentially fix the issue :smile:

view this post on Zulip Jan Range (Mar 12 2025 at 11:01):

As a last instance, you could move to sequential uploads and check for dataset locks, but I guess that's not as efficient as concurrent uploads.

view this post on Zulip Simon Carroll (Mar 13 2025 at 08:40):

Good morning! Thanks a lot for the comprehensive feedback. I will try the different approaches to see what can work for us. Many thanks!

view this post on Zulip Simon Carroll (Mar 17 2025 at 10:02):

Jan Range said:

Good morning Simon Carroll :)

I assume that the 500 error stems from a dataset lock due to ingestion. This is typically the case for tabular files, which subsequently induce the lock and no further uploads/edits to the dataset are possible. There are two ways to circumvent this:

I think the latter is the easiest way to get around this, but the zipping workflow really shines when you have a lot of small files.

If you are uploading tabular files, this could potentially fix the issue :)

Good moring! I am playing around. If I upload via the native api 2 files iwith tab ingest diabled it seems one fails with internal server error 500 . I will attach an example. I am uploading files in to two seperate datasets (in the same collection). The point is it seems another problem outside of the locks.

view this post on Zulip Simon Carroll (Mar 17 2025 at 10:02):

nativeAPIuploadTabIngestDisabledFailed.log

view this post on Zulip Simon Carroll (Mar 17 2025 at 10:04):

I dont see anything in the server logs which is quite strange. Is there some class I need to explictly add to the debug options that can help ?

view this post on Zulip Jan Range (Mar 18 2025 at 21:36):

@Simon Carroll thanks for testing! That is odd, given the tabIngest is turned off. Can you send me the script you are using? For debugging, upon failure the function will raise an error with the message returned by the Dataverse instance. Do you have a full traceback to inspect where the error is happening?

view this post on Zulip Simon Carroll (Mar 19 2025 at 10:26):

OK here comes a bombardment. Here is the python script :

dataverse_uploader.py
.

Here is a log if a single upload

singleUpload.log

here is a concurrent upload with ingestion on
TwoConcurrentUploadsDiffEnvIngestionOn.log

and here with it off
TwoConcurrentUploadsDiffEnvIngestionOff.log

I have included the errors from just one enviroment. There is no error in dataverse actually and I have noticed it seems the file that seems vo fail via API upload is in dataverse and valid. I suppose this is why I am not seeing a error log sever side ?

view this post on Zulip Jan Range (Mar 19 2025 at 13:55):

Thanks for providing the files! Now it is a bit clearer, because I thought you were using python-dvuploader.

Have you tried adding the tabIngest into the jsonData payload? As far as I know, passing it into the query parameters is not working, but maybe I am wrong @Philip Durbin โ˜€๏ธ? In ther docs it says to add it to the payload.

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2025 at 13:58):

Yes, it looks like tabIngest:false goes into the payload, the JSON you send.

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 19 2025 at 13:59):

@Simon Carroll which version of Dataverse are you running? tabIngest:false might be somewhat new. :thinking:

view this post on Zulip Simon Carroll (Mar 20 2025 at 09:07):

OK thanks. With the param in the jsonData it works as expected. About this :

"* Zip files into an archive and upload it. If enabled, Dataverse will unzip the files and register each individually. This is the way Python-DVUploader handles this case in the non-S3 upload."

Do you mean the python libary does this automatically in the case of non-S3 upload when falling back on the native API??

view this post on Zulip Simon Carroll (Mar 20 2025 at 09:09):

Philip Durbin โ˜€๏ธ said:

Simon Carroll which version of Dataverse are you running? tabIngest:false might be somewhat new. :thinking:

6.5 but I suppose it came down to me not reading the documentation properly :)

view this post on Zulip Jan Range (Mar 20 2025 at 14:15):

@Simon Carroll yes, the Python library takes care of zipping the data and shipping it. Data >2gb will be zipped into multiple zips and uploaded simultaneously.


Last updated: Nov 01 2025 at 14:11 UTC