Direct file upload to S3 using pyDataverse · troubleshooting

Stream: troubleshooting

Topic: Direct file upload to S3 using pyDataverse

Philipp Conzett (Nov 21 2025 at 14:50):

@Jan Range @Philip Durbin 🚀 A user of our installation is trying to create datasets and upload larger files using pyDataverse. It seems the files are uploaded through the web server, not directly to our S3 storage. Does pyDataverse support direct S3 upload?

Jan Range (Nov 21 2025 at 14:53):

@Philipp Conzett pyDataverse does not yet support S3 uploads, but python-dvuploader does. We are currently working on the next version, including some fixes to be compatible with 6.8. Hence, if you are using 6.8 we advise using the current mainbranch until the next version is released.

Philipp Conzett (Nov 21 2025 at 14:55):

Thanks, @Jan Range! We are at v6.6 of Dataverse. Does python-dvuploader support v6.6?

Jan Range (Nov 21 2025 at 14:57):

Yes, 6.x versions are supported and tested. The only issue with 6.8 was a change in how directoryLabelis processed, but this is fixed on the main branch. But, since you are using 6.6, the PyPI version can be used.

Philipp Conzett (Nov 21 2025 at 14:58):

Thanks! I'll point the user to python-dvuploader. :+1:

Philipp Conzett (Dec 05 2025 at 14:36):

Hi @Jan Range , all, the user mentioned above has successfully adapted a python-dvuploder script to upload files to DataverseNO. However, he's experiencing some issues, which we're trying to resolve. Below [1], I've copied some parts of our conversation with the user. Do you have any idea what causes these issues?

[1] Extracts from conversation with user:

#####

I’ve continued the upload process and it still fails at random intervals. I think today I got the internal server error once, then there have been http read errors reported by the python dvuploader library. The upload of some the datasets has gone without errors but mostly there has been at least one failure per dataset (each has 96 files). There is also large variation in upload times, some files every now and then are much slower to upload.

Another problem is that in a few datasets the upload process seems to go through without any errors but there are one or two files missing from the dataset. I modified my upload script so that it first checks which files have already been uploaded and then it uploads only the missing ones so that I can easily iterate until all the files have been uploaded correctly.

#####

From the server logs I find this error in connection with the 10.18710/ZS5KYE doi:

“com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond
…
Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond”

My guess here is that it might have been connection issues with the S3 storage server at the time of uploading.

Jan Range (Dec 10 2025 at 07:49):

@Philipp Conzett thanks for sharing the feedback! My first guess is that there may be too many requests hitting the server and it shuts down. Restricting the number of parallel uploads can help sometimes.

Another suspect could be that data chunks of 1MB sent by the generator may be too small such that processing and next incoming chunk could overwhelm (see upload handler). But this is just a guess.

I have published a new version 0.3.1 which resolves some recent issues, which were similar to the ones you have shared with me.

Jan Range (Dec 10 2025 at 07:55):

It could be that httpx's async client limits apply randomly. Meaning, there is a connection pool of X workers and 96 distributed tasks, which are not executed in the order given in this code but at random. This could explain the varying upload progress. I was hoping that applying the limits on the client would achieve a balanced execution. I will look into this.

Jan Range (Dec 10 2025 at 08:18):

Ah and I forgot, would it be possible to share the Python tracebacks?

Philipp Conzett (Dec 12 2025 at 05:59):

Thanks, @Jan Range, for looking into this! I'll share your comments with the depositor and ask for Python tracebacks.

Philipp Conzett (Dec 12 2025 at 13:22):

@Jan Range, unfortunately, the depositor doesn't have access to the tracebacks anymore, but informed me that the code uploaded the images one by one with the following snippet with a 5 second sleep between the files.

files = [ dv.File(filepath=tif) ]
dvuploader = dv.DVUploader(files=files, tab_ingest=False)
dvuploader.upload(api_token=API_TOKEN, dataverse_url=BASE_URL, persistent_id=ds_pid, n_parallel_uploads=1)

and he also suspected that overall server load may have something to do with the problem.

Last updated: Jan 09 2026 at 14:18 UTC