dvuploader and double-zipped files · python

i've been starting to use python-dvuploader's cli for most of our uploads, and i'm running into an issue for zip files that have been "double-zipped" to workaround cases where a zip file in the data set contains more files than the dataverse limit. the upload seems to succeed, but results in an error like

ValueError: ('File DXXFAM.zip.zip not found in Dataverse repository.', 'This may be due to the file not being uploaded to the repository:')

i'm guessing this is fine, as the unpacked double-zipped files contents are actually there. any suggestions about what the error handling logic might be here?

Jan Range (Mar 27 2025 at 20:05):

@María A. Matienzo Thanks for the feedback! The issue likely stems from the postponed metadata update. I guess you are using the non-S3-upload?

In this case, all files provided are zipped and uploaded. Due to the zipping, the individual file metadata cannot be passed, and this call updates the metadata for each file. My guess is that is that this is causing the issue.

I will look into this and replicate it locally. Maybe the double-zipped case is an edge case I need to take care of.

maría a. matienzo (Mar 27 2025 at 20:09):

yes, that's correct - we're still using the native API as opposed to direct upload.

i'll also note that the error where a zip file is not double-zipped and it contains more than the limit is failing silently with dvuploader, which leads to a retry loop (this is based on testing a rebased version of the branch for the tabIngest PR).

maría a. matienzo (Mar 27 2025 at 20:11):

silent failure in this case means that it's not reported back from dvuploader to the user, despite the API endpoint returning an error.

Jan Range (Mar 27 2025 at 20:16):

That's good to know! I was not aware of this. I guess it would make sense to explicitly check for this here.

I am checking for the status through raise_for_status, but it seems like it is not catching the error. Is the status code a different one in this case?

maría a. matienzo (Mar 27 2025 at 20:41):

i'm not sure what the HTTP return code for this is coming from Dataverse, but the following message is returned as JSON:

{"status":"ERROR","message":"The number of files in the zip archive is over the limit (1000); please upload a zip archive with fewer files, if you want them to be ingested as individual DataFiles."}

Jan Range (Mar 28 2025 at 08:52):

Okay, it may have a different status code that httpx does not recognize as an error. Reproducing now and will look into the status code.

Jan Range (Mar 28 2025 at 09:19):

The zip-zip case was related to the update metadata function, which tries to map the local file to the ones at Dataverse. Since the Zip is unpacked and not present in the dataset, this case will be skipped in the update step, since there is nothing to update.

The zip limit case is now handled explicitly as well. It raises a 400 and if that is the case, plus the message matches, the code will raise a ValueError and will stop the upload process. Hence, the retry logic is stopped.

Since these are rather small changes, I would ship it with the tabIngest PR after I have added test cases for this. Thanks again for raising awareness of this :smile:

maría a. matienzo (Mar 29 2025 at 18:40):

Jan Range (Mar 31 2025 at 18:42):

Jan Range (Apr 16 2025 at 12:43):

@María A. Matienzo The new version of python-uploader has just been released :raised_hands:

Stream: python

Topic: dvuploader and double-zipped files

maría a. matienzo (Mar 26 2025 at 21:23):