Stream: python

Topic: uploading .tar.gz files via API


view this post on Zulip Kai König (Dec 13 2024 at 17:37):

Looks like something is wrong with the demo page? https://demo.dataverse.org/
image.png

view this post on Zulip Kai König (Dec 13 2024 at 17:43):

I really wonder if I just killed it, because it was working until now and I just tried to send a compressed TGZ archive (to test a feature for the galaxy integration). The reason is that zip files get unarchived automatically and I wanted to test if the same happens with .tgz files.

This is the response I just got after I tried to upload it via API and since then the server is not responsive anymore:

raised unexpected: Exception('Request to https://demo.dataverse.org/api/v1/datasets/:persistentId/add?persistentId=doi:10.70122/FK2/3HKFAU failed with status code 400: Failed to add file to dataset.')

view this post on Zulip Kai König (Dec 13 2024 at 17:43):

CC @Philip Durbin 🚀

view this post on Zulip Kai König (Dec 13 2024 at 17:44):

reimport-test-3.tar.gz
this should be the file that was sent, just a tar.gz file with two images inside

view this post on Zulip Kai König (Dec 13 2024 at 18:03):

page is back up apparently

view this post on Zulip Philip Durbin 🚀 (Dec 13 2024 at 18:25):

Sorry, we just released Dataverse 6.5 (#community > Dataverse 6.5 is here! ) and were updating the demo site to it.

view this post on Zulip Kai König (Dec 14 2024 at 08:37):

haha okay I'm glad :D was just in exactly that moment it went down

view this post on Zulip Kai König (Dec 14 2024 at 08:40):

I still get the 400 error though. Are .tar.gz files not accepted?

view this post on Zulip Kai König (Dec 14 2024 at 08:50):

hmm I tested it directly via API and it uploads fine actually. I will have to investigate further what exactly Galaxy is sending.

curl -H "X-Dataverse-key:XXX" -X POST -F file=@test.tar.gz "https://demo.dataverse.org/api/datasets/:persistentId/add?persistentId=doi:10.70122/FK2/DIG2DG"

view this post on Zulip Kai König (Dec 14 2024 at 11:04):

It's weird, I can't see why it works with curl and not with python. I had a look at the raw requests by using both postman and python to send the post requests to https://httpbin.org/post and the requests look virtually equal. However curl successfully uploads the file and in python I get failed with status code 400: Failed to add file to dataset.

Screenshot 2024-12-14 at 12.02.40.png

view this post on Zulip Kai König (Dec 14 2024 at 11:13):

It would be amazing if somebody could help me out with the dataverse server logs next week :folded_hands: . I just made two requests via galaxy (python), one with the failing .tar.gz file and one with the working .zip (time is CET):

[2024-12-14 12:10:55,328: DEBUG/main] https://demo.dataverse.org:443 "POST /api/v1/datasets/:persistentId/add?persistentId=doi:10.70122/FK2/3HKFAU HTTP/1.1" 400 61 [2024-12-14 12:10:55,330: WARNING/main] RESPONSE: {'status': 'ERROR', 'message': 'Failed to add file to dataset.'}

[2024-12-14 12:12:23,311: DEBUG/main] https://demo.dataverse.org:443 "POST /api/v1/datasets/:persistentId/add?persistentId=doi:10.70122/FK2/3HKFAU HTTP/1.1" 200 None [2024-12-14 12:12:23,312: WARNING/main] RESPONSE: {'status': 'OK', 'message': {'message': 'This file has the same content as test.txt_64e0efaf9500cb29.txt that is in the dataset. '} ...,

view this post on Zulip Kai König (Dec 14 2024 at 11:16):

oh and I just made a third one, the working curl request with the .tar.gz at 12:15 CET:

curl -H "X-Dataverse-keyXXX" -X POST -F file=@small-file-history-test.tar.gz "https://demo.dataverse.org/api/datasets/:persistentId/add?persistentId=doi:10.70122/FK2/3HKFAU"
{"status":"OK","message":{"message":"This file has the same content as small-file-history-test.tar.gz that is in the dataset. "},"data":{"files":[{"description":"","label":"small-file-history-test.tar-3.gz","restricted":false,"version":1,"datasetVersionId":276019,"dataFile":{"id":2476460,"persistentId":"doi:10.70122/FK2/3HKFAU/9RJV3C","pidURL":"https://doi.org/10.70122/FK2/3HKFAU/9RJV3C","filename":"small-file-history-test.tar-3.gz","contentType":"application/gzip","friendlyType":"Gzip Archive","filesize":1894,"description":"","storageIdentifier":"s3://demo-dataverse-org:193c4e10eff-c0474fd01718","rootDataFileId":-1,"md5":"f804a9a4e5f8f373dd87938ad1d01325","checksum":{"type":"MD5","value":"f804a9a4e5f8f373dd87938ad1d01325"},"tabularData":false,"creationDate":"2024-12-14","fileAccessRequest":false}}]}}%

view this post on Zulip Philip Durbin 🚀 (Dec 16 2024 at 13:12):

Oh, so it wasn't the demo site being down. :thinking:

@Kai König what's the latest, please? It work with curl but not Python? (We might want to move this topic to #python.)

Do you want to give it a try on https://beta.dataverse.org ?

view this post on Zulip Kai König (Dec 16 2024 at 13:13):

yes exactly works with curl but not with python, my last messages are still the current state. I ignored this issue for now because zip upload works

view this post on Zulip Notification Bot (Dec 16 2024 at 13:14):

This topic was moved here from #community > uploading .tar.gz files via API by Philip Durbin 🚀.

view this post on Zulip Philip Durbin 🚀 (Dec 16 2024 at 13:15):

Can you please show us your python script?

view this post on Zulip Kai König (Dec 16 2024 at 13:18):

Sure!

        with open(file_path, "rb") as file:
            files = {'file': (filename, file)}
            payload = dict()
            add_files_url = self.add_files_to_dataset_url(dataset_id)
            response = requests.post(
                add_files_url,
                data=payload,
                files=files,
                headers=headers)
            self._ensure_response_has_expected_status_code(response, 200)
def add_files_to_dataset_url(self, dataset_id: str) -> str:
        return f"{self.api_base_url}/datasets/:persistentId/add?persistentId={dataset_id}"

view this post on Zulip Jan Range (Dec 16 2024 at 14:25):

@Kai König thanks for reaching out! I'll read through the messages an get back to you asap

view this post on Zulip Jan Range (Dec 16 2024 at 14:37):

@Kai König are you using pyDataverse or playin requests? I have just tested it using the former and everything works well using tar.gz.

image.png

view this post on Zulip Jan Range (Dec 16 2024 at 14:45):

When using requests, it's essential to provide the form-data section jsonData as a string. Providing it as a dict may lead to issues. This might explain why the replace endpoint didn't function correctly too.

You may want to check out the pyDataverse implementation as guidance.

@Philip Durbin 🚀 would it make sense to mention this in the general docs or is it to specific for Python?

view this post on Zulip Philip Durbin 🚀 (Dec 16 2024 at 14:46):

Well, dicts are Python-specific.

view this post on Zulip Philip Durbin 🚀 (Dec 16 2024 at 14:47):

Let's see if @Kai König is unblocked now. Thanks for helping! Then we can figure out where to highlight the fix in the docs.

view this post on Zulip Kai König (Dec 16 2024 at 14:59):

thanks guys! Will have a look at this tomorrow. If what you wrote is the source of this issue, I would find it weird anyway. Because the .zip file uses the exact same function and it works without problems there.

view this post on Zulip Philip Durbin 🚀 (Dec 16 2024 at 15:00):

@Kai König please do feel free to open an issue at https://github.com/IQSS/dataverse/issues about the confusion

view this post on Zulip Jan Range (Dec 16 2024 at 15:52):

It depends, if you post metadata such as description or else, the jsonData field needs to be a string. Very odd, but otherwise you'll get an error. But I missed that you are in fact not passing any metadata, so that might not be relevant here - Probably in the replace case though.

I have used requests to reproduce your error, but I was not able to. Here is the code I have been using to upload a tar.gz file:

from rich import print

import json
import requests

pid = "doi:10.70122/FK2/4ZCAHN"
url = f"https://demo.dataverse.org/api/datasets/:persistentId/add?persistentId={pid}"
files = {
    "file": ("some_other_name.tar.gz", open("test.tar.gz", "rb"), "application/octet-stream")
}

metadata = json.dumps({
    "description": "Look, I am a DataFile!",
})

headers = {"X-Dataverse-Key": "..."}
resp = requests.post(
    url,
    files=files,
    data={"jsonData": metadata},
    headers=headers
)

print(resp.json())

image.png

view this post on Zulip Jan Range (Dec 16 2024 at 16:00):

By the way, you get this when you don't serialize the payload to a string.

image.png

view this post on Zulip Kai König (Dec 17 2024 at 09:01):

@Jan Range thanks for looking into this! Well it is odd, I tried again and added "application/octet-stream" as filetype and completely removed the metadat parameter. But I still get the same error

view this post on Zulip Kai König (Dec 17 2024 at 09:02):

So I guess the galaxy application is doing something to the file that makes the request fail. The server logs would definitely be helpful

view this post on Zulip Kai König (Dec 17 2024 at 09:02):

but tbh, it's not a super high priority, because people can just export as zip and that works

view this post on Zulip Kai König (Dec 17 2024 at 09:03):

im going to integrate the last feature and then will try wrap up the integration. If anyone provides me server logs or more insights I might look into this again.

view this post on Zulip Jan Range (Dec 17 2024 at 11:23):

@Kai König Regarding server logs, you can also host Dataverse via Docker locally. There should be no difference to the Demo website. It helps a lot with debugging, especially when the API responses are not specific enough. You can clone the main repo and run the following, given mvn is installed:

mvn -Pct clean package docker:start

I usually do it that way, but @Philip Durbin 🚀 may have better approaches?

Also, if you want to CI/CD test your Python library, I highly recommend using our GitHub Action. Here is an example workflow, which you can mostly copy-paste.

view this post on Zulip Kai König (Dec 17 2024 at 11:51):

Thanks Jan, Im basically finished now with the integration and in my experience creating a local env always is more work as expected... Is the docker setup really easy or do I have to configure anything?

view this post on Zulip Jan Range (Dec 17 2024 at 12:22):

From my experience, it has always been straightforward. There was no additional configuration necessary, except for adding certain services. I just found the part in the docs, if that helps:

https://guides.dataverse.org/en/latest/container/dev-usage.html

view this post on Zulip Philip Durbin 🚀 (Dec 17 2024 at 12:49):

The quickstart ( https://guides.dataverse.org/en/latest/developers/dev-environment.html#quickstart ) should "just work" and if it doesn't, please let us know in #containers! :sweat_smile:


Last updated: Nov 01 2025 at 14:11 UTC