S3 Downloads · python · Zulip Chat Archive

A couple of months ago, @Oliver Bertuch requested optimized S3 downloads that could be parallelized. We’ve identified potential libraries to achieve this, but I’m stuck figuring out the S3 URLs. I know these URLs are typically exposed through the redirect when using the DataAccess API, but I’m wondering if there’s a more efficient way to obtain them.

My concern is that not all files will be stored in S3, so the download might fail on instances that use a different storage. Therefore, I’d like to differentiate cases and support S3 downloads whenever possible, and fall back to normal HTTP downloads using httpx in other cases.

Or would you suggest using the redirect URL and checking if s3 is instorageIdentifieris sufficient?

Example DataFile Output

        "dataFile": {
          "checksum": {
            "type": "MD5",
            "value": "7961fc2b94cc8cef2ae4d143021394e0"
          },
          "contentType": "text/plain",
          "creationDate": "2022-01-30",
          "description": "",
          "fileAccessRequest": false,
          "filename": "DeactivationOfGlucoseOxidase_Host.omex",
          "filesize": 10743,
          "friendlyType": "Plain Text",
          "id": 90554,
          "md5": "7961fc2b94cc8cef2ae4d143021394e0",
          "persistentId": "doi:10.18419/DARUS-2469/1",
          "rootDataFileId": -1,
          "storageIdentifier": "s3://fokus-dv-prod-1:17eac74da1d-96e7a66f41eb",
          "tabularData": false
        }

Stream: python

Topic: S3 Downloads

Jan Range (Nov 17 2025 at 14:55):

Philip Durbin 🚀 (Nov 17 2025 at 15:15):