@Oliver Bertuch had the idea of providing a custom filesystem to remotely access and upload files via PyDataverse. This works already and will soon be put into a PR.
Another idea was to allow crawling ZIP files, similar to the Zip Previewer. Unfortunately the previewer initially downloads the Zip file and then displays the content. Hence my question, is there a way or could you think of providing this as a Dataverse endpoint? At least listing the content would be beneficial.
I may not be following very well..
I did see this issue about a PyFilesystem implementation: https://github.com/gdcc/pyDataverse/issues/178
But now you're asking about zip files? You want to preview the contents from pyDataverse?
Exactly! The idea goes like this: someone wants to use a single file from a ZIP living on Dataverse. Using the ZIP Pyfilesystem backed by the Dataverse Pyfilesystem, you would be able to retrieve it, without downloading all of it.
Does the existing HTTP Range header support help here? Or do we need a new API endpoint? You just want the list of contents in the zip file, right?
I'm not sure how the ZIP file previewer does it
It uses HTTPRange too
It must extract the list of files from the ZIP to display it
And then probably has coded in the ranges to be able to download a single file from the ZIP without downloading the whole ZIP file first
I have never used HTTP Range though. Would need to dig a bit into this.
Range has your name on it! Please see https://guides.dataverse.org/en/6.2/api/dataaccess.html#headers
Hehe should be familiar thing to me :grinning:
In my imagination the zip previewer/downloader uses the Range header to get the listing of files. Then it presents the list to the user. When the user clicks a file to download, it usese the Range header again to download just the bytes for that file.
I don't think it downloads the entire zip file to get the listing of files. I sure hope not.
Maybe there is something to deal with this already present in PyFilesystems. Will check!
Probably https://docs.pyfilesystem.org/en/latest/_modules/fs/zipfs.html#ZipFS does not support remote ZIP files or range requests...
I have found another library that supports opening remote S3 and I guess HTTP too:
https://pypi.org/project/smart-open/
Giving it a try now
Oh wow!
Looks amazing!
The unicorn we needed :grinning:
Hmm does it support ZIP?
Loads ZIPs very well from remote! At least we are getting some binary. Checking the content now
Might need to add a compression handler...
Remote file was a tar file btw.
From looking at the library, Im not sure it supports extracting the list of files and extracting parts of the ZIP file via HTTP range requests
Yes, that seems impossible. I can't find any documentation about this, and there is no dedicated method. Would it be a contender for the S3 download though? Seems pretty simple to me
Related and Interesting: https://github.com/piskvorky/smart_open/issues/725
I suppose the ZIP file previewer is loading a few kilobytes from the end of the zip file (the size is known from metadata IIRC or maybe ranges support negative values)
If you find the central directory header signature you know you got it all (0x02014b50) and can start browsing for the files
https://en.wikipedia.org/wiki/ZIP_(file_format)
Oh wait it actually is even easier when you know the last byte... There is a record about the central directory at the end of it all
Again, it should be possible to see what the ZIP file previewer is doing and try to transfer that to Python
Cool! Learned something new today :smile: Going to check it out!
Feels so good to do some coding again after a week full of enzyme catalysis stuff :grinning:
Some StackOverflow digging has helped!
https://stackoverflow.com/a/17434121 (especially the last section with ZipFile)
@Markus Haarländer created the zip previewer/downloader. Maybe he can help.
I think I have it!
Original file - https://darus.uni-stuttgart.de/file.xhtml?persistentId=doi:10.18419/darus-3372/7
That looks promising!
It might be necessary to repeat the reading of 256k if the ZIP file is large
Yes, I will add an iterative process. According to StackOverflow ZipFile will raise an error if it is not enough. Hence, I would just repeat and increment until there is no error.
Sounds good to me!
So this would be a feature of DataverseFS, right?
Yes, would be a great feature to have! Next challenge though is to extract a specific file
Oh and what about retrieval of a file from the ZIP? You didn't take a look at that yet, right?
I would suggest to first nail this down on the example and then generalize it in DataverseFS
Ha! You beat me to it :racecar:
Haha
Will look into this. Guess the ZIP Preview has some ideas already
Probably they extract the byte locations from the ZIP directory and merge it all into a request including the range
That makes sense
Maybe ZipFile has some nice utilities for that
Hi guys.
Seems you already mastered most of it. Yes, the ZipPreviewer utilizes HTTP Range Requests to read the central directory of a ZIP file first, and to download and extract single files from the ZIP (using ranges from the central directory). It makes use of a great JavaScript Library which can do all of these things: https://github.com/gildas-lormeau/zip.js. But I don't know about a Python library
@Markus Haarländer thanks for the info! Glad to hear we are on the right track :smile:
Coincidentally, right after reading your message I stumbled across something similar to zip.js and it does exactly what we want! The library is called remotezip
mmm, pythonic :yum:
For reference, this is a 2.8 GB file, and even this works pretty well/fast. Super nice!!
We're going to have a lot to talk about at the next pyDataverse meeting. :grinning:
True :grinning_face_with_smiling_eyes:
The code for listing the contents of a ZIP file and downloading specific parts is working well. I have separated the Dataverse and Zip Filesystem, so passing a DataFile object to the ZIP filesystem is necessary. Here is a working example:
On the left sidebar, you can see the downloaded part of the ZIP file. Once I have implemented the write method to upload data files, I will create a pull request :smile:
This looks amazing!
Question: should the name of the ZipFS rather be "RemoteZipFS" to make it more obvious what this is about? Someone might want to combine it with the ZipFS shipped with PyFilesystem
Yes, that makes sense! Will rename it :smile:
Upload works too :smile:
@Philip Durbin Would Stefano be interested in putting the "Compute on Data" logic into the filesystem? I think this would be the right place
Maybe! Let's see what @Leo Andreev thinks.
I think the filesystem is pretty close to the finish. It is now possible to write files similar to regular filesystems using the open("my.file", "w") way. Will create a new data file upon closing and also supports S3 due to DVUploader. Here is an example:
Feels almost like you are writing files to the hard drive :stuck_out_tongue:
Would it be an option to have a "with" thing? So on any close the file automatically gets uploaded?
Do you mean for the filesystem itself?
Or is this already happening?
The with operation is already available for files itself
You can either use the with to close automatically or invoke it by yourself afterwards for the upload
Ah! So the second loop already does the upload in the background
And the third loop is just an example to upload other files on disk?
Yes, exactly :smile:
Great!
This is really great!
I thought it would make sense to have a local file option too, since the usual open operation is blocking and yet cant be parallelized
Does it support remote files (for download), too?
Yes, you can download any file from a dataset including zip members
No I meant files that are in a remote store.
So not in S3 and not stored in Dataverse
But referenced using a URL
Ah alright, I have not tested this yet, but I am sure there are ways to integrate it
It would be great to enable registering URL handlers here
You mean in a way to transfer from a remote store to dataverse?
Would be great if there is a way to not have to download the intermediate file then
So people could store something using git-annex and receive the file when they execute the python script
Do you have an example for this? Havent used git-annex yet
That way they could very naturally interact with the files and they would be fetched as necessary
Ha storing the git annex thing was a wild idea at distribits
For now there might be examples using HTTP and Globus links
Let me get to work then I'll try to cook some better example. Typing this on my mobile is hard...
Alright. From https://guides.dataverse.org/en/latest/api/native-api.html#add-remote-file-api we know that files registered as remote files contain lots of information about the file. The most important bit is probably the storage identifier.
It will contain a URL that has been configured as a valid base url in the store plus a path within that location
The filesystem will be presented with this information when downloading the file metadata
So it would know about the files and folder structures
But it could not download the file from the Dataverse instance
Which means that the filesystem would need to understand how to resolve the URLs into a file
The example JSON to register the remote file has the example storage ID trsa://themes/custom/qdr/images/CoreTrustSeal-logo-transparent.png
Users of the filesystem would need some means to register a handler that knows how to deal with protocol "trsa://" and the rest of the URL
In case of Datalad, the idea is to store "git-annex://" URLs that encode a git-annex remote file reference as a URL.
Using a handler to access the git-annex URL and download the file would be great
Okay, got it. Sounds great! How can I set up my local instance to test it? Can I add any remote store I want? Tried it using the docs, but I have not gotten it to work.
I've added these JVM args:
-Ddataverse.files.trsa.type=remote
-Ddataverse.files.trsa.label=SomeRemoteStorage
-Ddataverse.files.trsa.base-url=trsa://
-Ddataverse.files.trsa.base-store=trsa
The script:
export API_TOKEN=7a51588f-8422-4868-bc66-c791016e4a30
export SERVER_URL=http://localhost:8080
export PERSISTENT_ID=doi:10.5072/FK2/ZTXNOV
export JSON_DATA=$(<body.json)
curl -H "X-Dataverse-key: $API_TOKEN" -X POST "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID" -F "jsonData=$JSON_DATA"
The request body:
{
"description": "A remote image.",
"storageIdentifier": "trsa://hello/testlogo.png",
"checksumType": "MD5",
"md5Hash": "509ef88afa907eaf2c17c1c8d8fde77e",
"label": "testlogo.png",
"fileName": "testlogo.png",
"mimeType": "image/png"
}
Pretty sure I am doing sth wrong :grinning:
I receive the following message every time I try to add a remote file:
{"status":"ERROR","message":"Dataset store configuration does not allow provided storageIdentifier."}
The storage identifier follows the base URL scheme but does not match.
Tried setting the collection storage to the remote store, but without an effect
I'm not sure if the helps but we have a test on the Java side (that isn't exercised regularly): https://github.com/IQSS/dataverse/blob/v6.2/src/test/java/edu/harvard/iq/dataverse/api/RemoteStoreIT.java
Awesome! Thanks for the hint. Could it be that I am missing this line?
-Ddataverse.files.trsa.base-store=file
I thought it had a default, but I will give it a try!
Could be. I'm pretty sure you need a base store for thumbnails, etc.
Stupid idea, but how would it be if we define this filesystem instance-wide? Instead of connecting to a single dataset, you could access all the datasets. I would think of something like this:
from dataversefs import DataverseFS
fs = DataverseFS(base_url="https://demo.dataverse.org")
fs.listdir("doi:10.70122/FK2/TDI8JO://some/dir")
file = open("doi:10.70122/FK2/TDI8JO://some/dir/myfile.txt", "r")
You don't even need the ://
It's not a resolvable DOI UIR but who cares
Maybe there is a better way, but I think it would be cool to grab from any dataset you want.
Maybe for the sake of validity go for doi:10.70122/FK2/TDI8JO?file=/path/to/file
Thats nice!
It becomes a valid URI this way but is simple to parse because of the "separator string"
If you don't want a parameter, you could use anchors
True that, havent thought about this. The idea just popped in my head :grinning:
doi:10.70122/FK2/TDI8JO#...
Absolutely! Makes a lot of sense.
I am a fan of the hash - Looks smooth
Maybe support both. The names are much shorter when not always needing the full qualified one
There might be character limitations for anchors!
I will hack sth. New material for the weekend :grinning_face_with_smiling_eyes:
Sorry the correct term is "fragment"
Which even makes more sense here - you want a fragment of a dataset
Dataset Fragments sounds nice :grinning:
The characters slash ("/") and question mark ("?") are allowed to represent data within the fragment identifier. Beware that some older, erroneous implementations may not handle this data correctly when it is used as the base URI for relative references (Section 5.1).
https://www.rfc-editor.org/rfc/rfc3986#page-24
You could even support having a query part
doi:10.70122/FK2/TDI8JO?direct-download=false#path/to/file.ext
Wouldn't it be nice if we had a Dataverse API endpoint like this?
@Jan Range have you ever tried benchmarking the pyfilesystem? Wondering what kind of performance one could get using the underlying S3 and potentially caching the files locally to make sure multiple usage of a file doesn't reload over and over.
Had a discussion with a few RSEs today that do Electron Microscopy. Their sensors stream a solid 2GiB/s and I was wondering if once they put that kind of data into S3 (maybe Dataverse in the mix) what kind of speed they could achieve reading the data back again.
Currently they heavily rely on filesystems and direct IO to avoid page table madness...
But they also want to expose data to analysis stations using SMB/NFS, so going through a network stack. Wondering if the S3 direct download stuff with PyFilesystem2 would be able to compete.
@Oliver Bertuch I have not benchmarked it yet, but I can test it in the upcoming weeks.
I’ve checked the source code, and the fs-s3fs package that provides S3 support in pyFileSystem uses boto3, the official AWS SDK. According to the implementation, it automatically utilizes the Range header and parallel downloads, which should make it noticeably faster than a standard sequential download.
However, when I tried using both boto3 and pyFileSystem’s S3 backend, I ran into an issue: these libraries require AWS credentials, even for publicly accessible files. I attempted to extract credentials or any required information from the S3 redirect URL, but that wasn’t sufficient to get these libraries working. Do you have any ideas on how we could make this work using only the pre-signed URL?
That said, I believe we could still achieve better download speeds by leveraging Range headers and parallelizing the download process ourselves. For reference, I ran a quick benchmark comparing the current, non-parallelized PyDataverse download of a 1.5 GB file to a Rust implementation that uses Range requests for parallel downloading.
| Library | Size | Time Taken | Gb/s |
|---|---|---|---|
| PyDataverse | 1.5gb | ~90s | ~0.01 |
| DVCLI (Rust) | 1.5gb | ~30s | ~0.05 |
The Rust implementation follows the S3 redirect and uses the Range header to enable partial downloads. It splits the file into 5 MB chunks and distributes the workload across 64 workers, which turned out to be the optimal configuration in my tests. To ensure realistic results, I used a file from production as the test case.
https://darus.uni-stuttgart.de/file.xhtml?persistentId=doi:10.18419/DARUS-444/1&version=1.0
I am not sure if we can get any faster, since the AWS libraries practically do the same thing. Do you have any ideas how we could match the insane 2gb/s :smile:
It could be that my WiFi is not the best to benchmark. I could imagine having a direct cable-based connection would perform even better.
Last updated: Nov 01 2025 at 14:11 UTC