Dataverse filesystem-like extension · python

Stream: python

Topic: Dataverse filesystem-like extension

Jan Range (Apr 12 2024 at 08:04):

@Oliver Bertuch had the idea of providing a custom filesystem to remotely access and upload files via PyDataverse. This works already and will soon be put into a PR.

Another idea was to allow crawling ZIP files, similar to the Zip Previewer. Unfortunately the previewer initially downloads the Zip file and then displays the content. Hence my question, is there a way or could you think of providing this as a Dataverse endpoint? At least listing the content would be beneficial.

Philip Durbin 🚀 (Apr 12 2024 at 11:42):

I may not be following very well..

I did see this issue about a PyFilesystem implementation: https://github.com/gdcc/pyDataverse/issues/178

But now you're asking about zip files? You want to preview the contents from pyDataverse?

Oliver Bertuch (Apr 12 2024 at 11:43):

Exactly! The idea goes like this: someone wants to use a single file from a ZIP living on Dataverse. Using the ZIP Pyfilesystem backed by the Dataverse Pyfilesystem, you would be able to retrieve it, without downloading all of it.

Philip Durbin 🚀 (Apr 12 2024 at 11:45):

Does the existing HTTP Range header support help here? Or do we need a new API endpoint? You just want the list of contents in the zip file, right?

Oliver Bertuch (Apr 12 2024 at 11:45):

I'm not sure how the ZIP file previewer does it

Jan Range (Apr 12 2024 at 11:46):

It uses HTTPRange too

Oliver Bertuch (Apr 12 2024 at 11:46):

It must extract the list of files from the ZIP to display it

Oliver Bertuch (Apr 12 2024 at 11:46):

And then probably has coded in the ranges to be able to download a single file from the ZIP without downloading the whole ZIP file first

Jan Range (Apr 12 2024 at 11:47):

I have never used HTTP Range though. Would need to dig a bit into this.

Philip Durbin 🚀 (Apr 12 2024 at 11:48):

Range has your name on it! Please see https://guides.dataverse.org/en/6.2/api/dataaccess.html#headers

Jan Range (Apr 12 2024 at 11:49):

Hehe should be familiar thing to me :grinning:

Philip Durbin 🚀 (Apr 12 2024 at 11:49):

In my imagination the zip previewer/downloader uses the Range header to get the listing of files. Then it presents the list to the user. When the user clicks a file to download, it usese the Range header again to download just the bytes for that file.

Philip Durbin 🚀 (Apr 12 2024 at 11:50):

I don't think it downloads the entire zip file to get the listing of files. I sure hope not.

Jan Range (Apr 12 2024 at 11:50):

Maybe there is something to deal with this already present in PyFilesystems. Will check!

Oliver Bertuch (Apr 12 2024 at 11:56):

Probably https://docs.pyfilesystem.org/en/latest/_modules/fs/zipfs.html#ZipFS does not support remote ZIP files or range requests...

Jan Range (Apr 12 2024 at 11:56):

I have found another library that supports opening remote S3 and I guess HTTP too:

https://pypi.org/project/smart-open/

Jan Range (Apr 12 2024 at 11:57):

Giving it a try now

Oliver Bertuch (Apr 12 2024 at 11:58):

Oh wow!

Oliver Bertuch (Apr 12 2024 at 11:58):

Looks amazing!

Jan Range (Apr 12 2024 at 11:58):

The unicorn we needed :grinning:

Oliver Bertuch (Apr 12 2024 at 12:00):

Hmm does it support ZIP?

Jan Range (Apr 12 2024 at 12:00):

Loads ZIPs very well from remote! At least we are getting some binary. Checking the content now

Jan Range (Apr 12 2024 at 12:00):

image.png

Oliver Bertuch (Apr 12 2024 at 12:01):

Might need to add a compression handler...

Jan Range (Apr 12 2024 at 12:01):

Remote file was a tar file btw.

Oliver Bertuch (Apr 12 2024 at 12:02):

From looking at the library, Im not sure it supports extracting the list of files and extracting parts of the ZIP file via HTTP range requests

Jan Range (Apr 12 2024 at 12:04):

Yes, that seems impossible. I can't find any documentation about this, and there is no dedicated method. Would it be a contender for the S3 download though? Seems pretty simple to me

Oliver Bertuch (Apr 12 2024 at 12:06):

Related and Interesting: https://github.com/piskvorky/smart_open/issues/725

Oliver Bertuch (Apr 12 2024 at 12:13):

I suppose the ZIP file previewer is loading a few kilobytes from the end of the zip file (the size is known from metadata IIRC or maybe ranges support negative values)

Oliver Bertuch (Apr 12 2024 at 12:14):

If you find the central directory header signature you know you got it all (0x02014b50) and can start browsing for the files

Oliver Bertuch (Apr 12 2024 at 12:14):

https://en.wikipedia.org/wiki/ZIP_(file_format)

Oliver Bertuch (Apr 12 2024 at 12:15):

Oh wait it actually is even easier when you know the last byte... There is a record about the central directory at the end of it all

Oliver Bertuch (Apr 12 2024 at 12:17):

Again, it should be possible to see what the ZIP file previewer is doing and try to transfer that to Python

Jan Range (Apr 12 2024 at 12:17):

Cool! Learned something new today :smile: Going to check it out!

Jan Range (Apr 12 2024 at 12:19):

Feels so good to do some coding again after a week full of enzyme catalysis stuff :grinning:

Jan Range (Apr 12 2024 at 12:56):

Some StackOverflow digging has helped!

https://stackoverflow.com/a/17434121 (especially the last section with ZipFile)

Philip Durbin 🚀 (Apr 12 2024 at 13:07):

@Markus Haarländer created the zip previewer/downloader. Maybe he can help.

Jan Range (Apr 12 2024 at 13:12):

I think I have it!

image.png

Jan Range (Apr 12 2024 at 13:13):

Original file - https://darus.uni-stuttgart.de/file.xhtml?persistentId=doi:10.18419/darus-3372/7

Oliver Bertuch (Apr 12 2024 at 13:15):

That looks promising!

Oliver Bertuch (Apr 12 2024 at 13:16):

It might be necessary to repeat the reading of 256k if the ZIP file is large

Jan Range (Apr 12 2024 at 13:17):

Yes, I will add an iterative process. According to StackOverflow ZipFile will raise an error if it is not enough. Hence, I would just repeat and increment until there is no error.

Oliver Bertuch (Apr 12 2024 at 13:17):

Sounds good to me!

Oliver Bertuch (Apr 12 2024 at 13:18):

So this would be a feature of DataverseFS, right?

Jan Range (Apr 12 2024 at 13:18):

Yes, would be a great feature to have! Next challenge though is to extract a specific file

Oliver Bertuch (Apr 12 2024 at 13:19):

Oh and what about retrieval of a file from the ZIP? You didn't take a look at that yet, right?

Jan Range (Apr 12 2024 at 13:19):

I would suggest to first nail this down on the example and then generalize it in DataverseFS

Oliver Bertuch (Apr 12 2024 at 13:19):

Ha! You beat me to it :racecar:

Jan Range (Apr 12 2024 at 13:19):

Haha

Jan Range (Apr 12 2024 at 13:20):

Will look into this. Guess the ZIP Preview has some ideas already

Oliver Bertuch (Apr 12 2024 at 13:21):

Probably they extract the byte locations from the ZIP directory and merge it all into a request including the range

Jan Range (Apr 12 2024 at 13:23):

That makes sense

Jan Range (Apr 12 2024 at 13:23):

Maybe ZipFile has some nice utilities for that

Markus Haarländer (Apr 12 2024 at 13:30):

Hi guys.
Seems you already mastered most of it. Yes, the ZipPreviewer utilizes HTTP Range Requests to read the central directory of a ZIP file first, and to download and extract single files from the ZIP (using ranges from the central directory). It makes use of a great JavaScript Library which can do all of these things: https://github.com/gildas-lormeau/zip.js. But I don't know about a Python library

Jan Range (Apr 12 2024 at 15:49):

@Markus Haarländer thanks for the info! Glad to hear we are on the right track :smile:

Jan Range (Apr 12 2024 at 15:50):

Coincidentally, right after reading your message I stumbled across something similar to zip.js and it does exactly what we want! The library is called remotezip

image.png

Philip Durbin 🚀 (Apr 12 2024 at 15:51):

mmm, pythonic :yum:

Jan Range (Apr 12 2024 at 15:55):

For reference, this is a 2.8 GB file, and even this works pretty well/fast. Super nice!!

image.png

Philip Durbin 🚀 (Apr 12 2024 at 15:57):

We're going to have a lot to talk about at the next pyDataverse meeting. :grinning:

Jan Range (Apr 12 2024 at 15:57):

True :grinning_face_with_smiling_eyes:

Jan Range (Apr 15 2024 at 09:37):

The code for listing the contents of a ZIP file and downloading specific parts is working well. I have separated the Dataverse and Zip Filesystem, so passing a DataFile object to the ZIP filesystem is necessary. Here is a working example:

image.png

Jan Range (Apr 15 2024 at 09:38):

On the left sidebar, you can see the downloaded part of the ZIP file. Once I have implemented the write method to upload data files, I will create a pull request :smile:

Oliver Bertuch (Apr 15 2024 at 09:47):

This looks amazing!

Oliver Bertuch (Apr 15 2024 at 09:48):

Question: should the name of the ZipFS rather be "RemoteZipFS" to make it more obvious what this is about? Someone might want to combine it with the ZipFS shipped with PyFilesystem

Jan Range (Apr 15 2024 at 09:48):

Yes, that makes sense! Will rename it :smile:

Jan Range (Apr 15 2024 at 11:39):

Upload works too :smile:

image.png

Jan Range (Apr 15 2024 at 11:44):

@Philip Durbin Would Stefano be interested in putting the "Compute on Data" logic into the filesystem? I think this would be the right place

Philip Durbin 🚀 (Apr 16 2024 at 13:39):

Maybe! Let's see what @Leo Andreev thinks.

Jan Range (Apr 17 2024 at 06:13):

I think the filesystem is pretty close to the finish. It is now possible to write files similar to regular filesystems using the open("my.file", "w") way. Will create a new data file upon closing and also supports S3 due to DVUploader. Here is an example:

image.png

Jan Range (Apr 17 2024 at 06:14):

Feels almost like you are writing files to the hard drive :stuck_out_tongue:

Oliver Bertuch (Apr 17 2024 at 06:41):

Would it be an option to have a "with" thing? So on any close the file automatically gets uploaded?

Jan Range (Apr 17 2024 at 06:41):

Do you mean for the filesystem itself?

Oliver Bertuch (Apr 17 2024 at 06:42):

Or is this already happening?

Jan Range (Apr 17 2024 at 06:42):

The with operation is already available for files itself

Jan Range (Apr 17 2024 at 06:42):

You can either use the with to close automatically or invoke it by yourself afterwards for the upload

Oliver Bertuch (Apr 17 2024 at 06:42):

Ah! So the second loop already does the upload in the background

Jan Range (Apr 17 2024 at 06:42):

image.png

Oliver Bertuch (Apr 17 2024 at 06:42):

And the third loop is just an example to upload other files on disk?

Jan Range (Apr 17 2024 at 06:43):

Yes, exactly :smile:

Oliver Bertuch (Apr 17 2024 at 06:43):

Great!

Oliver Bertuch (Apr 17 2024 at 06:43):

This is really great!

Jan Range (Apr 17 2024 at 06:43):

I thought it would make sense to have a local file option too, since the usual open operation is blocking and yet cant be parallelized

Oliver Bertuch (Apr 17 2024 at 06:43):

Does it support remote files (for download), too?

Jan Range (Apr 17 2024 at 06:44):

Yes, you can download any file from a dataset including zip members

Oliver Bertuch (Apr 17 2024 at 06:44):

No I meant files that are in a remote store.

Oliver Bertuch (Apr 17 2024 at 06:44):

So not in S3 and not stored in Dataverse

Oliver Bertuch (Apr 17 2024 at 06:44):

But referenced using a URL

Jan Range (Apr 17 2024 at 06:45):

Ah alright, I have not tested this yet, but I am sure there are ways to integrate it

Oliver Bertuch (Apr 17 2024 at 06:45):

It would be great to enable registering URL handlers here

Jan Range (Apr 17 2024 at 06:46):

You mean in a way to transfer from a remote store to dataverse?

Jan Range (Apr 17 2024 at 06:46):

Would be great if there is a way to not have to download the intermediate file then

Oliver Bertuch (Apr 17 2024 at 06:46):

So people could store something using git-annex and receive the file when they execute the python script

Jan Range (Apr 17 2024 at 06:47):

Do you have an example for this? Havent used git-annex yet

Oliver Bertuch (Apr 17 2024 at 06:47):

That way they could very naturally interact with the files and they would be fetched as necessary

Oliver Bertuch (Apr 17 2024 at 06:47):

Ha storing the git annex thing was a wild idea at distribits

Oliver Bertuch (Apr 17 2024 at 06:47):

For now there might be examples using HTTP and Globus links

Oliver Bertuch (Apr 17 2024 at 06:48):

Let me get to work then I'll try to cook some better example. Typing this on my mobile is hard...

Oliver Bertuch (Apr 17 2024 at 07:15):

Alright. From https://guides.dataverse.org/en/latest/api/native-api.html#add-remote-file-api we know that files registered as remote files contain lots of information about the file. The most important bit is probably the storage identifier.

Oliver Bertuch (Apr 17 2024 at 07:16):

It will contain a URL that has been configured as a valid base url in the store plus a path within that location

Oliver Bertuch (Apr 17 2024 at 07:16):

The filesystem will be presented with this information when downloading the file metadata

Oliver Bertuch (Apr 17 2024 at 07:17):

So it would know about the files and folder structures

Oliver Bertuch (Apr 17 2024 at 07:17):

But it could not download the file from the Dataverse instance

Oliver Bertuch (Apr 17 2024 at 07:21):

Which means that the filesystem would need to understand how to resolve the URLs into a file

Oliver Bertuch (Apr 17 2024 at 07:22):

The example JSON to register the remote file has the example storage ID trsa://themes/custom/qdr/images/CoreTrustSeal-logo-transparent.png

Oliver Bertuch (Apr 17 2024 at 07:23):

Users of the filesystem would need some means to register a handler that knows how to deal with protocol "trsa://" and the rest of the URL

Oliver Bertuch (Apr 17 2024 at 07:24):

In case of Datalad, the idea is to store "git-annex://" URLs that encode a git-annex remote file reference as a URL.

Oliver Bertuch (Apr 17 2024 at 07:24):

Using a handler to access the git-annex URL and download the file would be great

Jan Range (Apr 17 2024 at 11:30):

Okay, got it. Sounds great! How can I set up my local instance to test it? Can I add any remote store I want? Tried it using the docs, but I have not gotten it to work.

Jan Range (Apr 17 2024 at 11:58):

I've added these JVM args:

        -Ddataverse.files.trsa.type=remote
        -Ddataverse.files.trsa.label=SomeRemoteStorage
        -Ddataverse.files.trsa.base-url=trsa://
        -Ddataverse.files.trsa.base-store=trsa

The script:

export API_TOKEN=7a51588f-8422-4868-bc66-c791016e4a30
export SERVER_URL=http://localhost:8080
export PERSISTENT_ID=doi:10.5072/FK2/ZTXNOV
export JSON_DATA=$(<body.json)

curl -H "X-Dataverse-key: $API_TOKEN" -X POST "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID" -F "jsonData=$JSON_DATA"

The request body:

{
  "description": "A remote image.",
  "storageIdentifier": "trsa://hello/testlogo.png",
  "checksumType": "MD5",
  "md5Hash": "509ef88afa907eaf2c17c1c8d8fde77e",
  "label": "testlogo.png",
  "fileName": "testlogo.png",
  "mimeType": "image/png"
}

Jan Range (Apr 17 2024 at 11:58):

Pretty sure I am doing sth wrong :grinning:

Jan Range (Apr 17 2024 at 12:57):

I receive the following message every time I try to add a remote file:

{"status":"ERROR","message":"Dataset store configuration does not allow provided storageIdentifier."}

The storage identifier follows the base URL scheme but does not match.

Jan Range (Apr 17 2024 at 12:58):

Tried setting the collection storage to the remote store, but without an effect

Philip Durbin 🚀 (Apr 17 2024 at 13:14):

I'm not sure if the helps but we have a test on the Java side (that isn't exercised regularly): https://github.com/IQSS/dataverse/blob/v6.2/src/test/java/edu/harvard/iq/dataverse/api/RemoteStoreIT.java

Jan Range (Apr 17 2024 at 13:16):

Awesome! Thanks for the hint. Could it be that I am missing this line?

-Ddataverse.files.trsa.base-store=file

I thought it had a default, but I will give it a try!

Philip Durbin 🚀 (Apr 17 2024 at 13:20):

Could be. I'm pretty sure you need a base store for thumbnails, etc.

Jan Range (Apr 18 2024 at 12:37):

Stupid idea, but how would it be if we define this filesystem instance-wide? Instead of connecting to a single dataset, you could access all the datasets. I would think of something like this:

from dataversefs import DataverseFS


fs = DataverseFS(base_url="https://demo.dataverse.org")
fs.listdir("doi:10.70122/FK2/TDI8JO://some/dir")

file = open("doi:10.70122/FK2/TDI8JO://some/dir/myfile.txt", "r")

Oliver Bertuch (Apr 18 2024 at 12:38):

You don't even need the ://

Oliver Bertuch (Apr 18 2024 at 12:39):

It's not a resolvable DOI UIR but who cares

Jan Range (Apr 18 2024 at 12:39):

Maybe there is a better way, but I think it would be cool to grab from any dataset you want.

Oliver Bertuch (Apr 18 2024 at 12:40):

Maybe for the sake of validity go for doi:10.70122/FK2/TDI8JO?file=/path/to/file

Jan Range (Apr 18 2024 at 12:40):

Thats nice!

Oliver Bertuch (Apr 18 2024 at 12:40):

It becomes a valid URI this way but is simple to parse because of the "separator string"

Oliver Bertuch (Apr 18 2024 at 12:41):

If you don't want a parameter, you could use anchors

Jan Range (Apr 18 2024 at 12:41):

True that, havent thought about this. The idea just popped in my head :grinning:

Oliver Bertuch (Apr 18 2024 at 12:41):

doi:10.70122/FK2/TDI8JO#...

Oliver Bertuch (Apr 18 2024 at 12:41):

Absolutely! Makes a lot of sense.

Jan Range (Apr 18 2024 at 12:42):

I am a fan of the hash - Looks smooth

Oliver Bertuch (Apr 18 2024 at 12:42):

Maybe support both. The names are much shorter when not always needing the full qualified one

Oliver Bertuch (Apr 18 2024 at 12:42):

There might be character limitations for anchors!

Jan Range (Apr 18 2024 at 12:43):

I will hack sth. New material for the weekend :grinning_face_with_smiling_eyes:

Oliver Bertuch (Apr 18 2024 at 12:43):

Sorry the correct term is "fragment"

Oliver Bertuch (Apr 18 2024 at 12:43):

Which even makes more sense here - you want a fragment of a dataset

Jan Range (Apr 18 2024 at 12:43):

Dataset Fragments sounds nice :grinning:

Oliver Bertuch (Apr 18 2024 at 12:45):

The characters slash ("/") and question mark ("?") are allowed to represent data within the fragment identifier. Beware that some older, erroneous implementations may not handle this data correctly when it is used as the base URI for relative references (Section 5.1).

https://www.rfc-editor.org/rfc/rfc3986#page-24

Oliver Bertuch (Apr 18 2024 at 12:48):

You could even support having a query part

Oliver Bertuch (Apr 18 2024 at 12:48):

doi:10.70122/FK2/TDI8JO?direct-download=false#path/to/file.ext

Oliver Bertuch (Apr 18 2024 at 12:52):

Wouldn't it be nice if we had a Dataverse API endpoint like this?

Oliver Bertuch (May 20 2025 at 14:57):

@Jan Range have you ever tried benchmarking the pyfilesystem? Wondering what kind of performance one could get using the underlying S3 and potentially caching the files locally to make sure multiple usage of a file doesn't reload over and over.

Oliver Bertuch (May 20 2025 at 14:58):

Had a discussion with a few RSEs today that do Electron Microscopy. Their sensors stream a solid 2GiB/s and I was wondering if once they put that kind of data into S3 (maybe Dataverse in the mix) what kind of speed they could achieve reading the data back again.

Oliver Bertuch (May 20 2025 at 14:59):

Currently they heavily rely on filesystems and direct IO to avoid page table madness...

Oliver Bertuch (May 20 2025 at 15:00):

But they also want to expose data to analysis stations using SMB/NFS, so going through a network stack. Wondering if the S3 direct download stuff with PyFilesystem2 would be able to compete.

Jan Range (May 21 2025 at 11:54):

@Oliver Bertuch I have not benchmarked it yet, but I can test it in the upcoming weeks.

I’ve checked the source code, and the fs-s3fs package that provides S3 support in pyFileSystem uses boto3, the official AWS SDK. According to the implementation, it automatically utilizes the Range header and parallel downloads, which should make it noticeably faster than a standard sequential download.

However, when I tried using both boto3 and pyFileSystem’s S3 backend, I ran into an issue: these libraries require AWS credentials, even for publicly accessible files. I attempted to extract credentials or any required information from the S3 redirect URL, but that wasn’t sufficient to get these libraries working. Do you have any ideas on how we could make this work using only the pre-signed URL?

That said, I believe we could still achieve better download speeds by leveraging Range headers and parallelizing the download process ourselves. For reference, I ran a quick benchmark comparing the current, non-parallelized PyDataverse download of a 1.5 GB file to a Rust implementation that uses Range requests for parallel downloading.

Library	Size	Time Taken	Gb/s
PyDataverse	1.5gb	~90s	~0.01
DVCLI (Rust)	1.5gb	~30s	~0.05

The Rust implementation follows the S3 redirect and uses the Range header to enable partial downloads. It splits the file into 5 MB chunks and distributes the workload across 64 workers, which turned out to be the optimal configuration in my tests. To ensure realistic results, I used a file from production as the test case.

https://darus.uni-stuttgart.de/file.xhtml?persistentId=doi:10.18419/DARUS-444/1&version=1.0

I am not sure if we can get any faster, since the AWS libraries practically do the same thing. Do you have any ideas how we could match the insane 2gb/s :smile:

Jan Range (May 21 2025 at 11:56):

It could be that my WiFi is not the best to benchmark. I could imagine having a direct cable-based connection would perform even better.

Last updated: Nov 01 2025 at 14:11 UTC