Skip to content

pandas & the fsspec ecosystem

Importing pyDataverse registers a dataverse:// URL protocol with fsspec (a packaging entry point also registers it on install). From then on, any fsspec-aware library can open a dataset file directly from a URL — no filesystem object to construct, no download step to manage.

A URL carries the dataset’s connection details in its query string; the path component is the file’s path inside the dataset:

dataverse://<host>/<file/path>?persistentId=doi:...&version=:latest
dataverse://<host>/<file/path>?id=12345
Query parameterPurpose
persistentIdDataset DOI (e.g. doi:10.5072/FK2/ABCDEF).
idDataset numeric database ID (alternative to persistentId).
versionDataset version (:latest, :draft, 1.0, …). Optional.
schemeTransport for the host, https (default) or http. Optional.

The API token is not read from the URL. Pass it — when needed — through the reading library’s storage_options:

storage_options = {"api_token": "your-token"}

For a tabular file, point read_csv at the URL. Dataverse serves ingested files as tab-delimited, so use sep="\t":

import pandas as pd
import pyDataverse # registers the dataverse:// protocol
url = (
"dataverse://demo.dataverse.org/data/table.tab"
"?persistentId=doi:10.5072/FK2/ABCDEF"
)
df = pd.read_csv(url, sep="\t")
# Restricted dataset? Add the token:
# df = pd.read_csv(url, sep="\t", storage_options={"api_token": "your-token"})

Any tool that delegates path handling to fsspec accepts dataverse:// URLs the same way:

# Dask — lazy, parallel reads
import dask.dataframe as dd
ddf = dd.read_csv(url, sep="\t", storage_options={"api_token": "your-token"})
# Polars (via fsspec)
import fsspec, polars as pl
with fsspec.open(url, "rb", api_token="your-token") as f:
df = pl.read_csv(f, separator="\t")
# Plain fsspec — open any file, tabular or not
with fsspec.open(url, "rb") as f:
raw = f.read()

fsspec has no standalone CLI, but because the protocol is registered, any fsspec-aware command-line tool — and a short python -c one-liner — can read a dataset by URL. The import pyDataverse is what triggers registration:

Terminal window
# Print the first lines of a file
python -c "import pyDataverse, fsspec; \
print(fsspec.open('dataverse://demo.dataverse.org/data/notes.txt?persistentId=doi:10.5072/FK2/ABCDEF', \
'rt').open().read())"
Terminal window
# List the files in a dataset
python -c "import pyDataverse, fsspec; \
fs = fsspec.filesystem('dataverse', base_url='https://demo.dataverse.org', \
identifier='doi:10.5072/FK2/ABCDEF'); \
print('\n'.join(fs.ls('/', detail=False)))"

This reads a real, public tabular file from DaRUS — no token required. The dataset is doi:10.18419/DARUS-5539, a kinetic-modeling study; results/summary.tab is a model-comparison table.

import pandas as pd
import pyDataverse # registers the dataverse:// protocol
url = (
"dataverse://darus.uni-stuttgart.de/results/summary.tab"
"?persistentId=doi:10.18419/DARUS-5539"
)
df = pd.read_csv(url, sep="\t")
print(df[["name", "n_parameters", "r2", "aic", "bic"]])
name n_parameters r2 aic bic
0 model_04 12 0.997071 214.849030 260.068878
1 model_07 10 0.998349 27.439997 65.123207
2 model_06 9 0.996706 246.452515 280.367401
3 model_08 9 0.998378 19.794846 53.709736

The equivalent through a filesystem instance, letting pyDataverse pick the delimiter for you:

from pyDataverse.filesystem import DataverseFS
fs = DataverseFS(
base_url="https://darus.uni-stuttgart.de",
identifier="doi:10.18419/DARUS-5539",
)
df = fs.open_tabular("results/summary.tab", api_token=None)

With a token and edit permission, the protocol also works for writing — for example, DataFrame.to_csv to a dataverse:// URL streams a new (or replacement) file into the dataset:

df.to_csv(
"dataverse://demo.dataverse.org/results/out.csv?persistentId=doi:10.5072/FK2/ABCDEF",
index=False,
storage_options={"api_token": "your-token"},
)

See Writing files for the details of how uploads stream and how to attach metadata.