Skip to content

Tabular data

Dataverse has first-class support for tabular data. When you upload a CSV, TSV, or supported spreadsheet, Dataverse ingests it: the values are stored in a normalized tab-delimited .tab file and the column names and types are recorded as variable metadata. DataverseFS provides helpers that turn these files back into pandas DataFrames in one call.

After ingest, a file you uploaded as data/table.csv is served as data/table.tab. Two consequences are worth knowing:

  • Path. Refer to the ingested file by its .tab path when reading it as tabular data.
  • Size. Dataverse stores the tab data without its header row (the column names live in metadata) but reconstructs the header when you download the file. The reported filesize therefore describes the stored bytes, not the downloaded ones. DataverseFS handles this for you — a full read returns the complete, header-included content — but it’s why a tabular file’s size may look smaller than the bytes you receive.

open_tabular downloads a tabular file and parses it into a DataFrame, choosing the delimiter from the file’s MIME type automatically:

df = fs.open_tabular("data/table.tab", api_token=None)

Pass api_token when the file is restricted; None is fine for public files.

It forwards any keyword arguments to pandas.read_csv / read_excel, so you can shape the read as usual:

df = fs.open_tabular(
"data/table.tab",
api_token=None,
usecols=["name", "r2", "aic"],
nrows=100,
dtype={"r2": "float64"},
)

Set no_header=True for files without a header row — columns are then named by integer position (0, 1, 2, …).

For files too large to hold in memory, stream_tabular yields row-chunked DataFrames:

total = 0
for chunk in fs.stream_tabular("data/table.tab", api_token=None, chunk_size=10_000):
total += len(chunk)
print(total)

stream_tabular accepts the same read_csv keyword arguments as open_tabular, plus chunk_size (rows per chunk) and sep (delimiter override).

A file is treated as tabular when its MIME type is one Dataverse ingests — text/csv, text/tab-separated-values, text/tsv, the Excel types, and a few common variants. You can check a file’s type via its metadata:

info = fs.getinfo("data/table.tab")
print(info.tabular_data) # True once Dataverse has ingested it
print(info.content_type) # e.g. 'text/tab-separated-values'

Calling open_tabular or stream_tabular on a non-tabular file raises a ValueError.

Everything above goes through pyDataverse. If you’d rather hand a URL to pandas, Dask, or Polars and let them read it directly, see pandas & the fsspec ecosystem — including a complete, runnable example against a public dataset.