hi folks - i know it's been discussed before, but can someone verify the recommended process for dealing with a large files over the native api? as i understand it it's something like the following, but i can't confirm it anywhere in the documentation.
checksumvalue, contenttype, and filesize columns.relatedly: in the case that this large file is, for example, a zip file, will dataverse extract the file placed in this location? if it can't be done automatically, is there a way to trigger this to happen?
Right, we used to talk about that placeholder workaround, in this thread on the mailing list, for example.
However, these days I think we enable direct upload (requires S3) and push the files up from the client.
No, I don't believe there's a way to tell Dataverse to unzip a file afterwards.
(There is a way to trigger a reingest of a tabular file, but that's different.)
okay, thanks. at Berkeley, we're still relying on file backend storage, and i don't think we'll be able to switch to S3 before we launch our Dataverse instance.
we do want to revisit using object storage post launch, though (along with a number of other things!)
There are some additional big data options on the horizon, such as Globus.
do you know if there's a diagram or something that describes the workflow of steps a file goes through on upload? i realize i could look at the code, but curious if there's something in addition that.
@Oliver Bertuch created a nice diagram in this issue: Refactor file upload from web UI and temporary storage #6656
got it - so i gather that FileUtil is where that process occurs?
Well, I think a lot happens in the IngestService as well. I haven't looked closely in a while. :sweat_smile:
got it - trying to figure out which seam i should advocate for there to be an API endpoint for :thinking:
Sorry, I'm multi-tasking poorly. :sweat_smile: You might want to make a PR? To do what?
at this point, not a PR, just a feature request for an api endpoint to be able to retrigger these tasks.
Oh, retriggering unzipping of a file, for example?
exactly. :smile:
Sure, please go ahead and create an issue for that one, if you like.
Smaller issues are better for us.
done: https://github.com/IQSS/dataverse/issues/10723
Thanks!
related question: is it possible that the replace file API method would work here (writing just the json data, not the file itself) instead of updating the database directly? again, this is for file-backed storage, not s3. (rationale: i'd be more comfortable sending this as an API request rather than mucking about in the database.)
Yes, or maybe this API: https://guides.dataverse.org/en/6.3/developers/s3-direct-upload-api.html#adding-the-uploaded-file-to-the-dataset
ah, i see. i think the issue with using the /add endpoint is that there might not be an existing storageIdentifier for the file. is there something that i'm missing there?
okay, interesting. /replace is failing in this case because of a uniqueness constraint violation on dvobject if i pass it the same storage identifier. i take it that this is expected? would it be more correct to mint a new storage identifier?
okay, i think i've got this figured out. i do need to mint a new storage identifier, which i can do fairly easily.
/add, however, doesn't work if i've minted a new storage location; i get a 400 error (Dataset store configuration does not allow provided storageIdentifier.)
María A. Matienzo said:
ah, i see. i think the issue with using the
/addendpoint is that there might not be an existingstorageIdentifierfor the file. is there something that i'm missing there?
Well, my mental model of this the storageIdentifier should be whatever filename you give the file. So you put the file on disk as b4c5c9ab-4b4fand then use that same string as the storageIdentifier in the JSON.
I'm not sure if I've ever done this. I'm not sure if it'll work for non-S3 files. (That S3 direct upload API should work for files on S3.)
@Philip Durbin got it. i think the new work around, which seems to work, is as follows:
storageIdentifier/replace (or, in theory, the /replaceFiles) endpointGreat! Do you want to create a pull request to document it? :grinning:
yup, sure - once i've done some more testing, i'd be happy to. any recommendation about where it should sit?
i suppose it could go in the direct upload documentation, if people know to look there
Hmm, you're doing this because the file is large, right? How about somewhere under a future version of https://guides.dataverse.org/en/6.3/developers/big-data-support.html ?
yup, that sounds good. thanks!
This just in from Jim Myers:
"File stores can use dataverse.files.<id>.upload-out-of-band flag which allows an improved file hack - you place the file and then call the https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html#adding-the-uploaded-file-to-the-dataset or https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html#to-add-multiple-uploaded-files-to-the-dataset to add the metadata (including size, hash, etc.). It just avoids having to edit the db."
oh, neat. thanks! i'll give that a shot.
If you don't mind my asking, what's considered a large upload? Anything greater than 2GB? We are troubleshooting some issues with files greater than 2GB so I'm trying to figure out if the above suggestions are where we need to go with things as well. Thanks!
Sure, 2GB is pretty big. I would suggest trying DVUploader, either the traditional Java one or the Python one.
Last updated: Oct 30 2025 at 06:21 UTC