large uploads via native api workaround · troubleshooting

Stream: troubleshooting

Topic: large uploads via native api workaround

maría a. matienzo (Jul 23 2024 at 21:31):

hi folks - i know it's been discussed before, but can someone verify the recommended process for dealing with a large files over the native api? as i understand it it's something like the following, but i can't confirm it anywhere in the documentation.

upload a small placeholder file with the same filename and (preferably) the same mime type
find the info in the dataverse backend database for the placeholder file to get the storage location
copy the large file you'd like to place in this storage location identified in (2)
update the row in dataverse database that corresponds to the file, updating the checksumvalue, contenttype, and filesize columns.

maría a. matienzo (Jul 23 2024 at 21:32):

relatedly: in the case that this large file is, for example, a zip file, will dataverse extract the file placed in this location? if it can't be done automatically, is there a way to trigger this to happen?

Philip Durbin 🚀 (Jul 24 2024 at 13:33):

Right, we used to talk about that placeholder workaround, in this thread on the mailing list, for example.

However, these days I think we enable direct upload (requires S3) and push the files up from the client.

Philip Durbin 🚀 (Jul 24 2024 at 13:34):

No, I don't believe there's a way to tell Dataverse to unzip a file afterwards.

(There is a way to trigger a reingest of a tabular file, but that's different.)

maría a. matienzo (Jul 24 2024 at 16:02):

okay, thanks. at Berkeley, we're still relying on file backend storage, and i don't think we'll be able to switch to S3 before we launch our Dataverse instance.

maría a. matienzo (Jul 24 2024 at 16:11):

we do want to revisit using object storage post launch, though (along with a number of other things!)

Philip Durbin 🚀 (Jul 24 2024 at 16:11):

There are some additional big data options on the horizon, such as Globus.

maría a. matienzo (Jul 24 2024 at 17:30):

do you know if there's a diagram or something that describes the workflow of steps a file goes through on upload? i realize i could look at the code, but curious if there's something in addition that.

Philip Durbin 🚀 (Jul 24 2024 at 18:21):

@Oliver Bertuch created a nice diagram in this issue: Refactor file upload from web UI and temporary storage #6656

Philip Durbin 🚀 (Jul 24 2024 at 18:22):

file-upload.png

maría a. matienzo (Jul 24 2024 at 18:36):

got it - so i gather that FileUtil is where that process occurs?

Philip Durbin 🚀 (Jul 24 2024 at 18:42):

Well, I think a lot happens in the IngestService as well. I haven't looked closely in a while. :sweat_smile:

maría a. matienzo (Jul 24 2024 at 18:43):

got it - trying to figure out which seam i should advocate for there to be an API endpoint for :thinking:

Philip Durbin 🚀 (Jul 24 2024 at 18:46):

Sorry, I'm multi-tasking poorly. :sweat_smile: You might want to make a PR? To do what?

maría a. matienzo (Jul 24 2024 at 18:54):

at this point, not a PR, just a feature request for an api endpoint to be able to retrigger these tasks.

Philip Durbin 🚀 (Jul 24 2024 at 19:02):

Oh, retriggering unzipping of a file, for example?

maría a. matienzo (Jul 24 2024 at 19:02):

exactly. :smile:

Philip Durbin 🚀 (Jul 24 2024 at 19:03):

Sure, please go ahead and create an issue for that one, if you like.

Smaller issues are better for us.

maría a. matienzo (Jul 24 2024 at 23:04):

done: https://github.com/IQSS/dataverse/issues/10723

Philip Durbin 🚀 (Jul 25 2024 at 02:01):

Thanks!

maría a. matienzo (Jul 30 2024 at 22:38):

related question: is it possible that the replace file API method would work here (writing just the json data, not the file itself) instead of updating the database directly? again, this is for file-backed storage, not s3. (rationale: i'd be more comfortable sending this as an API request rather than mucking about in the database.)

Philip Durbin 🚀 (Jul 31 2024 at 20:23):

Yes, or maybe this API: https://guides.dataverse.org/en/6.3/developers/s3-direct-upload-api.html#adding-the-uploaded-file-to-the-dataset

maría a. matienzo (Jul 31 2024 at 21:58):

ah, i see. i think the issue with using the /add endpoint is that there might not be an existing storageIdentifier for the file. is there something that i'm missing there?

maría a. matienzo (Jul 31 2024 at 22:26):

okay, interesting. /replace is failing in this case because of a uniqueness constraint violation on dvobject if i pass it the same storage identifier. i take it that this is expected? would it be more correct to mint a new storage identifier?

maría a. matienzo (Jul 31 2024 at 23:22):

okay, i think i've got this figured out. i do need to mint a new storage identifier, which i can do fairly easily.

maría a. matienzo (Jul 31 2024 at 23:39):

/add, however, doesn't work if i've minted a new storage location; i get a 400 error (Dataset store configuration does not allow provided storageIdentifier.)

Philip Durbin 🚀 (Aug 01 2024 at 13:52):

María A. Matienzo said:

ah, i see. i think the issue with using the /add endpoint is that there might not be an existing storageIdentifier for the file. is there something that i'm missing there?

Well, my mental model of this the storageIdentifier should be whatever filename you give the file. So you put the file on disk as b4c5c9ab-4b4fand then use that same string as the storageIdentifier in the JSON.

I'm not sure if I've ever done this. I'm not sure if it'll work for non-S3 files. (That S3 direct upload API should work for files on S3.)

maría a. matienzo (Aug 01 2024 at 16:33):

@Philip Durbin got it. i think the new work around, which seems to work, is as follows:

upload a small placeholder file with the same filename
get the file's ID from the API response
move the actual file into a file with a new storageIdentifier
call the /replace (or, in theory, the /replaceFiles) endpoint

Philip Durbin 🚀 (Aug 01 2024 at 17:22):

Great! Do you want to create a pull request to document it? :grinning:

maría a. matienzo (Aug 01 2024 at 18:18):

yup, sure - once i've done some more testing, i'd be happy to. any recommendation about where it should sit?

maría a. matienzo (Aug 01 2024 at 18:19):

i suppose it could go in the direct upload documentation, if people know to look there

Philip Durbin 🚀 (Aug 01 2024 at 18:22):

Hmm, you're doing this because the file is large, right? How about somewhere under a future version of https://guides.dataverse.org/en/6.3/developers/big-data-support.html ?

maría a. matienzo (Aug 01 2024 at 18:22):

yup, that sounds good. thanks!

Philip Durbin 🚀 (Aug 02 2024 at 14:14):

This just in from Jim Myers:

"File stores can use dataverse.files.<id>.upload-out-of-band flag which allows an improved file hack - you place the file and then call the https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html#adding-the-uploaded-file-to-the-dataset or https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html#to-add-multiple-uploaded-files-to-the-dataset to add the metadata (including size, hash, etc.). It just avoids having to edit the db."

maría a. matienzo (Aug 02 2024 at 15:59):

oh, neat. thanks! i'll give that a shot.

Bethany Seeger (Aug 13 2024 at 15:56):

If you don't mind my asking, what's considered a large upload? Anything greater than 2GB? We are troubleshooting some issues with files greater than 2GB so I'm trying to figure out if the above suggestions are where we need to go with things as well. Thanks!

Philip Durbin 🚀 (Aug 13 2024 at 16:05):

Sure, 2GB is pretty big. I would suggest trying DVUploader, either the traditional Java one or the Python one.

Last updated: Oct 30 2025 at 06:21 UTC