Stream: troubleshooting

Topic: large uploads via native api workaround


view this post on Zulip maría a. matienzo (Jul 23 2024 at 21:31):

hi folks - i know it's been discussed before, but can someone verify the recommended process for dealing with a large files over the native api? as i understand it it's something like the following, but i can't confirm it anywhere in the documentation.

  1. upload a small placeholder file with the same filename and (preferably) the same mime type
  2. find the info in the dataverse backend database for the placeholder file to get the storage location
  3. copy the large file you'd like to place in this storage location identified in (2)
  4. update the row in dataverse database that corresponds to the file, updating the checksumvalue, contenttype, and filesize columns.

view this post on Zulip maría a. matienzo (Jul 23 2024 at 21:32):

relatedly: in the case that this large file is, for example, a zip file, will dataverse extract the file placed in this location? if it can't be done automatically, is there a way to trigger this to happen?

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 13:33):

Right, we used to talk about that placeholder workaround, in this thread on the mailing list, for example.

However, these days I think we enable direct upload (requires S3) and push the files up from the client.

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 13:34):

No, I don't believe there's a way to tell Dataverse to unzip a file afterwards.

(There is a way to trigger a reingest of a tabular file, but that's different.)

view this post on Zulip maría a. matienzo (Jul 24 2024 at 16:02):

okay, thanks. at Berkeley, we're still relying on file backend storage, and i don't think we'll be able to switch to S3 before we launch our Dataverse instance.

view this post on Zulip maría a. matienzo (Jul 24 2024 at 16:11):

we do want to revisit using object storage post launch, though (along with a number of other things!)

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 16:11):

There are some additional big data options on the horizon, such as Globus.

view this post on Zulip maría a. matienzo (Jul 24 2024 at 17:30):

do you know if there's a diagram or something that describes the workflow of steps a file goes through on upload? i realize i could look at the code, but curious if there's something in addition that.

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 18:21):

@Oliver Bertuch created a nice diagram in this issue: Refactor file upload from web UI and temporary storage #6656

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 18:22):

file-upload.png

view this post on Zulip maría a. matienzo (Jul 24 2024 at 18:36):

got it - so i gather that FileUtil is where that process occurs?

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 18:42):

Well, I think a lot happens in the IngestService as well. I haven't looked closely in a while. :sweat_smile:

view this post on Zulip maría a. matienzo (Jul 24 2024 at 18:43):

got it - trying to figure out which seam i should advocate for there to be an API endpoint for :thinking:

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 18:46):

Sorry, I'm multi-tasking poorly. :sweat_smile: You might want to make a PR? To do what?

view this post on Zulip maría a. matienzo (Jul 24 2024 at 18:54):

at this point, not a PR, just a feature request for an api endpoint to be able to retrigger these tasks.

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 19:02):

Oh, retriggering unzipping of a file, for example?

view this post on Zulip maría a. matienzo (Jul 24 2024 at 19:02):

exactly. :smile:

view this post on Zulip Philip Durbin 🚀 (Jul 24 2024 at 19:03):

Sure, please go ahead and create an issue for that one, if you like.

Smaller issues are better for us.

view this post on Zulip maría a. matienzo (Jul 24 2024 at 23:04):

done: https://github.com/IQSS/dataverse/issues/10723

view this post on Zulip Philip Durbin 🚀 (Jul 25 2024 at 02:01):

Thanks!

view this post on Zulip maría a. matienzo (Jul 30 2024 at 22:38):

related question: is it possible that the replace file API method would work here (writing just the json data, not the file itself) instead of updating the database directly? again, this is for file-backed storage, not s3. (rationale: i'd be more comfortable sending this as an API request rather than mucking about in the database.)

view this post on Zulip Philip Durbin 🚀 (Jul 31 2024 at 20:23):

Yes, or maybe this API: https://guides.dataverse.org/en/6.3/developers/s3-direct-upload-api.html#adding-the-uploaded-file-to-the-dataset

view this post on Zulip maría a. matienzo (Jul 31 2024 at 21:58):

ah, i see. i think the issue with using the /add endpoint is that there might not be an existing storageIdentifier for the file. is there something that i'm missing there?

view this post on Zulip maría a. matienzo (Jul 31 2024 at 22:26):

okay, interesting. /replace is failing in this case because of a uniqueness constraint violation on dvobject if i pass it the same storage identifier. i take it that this is expected? would it be more correct to mint a new storage identifier?

view this post on Zulip maría a. matienzo (Jul 31 2024 at 23:22):

okay, i think i've got this figured out. i do need to mint a new storage identifier, which i can do fairly easily.

view this post on Zulip maría a. matienzo (Jul 31 2024 at 23:39):

/add, however, doesn't work if i've minted a new storage location; i get a 400 error (Dataset store configuration does not allow provided storageIdentifier.)

view this post on Zulip Philip Durbin 🚀 (Aug 01 2024 at 13:52):

María A. Matienzo said:

ah, i see. i think the issue with using the /add endpoint is that there might not be an existing storageIdentifier for the file. is there something that i'm missing there?

Well, my mental model of this the storageIdentifier should be whatever filename you give the file. So you put the file on disk as b4c5c9ab-4b4fand then use that same string as the storageIdentifier in the JSON.

I'm not sure if I've ever done this. I'm not sure if it'll work for non-S3 files. (That S3 direct upload API should work for files on S3.)

view this post on Zulip maría a. matienzo (Aug 01 2024 at 16:33):

@Philip Durbin got it. i think the new work around, which seems to work, is as follows:

  1. upload a small placeholder file with the same filename
  2. get the file's ID from the API response
  3. move the actual file into a file with a new storageIdentifier
  4. call the /replace (or, in theory, the /replaceFiles) endpoint

view this post on Zulip Philip Durbin 🚀 (Aug 01 2024 at 17:22):

Great! Do you want to create a pull request to document it? :grinning:

view this post on Zulip maría a. matienzo (Aug 01 2024 at 18:18):

yup, sure - once i've done some more testing, i'd be happy to. any recommendation about where it should sit?

view this post on Zulip maría a. matienzo (Aug 01 2024 at 18:19):

i suppose it could go in the direct upload documentation, if people know to look there

view this post on Zulip Philip Durbin 🚀 (Aug 01 2024 at 18:22):

Hmm, you're doing this because the file is large, right? How about somewhere under a future version of https://guides.dataverse.org/en/6.3/developers/big-data-support.html ?

view this post on Zulip maría a. matienzo (Aug 01 2024 at 18:22):

yup, that sounds good. thanks!

view this post on Zulip Philip Durbin 🚀 (Aug 02 2024 at 14:14):

This just in from Jim Myers:

"File stores can use dataverse.files.<id>.upload-out-of-band flag which allows an improved file hack - you place the file and then call the https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html#adding-the-uploaded-file-to-the-dataset or https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html#to-add-multiple-uploaded-files-to-the-dataset to add the metadata (including size, hash, etc.). It just avoids having to edit the db."

view this post on Zulip maría a. matienzo (Aug 02 2024 at 15:59):

oh, neat. thanks! i'll give that a shot.

view this post on Zulip Bethany Seeger (Aug 13 2024 at 15:56):

If you don't mind my asking, what's considered a large upload? Anything greater than 2GB? We are troubleshooting some issues with files greater than 2GB so I'm trying to figure out if the above suggestions are where we need to go with things as well. Thanks!

view this post on Zulip Philip Durbin 🚀 (Aug 13 2024 at 16:05):

Sure, 2GB is pretty big. I would suggest trying DVUploader, either the traditional Java one or the Python one.


Last updated: Oct 30 2025 at 06:21 UTC