Stream: dev

Topic: Upload checks need optimization


view this post on Zulip Oliver Bertuch (Oct 01 2025 at 13:31):

We are not handling uploads very clever, are we? For example instead of looking at the Content Length header, we transfer the whole file and then take a look at it. Depending on the file size, that's a long turnaround time... :see_no_evil:

view this post on Zulip Philip Durbin ๐Ÿš€ (Oct 01 2025 at 13:38):

Which system are you talking about? JSF? The SPA? dvwebloader? DVUploader? python-DVUploader?

view this post on Zulip Oliver Bertuch (Oct 01 2025 at 13:39):

Backend API endpoint.

view this post on Zulip Oliver Bertuch (Oct 01 2025 at 13:39):

https://github.com/poikilotherm/dataverse/blob/f79a02b4ad33f4febba96ddc41278011f42a87b1/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/CreateNewDataFilesCommand.java#L191-L200

view this post on Zulip Philip Durbin ๐Ÿš€ (Oct 01 2025 at 13:40):

Ah, you mean for limiting the size?

view this post on Zulip Oliver Bertuch (Oct 01 2025 at 13:40):

Aye

view this post on Zulip Philip Durbin ๐Ÿš€ (Oct 01 2025 at 13:40):

Hmm, I think you're right about that code but maybe I'm missing something. @Leo Andreev can you please comment?

view this post on Zulip Oliver Bertuch (Oct 01 2025 at 13:42):

We don't even need to have a fancy JAX-RS ContainerFilter in place. I assume we could just access the HTTP headers via the DataverseRequest.

view this post on Zulip Oliver Bertuch (Oct 01 2025 at 13:42):

Although I'm not sure this would actually stop transfering the data. A filter may be necessary to interrupt the process early.

view this post on Zulip Leo Andreev (Oct 01 2025 at 19:50):

Yes, this is a problem with the native ("non-direct") upload API. One of numerous problems with it. *)

Addressing it would not be as easy as checking some HTTP header though. With large uploads in particular, there is likely not going to be any such header advertising the total size of the byte stream. The client is more likely to use chunked encoding for the transfer, where the size header is supplied for each buffer-worth ("chunk") of bytes instead.
What we can do is stop and reject the transfer the moment it reaches the file size limit (or goes over the storage quota); which would still be much better than allowing a giant, potentially filesystem-flooding upload to complete before rejecting it.
From what I understand, we would need to implement our own input streaming (as opposed to using the jakarta.ws.rs and icefaces implementations we are using in the API and jsf UI respectively).

*) I dislike the native/basic upload API (/api/datasets/{id}/add) enough that I am considering disabling it on the IQSS prod. instance, now that we have direct upload enabled on all storage volumes.

view this post on Zulip Oliver Bertuch (Oct 02 2025 at 06:05):

With the current implementation, we have no chunked upload support either. The multipart/form-data thing is single file in one go only...

view this post on Zulip Oliver Bertuch (Oct 02 2025 at 06:06):

For chunked uploads we'd need to implement custom handling anyway. There are a few "vendor standards" (Google, AWS, tus.io), but not one single concise thing. It boils down to have three additional endpoints: one to initialise a chunked upload, one to upload the chunks themselves and a status check.

view this post on Zulip Oliver Bertuch (Oct 02 2025 at 06:24):

I suppose these days a lot of development in the HTTP world focuses on streaming. HTTP/3 (QUIC) removes the "chunked encoding" altogether and replaces it with stream support. The application doesn't need to deal with the details, this is handled by the stack for you. You just send a "normal" PUT/POST and QUIC takes care of splitting this into streams. Obviously, we can't use that - Payara is at HTTP/2 only. Using something like tus.io would give us compatibility with HTTP 1.1 and is at least an open standard (even an active draft with the IETF to become a "real" standard).

view this post on Zulip Leo Andreev (Oct 02 2025 at 13:27):

With the current implementation, we have no chunked upload support either. The multipart/form-data thing is single file in one go only...

Are you referring to the API, or the JSF/IceFaces upload?

view this post on Zulip Philip Durbin ๐Ÿš€ (Oct 02 2025 at 13:28):

IceFaces takes me back :smile:

view this post on Zulip Leo Andreev (Oct 02 2025 at 13:28):

Yeah. Ice, prime, same stuff. You know what I meant. :)

view this post on Zulip Leo Andreev (Oct 02 2025 at 13:43):

If the /add API only supports multipart/form-data (does it really? hmm)... That would mean that it is in fact possible to check the total size header and reject the call early on if it's too large. We would need to reimplement how it's handled (i.e., I'm pretty sure that in the current implementation the API method is only called once the transfer is complete; we'll need to intercept that call earlier).
But, rather than making this API support true streaming/chunked encoding, I would rather deprecate it effectively. I really believe we need to replace or supplement it with an equivalent of the direct upload we are already using for S3; that would allow to stream uploads more efficiently and, most importantly, allow multiple file uploads, using /addFiles (again, similarly to the direct-to-S3 uploads).

view this post on Zulip Oliver Bertuch (Oct 02 2025 at 15:02):

If the JSF upload uses chunks, it's a Primefaces custom thing and protocol. I'm only talking about the /api/datasets/ID/add endpoint.

view this post on Zulip Oliver Bertuch (Oct 02 2025 at 15:06):

I don't think it's necessary to have an endpoint that allows multiple files at once. But I'd like to see chunked or streamed/multiplexed uploads without S3 being an option, thus making it resumable and faster due to parallel uploads.

The client side would need to implement such a chunked protocol anyway, no problem to make the client iterate over files. (Potentially uploading multiple files in parallel, but within their own "chunk upload session".)

The tus.io stuff looks very interesting, I could totally see it using that and not need to rely on out of band uploads. After all, exposing an S3 server on the net is not without risks.

view this post on Zulip Leo Andreev (Oct 02 2025 at 15:48):

When I'm talking about multi-file uploads, again, I'm referring to the model already used for the direct-to-S3 uploads.
You are still uploading one file at a time. You also request the upload authorization for each file, one at a time. Note that you must supply the file size when requesting this pre-signed upload url. So it is very easy to reject an upload request based on size early on, before any bytes are transferred.
But, you can then use one /addFiles call to _finalize_ multiple file uploads to add the datafiles to the dataset, once the physical file transfers have been completed. This is extremely important; because updating the version after every upload (like the /add API does) can become very expensive as the number of files in the dataset grows.
I absolutely want to be able to do something like this for non-S3 uploads.

view this post on Zulip Leo Andreev (Oct 07 2025 at 18:41):

In the spirit of useful incremental improvements, it would be great to start with re-working the "classic" upload API (/add), just so that it can check multipart/form-data headers and reject an over-the-limit uploads without accepting the entire upload first.


Last updated: Nov 01 2025 at 14:11 UTC