Stream: large-data

Topic: Direct Upload to Immutable S3 Buckets


view this post on Zulip Daniel C Schmidt (Oct 16 2025 at 16:56):

Hi folks, hope this is the right place to ask (and happy to relay elsewhere). We're testing Dataverse with direct uploads to a CloudianS3 bucket with immutability enabled, which in their implementation requires that the Content-MD5 header is sent when PUTing an object to the backend. My understanding is that the presigned URL would need to include Content-MD5 as one of the signed headers and incorporate that into its signature, in addition to the client separately sending that header in the direct PUT request. I don't think that's possible without the UI prompting the user or otherwise calculating the MD5 before submitting the Dataverse API request that generates the presigned URL. But I just wanted to throw it out there in case anyone else had run into this specific issue! Any thoughts would be appreciated.

view this post on Zulip Philip Durbin 🚀 (Oct 16 2025 at 18:03):

Hmm. Could you please open an issue about this at https://github.com/IQSS/dataverse/issues ?

view this post on Zulip Daniel C Schmidt (Oct 16 2025 at 18:11):

@Philip Durbin 🚀 Will do! I'll file that later today.

view this post on Zulip Daniel C Schmidt (Oct 16 2025 at 19:53):

Ok, submitted —> https://github.com/IQSS/dataverse/issues/11901 Hope that's clear. This is not a big ask for us, as we're going to use a workaround, but given that it touches on data integrity I think it's useful for you (us?) all to ponder even if it doesn't make it into Dataverse.

view this post on Zulip Daniel C Schmidt (Oct 16 2025 at 19:55):

As for our workaround(s):

  1. For integrity checking, it looks like we're going to have to do something out of band with the ComputeChecksum bucket-side feature. I.e. we'll upload without checksums, compute checksums on the bucket, then compare that with our metadata in Dataverse before publishing.
  2. For immutability, in our case it's possible to disable ObjectLock in the application bucket but enable it in our replicated bucket. That gets us the security protections we want. There aren't a lot of downsides, either—if we need to failover to the replicated bucket, it's possible in a pinch to temporarily disable ObjectLock.

view this post on Zulip Philip Durbin 🚀 (Oct 16 2025 at 19:58):

Looks great. Thanks! Have you looked at https://guides.dataverse.org/en/6.8/developers/s3-direct-upload-api.html

You'll find md5Hash in there.

view this post on Zulip Daniel C Schmidt (Oct 16 2025 at 20:01):

We have, but unless I'm mistaken that doesn't solve this problem. We're not able to upload files at all to S3 buckets with ObjectLock enabled b/c the Content-MD5 header needs to be submitted at upload time. That's true whether it goes through Dataverse (direct-upload=false) or from the client (direct-upload=true, in which case the pre-signed URL would also need to contain the header+value). Does that make sense?

view this post on Zulip Daniel C Schmidt (Oct 16 2025 at 20:03):

It brings up a related question — how are folks verifying file integrity when using S3 backends? Is that all out-of-band? If we're not validating checksums on submission, then the obvious answer would be to audit buckets by comparing metadata in Dataverse with the results of a ComputeChecksum job on the bucket.

view this post on Zulip Philip Durbin 🚀 (Oct 16 2025 at 20:21):

Hmm, I see what you mean, I think. That md5Hash I mentioned is what you tell Dataverse the md5 is for the file. But Dataverse is just trusting you on that, right?

view this post on Zulip Philip Durbin 🚀 (Oct 16 2025 at 20:23):

Anyway, I think Jim is on his way to the Head of the Charles but he'll probably see your issue and get back to you next week. He implemented most of this. :smile:


Last updated: Nov 01 2025 at 14:11 UTC