There's a nice "direct upload to S3" script at https://github.com/IQSS/dataverse.harvard.edu/tree/3fc9bfe9a171b2f7546ad44b1114f5c3920907d1/util/python/direct-upload
@Jan Range how do you feel about adding it to easyDataverse or pyDataverse?
I just checked with Leonid and he's cool with it. It should like you two even discussed it already. The only caveat, he said, is that it doesn't support multipart S3 upload. This is mentioned already in the README. ^^
Yes, already put this in my ToDo's for this and next week. Does Demo Dataverse allow direct uploads already?
Yes, it does. But if you have any trouble at all, please ping me!
Should I create the issue in the easyDataverse repo or the pyDataverse repo?
Lets go for pyDataverse, I think its best suited for
Ah, the person who wants this gave it a :tada: :smile:
Perfect :raised_hands: In this case we may not need to create a new issue as it has already been pointed out in #136
Just added this issue: https://github.com/gdcc/pyDataverse/issues/157
@Jan Range as Phil says, demo.dataverse.org does support direct upload, but I don't believe that's the "default" datastore. be certain to ask (Kevin or Leonid) which datastore your collection is using - only cautioning as I don't believe you get direct S3 upload by default.
Yes, Leonid emphasized we should let him know if direct upload isn't working on demo.
Alright, perfect! Thanks for the heads up :-)
Just tried it and I dont have permission for a direct upload :sweat: Can I just send you the name of my collection or can you set one up for me on Demo Dataverse?
@Jan Range when you get a chance, can you please ask Leonid on Slack?
Alright, will do :-)
Got it :tada:
So far everything is working, will transfer the code to pyDataverse and open a pull request once tests are implemented. One thing I wanted to ask is if there is a functionality to assemble chunks at Dataverse. Maybe this way we can override the max_part_size of AWS
Uh. I'm probably misunderstanding the question (let me know if you'd like to hop on a video call) but DVUploaders has a way to to break up files client-side, if that helps: https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader/blob/v1.1.0/src/main/java/org/sead/uploader/dataverse/HttpPartUploadJob.java#L25
Perfect, that was what I was looking for!
Happy to update you that the S3 upload is implemented into Python now. Will be shipped with the next EasyDataverse update!
I have a question though, the multipart upload is parallelized and I have yet only tested this with rather small files. However, I am certain that big files might cause memory/bandwith issues when uploaded in parallel. Hence, I'd like to restrict the number or size of part uploads.
This may depend on the users machine/connection and will be customizable, but do you have any suggestion of a default? Like 4gb maximum upload volume at once?
Btw if you are interested, here is the branch of the next ED release. Once tests are up and running, it'll be shipped .
https://github.com/gdcc/easyDataverse/tree/flexible-connect
@Leo Andreev check it out! S3 upload in EasyDataverse! ^^ :tada:
Hi All,
I've uploaded 180 GB files to S3 (multipart of course), with my own script, and the main problems I've noticed are:
I know, that 180GB is extreme, but still for 32GB+ files those problems will occur. I put it in this thread because direct upload to S3 is in fact the only way to deal with large files.
@Andrzej Zemla hi! Have you tried https://github.com/gdcc/python-dvuploader by @Jan Range ? It's new! Announced here: https://groups.google.com/g/dataverse-community/c/TQZJOpYmXbU/m/6m27nm4dAQAJ
No i didn't, I needed it in July ;), but I'll test it for sure, and write you a feedback
@Andrzej Zemla thanks, that would be extremely helpful :blush:
Last updated: Nov 01 2025 at 14:11 UTC