Stream: python

Topic: direct upload to S3


view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 15 2023 at 21:24):

There's a nice "direct upload to S3" script at https://github.com/IQSS/dataverse.harvard.edu/tree/3fc9bfe9a171b2f7546ad44b1114f5c3920907d1/util/python/direct-upload

@Jan Range how do you feel about adding it to easyDataverse or pyDataverse?

I just checked with Leonid and he's cool with it. It should like you two even discussed it already. The only caveat, he said, is that it doesn't support multipart S3 upload. This is mentioned already in the README. ^^

view this post on Zulip Jan Range (Mar 15 2023 at 21:39):

Yes, already put this in my ToDo's for this and next week. Does Demo Dataverse allow direct uploads already?

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 15 2023 at 21:58):

Yes, it does. But if you have any trouble at all, please ping me!

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 15 2023 at 21:58):

Should I create the issue in the easyDataverse repo or the pyDataverse repo?

view this post on Zulip Jan Range (Mar 15 2023 at 22:19):

Lets go for pyDataverse, I think its best suited for

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 16 2023 at 11:26):

Ah, the person who wants this gave it a :tada: :smile:

view this post on Zulip Jan Range (Mar 16 2023 at 11:29):

Perfect :raised_hands: In this case we may not need to create a new issue as it has already been pointed out in #136

view this post on Zulip Ceilyn Boyd (Mar 16 2023 at 13:48):

Just added this issue: https://github.com/gdcc/pyDataverse/issues/157

view this post on Zulip Don Sizemore (Mar 21 2023 at 11:25):

@Jan Range as Phil says, demo.dataverse.org does support direct upload, but I don't believe that's the "default" datastore. be certain to ask (Kevin or Leonid) which datastore your collection is using - only cautioning as I don't believe you get direct S3 upload by default.

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 21 2023 at 11:28):

Yes, Leonid emphasized we should let him know if direct upload isn't working on demo.

view this post on Zulip Jan Range (Mar 21 2023 at 11:56):

Alright, perfect! Thanks for the heads up :-)

view this post on Zulip Jan Range (Mar 22 2023 at 21:40):

Just tried it and I dont have permission for a direct upload :sweat: Can I just send you the name of my collection or can you set one up for me on Demo Dataverse?

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 22 2023 at 22:14):

@Jan Range when you get a chance, can you please ask Leonid on Slack?

view this post on Zulip Jan Range (Mar 22 2023 at 22:19):

Alright, will do :-)

view this post on Zulip Jan Range (Mar 22 2023 at 22:25):

Got it :tada:

view this post on Zulip Jan Range (Mar 23 2023 at 10:48):

So far everything is working, will transfer the code to pyDataverse and open a pull request once tests are implemented. One thing I wanted to ask is if there is a functionality to assemble chunks at Dataverse. Maybe this way we can override the max_part_size of AWS

view this post on Zulip Philip Durbin ๐Ÿš€ (Mar 23 2023 at 13:21):

Uh. I'm probably misunderstanding the question (let me know if you'd like to hop on a video call) but DVUploaders has a way to to break up files client-side, if that helps: https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader/blob/v1.1.0/src/main/java/org/sead/uploader/dataverse/HttpPartUploadJob.java#L25

view this post on Zulip Jan Range (Mar 23 2023 at 13:53):

Perfect, that was what I was looking for!

view this post on Zulip Jan Range (Sep 18 2023 at 05:43):

Happy to update you that the S3 upload is implemented into Python now. Will be shipped with the next EasyDataverse update!

view this post on Zulip Jan Range (Sep 18 2023 at 05:47):

I have a question though, the multipart upload is parallelized and I have yet only tested this with rather small files. However, I am certain that big files might cause memory/bandwith issues when uploaded in parallel. Hence, I'd like to restrict the number or size of part uploads.

This may depend on the users machine/connection and will be customizable, but do you have any suggestion of a default? Like 4gb maximum upload volume at once?

view this post on Zulip Jan Range (Sep 18 2023 at 05:54):

Btw if you are interested, here is the branch of the next ED release. Once tests are up and running, it'll be shipped .

https://github.com/gdcc/easyDataverse/tree/flexible-connect

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 18 2023 at 14:40):

@Leo Andreev check it out! S3 upload in EasyDataverse! ^^ :tada:

view this post on Zulip Andrzej Zemla (Oct 10 2023 at 10:46):

Hi All,

I've uploaded 180 GB files to S3 (multipart of course), with my own script, and the main problems I've noticed are:

I know, that 180GB is extreme, but still for 32GB+ files those problems will occur. I put it in this thread because direct upload to S3 is in fact the only way to deal with large files.

view this post on Zulip Philip Durbin ๐Ÿš€ (Oct 10 2023 at 11:07):

@Andrzej Zemla hi! Have you tried https://github.com/gdcc/python-dvuploader by @Jan Range ? It's new! Announced here: https://groups.google.com/g/dataverse-community/c/TQZJOpYmXbU/m/6m27nm4dAQAJ

view this post on Zulip Andrzej Zemla (Oct 10 2023 at 12:09):

No i didn't, I needed it in July ;), but I'll test it for sure, and write you a feedback

view this post on Zulip Jan Range (Oct 10 2023 at 12:53):

@Andrzej Zemla thanks, that would be extremely helpful :blush:


Last updated: Nov 01 2025 at 14:11 UTC