Stream: python

Topic: pyDataverse dependencies vs convenience


view this post on Zulip Oliver Bertuch (May 22 2025 at 10:09):

From the discussion yesterday in the WG meeting something stuck with me: what should be in the pyDataverse package? I looked a bit into pythonic ways to speedup transfers and there are some great libraries out there already.

That made me wonder: should we just go ahead and try to add nice, convenient wrappers around them? It would make the library depend on many new things. On the other hand people just looking for an easy way to interact with Dataverse from some script, Jupyter notebook etc might actually prefer having a convenient and well documented way, they might not care much about dependencies.

Should we just copy dvuploader into pyDataverse, maybe even without paying much attention to the class tree? One stop shop vs more granular reusable pieces...

view this post on Zulip Oliver Bertuch (May 22 2025 at 10:12):

To make an example... There's pydl which seems to be exactly the library we might put in peoples hands to download data from an installation fast. But it comes with aiohttp, we use httpx. Reading up on httpx vs aiohttp reveals that the later is often much much faster compared to httpx in async cases (see https://github.com/encode/httpx/issues/3215). Adding these dependencies means we don't need to code all of this again and focus on making it available in a nice Dataverse API package, so it "just works" for users. But they need to install quite a bit of dependencies to get this going...

view this post on Zulip Oliver Bertuch (May 22 2025 at 10:14):

The way pyDataverse is structured now also makes it hard to enable selective install choices with optional dependencies. :shrugdog:

view this post on Zulip Oliver Bertuch (May 22 2025 at 10:15):

So maybe we should just educate people how to use these libraries? And maybe only provide a thin layer of how to retrieve the download URL from the API (to stick with the example here)?

view this post on Zulip Oliver Bertuch (May 22 2025 at 10:17):

So maybe instead of sticking with 1 library, maybe we should use a HTTP Client Facade where people can put in any python HTTP client they like best? (Or we provide some of them for them, using optional dependencies...)

view this post on Zulip Jan Range (May 22 2025 at 12:10):

I agree that the current structure is relatively flat and challenging to maintain. I have already started a re-factor and extracted classes into separate modules in this branch.

I honestly think that having a couple more dependencies is not a big issue. Of course, it puts a higher maintenance burden on us when other libraries move to another major version of a dependency, but this is manageable, in my opinion.

Moving the DVUploader into pyDataverse makes sense and de-clutters the DV Python landscape. I have used aiohttp previously, but moved to httpx because the API design is way nicer to work with. Both of them can coexist, though, and we could use pydl directly for the downloads.

There is also EasyDataverse which is very Jupyter-friendly and easy to be used for setting datasets and downloads. Maybe we could also merge this one as a higher level interface. The good thing is that it uses DVUploader/PyDataverse already and would not require much adaption, except for some dependencies.


Last updated: Nov 01 2025 at 14:11 UTC