Stream: python

Topic: creating datasets from an Excel file


view this post on Zulip Philip Durbin πŸš€ (Jan 14 2025 at 14:32):

Given an Excel file with columns like "dataset title", "description", etc., what's the recommended way to create datasets using Python? pyDataverse? EasyDataverse? Something else?

view this post on Zulip Julian Gautier (Jan 14 2025 at 16:24):

So great you asked this today! I was just talking with the manager of a collection about this, but for editing the metadata of multiple datasets. I pointed them to pyDataverse documentation and the advanced usage page, which mentions using their csv templates. Seems like you can create and edit dataset metadata that way. I've never got it to work, but always figured I didn't know enough about Python :grimacing:

view this post on Zulip Philip Durbin πŸš€ (Jan 14 2025 at 16:25):

In practice I'm using EasyDataverse. I'll push my script to GitHub soon.

view this post on Zulip GwenaΓ«l Doux (Jan 15 2025 at 10:25):

Wouldn't it be interesting to have a (community) repository with examples of concrete cases using python or easyDataverse? For people who aren't necessarily developers but can modify scripts. There could also be an entry somewhere in the documentation, in https://guides.dataverse.org/en/latest/admin/reporting-tools-and-queries.html#reporting-tools-and-common-queries for example.

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 12:44):

Yes, absolutely. Great idea. I'm not sure where these scripts should go.

view this post on Zulip Jan Range (Jan 15 2025 at 12:51):

We could host a repository that includes Jupyter notebooks with examples. The upside is that these can be used pretty nicely within an mkdocs documentation. Did this for my latest Python course (Doc Page)

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 15:53):

@Jan Range we forgot to talk about this today! Where to put scripts, I mean. Next time. :big_smile: (#python > meetings)

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 16:49):

I'm still not quite sure where to put this script but I'll go ahead and attach it here so people can see it: create-datasets-from-excel.py

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 16:50):

This is a slightly redacted version but quite similar to the one we plan to use.

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 16:50):

Note that we plan to fix the license issue mentioned in the script. The discussion is happening here: #python > setting a license with EasyDataverse

view this post on Zulip Jan Range (Jan 15 2025 at 17:47):

We could open a general repository. Maybe something along the lines of "Dataverse Recipes" - Could then be used for Python and other languages.

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 18:36):

Back in the day, Perl had a cookbook: https://en.wikipedia.org/wiki/Perl_Cookbook :cook:

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 21:05):

It looks like https://github.com/ubc-library-rc/dataverse_utils is still being maintained. We link to it from https://guides.dataverse.org/en/6.5/api/client-libraries.html#python

All commits by Paul Lesack, it looks like. I don't think he's on Zulip but here's his GitHub: https://github.com/plesubc

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 21:09):

From the perspective of the person trying to get something with Python, it probably doesn't matter if there's a mix of examples that use pyDataverse, EasyDataverse, requests, the standard library, etc.

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 21:10):

We had an intern suggest that we add even more Python snippets to the API Guide in https://github.com/IQSS/dataverse/issues/4255 but we resisted this. We have been trying to keep the API Guide agnostic when it comes to language. So we document everything with curl.

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 21:11):

@Don Sizemore you also have https://github.com/uncch-rdmc/dataverse-toolbox

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 21:13):

Whether it's that "dataverse_utils" repo or that "dataverse-toolbox" repo, I think it's the right approach, to have a separate repo, rather than trying to maintain these scripts as part of the API Guide.

view this post on Zulip Jan Range (Jan 15 2025 at 21:20):

I agree, a separate repo is a cleaner solution and easier in terms of maintenance. What do you think of the idea of having a documentation on top of it, like I did for the Python course?

Docs - https://jr-1991.github.io/PythonProgrammingBio24/
Repo - https://github.com/JR-1991/PythonProgrammingBio24

Jupyter notebooks integrate quite well into MkDocs and could essentially function as doc and recipe at the same time. Plus, we could add a CI that tests these using the DV Action by running the notebooks. So, when something falls out of sync we know it.

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 21:21):

Sure! Docs on top sounds great!

view this post on Zulip Philip Durbin πŸš€ (Jan 15 2025 at 21:23):

And no objection to CI, of course! Fancy! :unicorn:

view this post on Zulip Philip Durbin πŸš€ (Jan 16 2025 at 19:43):

I created a dedicated topic for this: #python > a place for Python scripts

view this post on Zulip Philip Durbin πŸš€ (Jan 16 2025 at 19:43):

Now we can keep talking about Excel if we like. :big_smile:

view this post on Zulip GwenaΓ«l Doux (Jan 17 2025 at 10:08):

Sorry for the hijacking

view this post on Zulip Philip Durbin πŸš€ (Jan 17 2025 at 13:08):

Ha, no problem. :big_smile:

view this post on Zulip Philip Durbin πŸš€ (Jan 17 2025 at 13:08):

We have a place now! See #python > a place for Python scripts

view this post on Zulip Philip Durbin πŸš€ (Feb 26 2025 at 13:53):

@Jan Range thanks for https://github.com/gdcc/dataverse-recipes/pull/2 ! Merged! Can you (and others, if you like) please look again at https://github.com/gdcc/dataverse-recipes/pull/1 ? Thanks!

view this post on Zulip Philip Durbin πŸš€ (Feb 27 2025 at 15:56):

Heads up that I closed PR#1 in favor of this one: https://github.com/gdcc/dataverse-recipes/pull/4

view this post on Zulip Philip Durbin πŸš€ (Feb 27 2025 at 15:58):

I'm having a little trouble with Excel actually. When I use the original file Sonia gave me, I'm able to extract the hyperlinks.

However, if I edit the file in Excel to edit the hyperlinks (to make them anonymous), the script can no longer extract them.

view this post on Zulip Philip Durbin πŸš€ (Feb 27 2025 at 15:59):

@Jan Range or others, I'm happy to send you the original file if you want to play with it.

view this post on Zulip Jan Range (Feb 27 2025 at 16:16):

Happy to check this :smile:

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:09):

Maybe I should add a comment saying that the hyperlink extraction doesn't work with the sample file. What do you think?

view this post on Zulip Jan Range (Feb 28 2025 at 15:21):

Is the data_example.xlsxcausing the issue?

view this post on Zulip Jan Range (Feb 28 2025 at 15:21):

Tested it outside the script and the extraction is working. Looking through your script now.

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:21):

Oh! Interesting.

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:22):

Please try it with my script. It doesn't work for me. If you can get it working, great!

view this post on Zulip Jan Range (Feb 28 2025 at 15:22):

image.png

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:22):

Or we could even switch to your way.

view this post on Zulip Jan Range (Feb 28 2025 at 15:22):

Yes, will do. Wanted to sort out the openpyxl stuff beforehand

view this post on Zulip Jan Range (Feb 28 2025 at 15:23):

Got it!

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:23):

pandas is probably more standard anyway :shrug:

view this post on Zulip Jan Range (Feb 28 2025 at 15:24):

The issue lies in the following:

if values[3].hyperlink:
        access_link_url = values[3].hyperlink.display # <-- Extracts the shown text

It should work by using

if values[3].hyperlink:
        access_link_url = values[3].hyperlink.target # <-- Extracts the URL

view this post on Zulip Jan Range (Feb 28 2025 at 15:24):

That's a good primer for my fresh local DV 6.5 now :stuck_out_tongue:

view this post on Zulip Jan Range (Feb 28 2025 at 15:25):

Giving it a test on it

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:31):

Go, go, go!

view this post on Zulip Jan Range (Feb 28 2025 at 15:35):

Okay, it works now, but my guess is that it is just some oddity of openpyxl

view this post on Zulip Jan Range (Feb 28 2025 at 15:35):

image.png

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:35):

you got it working? with that change above?

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:36):

display vs target?

view this post on Zulip Jan Range (Feb 28 2025 at 15:37):

The boolean check if value[3].hyperlink: is what has caused the condition to never be true. Hence, there was always None. Being explicit by using value[3].hyperlink is not None works though.

view this post on Zulip Jan Range (Feb 28 2025 at 15:37):

But in my Jupyter notebook using the exact same way does not cause these issues :shaking_face:

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:38):

But why would it work with the original Excel file and not the one I created from scratch?

view this post on Zulip Jan Range (Feb 28 2025 at 15:39):

I dont know. Strangely, I just fumbled around and reverted the explicit check and now it works too. I am confused :-D

view this post on Zulip Jan Range (Feb 28 2025 at 15:40):

Feels like rolling dice :grinning:

view this post on Zulip Jan Range (Feb 28 2025 at 15:40):

But yea, using target extracts the link as expected

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:41):

Yeah, seems like a good fix. I tested with the original file and my sample. Pushed! Thanks! https://github.com/gdcc/dataverse-recipes/pull/4/commits/9363ce7d705e490abb5cae0d7a9f8d1b0e977fdf

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:43):

Anything else to fix before we merge it?

view this post on Zulip Jan Range (Feb 28 2025 at 15:43):

Runs fine on my machine, nothing to fix from my side :smile:

view this post on Zulip Philip Durbin πŸš€ (Feb 28 2025 at 15:44):

Merged!


Last updated: Nov 01 2025 at 14:11 UTC