creating datasets from an Excel file · python

Stream: python

Topic: creating datasets from an Excel file

Philip Durbin 🚀 (Jan 14 2025 at 14:32):

Given an Excel file with columns like "dataset title", "description", etc., what's the recommended way to create datasets using Python? pyDataverse? EasyDataverse? Something else?

Julian Gautier (Jan 14 2025 at 16:24):

So great you asked this today! I was just talking with the manager of a collection about this, but for editing the metadata of multiple datasets. I pointed them to pyDataverse documentation and the advanced usage page, which mentions using their csv templates. Seems like you can create and edit dataset metadata that way. I've never got it to work, but always figured I didn't know enough about Python :grimacing:

Philip Durbin 🚀 (Jan 14 2025 at 16:25):

In practice I'm using EasyDataverse. I'll push my script to GitHub soon.

Gwenaël Doux (Jan 15 2025 at 10:25):

Wouldn't it be interesting to have a (community) repository with examples of concrete cases using python or easyDataverse? For people who aren't necessarily developers but can modify scripts. There could also be an entry somewhere in the documentation, in https://guides.dataverse.org/en/latest/admin/reporting-tools-and-queries.html#reporting-tools-and-common-queries for example.

Philip Durbin 🚀 (Jan 15 2025 at 12:44):

Yes, absolutely. Great idea. I'm not sure where these scripts should go.

Jan Range (Jan 15 2025 at 12:51):

We could host a repository that includes Jupyter notebooks with examples. The upside is that these can be used pretty nicely within an mkdocs documentation. Did this for my latest Python course (Doc Page)

Philip Durbin 🚀 (Jan 15 2025 at 15:53):

@Jan Range we forgot to talk about this today! Where to put scripts, I mean. Next time. :big_smile: (#python > meetings)

Philip Durbin 🚀 (Jan 15 2025 at 16:49):

I'm still not quite sure where to put this script but I'll go ahead and attach it here so people can see it: create-datasets-from-excel.py

Philip Durbin 🚀 (Jan 15 2025 at 16:50):

This is a slightly redacted version but quite similar to the one we plan to use.

Philip Durbin 🚀 (Jan 15 2025 at 16:50):

Note that we plan to fix the license issue mentioned in the script. The discussion is happening here: #python > setting a license with EasyDataverse

Jan Range (Jan 15 2025 at 17:47):

We could open a general repository. Maybe something along the lines of "Dataverse Recipes" - Could then be used for Python and other languages.

Philip Durbin 🚀 (Jan 15 2025 at 18:36):

Back in the day, Perl had a cookbook: https://en.wikipedia.org/wiki/Perl_Cookbook :cook:

Philip Durbin 🚀 (Jan 15 2025 at 21:05):

It looks like https://github.com/ubc-library-rc/dataverse_utils is still being maintained. We link to it from https://guides.dataverse.org/en/6.5/api/client-libraries.html#python

All commits by Paul Lesack, it looks like. I don't think he's on Zulip but here's his GitHub: https://github.com/plesubc

Philip Durbin 🚀 (Jan 15 2025 at 21:09):

From the perspective of the person trying to get something with Python, it probably doesn't matter if there's a mix of examples that use pyDataverse, EasyDataverse, requests, the standard library, etc.

Philip Durbin 🚀 (Jan 15 2025 at 21:10):

We had an intern suggest that we add even more Python snippets to the API Guide in https://github.com/IQSS/dataverse/issues/4255 but we resisted this. We have been trying to keep the API Guide agnostic when it comes to language. So we document everything with curl.

Philip Durbin 🚀 (Jan 15 2025 at 21:11):

@Don Sizemore you also have https://github.com/uncch-rdmc/dataverse-toolbox

Philip Durbin 🚀 (Jan 15 2025 at 21:13):

Whether it's that "dataverse_utils" repo or that "dataverse-toolbox" repo, I think it's the right approach, to have a separate repo, rather than trying to maintain these scripts as part of the API Guide.

Jan Range (Jan 15 2025 at 21:20):

I agree, a separate repo is a cleaner solution and easier in terms of maintenance. What do you think of the idea of having a documentation on top of it, like I did for the Python course?

Docs - https://jr-1991.github.io/PythonProgrammingBio24/
Repo - https://github.com/JR-1991/PythonProgrammingBio24

Jupyter notebooks integrate quite well into MkDocs and could essentially function as doc and recipe at the same time. Plus, we could add a CI that tests these using the DV Action by running the notebooks. So, when something falls out of sync we know it.

Philip Durbin 🚀 (Jan 15 2025 at 21:21):

Sure! Docs on top sounds great!

Philip Durbin 🚀 (Jan 15 2025 at 21:23):

And no objection to CI, of course! Fancy! :unicorn:

Philip Durbin 🚀 (Jan 16 2025 at 19:43):

I created a dedicated topic for this: #python > a place for Python scripts

Philip Durbin 🚀 (Jan 16 2025 at 19:43):

Now we can keep talking about Excel if we like. :big_smile:

Gwenaël Doux (Jan 17 2025 at 10:08):

Sorry for the hijacking

Philip Durbin 🚀 (Jan 17 2025 at 13:08):

Ha, no problem. :big_smile:

Philip Durbin 🚀 (Jan 17 2025 at 13:08):

We have a place now! See #python > a place for Python scripts

Philip Durbin 🚀 (Feb 26 2025 at 13:53):

@Jan Range thanks for https://github.com/gdcc/dataverse-recipes/pull/2 ! Merged! Can you (and others, if you like) please look again at https://github.com/gdcc/dataverse-recipes/pull/1 ? Thanks!

Philip Durbin 🚀 (Feb 27 2025 at 15:56):

Heads up that I closed PR#1 in favor of this one: https://github.com/gdcc/dataverse-recipes/pull/4

Philip Durbin 🚀 (Feb 27 2025 at 15:58):

I'm having a little trouble with Excel actually. When I use the original file Sonia gave me, I'm able to extract the hyperlinks.

However, if I edit the file in Excel to edit the hyperlinks (to make them anonymous), the script can no longer extract them.

Philip Durbin 🚀 (Feb 27 2025 at 15:59):

@Jan Range or others, I'm happy to send you the original file if you want to play with it.

Jan Range (Feb 27 2025 at 16:16):

Happy to check this :smile:

Philip Durbin 🚀 (Feb 28 2025 at 15:09):

Maybe I should add a comment saying that the hyperlink extraction doesn't work with the sample file. What do you think?

Jan Range (Feb 28 2025 at 15:21):

Is the data_example.xlsxcausing the issue?

Jan Range (Feb 28 2025 at 15:21):

Tested it outside the script and the extraction is working. Looking through your script now.

Philip Durbin 🚀 (Feb 28 2025 at 15:21):

Oh! Interesting.

Philip Durbin 🚀 (Feb 28 2025 at 15:22):

Please try it with my script. It doesn't work for me. If you can get it working, great!

Jan Range (Feb 28 2025 at 15:22):

image.png

Philip Durbin 🚀 (Feb 28 2025 at 15:22):

Or we could even switch to your way.

Jan Range (Feb 28 2025 at 15:22):

Yes, will do. Wanted to sort out the openpyxl stuff beforehand

Jan Range (Feb 28 2025 at 15:23):

Got it!

Philip Durbin 🚀 (Feb 28 2025 at 15:23):

pandas is probably more standard anyway :shrug:

Jan Range (Feb 28 2025 at 15:24):

The issue lies in the following:

if values[3].hyperlink:
        access_link_url = values[3].hyperlink.display # <-- Extracts the shown text

It should work by using

if values[3].hyperlink:
        access_link_url = values[3].hyperlink.target # <-- Extracts the URL

Jan Range (Feb 28 2025 at 15:24):

That's a good primer for my fresh local DV 6.5 now :stuck_out_tongue:

Jan Range (Feb 28 2025 at 15:25):

Giving it a test on it

Philip Durbin 🚀 (Feb 28 2025 at 15:31):

Go, go, go!

Jan Range (Feb 28 2025 at 15:35):

Okay, it works now, but my guess is that it is just some oddity of openpyxl

Jan Range (Feb 28 2025 at 15:35):

image.png

Philip Durbin 🚀 (Feb 28 2025 at 15:35):

you got it working? with that change above?

Philip Durbin 🚀 (Feb 28 2025 at 15:36):

display vs target?

Jan Range (Feb 28 2025 at 15:37):

The boolean check if value[3].hyperlink: is what has caused the condition to never be true. Hence, there was always None. Being explicit by using value[3].hyperlink is not None works though.

Jan Range (Feb 28 2025 at 15:37):

But in my Jupyter notebook using the exact same way does not cause these issues :shaking_face:

Philip Durbin 🚀 (Feb 28 2025 at 15:38):

But why would it work with the original Excel file and not the one I created from scratch?

Jan Range (Feb 28 2025 at 15:39):

I dont know. Strangely, I just fumbled around and reverted the explicit check and now it works too. I am confused :-D

Jan Range (Feb 28 2025 at 15:40):

Feels like rolling dice :grinning:

Jan Range (Feb 28 2025 at 15:40):

But yea, using target extracts the link as expected

Philip Durbin 🚀 (Feb 28 2025 at 15:41):

Yeah, seems like a good fix. I tested with the original file and my sample. Pushed! Thanks! https://github.com/gdcc/dataverse-recipes/pull/4/commits/9363ce7d705e490abb5cae0d7a9f8d1b0e977fdf

Philip Durbin 🚀 (Feb 28 2025 at 15:43):

Anything else to fix before we merge it?

Jan Range (Feb 28 2025 at 15:43):

Runs fine on my machine, nothing to fix from my side :smile:

Philip Durbin 🚀 (Feb 28 2025 at 15:44):

Merged!

Last updated: Jan 09 2026 at 14:18 UTC