Stream: python

Topic: auth options


view this post on Zulip Oliver Bertuch (Jul 25 2024 at 20:27):

Just today I've been looking into options how to interact with our competitor, InvenioRDM, when you don't have an API Token yet and want to support users to make the process to get one as simple as possible.

There is already an API endpoint to receive a token for a user, but it's not very usable at the moment.

IMHO a nice feature for pyDataverse would be an option to enable an auth flow. Wouldn't it be nice if some users run a Python script with pyDataverse to open a browser window for them, make them login with their usual credentials and afterwards continue working with the API, all without asking them to create and provide an API token first? (Or even more complicated, a signed URL)

Thoughts anyone?

view this post on Zulip Philip Durbin 🚀 (Jul 25 2024 at 20:41):

That would be killer. :rock_on:

view this post on Zulip Sebastian Höffner (Jul 29 2024 at 08:12):

Getting the API key is difficult because it's behind some JavaScript, so that's not easily scriptable, although it might be possible. I just tried it for like 5 minutes or so.

Opening the browser is possible, but transferring something out of the browser without a cooperating website is difficult, as you would need to allow and trigger a callback URL like http://localhost:9473/login-callback, so this would require implementation efforts on Dataverse's side, unless they already support a similar callback for OIDC one could re-use.

However, in general Dataverse seems to support this kind of authentication (and pyDataverse kind of as well in https://github.com/gdcc/pyDataverse/pull/201, as that adds support for Bearer tokens), although it expects authentication to happen out-of-band: https://guides.dataverse.org/en/latest/developers/remote-users.html . So if you use Shibboleth, Keycloak, or another OIDC provider, you can handle this use case by first logging in (potentially with any OIDC Python library) and using that token in BearerTokenAuth(...) in #201.

In general, all this should considered as part of Phase 3 in https://py.gdcc.io/ , especially when integrating DVCLI.

For Signed URLs, I started a longer discussion here: https://github.com/gdcc/pyDataverse/issues/200, and I am not sure those will solve the problem at hand as their use case is somewhat different.

view this post on Zulip Philip Durbin 🚀 (Jul 29 2024 at 13:50):

I appreciate you digging into this. I'm afraid I don't have any bright ideas on how to move forward. Maybe we can brainstorm a bit at a future pyDataverse meeting.

view this post on Zulip Philip Durbin 🚀 (Jul 29 2024 at 14:09):

Oh, wow, you even made diagrams: https://github.com/gdcc/pyDataverse/issues/200#issuecomment-2254517953

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:05):

Maybe it's fine to make this auth thing require people to use an OIDC provider. That will be a necessity as soon as the SPA is around anyway.

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:06):

WRT receiving an API token for a user, there are some not very well documented endpoints available.

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:06):

For instance one to recreate your token https://github.com/IQSS/dataverse/blob/c39ac8843738ebf3e48be17370b2a35f49432226/src/main/java/edu/harvard/iq/dataverse/api/Users.java#L160

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:06):

It doesn't return a nice JSON response, but we could change that

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:07):

So folks could trade an OIDC access token for an API token.

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:11):

Using a shortlived localhost server started by pyDataverse, this should be fairly simple to achieve.

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:12):

Around the SPA has been discussion about making Dataverse an OAuth2/OIDC identity provider, too. Builtin users, migrations and such things would potentially be a lot easier that way.

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:13):

That obviously would require much more implementation changes in DV... Which is why one of the ideas is to ask people to use Keycloak and add a HTTP Basic Auth against Dataverse to it.

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:46):

Dang with this tutorial https://www.baeldung.com/java-ee-oauth2-implementation it doesn't seem so complicated to make Dataverse an Authorization Server itself.

view this post on Zulip Oliver Bertuch (Jul 29 2024 at 15:47):

"Login with Dataverse" how's that sound @Philip Durbin

view this post on Zulip Philip Durbin 🚀 (Jul 29 2024 at 16:00):

Sounds nice :grinning:

view this post on Zulip Sebastian Höffner (Jul 29 2024 at 21:41):

I cannot comment on identity management, so I'll instead focus on the token auth.

Recreating a token might solve the issue for testing or single app access. However, since Dataverse only allows a single API token (and this one will be rotated with the recreation request), this will cause problems if you use your API token in multiple services. I didn't find another API endpoint to actually retrieve a (new) token.
I was about to try it though with the recreate (which would make the (local) test setup for pyDataverse much easier) and found another edge-case we should support in pyDataverse: Currently demo.dataverse.org is in maintenance mode and returns a 200 status code with a big html for every API call... I think this should maybe be a 503 status code, but well. I stored the html locally and will see if we can make that work in the error handling.

Do you know how to call the users/token/recreate path? In the browser it's some very tricky POST call to the http://localhost:8080/dataverseuser.xhtml which only works because at that point I already have a session cookie.
I tried it with POST to /users/token/recreate, /api/users/token/recreate, /api/v1/users/token/recreate and directly to dataverseuser.xhtml similar to what the web UI does. I tried the following form data:

loginForm:credentialsContainer:0:credValue = username
loginForm:credentialsContainer:1:sCredValue = password

but it didn't work. So I guess I need to somehow perform a login, retrieve the cookie and can then interact with the token?

view this post on Zulip Sebastian Höffner (Jul 29 2024 at 21:51):

@Philip Durbin re the diagrams, note that those are solely related to signed URLs, which I originally understood should work like variant 2 which seems... wrong. They seem to be intended more in light with variants 1 and 3 there.

view this post on Zulip Philip Durbin 🚀 (Jul 30 2024 at 13:47):

@Sebastian Höffner you're asking how to recreate a token via API? Please see https://guides.dataverse.org/en/6.3/api/native-api.html#recreate-a-token

view this post on Zulip Sebastian Höffner (Jul 30 2024 at 19:19):

Thanks, I was more asking how to bootstrap a token or retrieve it via the API, because recreating it with one app might break others.
In this case, one already needs to know the API token to authenticate, but I am wondering how to get an API token without manually logging in. Although it is probably the safer way to simply not allow that, otherwise other services might want to grab your credentials to retrieve a token.

view this post on Zulip Oliver Bertuch (Jul 30 2024 at 19:23):

I'm pretty sure we can create new endpoints for this kind of thing.

view this post on Zulip Oliver Bertuch (Jul 30 2024 at 19:24):

To ensure safety, we can add filters so it would for example require logging in via bearer token.

view this post on Zulip Oliver Bertuch (Jul 30 2024 at 19:24):

Another thought I had: it would probably help if we can give people collection based access tokens and not just PATs

view this post on Zulip Sebastian Höffner (Jul 30 2024 at 19:45):

So many options :-)
I don't think it's really necessary right now.

But I think since Dataverse supports various OIDC sources, we could make at least make the auth flow happen somehow. I'll read a little bit up on that to see how it might go.

view this post on Zulip Philip Durbin 🚀 (Jul 30 2024 at 19:53):

Are there other apps that do this well? DataLad or whatever app?

view this post on Zulip Sebastian Höffner (Jul 30 2024 at 20:25):

I know that Vault and Nomad have such CLI login flows, but I haven't had a closer look at how they are implemented – I just know that I had to configure a localhost:... callback URL to make it work (https://github.com/hashicorp/nomad/blob/main/command/login.go), and it works really well: you type nomad login, it opens the browser, you do your OIDC login, it performs a callback to localhost, you have a token. They also support the other way: nomad ui -authenticate will open the browser and pass a token to it, if you happen to have one on the CLI.

view this post on Zulip Oliver Bertuch (Jul 30 2024 at 20:28):

The most prominent example I know is Kubernetes, using kubectl with OIDC login

view this post on Zulip Oliver Bertuch (Jul 30 2024 at 20:29):

All you need is a local, shortlived webserver you redirect to. That way you get the auth code flow flgoing

view this post on Zulip Philip Durbin 🚀 (Jul 30 2024 at 20:29):

I haven't tried Zulip Terminal but I wonder how auth works for it.

view this post on Zulip Oliver Bertuch (Jul 30 2024 at 20:29):

Another option is the device flow, but it is less commonly supported by OIDc/Oauth Idps

view this post on Zulip Oliver Bertuch (Jul 30 2024 at 20:31):

Pydv should also take care about caching tokens and refreshing them

view this post on Zulip Philip Durbin 🚀 (Jul 30 2024 at 20:31):

Oh. "NOTE: If you use Google, Github or another external authentication to access your Zulip organization then you likely won't have a password set and currently need to create one to use zulip-terminal." -- https://github.com/zulip/zulip-terminal#running-for-the-first-time

view this post on Zulip Sebastian Höffner (Jul 30 2024 at 20:38):

This shouldn't be the case for nomad and k8s though, so I guess with the shortlived local server we are probably good to go.

view this post on Zulip Sebastian Höffner (Jul 30 2024 at 20:39):

I'll create an issue to track this and link to this thread for some details.

view this post on Zulip Sebastian Höffner (Jul 30 2024 at 20:59):

https://github.com/gdcc/pyDataverse/issues/209

view this post on Zulip Philip Durbin 🚀 (Jul 30 2024 at 21:02):

Awesome, thanks

view this post on Zulip Jan Range (Jul 31 2024 at 08:24):

I have to admit I have no experience with OIDC/OAuth yet, but I think this is a nice feature! I am happy to support you whereever possible :smile:

view this post on Zulip Jan Range (Jul 31 2024 at 09:08):

On a different note, I’ve been experimenting with the keyring crate in Rust for the rust-dataverse library. This crate allows users to securely store credentials (URL and token) under an alias in the OS’s dedicated secure store. When these credentials are used within the CLI, access must be granted, with the option to permanently allow it for convenience.

While it's not exactly the same as having an online login, it has made my workflow more convenient by eliminating the need to constantly copy the token and URL into my environment. Perhaps there's a similar solution in Python that could offer the same level of convenience.

view this post on Zulip Oliver Bertuch (Jul 31 2024 at 15:15):

The OIDC/OAuth thing is mostly making it much more convenient to retrieve some token for further use. Either caching the access and refresh tokens to interact with the API or retrieve a longer lasting PAT. That could, as you said, be stored in some secure storage integrated with the OS.

view this post on Zulip Oliver Bertuch (Jul 31 2024 at 15:20):

Here's also an example of combining OIDC tokens with shortlived API tokens: https://docs.pypi.org/trusted-publishers

view this post on Zulip Jan Range (Jul 31 2024 at 15:36):

Thanks, that's fancy!

view this post on Zulip Oliver Bertuch (Aug 06 2024 at 08:36):

I just learned that Zenodo is an OAuth2 Authorization Server! You can even add OAuth applications as a user :smile: Dataverse should certainly have the same functionality :see_no_evil:

view this post on Zulip Sebastian Höffner (Aug 06 2024 at 21:38):

I checked out the OIDC stuff but I wasn't able to spin it up properly without modifying the /etc/hosts file (see https://guides.dataverse.org/en/latest/developers/remote-users.html#openid-connect-oidc).
This makes it tricky to actually write some tests, so I'm gonna have to think about this a little more. Maybe I can configure keycloak in a different way than what the repo does (the config is not linked in the docs but referenced, it's located at https://github.com/IQSS/dataverse/tree/develop/conf/keycloak).

view this post on Zulip Oliver Bertuch (Aug 06 2024 at 21:39):

At some point we should add this to the dataverse-action, so it at least is easy to test within CI

view this post on Zulip Jan Range (Aug 07 2024 at 06:23):

@Oliver Bertuch we could have a small hackathon and implement the localstack/minio services too. Would be beneficial for testing the direct S3 upload.

view this post on Zulip Jan Range (Aug 21 2024 at 23:37):

It clicked after today's PyWG meeting and a deeper dive into OIDC. I took the server idea from @Sebastian Höffner to Python and tested the auth flow using the httpx.Auth base class. Works just fine, although it is very much hard-coded to work with the local keycloak service. Maybe we can use this as a starter to work toward a general solution.

oidc-httpx-flow.mov

One thing I am still puzzled with is how one should know the client_secret & client_id in advance. I am not very experienced with this type of auth flow, but I am sure there are clever ways to do this or work around.

view this post on Zulip Jan Range (Aug 22 2024 at 08:03):

Here are some additional thoughts:

Would it make more sense for the callback and bearer token retrieval to be handled server-side?

Given that Dataverse already has access to the Auth Provider’s ID and Secret, it could manage this process instead of pyDataverse. In this setup, pyDataverse would initiate the authentication flow, manage the web browser opening for user authentication, and then receive the token directly from Dataverse. To test this approach, we could consider extending the Docker Compose file with a small sidekick API using Flask (Python) or Rocket (Rust) for now instead of extending the Dataverse API.

Additionally, I believe this workflow could eliminate the need for local etc/hosts modifications, as the sidekick server is already within the Docker network, making testing more straightforward.

If this has already been implemented elsewhere or if this was the plan already, feel free to disregard— I'm just learning as I go :grinning_face_with_smiling_eyes:

view this post on Zulip Lincoln (Aug 22 2024 at 08:11):

Regarding : client_secret and client_id you need to ask the OIDC provider
and they will register your return-url (the url after being logged in) on their side

You can try with Helmholz-AAI ,its pretty straight forward
[Although completely unware of what the earlier conv. was/had been on the chat
]

view this post on Zulip Jan Range (Aug 22 2024 at 08:17):

Thanks @Lincoln I will look into this :raised_hands:

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:19):

It reads like there are some mixups of concepts and tech here...

view this post on Zulip Lincoln (Aug 22 2024 at 08:19):

Just remembered in Helmholz-AAI portal you can actually register your return-url by yourself /customizatble

but somehow for me the flask post response was not working

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:21):

Probably someone trying to use pyDataverse as an OIDC client and interacting with Dataverse's API using an access token should use a public client. Then no secret is necessary.

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:22):

These clients should always be different from the client credentials a Dataverse installation uses.

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:25):

For pyDataverse usually acting as a CLI client, there are two ways to retrieve an access token. Either make pyDataverse run a simple localhost server that you send a browser window to - or - use the device auth flow.

view this post on Zulip Lincoln (Aug 22 2024 at 08:27):

if accesstoken from an OIDC provider is merged with /used as Dataverse access token..
That would be really cool
However access token from an OIDC provider (by default) are only short lived due to security reasons

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:28):

Here's a work in progress using Github OAuth2 and a minimal local server for Hermes init purposes: https://github.com/softwarepub/hermes/blob/feature/init-command/src/hermes/commands/init/oauth_github.py Just as a en example what this could look alike

view this post on Zulip Jan Range (Aug 22 2024 at 08:29):

@Oliver Bertuch Yes, I did that in the example provided, but to make it work, I had to hard-code the ID and secret into the authentication flow at pyDataverse. However, this approach isn’t sustainable, so I was looking for alternative solutions. Apologies for the confusion—I’m still in the early stages of learning OIDC.

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:30):

Lincoln said:

if accesstoken from an OIDC provider is merged with /used as Dataverse access token..

This is already available as a feature, hidden behind a feature flag for now as experimental.

token from an OIDC provider (by default) are only short lived due to security reasons

True. But: when authenticating with the provider, you also receive a refresh token. That one is usually longer lived and can be used to get a new access token after it has expired.

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:36):

hard-code the ID and secret into the authentication flow

With a public client you can at least omit the secret, just need the client ID. There's not really a good way around that one.

However, this approach isn’t sustainable, so I was looking for alternative solutions.

Which is why I was suggesting making Dataverse an OAuth2 identity provider. It's probably a lot easier to make integrations between Dataverse and pyDataverse happen than some workaround how to get the OIDC provider going. We could create a discovery endpoint in Dataverse to retrieve a config that pyDataverse or others can work with.

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:37):

With OIDC there exist mechanisms to register a client dynamically. But as far as I know, these are not very widespread in academia.

view this post on Zulip Jan Range (Aug 22 2024 at 08:42):

I agree, that would simplify the process by a lot. But I guess that's a lot of work to implement upstream, or am I wrong?

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:44):

It's nothing done in 5 minutes, no. But it would solve a lot of problems and is also very relevant to the SPA work (builtin users compatibility). I don't know if there is an issue already, but it's certainly worth opening one.

view this post on Zulip Jan Range (Aug 22 2024 at 08:53):

Should we consider postponing the OIDC feature for pyDataverse? It seems that any current solution either requires users to have access to sensitive information or is more cumbersome than simply using an API Token. We could include the server you mentioned as a service within the compose setup, but I’m concerned that this might be a limited solution, as other installations are unlikely to have this server available if it is not part of the Dataverse instance itself.

view this post on Zulip Oliver Bertuch (Aug 22 2024 at 08:55):

Yeah, maybe postpone for now. The OIDC Bearer Access to the DV API is still experimental, too.

view this post on Zulip Jan Range (Aug 22 2024 at 08:58):

Makes sense. Have a great and relaxing vacation :island:

view this post on Zulip Sebastian Höffner (Sep 11 2024 at 08:14):

Sorry for the long silence, but I had to a) think about some of the "mixup of concepts and tech" and b) think about the feature a little more.

I met with @Jan Range yesterday and we cleared up some of the "mixup of concepts and tech" by implementing a prototype for a login of pyDataverse via Dataverse and Keycloak. You can find the details in https://github.com/gdcc/pyDataverse/issues/209#issuecomment-2342862132. We'll meet next week to flesh out the details and maybe implement it in pyDataverse.

Regarding tests and CI etc. I am not yet sure, as we still have a few manual/interactive steps:

view this post on Zulip Philip Durbin 🚀 (Sep 11 2024 at 13:41):

Go go go! :tada:

view this post on Zulip Philip Durbin 🚀 (Sep 18 2024 at 15:15):

At standup @Guillermo Portas just mentioned that the frontend team might turn its attention to auth soon. There's a pretty good chance we'll talk about it at our next tech hours on Tuesday.

view this post on Zulip Philip Durbin 🚀 (Sep 18 2024 at 15:16):

As a starting point, we have a short and long doc on auth (via our list of re-arch docs).

view this post on Zulip Philip Durbin 🚀 (Sep 18 2024 at 15:18):

Also, @Jan Range @Slava Tykhonov and I just talked about auth a bit at the pyDataverse meeting. The recording should be up soon at https://py.gdcc.io

view this post on Zulip Jan Range (Sep 18 2024 at 15:39):

Thanks @Philip Durbin 🐉 ! Recording and date/notes of the next meeting are online at https://py.gdcc.io :raised_hands:

view this post on Zulip Philip Durbin 🚀 (Oct 07 2024 at 17:00):

Check this out:

view this post on Zulip Oliver Bertuch (Oct 07 2024 at 17:01):

Technically this has been around before, too :wink: So no need to compile manual images, should work with what's in 6.2, 6.3 and 6.4

view this post on Zulip Jan Range (Oct 07 2024 at 17:02):

Just checked it out! Looks great, will test it tomorrow :muscle:

view this post on Zulip Philip Durbin 🚀 (Oct 16 2024 at 15:34):

@Jan Range it was nice to be reminded that the gh command line app shows the experience we want. Tell it to auth and a browser window pops up. You auth there and return to the command line.

view this post on Zulip Philip Durbin 🚀 (Oct 16 2024 at 15:35):

I brought it up and standup and there was some talk of device flow but I don't have any particular insight to share with you.

view this post on Zulip Philip Durbin 🚀 (Oct 16 2024 at 15:35):

We completely understand the need for this. We don't want people hard coding API tokens into their notebooks either! :sweat_smile:

view this post on Zulip Sebastian Höffner (Oct 24 2024 at 08:15):

The bearer-token-example is not really what we want to achieve though.
It's opening a full-blown test browser session to grab content from the DOM – it's a bit crazy to bring in selenium or similar just to open a browser page – instead, one could do what Jan and I did and simply do the requests with httpx, resulting in the same outcome.

What we want to achieve is to have the normal browser session handle authentication and return some bearer token to a callback URL. Unfortunately, the way this currently works, we would need to MITM to the IdP (which kind of works in this case but is not a good practice). I think getting it to work the same way the gh CLI et al. do it will require some changes in Dataverse itself – that's what Jan and I figured out during our last coding session.

We probably won't be able to stop people from storing some tokens in notebooks, though. It's tricky to do that right if you are in a remote-only context (as you might need JS–Python–interop), but also requires some conscious thoughts in a local notebook (e.g., storing tokens in a path or credential manager the server has access to and not accidentally committing it etc.) I think the only way to do that is to educate people about tokens and lead by example...

view this post on Zulip Oliver Bertuch (Oct 24 2024 at 10:41):

The only way to avoid storing some kind of secret inside a notebook is by having it injected by the service that runs the notebook for you.

view this post on Zulip Oliver Bertuch (Oct 24 2024 at 10:41):

This may be some kind of secret env var that is stored for your user account. Or, even better, the service injects an ID Token, which is a JWT. As long as the Dataverse backend trusts the origin of this token, it would grant you access.

view this post on Zulip Oliver Bertuch (Oct 24 2024 at 10:43):

Injecting these tokens may happen in different ways. One is to have a service that already uses OIDC to request another access token on your behalf or forward one your already provide from the IdP and send it along to the Dataverse backend.

view this post on Zulip Oliver Bertuch (Oct 24 2024 at 10:44):

The other way is to have the service create JWT tokens from its own provider and move those along. This is especially useful in unattended jobs like pipelines etc.

view this post on Zulip Oliver Bertuch (Oct 24 2024 at 10:45):

Examples for those are the Github Tokens you can use to publish Python packages via OIDC. Gitlab offers the same type of tokens.

view this post on Zulip Oliver Bertuch (Oct 24 2024 at 10:46):

Again, in any case it will require trust at the Dataverse backend side by configuring the IdPs.

view this post on Zulip Oliver Bertuch (Oct 24 2024 at 11:00):

And of course there are other options, how to get ahold of a secret without storing it in the Notebook... getpass, SOPS, integration with secrets managers, ...

view this post on Zulip Philip Durbin 🚀 (Oct 24 2024 at 11:37):

@Sebastian Höffner a while ago you and @Jan Range posted a video called "oidc-httpx-flow.mov". Is the code available?

view this post on Zulip Jan Range (Nov 06 2024 at 06:47):

@Philip Durbin 🐉 This one required client id and secret to run, and I am not sure if this is suitable for a generic implementation. Ultimately, when @Sebastian Höffner and I tried to replicate the browser flow, we encountered KeyCloak-specific issues, which prevented at least a single solution.

Here is the code:

OIDC_Auth_Prototype.zip

view this post on Zulip Philip Durbin 🚀 (Nov 06 2024 at 13:30):

I took a quick look. Thanks. Maybe it'll come in handy again. :grinning:

view this post on Zulip Sebastian Höffner (Nov 07 2024 at 12:12):

Thanks for uploading the sources, Jan!

In general the main roadblock we faced was more of an "ethical" issue. We could pretend to be the browser and have the users pass their login credentials to us, effectively MITM'ing them. That's exactly what the bearer-token-example in https://dataverse.zulipchat.com/#narrow/channel/377090-python/topic/auth.20options/near/475346538 does with a full blown browser driver they have control over. Then everything works. But the way it should work is without pretending to be the browser and not forcing the user to enter credentials in an "untrusted" browser or app.

But that is something dataverse likely has to implement properly.


Last updated: Nov 01 2025 at 14:11 UTC