Just today I've been looking into options how to interact with our competitor, InvenioRDM, when you don't have an API Token yet and want to support users to make the process to get one as simple as possible.
There is already an API endpoint to receive a token for a user, but it's not very usable at the moment.
IMHO a nice feature for pyDataverse would be an option to enable an auth flow. Wouldn't it be nice if some users run a Python script with pyDataverse to open a browser window for them, make them login with their usual credentials and afterwards continue working with the API, all without asking them to create and provide an API token first? (Or even more complicated, a signed URL)
Thoughts anyone?
That would be killer. :rock_on:
Getting the API key is difficult because it's behind some JavaScript, so that's not easily scriptable, although it might be possible. I just tried it for like 5 minutes or so.
Opening the browser is possible, but transferring something out of the browser without a cooperating website is difficult, as you would need to allow and trigger a callback URL like http://localhost:9473/login-callback, so this would require implementation efforts on Dataverse's side, unless they already support a similar callback for OIDC one could re-use.
However, in general Dataverse seems to support this kind of authentication (and pyDataverse kind of as well in https://github.com/gdcc/pyDataverse/pull/201, as that adds support for Bearer tokens), although it expects authentication to happen out-of-band: https://guides.dataverse.org/en/latest/developers/remote-users.html . So if you use Shibboleth, Keycloak, or another OIDC provider, you can handle this use case by first logging in (potentially with any OIDC Python library) and using that token in BearerTokenAuth(...) in #201.
In general, all this should considered as part of Phase 3 in https://py.gdcc.io/ , especially when integrating DVCLI.
For Signed URLs, I started a longer discussion here: https://github.com/gdcc/pyDataverse/issues/200, and I am not sure those will solve the problem at hand as their use case is somewhat different.
I appreciate you digging into this. I'm afraid I don't have any bright ideas on how to move forward. Maybe we can brainstorm a bit at a future pyDataverse meeting.
Oh, wow, you even made diagrams: https://github.com/gdcc/pyDataverse/issues/200#issuecomment-2254517953
Maybe it's fine to make this auth thing require people to use an OIDC provider. That will be a necessity as soon as the SPA is around anyway.
WRT receiving an API token for a user, there are some not very well documented endpoints available.
For instance one to recreate your token https://github.com/IQSS/dataverse/blob/c39ac8843738ebf3e48be17370b2a35f49432226/src/main/java/edu/harvard/iq/dataverse/api/Users.java#L160
It doesn't return a nice JSON response, but we could change that
So folks could trade an OIDC access token for an API token.
Using a shortlived localhost server started by pyDataverse, this should be fairly simple to achieve.
Around the SPA has been discussion about making Dataverse an OAuth2/OIDC identity provider, too. Builtin users, migrations and such things would potentially be a lot easier that way.
That obviously would require much more implementation changes in DV... Which is why one of the ideas is to ask people to use Keycloak and add a HTTP Basic Auth against Dataverse to it.
Dang with this tutorial https://www.baeldung.com/java-ee-oauth2-implementation it doesn't seem so complicated to make Dataverse an Authorization Server itself.
"Login with Dataverse" how's that sound @Philip Durbin
Sounds nice :grinning:
I cannot comment on identity management, so I'll instead focus on the token auth.
Recreating a token might solve the issue for testing or single app access. However, since Dataverse only allows a single API token (and this one will be rotated with the recreation request), this will cause problems if you use your API token in multiple services. I didn't find another API endpoint to actually retrieve a (new) token.
I was about to try it though with the recreate (which would make the (local) test setup for pyDataverse much easier) and found another edge-case we should support in pyDataverse: Currently demo.dataverse.org is in maintenance mode and returns a 200 status code with a big html for every API call... I think this should maybe be a 503 status code, but well. I stored the html locally and will see if we can make that work in the error handling.
Do you know how to call the users/token/recreate path? In the browser it's some very tricky POST call to the http://localhost:8080/dataverseuser.xhtml which only works because at that point I already have a session cookie.
I tried it with POST to /users/token/recreate, /api/users/token/recreate, /api/v1/users/token/recreate and directly to dataverseuser.xhtml similar to what the web UI does. I tried the following form data:
loginForm:credentialsContainer:0:credValue = username
loginForm:credentialsContainer:1:sCredValue = password
but it didn't work. So I guess I need to somehow perform a login, retrieve the cookie and can then interact with the token?
@Philip Durbin re the diagrams, note that those are solely related to signed URLs, which I originally understood should work like variant 2 which seems... wrong. They seem to be intended more in light with variants 1 and 3 there.
@Sebastian Höffner you're asking how to recreate a token via API? Please see https://guides.dataverse.org/en/6.3/api/native-api.html#recreate-a-token
Thanks, I was more asking how to bootstrap a token or retrieve it via the API, because recreating it with one app might break others.
In this case, one already needs to know the API token to authenticate, but I am wondering how to get an API token without manually logging in. Although it is probably the safer way to simply not allow that, otherwise other services might want to grab your credentials to retrieve a token.
I'm pretty sure we can create new endpoints for this kind of thing.
To ensure safety, we can add filters so it would for example require logging in via bearer token.
Another thought I had: it would probably help if we can give people collection based access tokens and not just PATs
So many options :-)
I don't think it's really necessary right now.
But I think since Dataverse supports various OIDC sources, we could make at least make the auth flow happen somehow. I'll read a little bit up on that to see how it might go.
Are there other apps that do this well? DataLad or whatever app?
I know that Vault and Nomad have such CLI login flows, but I haven't had a closer look at how they are implemented – I just know that I had to configure a localhost:... callback URL to make it work (https://github.com/hashicorp/nomad/blob/main/command/login.go), and it works really well: you type nomad login, it opens the browser, you do your OIDC login, it performs a callback to localhost, you have a token. They also support the other way: nomad ui -authenticate will open the browser and pass a token to it, if you happen to have one on the CLI.
The most prominent example I know is Kubernetes, using kubectl with OIDC login
All you need is a local, shortlived webserver you redirect to. That way you get the auth code flow flgoing
I haven't tried Zulip Terminal but I wonder how auth works for it.
Another option is the device flow, but it is less commonly supported by OIDc/Oauth Idps
Pydv should also take care about caching tokens and refreshing them
Oh. "NOTE: If you use Google, Github or another external authentication to access your Zulip organization then you likely won't have a password set and currently need to create one to use zulip-terminal." -- https://github.com/zulip/zulip-terminal#running-for-the-first-time
This shouldn't be the case for nomad and k8s though, so I guess with the shortlived local server we are probably good to go.
I'll create an issue to track this and link to this thread for some details.
https://github.com/gdcc/pyDataverse/issues/209
Awesome, thanks
I have to admit I have no experience with OIDC/OAuth yet, but I think this is a nice feature! I am happy to support you whereever possible :smile:
On a different note, I’ve been experimenting with the keyring crate in Rust for the rust-dataverse library. This crate allows users to securely store credentials (URL and token) under an alias in the OS’s dedicated secure store. When these credentials are used within the CLI, access must be granted, with the option to permanently allow it for convenience.
While it's not exactly the same as having an online login, it has made my workflow more convenient by eliminating the need to constantly copy the token and URL into my environment. Perhaps there's a similar solution in Python that could offer the same level of convenience.
The OIDC/OAuth thing is mostly making it much more convenient to retrieve some token for further use. Either caching the access and refresh tokens to interact with the API or retrieve a longer lasting PAT. That could, as you said, be stored in some secure storage integrated with the OS.
Here's also an example of combining OIDC tokens with shortlived API tokens: https://docs.pypi.org/trusted-publishers
Thanks, that's fancy!
I just learned that Zenodo is an OAuth2 Authorization Server! You can even add OAuth applications as a user :smile: Dataverse should certainly have the same functionality :see_no_evil:
I checked out the OIDC stuff but I wasn't able to spin it up properly without modifying the /etc/hosts file (see https://guides.dataverse.org/en/latest/developers/remote-users.html#openid-connect-oidc).
This makes it tricky to actually write some tests, so I'm gonna have to think about this a little more. Maybe I can configure keycloak in a different way than what the repo does (the config is not linked in the docs but referenced, it's located at https://github.com/IQSS/dataverse/tree/develop/conf/keycloak).
At some point we should add this to the dataverse-action, so it at least is easy to test within CI
@Oliver Bertuch we could have a small hackathon and implement the localstack/minio services too. Would be beneficial for testing the direct S3 upload.
It clicked after today's PyWG meeting and a deeper dive into OIDC. I took the server idea from @Sebastian Höffner to Python and tested the auth flow using the httpx.Auth base class. Works just fine, although it is very much hard-coded to work with the local keycloak service. Maybe we can use this as a starter to work toward a general solution.
One thing I am still puzzled with is how one should know the client_secret & client_id in advance. I am not very experienced with this type of auth flow, but I am sure there are clever ways to do this or work around.
Here are some additional thoughts:
Would it make more sense for the callback and bearer token retrieval to be handled server-side?
Given that Dataverse already has access to the Auth Provider’s ID and Secret, it could manage this process instead of pyDataverse. In this setup, pyDataverse would initiate the authentication flow, manage the web browser opening for user authentication, and then receive the token directly from Dataverse. To test this approach, we could consider extending the Docker Compose file with a small sidekick API using Flask (Python) or Rocket (Rust) for now instead of extending the Dataverse API.
Additionally, I believe this workflow could eliminate the need for local etc/hosts modifications, as the sidekick server is already within the Docker network, making testing more straightforward.
If this has already been implemented elsewhere or if this was the plan already, feel free to disregard— I'm just learning as I go :grinning_face_with_smiling_eyes:
Regarding : client_secret and client_id you need to ask the OIDC provider
and they will register your return-url (the url after being logged in) on their side
You can try with Helmholz-AAI ,its pretty straight forward
[Although completely unware of what the earlier conv. was/had been on the chat
]
Thanks @Lincoln I will look into this :raised_hands:
It reads like there are some mixups of concepts and tech here...
Just remembered in Helmholz-AAI portal you can actually register your return-url by yourself /customizatble
but somehow for me the flask post response was not working
Probably someone trying to use pyDataverse as an OIDC client and interacting with Dataverse's API using an access token should use a public client. Then no secret is necessary.
These clients should always be different from the client credentials a Dataverse installation uses.
For pyDataverse usually acting as a CLI client, there are two ways to retrieve an access token. Either make pyDataverse run a simple localhost server that you send a browser window to - or - use the device auth flow.
if accesstoken from an OIDC provider is merged with /used as Dataverse access token..
That would be really cool
However access token from an OIDC provider (by default) are only short lived due to security reasons
Here's a work in progress using Github OAuth2 and a minimal local server for Hermes init purposes: https://github.com/softwarepub/hermes/blob/feature/init-command/src/hermes/commands/init/oauth_github.py Just as a en example what this could look alike
@Oliver Bertuch Yes, I did that in the example provided, but to make it work, I had to hard-code the ID and secret into the authentication flow at pyDataverse. However, this approach isn’t sustainable, so I was looking for alternative solutions. Apologies for the confusion—I’m still in the early stages of learning OIDC.
Lincoln said:
if accesstoken from an OIDC provider is merged with /used as Dataverse access token..
This is already available as a feature, hidden behind a feature flag for now as experimental.
token from an OIDC provider (by default) are only short lived due to security reasons
True. But: when authenticating with the provider, you also receive a refresh token. That one is usually longer lived and can be used to get a new access token after it has expired.
hard-code the ID and secret into the authentication flow
With a public client you can at least omit the secret, just need the client ID. There's not really a good way around that one.
However, this approach isn’t sustainable, so I was looking for alternative solutions.
Which is why I was suggesting making Dataverse an OAuth2 identity provider. It's probably a lot easier to make integrations between Dataverse and pyDataverse happen than some workaround how to get the OIDC provider going. We could create a discovery endpoint in Dataverse to retrieve a config that pyDataverse or others can work with.
With OIDC there exist mechanisms to register a client dynamically. But as far as I know, these are not very widespread in academia.
I agree, that would simplify the process by a lot. But I guess that's a lot of work to implement upstream, or am I wrong?
It's nothing done in 5 minutes, no. But it would solve a lot of problems and is also very relevant to the SPA work (builtin users compatibility). I don't know if there is an issue already, but it's certainly worth opening one.
Should we consider postponing the OIDC feature for pyDataverse? It seems that any current solution either requires users to have access to sensitive information or is more cumbersome than simply using an API Token. We could include the server you mentioned as a service within the compose setup, but I’m concerned that this might be a limited solution, as other installations are unlikely to have this server available if it is not part of the Dataverse instance itself.
Yeah, maybe postpone for now. The OIDC Bearer Access to the DV API is still experimental, too.
Makes sense. Have a great and relaxing vacation :island:
Sorry for the long silence, but I had to a) think about some of the "mixup of concepts and tech" and b) think about the feature a little more.
I met with @Jan Range yesterday and we cleared up some of the "mixup of concepts and tech" by implementing a prototype for a login of pyDataverse via Dataverse and Keycloak. You can find the details in https://github.com/gdcc/pyDataverse/issues/209#issuecomment-2342862132. We'll meet next week to flesh out the details and maybe implement it in pyDataverse.
Regarding tests and CI etc. I am not yet sure, as we still have a few manual/interactive steps:
Go go go! :tada:
At standup @Guillermo Portas just mentioned that the frontend team might turn its attention to auth soon. There's a pretty good chance we'll talk about it at our next tech hours on Tuesday.
As a starting point, we have a short and long doc on auth (via our list of re-arch docs).
Also, @Jan Range @Slava Tykhonov and I just talked about auth a bit at the pyDataverse meeting. The recording should be up soon at https://py.gdcc.io
Thanks @Philip Durbin 🐉 ! Recording and date/notes of the next meeting are online at https://py.gdcc.io :raised_hands:
Check this out:
Technically this has been around before, too :wink: So no need to compile manual images, should work with what's in 6.2, 6.3 and 6.4
Just checked it out! Looks great, will test it tomorrow :muscle:
@Jan Range it was nice to be reminded that the gh command line app shows the experience we want. Tell it to auth and a browser window pops up. You auth there and return to the command line.
I brought it up and standup and there was some talk of device flow but I don't have any particular insight to share with you.
We completely understand the need for this. We don't want people hard coding API tokens into their notebooks either! :sweat_smile:
The bearer-token-example is not really what we want to achieve though.
It's opening a full-blown test browser session to grab content from the DOM – it's a bit crazy to bring in selenium or similar just to open a browser page – instead, one could do what Jan and I did and simply do the requests with httpx, resulting in the same outcome.
What we want to achieve is to have the normal browser session handle authentication and return some bearer token to a callback URL. Unfortunately, the way this currently works, we would need to MITM to the IdP (which kind of works in this case but is not a good practice). I think getting it to work the same way the gh CLI et al. do it will require some changes in Dataverse itself – that's what Jan and I figured out during our last coding session.
We probably won't be able to stop people from storing some tokens in notebooks, though. It's tricky to do that right if you are in a remote-only context (as you might need JS–Python–interop), but also requires some conscious thoughts in a local notebook (e.g., storing tokens in a path or credential manager the server has access to and not accidentally committing it etc.) I think the only way to do that is to educate people about tokens and lead by example...
The only way to avoid storing some kind of secret inside a notebook is by having it injected by the service that runs the notebook for you.
This may be some kind of secret env var that is stored for your user account. Or, even better, the service injects an ID Token, which is a JWT. As long as the Dataverse backend trusts the origin of this token, it would grant you access.
Injecting these tokens may happen in different ways. One is to have a service that already uses OIDC to request another access token on your behalf or forward one your already provide from the IdP and send it along to the Dataverse backend.
The other way is to have the service create JWT tokens from its own provider and move those along. This is especially useful in unattended jobs like pipelines etc.
Examples for those are the Github Tokens you can use to publish Python packages via OIDC. Gitlab offers the same type of tokens.
Again, in any case it will require trust at the Dataverse backend side by configuring the IdPs.
And of course there are other options, how to get ahold of a secret without storing it in the Notebook... getpass, SOPS, integration with secrets managers, ...
@Sebastian Höffner a while ago you and @Jan Range posted a video called "oidc-httpx-flow.mov". Is the code available?
@Philip Durbin 🐉 This one required client id and secret to run, and I am not sure if this is suitable for a generic implementation. Ultimately, when @Sebastian Höffner and I tried to replicate the browser flow, we encountered KeyCloak-specific issues, which prevented at least a single solution.
Here is the code:
I took a quick look. Thanks. Maybe it'll come in handy again. :grinning:
Thanks for uploading the sources, Jan!
In general the main roadblock we faced was more of an "ethical" issue. We could pretend to be the browser and have the users pass their login credentials to us, effectively MITM'ing them. That's exactly what the bearer-token-example in https://dataverse.zulipchat.com/#narrow/channel/377090-python/topic/auth.20options/near/475346538 does with a full blown browser driver they have control over. Then everything works. But the way it should work is without pretending to be the browser and not forcing the user to enter credentials in an "untrusted" browser or app.
But that is something dataverse likely has to implement properly.
Last updated: Nov 01 2025 at 14:11 UTC