In https://github.com/IQSS/dataverse-pm/issues/394 we're talking about reporting on various metrics from https://hub.dataverse.org
What if we add this functionality to pyDataverse? :thinking:
There's a nice OpenAPI endpoint: https://hub.dataverse.org/openapi
And Swagger: https://hub.dataverse.org/swagger-ui/index.html?url=/openapi
@Jan Range @Juan Pablo Tosca Villanueva what do you think? :big_smile:
For now, for the number of installations, I make a draft PR as a recipe: https://github.com/gdcc/dataverse-recipes/pull/15
@Juan Pablo Tosca Villanueva @Ceilyn Boyd ^^
Nice!! Sure, sounds great :smile:
Would also put it in DVCLI :-P
I keep forgetting about dvcli!
But I think @Ceilyn Boyd would rather have it in Python than Rust.
Ah alright, Python first then :muscle:
So, my real question is given https://hub.dataverse.org/openapi do you have some Pythonic way to generate some nice bindings?
Can you open an issue at pyDataverse?
Yes, there are a couple of generators. I think there are also some that use pyDantic. I'll have a look at them
Sure! Done! https://github.com/gdcc/pyDataverse/issues/218
Awesome! Thanks :smile:
We have two options available to utilize the OpenAPI specs in pyDataverse/Python:
The first option is a streamlined approach that generates both models and endpoints, resulting in a fully functional client library. There are various tools available, some open source, others partially proprietary, that support this. These tools often also generate documentation alongside the code.
Pros:
Cons:
The second option is to generate only the models, for example using a Pydantic V2 generator, and implement the endpoints manually. This approach gives us more control and makes integration with pyDataverse smoother.
Pros:
Cons:
Iβm leaning toward this second option as it aligns better with the current pyDataverse codebase. Implementing an endpoint is relatively simple, it usually involves an HTTP call and some optional validation logic. As seen in the current PR, it's just a few lines of mostly repetitive code. The biggest challenge, in my view, is setting up the models and this is well solved with the second approach.
Alongside the model generation, we could opt for a hybrid approach and generate the methods dynamically by the following approaches:
Thanks so much for working on this, @Jan Range! ![]()
I see a lot of value in how Pydantic can generate the models automatically.
In https://github.com/gdcc/pyDataverse/pull/219 you've already done this for us. You created pyDataverse/hub/models.py. But how was it done? That's the part I'm especially curious about.
I initially implemented these models using Claude along with some manual work. However, this morning I tried out the Pydantic Data Model Generator, and it works wellβexcept for the monthly endpoint.
The issue lies in the OpenAPI specification for api/installation/metric/monthly, which appears to be out of sync with the actual API response. Here's a breakdown:
OpenAPI Spec
The response is defined as including the installation within the metric, both on the same hierarchical level:
"200": {
"description": "Registered installations metrics by month success",
"content": {
"application/json": {
"schema": {
"type": "array",
"items": {
"$ref": "#/components/schemas/InstallationMetrics"
}
},
"InstallationMetrics": {
"type": "object",
"description": "Dataverse installation metrics",
"properties": {
"installation": {
"$ref": "#/components/schemas/Installation"
},
"recordDate": {
"type": "string",
"format": "date-time",
"description": "Date when the metrics were captured",
"example": "2024-10-31T20:13:03.422+00:00"
},
"files": {
"type": "integer",
"format": "int64",
"description": "Number of files in the Dataverse installation",
"example": 100000
},
}
The actual response is the other way around. The Installation object contains the metrics within the metrics property. Here is an example output:
{
"dvHubId": "DVN_JOHNS_HOPKINS_RESEARCH_DATA_REPOSITORY_2013",
"name": "Johns Hopkins Research Data Repository",
"country": "USA",
"continent": "North America",
"launchYear": 2013,
"metrics": [
{
"recordDate": "2025-03-14T09:04:19.638513",
"files": 10449,
"downloads": 68983,
"datasets": 505,
"harvestedDatasets": 0,
"localDatasets": 505,
"dataverses": 66
}
]
},
Once this is fixed on the server side, the generated models will match the actual API response.
I already have a PR prepared that includes the CI for model generation and updated endpoints based on the new models.
As soon as the server-side fix is in place, we can open the PR on the repository and proceed with the merge.
Currently, the hub tests are failing because of this mismatch.
Interesting! Do you mind creating an issue about that at https://github.com/IQSS/dataverse-hub ?
Of course, will open an issue once back home :smile:
Awesome. Thanks.
Meanwhile, this seems to work fine!
datamodel-codegen --url https://hub.dataverse.org/openapi
I did have to install httpx as well.
% cat requirements.txt
datamodel-code-generator==0.28.5
httpx==0.28.1
Yes, it's pretty cool! Maybe we can also adapt it to the DV OpenAPI specs
Oh, sure. Sounds good.
The hub is somewhat in flux. I'd like us to change /api/installation to /api/installations, for example. See https://github.com/IQSS/dataverse-hub/issues/21
(I'm curious if you think this would be an improvement, by the way.) :big_smile:
Using installations seems more logical and aligns with common design principles. Effectively, that's also what you get when calling the endpoint. So, it makes total sense to use the plural.
Great, I'm glad you agree. :big_smile:
I also opened an issue about changing the dvHubId: https://github.com/IQSS/dataverse-hub/issues/15
I agree, the current ID approach has its flaws. I see that it is nice to read for humans, but when things change it might not be the most flexible.
Have you thought of using geographic location as an identifier? Iβve recently learned about Google Plus Codes that encode longitude and latitude into a string. I guess the location wonβt change, or what do you think?
https://en.m.wikipedia.org/wiki/Open_Location_Code
nope, this is new to me
Maybe we should ask about it in #geospatial :big_smile:
Here is a converter:
https://www.dcode.fr/open-location-code
Yeah, let's move this to #geospatial > Open Location Code
To summarize, Google Plus Codes might be problematic as a dvHubId because an org might host multiple installations of Dataverse (like DANS does). Also, while it's a remote possibility, an org could move physically.
@Jan Range looking through old posts about OpenAPI I forgot you did some analysis already: #python > Analysis of OpenAPI code generators :sweat_smile:
Last updated: Nov 01 2025 at 14:11 UTC