Stream: python

Topic: hub


view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 11:56):

In https://github.com/IQSS/dataverse-pm/issues/394 we're talking about reporting on various metrics from https://hub.dataverse.org

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 11:56):

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 11:56):

What if we add this functionality to pyDataverse? :thinking:

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 11:57):

There's a nice OpenAPI endpoint: https://hub.dataverse.org/openapi

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 11:57):

And Swagger: https://hub.dataverse.org/swagger-ui/index.html?url=/openapi

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 11:57):

@Jan Range @Juan Pablo Tosca Villanueva what do you think? :big_smile:

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 16:58):

For now, for the number of installations, I make a draft PR as a recipe: https://github.com/gdcc/dataverse-recipes/pull/15

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 16:58):

@Juan Pablo Tosca Villanueva @Ceilyn Boyd ^^

view this post on Zulip Jan Range (Apr 04 2025 at 18:47):

Nice!! Sure, sounds great :smile:

view this post on Zulip Jan Range (Apr 04 2025 at 18:48):

Would also put it in DVCLI :-P

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 18:48):

I keep forgetting about dvcli!

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 18:48):

But I think @Ceilyn Boyd would rather have it in Python than Rust.

view this post on Zulip Jan Range (Apr 04 2025 at 18:49):

Ah alright, Python first then :muscle:

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 18:49):

So, my real question is given https://hub.dataverse.org/openapi do you have some Pythonic way to generate some nice bindings?

view this post on Zulip Jan Range (Apr 04 2025 at 18:49):

Can you open an issue at pyDataverse?

view this post on Zulip Jan Range (Apr 04 2025 at 18:50):

Yes, there are a couple of generators. I think there are also some that use pyDantic. I'll have a look at them

view this post on Zulip Philip Durbin πŸš€ (Apr 04 2025 at 18:53):

Sure! Done! https://github.com/gdcc/pyDataverse/issues/218

view this post on Zulip Jan Range (Apr 04 2025 at 18:53):

Awesome! Thanks :smile:

view this post on Zulip Jan Range (Apr 10 2025 at 06:28):

We have two options available to utilize the OpenAPI specs in pyDataverse/Python:

1. Full Client Library Generation

The first option is a streamlined approach that generates both models and endpoints, resulting in a fully functional client library. There are various tools available, some open source, others partially proprietary, that support this. These tools often also generate documentation alongside the code.

Pros:

Cons:

2. Manual Endpoints with Generated Models

The second option is to generate only the models, for example using a Pydantic V2 generator, and implement the endpoints manually. This approach gives us more control and makes integration with pyDataverse smoother.

Pros:

Cons:

I’m leaning toward this second option as it aligns better with the current pyDataverse codebase. Implementing an endpoint is relatively simple, it usually involves an HTTP call and some optional validation logic. As seen in the current PR, it's just a few lines of mostly repetitive code. The biggest challenge, in my view, is setting up the models and this is well solved with the second approach.

A Hybrid Approach

Alongside the model generation, we could opt for a hybrid approach and generate the methods dynamically by the following approaches:

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 12:15):

Thanks so much for working on this, @Jan Range! :dataverse_man:

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 12:15):

I see a lot of value in how Pydantic can generate the models automatically.

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 12:16):

In https://github.com/gdcc/pyDataverse/pull/219 you've already done this for us. You created pyDataverse/hub/models.py. But how was it done? That's the part I'm especially curious about.

view this post on Zulip Jan Range (Apr 10 2025 at 12:42):

I initially implemented these models using Claude along with some manual work. However, this morning I tried out the Pydantic Data Model Generator, and it works wellβ€”except for the monthly endpoint.

The issue lies in the OpenAPI specification for api/installation/metric/monthly, which appears to be out of sync with the actual API response. Here's a breakdown:

OpenAPI Spec

The response is defined as including the installation within the metric, both on the same hierarchical level:

 "200": {
            "description": "Registered installations metrics by month success",
            "content": {
              "application/json": {
                "schema": {
                  "type": "array",
                  "items": {
                    "$ref": "#/components/schemas/InstallationMetrics"
                  }
                },
"InstallationMetrics": {
        "type": "object",
        "description": "Dataverse installation metrics",
        "properties": {
          "installation": {
            "$ref": "#/components/schemas/Installation"
          },
          "recordDate": {
            "type": "string",
            "format": "date-time",
            "description": "Date when the metrics were captured",
            "example": "2024-10-31T20:13:03.422+00:00"
          },
          "files": {
            "type": "integer",
            "format": "int64",
            "description": "Number of files in the Dataverse installation",
            "example": 100000
          },
}

The actual response is the other way around. The Installation object contains the metrics within the metrics property. Here is an example output:

{
        "dvHubId": "DVN_JOHNS_HOPKINS_RESEARCH_DATA_REPOSITORY_2013",
        "name": "Johns Hopkins Research Data Repository",
        "country": "USA",
        "continent": "North America",
        "launchYear": 2013,
        "metrics": [
            {
                "recordDate": "2025-03-14T09:04:19.638513",
                "files": 10449,
                "downloads": 68983,
                "datasets": 505,
                "harvestedDatasets": 0,
                "localDatasets": 505,
                "dataverses": 66
            }
        ]
    },

Once this is fixed on the server side, the generated models will match the actual API response.

I already have a PR prepared that includes the CI for model generation and updated endpoints based on the new models.

As soon as the server-side fix is in place, we can open the PR on the repository and proceed with the merge.
Currently, the hub tests are failing because of this mismatch.

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:27):

Interesting! Do you mind creating an issue about that at https://github.com/IQSS/dataverse-hub ?

view this post on Zulip Jan Range (Apr 10 2025 at 13:34):

Of course, will open an issue once back home :smile:

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:35):

Awesome. Thanks.

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:36):

Meanwhile, this seems to work fine!

datamodel-codegen --url https://hub.dataverse.org/openapi

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:36):

I did have to install httpx as well.

% cat requirements.txt
datamodel-code-generator==0.28.5
httpx==0.28.1

view this post on Zulip Jan Range (Apr 10 2025 at 13:36):

Yes, it's pretty cool! Maybe we can also adapt it to the DV OpenAPI specs

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:37):

Oh, sure. Sounds good.

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:43):

The hub is somewhat in flux. I'd like us to change /api/installation to /api/installations, for example. See https://github.com/IQSS/dataverse-hub/issues/21

(I'm curious if you think this would be an improvement, by the way.) :big_smile:

view this post on Zulip Jan Range (Apr 10 2025 at 13:45):

Using installations seems more logical and aligns with common design principles. Effectively, that's also what you get when calling the endpoint. So, it makes total sense to use the plural.

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:47):

Great, I'm glad you agree. :big_smile:

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 13:47):

I also opened an issue about changing the dvHubId: https://github.com/IQSS/dataverse-hub/issues/15

view this post on Zulip Jan Range (Apr 10 2025 at 14:45):

I agree, the current ID approach has its flaws. I see that it is nice to read for humans, but when things change it might not be the most flexible.

Have you thought of using geographic location as an identifier? I’ve recently learned about Google Plus Codes that encode longitude and latitude into a string. I guess the location won’t change, or what do you think?

view this post on Zulip Jan Range (Apr 10 2025 at 14:46):

https://en.m.wikipedia.org/wiki/Open_Location_Code

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 14:46):

nope, this is new to me

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 14:46):

Maybe we should ask about it in #geospatial :big_smile:

view this post on Zulip Jan Range (Apr 10 2025 at 14:52):

Here is a converter:

https://www.dcode.fr/open-location-code

view this post on Zulip Philip Durbin πŸš€ (Apr 10 2025 at 14:59):

Yeah, let's move this to #geospatial > Open Location Code

view this post on Zulip Philip Durbin πŸš€ (Apr 11 2025 at 14:02):

To summarize, Google Plus Codes might be problematic as a dvHubId because an org might host multiple installations of Dataverse (like DANS does). Also, while it's a remote possibility, an org could move physically.

view this post on Zulip Philip Durbin πŸš€ (Apr 11 2025 at 14:03):

@Jan Range looking through old posts about OpenAPI I forgot you did some analysis already: #python > Analysis of OpenAPI code generators :sweat_smile:


Last updated: Nov 01 2025 at 14:11 UTC