Stream: dev

Topic: JSON Schema for datasets


view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 11:46):

I've been reviewing this pull request: JSON Schema creator and validator #10109

I didn't write the code but I'm happy to discuss and answer any questions.

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 11:47):

Preliminary docs are here: https://dataverse-guide--10109.org.readthedocs.build/en/10109/api/native-api.html#retrieve-a-dataset-json-schema-for-a-collection

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 11:47):

In short, you can ask a specific collection for a JSON Schema for creating a dataset within it.

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 11:48):

The idea is that some collections require additional fields.

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 11:48):

And those fields are reflected in the JSON Schema.

view this post on Zulip Juan Pablo Tosca Villanueva (Nov 29 2023 at 14:05):

I accidentally marked this as resolved, and now it is back to its original state :rolling_on_the_floor_laughing:. I am doing QA and trying to understand the Issue/PR I am sure I will ask around here for more information.

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 14:07):

Sure! I started this thread initially because @Jan Range and I were talking about the JSON Schema stuff in the new pyDataverse revamp doc at #python > PyDataverse Re-Vamp but everyone is absolutely welcome!

view this post on Zulip Jan Range (Nov 29 2023 at 16:43):

That's awesome and solving a couple of issues for pyDataverse/EasyDataverse! Is there a way to test this functionality already?

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 16:44):

ghcr.io/gdcc/dataverse:9464-schema-creator-validator I guess. :smile:

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 16:45):

You want the code running on your laptop or a server?

view this post on Zulip Jan Range (Nov 29 2023 at 17:00):

Either way works fine for me :blush:

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 29 2023 at 18:37):

Do you have Java and Maven installed? If so, switch to the 9464-schema-creator-validator branch and run the quickstart: https://guides.dataverse.org/en/6.0/developers/dev-environment.html#quickstart

view this post on Zulip Jan Range (Nov 30 2023 at 09:14):

Smooth! Installation and endpoint working flawlessly :raised_hands:

image.png

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 30 2023 at 12:34):

@Jan Range fantastic! Is the JSON Schema more or less what you expect? Can you work with it?

view this post on Zulip Jan Range (Nov 30 2023 at 12:50):

It looks great so far! I will experiment with it and see how to plug it into EasyDataverse.

view this post on Zulip Jan Range (Nov 30 2023 at 12:50):

One missing thing I found is that controlled vocabularies are not included. Afaik the subject field is a controlled vocab, and maybe this could be added as an enum?

view this post on Zulip Jan Range (Nov 30 2023 at 12:52):

Happy to comment this to the PR if I am not missing something.

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 30 2023 at 13:00):

Yes, please comment on the PR. Thanks!

view this post on Zulip Juan Pablo Tosca Villanueva (Nov 30 2023 at 15:53):

Thanks for that @Jan Range I will bring it up during standup today. :smile:

view this post on Zulip Philip Durbin ๐Ÿš€ (Nov 30 2023 at 19:53):

Hey, this PR also addresses, this old issue: Query Dataverse for mandatory metadata fields via API #6978

view this post on Zulip Johannes D (Dec 01 2023 at 09:04):

Is it possible to also add information about datatypes (int,float) to the schema?

view this post on Zulip Jan Range (Dec 01 2023 at 10:58):

If I am correct, the new schemes are meant for the payload input to endpoints that add/update
metadata (see Example 1). These do not contain the type information that is shipped with the basic metadatablock schemes.

Also afaik, the endpoints expect (an array of) strings for the value property, given it is a primitive. This is also part of the schema for a field (see example 2) and thus I expect the types cannot be added in the typical way. Hence, the payload has no type enforcing per-se and types are handled at Dataverse's side. Please correct me if I am wrong @Philip Durbin

Example 1

Example 2

view this post on Zulip Jan Range (Dec 01 2023 at 11:06):

Now that both schemes (basic and collection-sepecifc) are at hand, one could condense this into an intermediate schema that complies with the collection requirements and types expected by the metadata block. That's essentially what EasyDataverse is doing.

Here is an example of a JSON schema for Citation generated by EasyDataverse. @Johannes D would this be useful for you?

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 12:22):

@Johannes D thanks for leaving a comment on the PR. That's perfect. I just replied there.

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 12:23):

@Jan Range I think I'm confused by how you're saying old and new. To me there's only one schema but I'm sure I'm simply misunderstanding what you're saying. :sweat_smile:

view this post on Zulip Jan Range (Dec 01 2023 at 12:29):

Sorry I should have rephrased that - End of the week and my brain goes :dizzy:

By "old" I mean the basic block schema (such as this), and with "new" the novel collection JSON schema.

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 12:38):

Oh, that makes much more sense. Thanks.

view this post on Zulip Johannes D (Dec 01 2023 at 12:42):

Kind of, the old format allows to specify a fieldType and I'd like to have that integrated into the new schema. One use case would be the SPA that uses something like this (https://rjsf-team.github.io/react-jsonschema-form/) to auto generate a form based on the schema. Here the field type is needed to create nice forms for numbers or dates...

view this post on Zulip Jan Range (Dec 01 2023 at 13:23):

Thanks for the explanation @Johannes D :blush: I am unsure if it is possible to include fieldType in the novel JSON schema. The schema validates whether a typeName is given in a payload to check compliance with the collection and utilizes a generic field schema.

To validate, I have checked the JSON schema for a collection that uses the astrophysics config and requires a float field. Unfortunately, the schema does not include any type checks. Only upon sending you will receive a validation error.

Collection schema

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 13:26):

@Johannes D thanks! I just copied your comment about that React tool over at https://github.com/IQSS/dataverse-frontend/issues/231#issuecomment-1836116092 (we are actively building forms in React now)

view this post on Zulip Jan Range (Dec 01 2023 at 13:26):

However, given the collection and metadatablock schema, it is possible to create a new schema from both. @Philip Durbin I think receiving a schema such as this one would be awesome since it checks on types too and you can plug it easily into other plugins such as the Form Creator.

view this post on Zulip Johannes D (Dec 01 2023 at 13:36):

@Jan Range The lack of further distinction for the primitives values (int, float, boolean, date) is the problem and I hope we could fix that with the new schema representation. IMHO the new schema would be a perfect start for v2 of the API

view this post on Zulip Johannes D (Dec 01 2023 at 13:40):

Philip Durbin said:

Johannes D thanks! I just copied your comment about that React tool over at https://github.com/IQSS/dataverse-frontend/issues/231#issuecomment-1836116092 (we are actively building forms in React now)

Thanks, before one can use the tooling we need to translate the rather complex dataverse json representation into a more readable, JSON intuitive representation...basically into what Jan suggested. Otherwise the form represents the complex internal data structure, which is something the normal user should not see.

view this post on Zulip Johannes D (Dec 01 2023 at 13:41):

Actually thats one reason why we have a python facade between our react SPA and dataverse

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 13:41):

right, v2 territory

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 13:41):

well, maybe we could implement a facade in js-dataverse

view this post on Zulip Johannes D (Dec 01 2023 at 13:46):

I'd rather would like to see that the backend as other non js-clients would also benefit from it. I foresee two tasks in the backend: Creation of collection specific simple schemas that include all needed information and transformation of json in the specific schema to the DB model and vise-versa.

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 13:47):

Oh, sure. I just meant that until we have a slick v2 API maybe js-dataverse could follow your lead and implement a similar facade.

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 13:47):

@Jan Range python stuff ^^ :grinning:

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 14:56):

@Johannes D are you actually using react-jsonschema-form or is it just a dream?

view this post on Zulip Johannes D (Dec 01 2023 at 15:00):

@Philip Durbin We wanted to use it but our designers and users requested a complex stepper for the input forms. The effort to adapt the lib for the use case was more complex than writing a form by hand, so we are not using it in this project. In a different project I used the library and was happy with the lib:)

view this post on Zulip Philip Durbin ๐Ÿš€ (Dec 01 2023 at 17:16):

Very interesting. Thanks.

view this post on Zulip Philip Durbin ๐Ÿš€ (Jun 14 2024 at 21:06):

How do folks feel about phase two of "JSON Schema for dataset"? Is #10543 what you expected? I just left a comment but maybe I'm confused: https://github.com/IQSS/dataverse/pull/10543#pullrequestreview-2119147899

view this post on Zulip Philip Durbin ๐Ÿš€ (Jul 01 2024 at 20:21):

@Oliver Bertuch you were part of the discussion early on and created this issue: bklog: Deliverable - As a system integrator, I would appreciate a JSON Schema for validating my dataset JSON before uploading via API - https://github.com/IQSS/dataverse-pm/issues/26

Any thoughts on my comment above?

view this post on Zulip Philip Durbin ๐Ÿš€ (Jul 03 2024 at 14:33):

@Jan Range this is the thread I just mentioned on the pyDataverse call.

view this post on Zulip Philip Durbin ๐Ÿš€ (Jul 11 2024 at 12:06):

@Jan Range @Oliver Bertuch have you had a chance to think about https://github.com/IQSS/dataverse/pull/10543 ?

It came up again in sprint planning yesterday.

I think we all agree that the PR should add value (more detailed error messages). However, it doesn't improve the JSON Schema we offer for datasets. Does that matter? Is that what you want?

view this post on Zulip Jan Range (Jul 11 2024 at 12:25):

To me, it looks great already! I have two small points that could be beneficial for integration into external tools/libs:

The message is good, but it is currently limited to a human-readable format. Adding a JSONPath or any other path that displays the exact location would allow other libraries to do more with the validation result. Furthermore, adding different error types could help. For instance, if a type validation fails, this could be indicated.

If I could imagine a response example it would look something like this:

Paths are not accurate

{
  "is_valid": "yes",
  "errors": [
    {
      "location": "citation/fields/0/value",
      "error_type": "required",
      "message": "The title field is required."
    },
    {
      "location": "citation/fields/1/value",
      "error_type": "invalid",
      "message": "The description must be a string."
    }
  ]
}

Is it possible to derive such a format from your validator? I know in Python and Rust it is possible, but I am a Java Noob :grinning:

view this post on Zulip Philip Durbin ๐Ÿš€ (Jul 11 2024 at 12:37):

Great, thanks. @Jan Range would you mind copying and pasting your comment into the PR or linking here?

view this post on Zulip Jan Range (Jul 11 2024 at 12:55):

Done :smile:

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 20 2024 at 18:16):

#10543 has been merged


Last updated: Nov 01 2025 at 14:11 UTC