harvest and use controlled vocabularies in DV · community

Stream: community

Topic: harvest and use controlled vocabularies in DV

Oliver Bertuch (Jul 16 2024 at 08:29):

I have been talking with the folks from https://join2.de today. (My library is a part of the consortium) They are transitioning to use MyCoRe instead of Invenio at the moment. One of the pieces I am very interested in for our institutional repository Jülich DATA is to use their controlled vocabularies in Dataverse. I could of course write some JavaScript that reads their REST API.

But I'm wondering if it wouldn't make more sense - also with the SPA on the horizon where we will still need server side validation of data - that Dataverse itself (or some microservice that the Dataverse backend knows how to talk to) would harvest these vocabularies (e.g. using OAI-PMH) and offer them via the DV REST API / JSON Schema to some client script, via JSF as options and use it for validation on datasets coming in via API.

Harvesting controlled vocabularies makes a lot of sense from a architectural and availibility point of view. What do y'all think @Slava Tykhonov @Philip Durbin @Julian Gautier @Philipp Conzett ?

Oliver Bertuch (Jul 16 2024 at 08:35):

(This might be connected to the idea of custom validators for fields, see #dev > metadata validators per field )

Slava Tykhonov (Jul 16 2024 at 08:41):

Hi Oliver, how is it different from https://zenodo.org/records/8133723?

Slava Tykhonov (Jul 16 2024 at 08:46):

scripts and docs here https://github.com/gdcc/dataverse-external-vocab-support

Oliver Bertuch (Jul 16 2024 at 08:46):

To quote from the paper:

This mechanism does not currently take advantage of the configuration mechanism, data-* attributes, or caching of our external vocabulary support mechanism which makes it harder to see how they could be shared across repositories.

Oliver Bertuch (Jul 16 2024 at 08:50):

What I am envisioning is kind of going to that place. If we could harvest the controlled vocabularies from external sources, we get this caching in place. A Dataverse installation would be more independent from the vocabulary provider, as it keeps a _synchronized_ copy. OAI-PMH has been used for such harvesting for a very long time and would come in handy here.

Also, the Javascript solution is great for the UI part. But it doesn't yet allow any server side validation of the content. And it cannot be used from API clients that do not use Javascript.

Oliver Bertuch (Jul 16 2024 at 08:55):

Coming to think of it: if we have a cache, we can also expose it via OAI-PMH, so others can harvest the vocabularies again...

Oliver Bertuch (Jul 16 2024 at 09:05):

We could even go a step further and change the data model of our controlled vocabularies to implement them in a more SKOS like manner. Exposing those again as a SKOS via OAI-PMH could at least partially solve the point you mentioned in the paper about "how they could be shared across repositories"

Slava Tykhonov (Jul 16 2024 at 09:25):

How they're hosting their vocabularies now? If it will be in SKOS you can directly upload in Skosmos https://skosmos.org and get connection to Dataverse working.

Slava Tykhonov (Jul 16 2024 at 09:27):

We've implemented "cache" from Dataverse in Jena Fuseki which is component of Skosmos platform.

Slava Tykhonov (Jul 16 2024 at 09:28):

And forget about OAI-PMH, it's not suitable for controlled vocabularies. export in OAI-ORE has it all.

Oliver Bertuch (Jul 16 2024 at 10:06):

OAI-PMH would only be used as the transport / sync protocol. The payload can be OAI-ORE, serialized as XML-RDF.

Oliver Bertuch (Jul 16 2024 at 10:08):

Currently, MyCoRe exposes "classifications" as a custom XML thing. They are experimenting with exposing it as SKOS though. https://cmswiki.rrz.uni-hamburg.de/hummel/MyCoRe/Organisation/AnwenderWorkshop2022?action=AttachFile&do=view&target=221109_MyCoRe-ObjectListing_SKOS.pdf

Slava Tykhonov (Jul 16 2024 at 11:07):

If you can get vocabs in SKOS, the integration is pretty straightforward like we designed it.

Philip Durbin 🚀 (Jul 16 2024 at 13:40):

It sounds like @Oliver Bertuch wants to pull down and sync the vocab values locally.

@Slava Tykhonov is saying he already implemented a cache. Is that enough? A cache?

It sounds like Oliver wants a local service as well.

Julian Gautier (Jul 16 2024 at 15:35):

The technical details here go over my head, but not having to pull data from an API every time a depositor uses a metadata field could be really helpful. When the external vocabulary support mechanism was used to suggest names from a Crossref API, we saw some performance-related issues that might make it tough for depositors to use those metadata fields, and we talked about how maintaining "local" copies of what's in that API might help.

Philip Durbin 🚀 (Jul 16 2024 at 15:43):

Yes, exactly, it should be a performance win.

Julian Gautier (Jul 16 2024 at 16:11):

Awesome, yeah. In the UX WG's plans to usability test a redesign of the Citation metadata block that uses the external vocabulary support mechanism, I mention that moderators should look out for problems that might be caused by these performance-related challenges.

Slava Tykhonov (Jul 16 2024 at 16:49):

Don't forget about ontologies like Dublin Core during redesign.

Julian Gautier (Jul 16 2024 at 16:59):

@Slava Tykhonov could you write more about what that could mean? For example, would this involve thinking about how what's entered in the deposit form is included in the Dublin Core metadata that Dataverse repositories export?

Julian Gautier (Aug 27 2024 at 18:23):

Hi @Slava Tykhonov. The UX WG has been getting more in-depth about the redesign and we haven't discussed Dublin Core in any capacity, including how deposit metadata is imported into and exported out of repositories that use Dataverse. We have discussed the DataCite schema.

But assuming that metadata mapping to Dublin Core, import/export and controlled vocabularies is generally what you were thinking of last month, I've thought there are more appropriate standards for sharing metadata that includes values from controlled vocabularies, and that a lot of details about controlled vocabulary terms used to describe deposits will be lost when Dublin Core is used.

Let me know what you think or if you were thinking about something else. :)

Slava Tykhonov (Sep 02 2024 at 12:12):

Hi Julian, sorry for the late reaction. It can be very interesting if Dataverse will support DCAT and Croissant next to Dublin Core for the import/export of metadata.

Julian Gautier (Sep 05 2024 at 12:52):

Ah, @Sonia Barbosa asked about interest in DCAT, too, in another Zulip thread at https://dataverse.zulipchat.com/#narrow/stream/375707-community/topic/Support.20for.20Importing.20and.20Exporting.20DCAT.20Metadata

Last updated: Nov 01 2025 at 14:11 UTC