POJO metadata block modeling · docs

Stream: docs

Topic: POJO metadata block modeling

Oliver Bertuch (Jan 16 2026 at 17:00):

In the current "specification" of our metadata fields, we express in the TSV format whether a field is "required", where it displays in relation to others and if it will be displayed on creation of a dataset as attributes of a field.

But is that truely an "attribute"? From my point of view it is something the surrounding block imposes on the field! The field doesn't care about being required, where it is and if it's important to be displayed during creation.
It must care about how it is displayed (format), if it has a CV, who it's parent or children are, or if it can be used in facets/advances searches, as this influences how the field itself must behave. (Indexing in Solr, etc).

So in a future data model of these things (so not some file format, just as in-memory Java class model), these things should rather be defined on the block, not the field, right?
Change my mind @Julian Gautier @Philip Durbin 🚀 @Leo Andreev

(This is docs related as the current docs page is a mix of specification and file format. A future specification of a data model should be format agnostic and define a reliable data contract.)

Philip Durbin 🚀 (Jan 16 2026 at 17:09):

For the logic around displayOnCreate, the TSV only tells part of the story. There's a good amount of logic and configuration available in the backend. See my review of display on create starting on slide 12 of this talk I gave at #community > #Dataverse2025 - https://docs.google.com/presentation/d/18tPj-t_v-5amXiaGoxFm-3LTm-reBKL3d303oydLURY/edit?usp=sharing

Oliver Bertuch (Jan 16 2026 at 17:11):

Yeah I thought of that - you can do boatloads of stuff these days with the fields, so this is not really part of what a field "is", just what you "do" with it, right?

Julian Gautier (Jan 16 2026 at 17:13):

Hey @Oliver Bertuch this far from my expertise so I'll defer to you and others. It sounded from other conversations like @Balázs Pataki would be helpful too :slight_smile:

Oliver Bertuch (Jan 16 2026 at 17:13):

But you're the IQSS metadata expert, right?

Oliver Bertuch (Jan 16 2026 at 17:13):

Technically, this POJO model in Java is as simple as it can be

Oliver Bertuch (Jan 16 2026 at 17:14):

It should be the simplest and cleanest transformation between a written spec and the technical implementation

Oliver Bertuch (Jan 16 2026 at 17:15):

Oh I see! :see_no_evil: I always thought of you like that as others seemed to refer to you that way... :see_no_evil: My bad!

Oliver Bertuch (Jan 16 2026 at 17:15):

Who'd be the right person to ask? I'd prefer human experience over any LLM :wink:

Julian Gautier (Jan 16 2026 at 17:17):

No worries :slight_smile: I think input from Phil, Leonid and maybe Balázs would help

Julian Gautier (Jan 16 2026 at 17:19):

And maybe Jim? Outside of Zulip of course

Oliver Bertuch (Jan 16 2026 at 17:19):

Do we have any domain modelling experts around? :smile:

Oliver Bertuch (Jan 16 2026 at 17:20):

Where are they when you need 'em? :wink:

Philip Durbin 🚀 (Jan 16 2026 at 17:22):

Are you thinking along these lines?

Attributes:
- name
- title
- description
- watermark
- fieldType
- advancedSearchField
- allowControlledVocabulary
- allowmultiples
- facetable
- parent
- metadatablock_id

Non-attributes:
- displayOrder
- displayFormat
- displayoncreate
- required

Oliver Bertuch (Jan 16 2026 at 17:40):

Maybe even more radical:

Links, provided by model building:

metadata block (no more metadatablock_id field in the format - let containment define what's what)
parent
children

Attributes:

name
title
description
watermark
fieldType
advancedSearchField
allowControlledVocabulary
allowmultiples (_cardinality influences storage, so important as attribute_)
facetable

DataTypes / Schema Context (part of Metadata Block):

required

UI Context (Metadata Block):

displayOrder
displayoncreate
displayFormat

Philip Durbin 🚀 (Jan 16 2026 at 17:41):

I think that makes sense.

Oliver Bertuch (Jan 16 2026 at 17:41):

Again, we must shake off the mental model the TSV format has burned into us (and in part also in how the database entities look like)

Balázs Pataki (Jan 16 2026 at 17:44):

Separating UI-related constraints from data constraints would be ideal.

Regarding "required", JSON Schema handles this in a similar way: it’s defined at the object level, rather than as an inherent attribute of each property.

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer" }
  },
  "required": ["name"]
}

Philip Durbin 🚀 (Jan 16 2026 at 17:44):

Oooh, I like that. Let's start with a JSON Schema and backward-engineer it. :smile:

Balázs Pataki (Jan 16 2026 at 17:45):

Actually, this is my mental model :-D

Balázs Pataki (Jan 16 2026 at 17:52):

Another thing we did during the CEDAR integration is that we made it possible that a field can be part of multiple MDB-s. This maybe worth considering as well if we want to remodel MDB-s.

For example, there's the field Frequency in "Social Science and Humanities Metadata". Assume its name is "frequency" and it is a number. What if a collection doesn't use this MDB, but wants to have its own custom MDB with a numeric "frequency" field. In the stock DV it is not possible, one field can belong to only one MDB and you cannot redefine it.

Of course this is also tightly coupled with Solr indexing and Solr field names, so it is tricky thing both at the current data model and at the Solr end. We worked around it, but it is not nice. :face_with_peeking_eye:

Julian Gautier (Jan 16 2026 at 18:00):

a field can be part of multiple MDB-s

Any change that would improve support for including a metadata field in any grouping of metadata fields (or what we've called a metadata block) seems very worthwhile!

Balázs Pataki (Jan 16 2026 at 18:00):

Just out of curiosity, here’s how CEDAR structures its model.

It has 3 types of objects:

Field: a single data field
Element: a reusable collection of Fields and/or other Elements.
Template: a top-level structure made up of Fields and Elements.

Fields and Elements are fully reusable: a Field can appear in multiple Elements, and an Element can be nested in multiple other Elements. Templates, however, act as “root” objects and cannot be nested within other objects.

Each Field, Element, and Template includes metadata, data constraints, and UI specification. Much like what you’d expect from JSON Schema. Because it is JSON Schema. :slight_smile:

Balázs Pataki (Jan 16 2026 at 18:02):

Requirement and multiplicity are defined at the point where an object is used. For example, when a Field is included in an Element, the Element determines whether that Field is required and what its multiplicity is.

Balázs Pataki (Jan 16 2026 at 18:31):

In Dataverse at the moment a field (or field type) identity is effectively determined by its name. For example, title, author, subject, etc. Although fields have internal database identifiers, they are mapped one-to-one to Solr fields, which means the field name already functions as a unique identifier in practice.

This approach has an important implication: there can be only one field named frequency. Even if multiple MDBs share the same field definition, they must also share the same semantics. However, this is often too restrictive. One MDB might want frequency to be an enumerated list (e.g., Day, Week, Month), while another might want it to be a numeric value.

To address this limitation, our implementation binds fields to the URIs that are already associated with field and uses those URIs as their effective identifiers. We assume that if two fields share the same URI, they are semantically and syntactically the same. These URIs are also used as the corresponding field names in Solr (after some transformation, of course, because field names cannot include "/", etc.).

At the same time, we decouple field names from field identity. This allows:

different field names to point to the same URI (same semantics), and
the same field name to point to different URIs (different semantics).

This makes the system more flexible while preserving semantic clarity.

For example:
MDB1 (where frequency is first defined)

Name: frequency
URI: http://dataverse.org/fields/frequency

MDB2 (frequency reused with the same meaning)

Name: frequency
URI: http://dataverse.org/fields/frequency

MDB3 (frequency with different semantics)

Name: frequency
URI: http://dataverse.org/fields/frequencyByDayWeekMonth

MDB4 (different name, same semantics as MDB3)

Name: frequencyInterval
URI: http://dataverse.org/fields/frequencyByDayWeekMonth

Making URIs (or, to be irritatingly precise, IRIs) the effective identifiers also improves interoperability with other linked-data oriented representations like RO-Crate.

Oliver Bertuch (Jan 16 2026 at 18:43):

The mental model that starts forming in my head is at this current state:

For now, the database in the core is hardcoded to a 1:1 relation between MDB and DF as well as DF and CV. The data model we create now should not stray to far from this, as we can wish for more all we want... But: it must keep a future refactoring of the core to support this in mind without breaking the data model and starting over!
Let's start treating the MDB more like a "UI group of fields", not the representation of a metadata schema. It never really was and we're constantly hitting the ceiling with it now.
Let's keep in mind that for the future, we want to be able to have a 1:n relation between these UI groups to a DF and CV to DF, as multiple fields may have the same vocabulary and appear in multiple groups.
Also, we probably want the data model be open to lift the UI restriction of only one level of parent/childs. (This is AFAIK not even a DB restriction!)
Let's keep formatting out of the field data model and put it elsewhere during the domain modeling process. We're already having a hard time as it is mostly about HTML UI, but for the future we might want more than that!

Gotchas:

I'm not sure yet what to do with the current metadata block URL which is used for JSON-LD exports. Maybe shape into a binding of group to a context? (So multiple MDBs can have the same context but be different groups in the UI)
Keep the limitation of unique names. Yes, this is unfortunate and has drawbacks. But changing that properly will require changing the DB model and index to become a tuple store. We simply don't have the concept of context available in Sole without a complete refactoring of that whole subsystem. That's a major change and seems unreasonable at this time. The POJO model is not necessarily broken if we ever change that - a DF is associated to a group/MDB, so we could keep using that as a context to it.

Oliver Bertuch (Jan 16 2026 at 18:45):

I agree @Balázs Pataki that the same field name may have different semantic meanings, depending on context. As outlined above, having this context embedded is hard in how the core works at the moment.

Oliver Bertuch (Jan 16 2026 at 18:46):

Especially as people use the search function and use field names in there, they would also have to add the context there... That's not an easy thing to figure out, especially from the UX side.

Oliver Bertuch (Jan 16 2026 at 18:47):

So maybe for the time being, we have to live with this shortcoming :relieved:

Balázs Pataki (Jan 16 2026 at 18:49):

yes, I know changes that deep have many implications. I’m not pushing for them, just explaining our approach. It isn’t in use today, but we built it to support future use cases.

Balázs Pataki (Jan 16 2026 at 18:50):

Or actually, we had one use case already with the "keyword" field.

Oliver Bertuch (Jan 16 2026 at 18:50):

Also @Balázs Pataki I'm not sure that the currwnt database SQL model is the limitation here. IIRC the name of a dataset is not part of the DB keys. AFAIK the technical limitation is in the search index! (That's good news...)

Balázs Pataki (Jan 16 2026 at 18:52):

No, actucally I think datasetfieldtype has a metadatablock_id relation which is unique.

Balázs Pataki (Jan 16 2026 at 18:53):

We definitely could not add the same field to multiple MDB-s. Maybe this changed since 6.1?

Balázs Pataki (Jan 16 2026 at 18:55):

Or maybe you can add but it won't appear in all of those MDB-s? :thinking: Anyway, there were some issues with them I cannot remember now without looking at the code.

Oliver Bertuch (Jan 16 2026 at 18:56):

Oh. Yeah. True, there is still the implication of 1:1 MDB-DF. (see above) But that's not the same as making the name a part of it!

Oliver Bertuch (Jan 16 2026 at 18:57):

The name limitation mostly comes from the Sold schema... We can't have cardinal fields there. And unless we work around that somehow (multiple cores per context / field naming policy) might be hard to overcome without switching tech. Both ways: big fat project...

Philip Durbin 🚀 (Jan 16 2026 at 19:26):

We're talking about metadata field at the dataset level, but what about the file and variable level? See https://github.com/IQSS/dataverse-pm/issues/112 and its design doc.

Rather than using our existing dataset-level model, perhaps we could design something better for datasets, files, and variables. And eventually migrate the current dataset-level stuff to the new and improved model.

That is to say, we could design something new without disturbing the existing infrastructure.

Balázs Pataki (Jan 16 2026 at 19:28):

Oh, yeah, file metadata! At the “mental model” level, it should be the same, isn’t it? Even at implementation level. That would be ideal.

Philip Durbin 🚀 (Jan 16 2026 at 19:30):

I mean, 10 years ago our thought was to just use the same model, the dataset model, for file metadata. But we never got around to it. Probably we should build something better, given what we've learned with our TSV adventure. :sweat_smile:

Balázs Pataki (Jan 16 2026 at 19:59):

As you know, this is exactly why we moved fully to RO-Crate: it gives us a single, consistent model for describing Dataset and File metadata while remaining compatible with Dataverse’s existing metadata storage. With RO-Crate we can even use existing Dataverse MDBs to describe files. So, for example, you can simply drop the Geospatial MDB onto a file and instantly add location metadata to it.

Now, the rest is just a far-fetched thought experiment for fun.

Imagine a Dataverse where MDBs are represented purely as JSON Schemas (or similar), and all dataset metadata lives in a single ro-crate-metadata.json file. No separate database representation, just one JSON file per dataset, which can describe the Dataset and Files. That JSON could still be stored in a PG jsonb field for indexing and full-text search, or fed into Elasticsearch/OpenSearch for more advanced discovery. But otherwise, most metadata and schema interactions would become straightforward JSON editing and parsing operations that require no backend roundtrips or database logic and could be handled entirely within the SPA.

Philip Durbin 🚀 (Jan 16 2026 at 20:01):

Interesting. We definitely considered MongoDB a decade ago when we came up with the custom metadata blocks system.

Philip Durbin 🚀 (Jan 16 2026 at 20:02):

But I like your idea of a Postgres jsonb field. Fewer moving parts. :sweat_smile:

Oliver Bertuch (Jan 19 2026 at 14:11):

I started writing a doc: https://docs.google.com/document/d/16qxyUejjkcSPb37d9lJlsLnuhLNLTpu5lAsJglugmbw/edit?usp=sharing

Oliver Bertuch (Jan 21 2026 at 20:03):

As I do have a hard time writing all of this, I wanted to ease my mind by doing some coding work on this. Looks like as I probably will get my hands dirty with exporters for citations and stuff, a better way to transfer the information about which fields actually exist etc would be VERY handy. Here's a start: https://github.com/gdcc/dataverse-spi/blob/core/core/src/main/java/io/gdcc/spi/core/metadata/description/Field.java

Philip Durbin 🚀 (Apr 01 2026 at 13:51):

Check out #community > JSON Schema to Dataverse TSV converter by @Vera Clemens

Last updated: May 30 2026 at 06:18 UTC