In the current "specification" of our metadata fields, we express in the TSV format whether a field is "required", where it displays in relation to others and if it will be displayed on creation of a dataset as attributes of a field.
But is that truely an "attribute"? From my point of view it is something the surrounding block imposes on the field! The field doesn't care about being required, where it is and if it's important to be displayed during creation.
It must care about how it is displayed (format), if it has a CV, who it's parent or children are, or if it can be used in facets/advances searches, as this influences how the field itself must behave. (Indexing in Solr, etc).
So in a future data model of these things (so not some file format, just as in-memory Java class model), these things should rather be defined on the block, not the field, right?
Change my mind @Julian Gautier @Philip Durbin 🚀 @Leo Andreev
(This is docs related as the current docs page is a mix of specification and file format. A future specification of a data model should be format agnostic and define a reliable data contract.)
For the logic around displayOnCreate, the TSV only tells part of the story. There's a good amount of logic and configuration available in the backend. See my review of display on create starting on slide 12 of this talk I gave at #community > #Dataverse2025 - https://docs.google.com/presentation/d/18tPj-t_v-5amXiaGoxFm-3LTm-reBKL3d303oydLURY/edit?usp=sharing
Yeah I thought of that - you can do boatloads of stuff these days with the fields, so this is not really part of what a field "is", just what you "do" with it, right?
Hey @Oliver Bertuch this far from my expertise so I'll defer to you and others. It sounded from other conversations like @Balázs Pataki would be helpful too :slight_smile:
But you're the IQSS metadata expert, right?
Technically, this POJO model in Java is as simple as it can be
It should be the simplest and cleanest transformation between a written spec and the technical implementation
Oh I see! :see_no_evil: I always thought of you like that as others seemed to refer to you that way... :see_no_evil: My bad!
Who'd be the right person to ask? I'd prefer human experience over any LLM :wink:
No worries :slight_smile: I think input from Phil, Leonid and maybe Balázs would help
And maybe Jim? Outside of Zulip of course
Do we have any domain modelling experts around? :smile:
Where are they when you need 'em? :wink:
Are you thinking along these lines?
Attributes:
- name
- title
- description
- watermark
- fieldType
- advancedSearchField
- allowControlledVocabulary
- allowmultiples
- facetable
- parent
- metadatablock_id
Non-attributes:
- displayOrder
- displayFormat
- displayoncreate
- required
Maybe even more radical:
Links, provided by model building:
Attributes:
DataTypes / Schema Context (part of Metadata Block):
UI Context (Metadata Block):
I think that makes sense.
Again, we must shake off the mental model the TSV format has burned into us (and in part also in how the database entities look like)
Separating UI-related constraints from data constraints would be ideal.
Regarding "required", JSON Schema handles this in a similar way: it’s defined at the object level, rather than as an inherent attribute of each property.
{
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["name"]
}
Oooh, I like that. Let's start with a JSON Schema and backward-engineer it. :smile:
Actually, this is my mental model :-D
Another thing we did during the CEDAR integration is that we made it possible that a field can be part of multiple MDB-s. This maybe worth considering as well if we want to remodel MDB-s.
For example, there's the field Frequency in "Social Science and Humanities Metadata". Assume its name is "frequency" and it is a number. What if a collection doesn't use this MDB, but wants to have its own custom MDB with a numeric "frequency" field. In the stock DV it is not possible, one field can belong to only one MDB and you cannot redefine it.
Of course this is also tightly coupled with Solr indexing and Solr field names, so it is tricky thing both at the current data model and at the Solr end. We worked around it, but it is not nice. :face_with_peeking_eye:
a field can be part of multiple MDB-s
Any change that would improve support for including a metadata field in any grouping of metadata fields (or what we've called a metadata block) seems very worthwhile!
Just out of curiosity, here’s how CEDAR structures its model.
It has 3 types of objects:
Fields and Elements are fully reusable: a Field can appear in multiple Elements, and an Element can be nested in multiple other Elements. Templates, however, act as “root” objects and cannot be nested within other objects.
Each Field, Element, and Template includes metadata, data constraints, and UI specification. Much like what you’d expect from JSON Schema. Because it is JSON Schema. :slight_smile:
Requirement and multiplicity are defined at the point where an object is used. For example, when a Field is included in an Element, the Element determines whether that Field is required and what its multiplicity is.
In Dataverse at the moment a field (or field type) identity is effectively determined by its name. For example, title, author, subject, etc. Although fields have internal database identifiers, they are mapped one-to-one to Solr fields, which means the field name already functions as a unique identifier in practice.
This approach has an important implication: there can be only one field named frequency. Even if multiple MDBs share the same field definition, they must also share the same semantics. However, this is often too restrictive. One MDB might want frequency to be an enumerated list (e.g., Day, Week, Month), while another might want it to be a numeric value.
To address this limitation, our implementation binds fields to the URIs that are already associated with field and uses those URIs as their effective identifiers. We assume that if two fields share the same URI, they are semantically and syntactically the same. These URIs are also used as the corresponding field names in Solr (after some transformation, of course, because field names cannot include "/", etc.).
At the same time, we decouple field names from field identity. This allows:
different field names to point to the same URI (same semantics), and
the same field name to point to different URIs (different semantics).
This makes the system more flexible while preserving semantic clarity.
For example:
MDB1 (where frequency is first defined)
Name: frequency
URI: http://dataverse.org/fields/frequency
MDB2 (frequency reused with the same meaning)
Name: frequency
URI: http://dataverse.org/fields/frequency
MDB3 (frequency with different semantics)
Name: frequency
URI: http://dataverse.org/fields/frequencyByDayWeekMonth
MDB4 (different name, same semantics as MDB3)
Name: frequencyInterval
URI: http://dataverse.org/fields/frequencyByDayWeekMonth
Making URIs (or, to be irritatingly precise, IRIs) the effective identifiers also improves interoperability with other linked-data oriented representations like RO-Crate.
The mental model that starts forming in my head is at this current state:
Gotchas:
I agree @Balázs Pataki that the same field name may have different semantic meanings, depending on context. As outlined above, having this context embedded is hard in how the core works at the moment.
Especially as people use the search function and use field names in there, they would also have to add the context there... That's not an easy thing to figure out, especially from the UX side.
So maybe for the time being, we have to live with this shortcoming :relieved:
yes, I know changes that deep have many implications. I’m not pushing for them, just explaining our approach. It isn’t in use today, but we built it to support future use cases.
Or actually, we had one use case already with the "keyword" field.
Also @Balázs Pataki I'm not sure that the currwnt database SQL model is the limitation here. IIRC the name of a dataset is not part of the DB keys. AFAIK the technical limitation is in the search index! (That's good news...)
No, actucally I think datasetfieldtype has a metadatablock_id relation which is unique.
We definitely could not add the same field to multiple MDB-s. Maybe this changed since 6.1?
Or maybe you can add but it won't appear in all of those MDB-s? :thinking: Anyway, there were some issues with them I cannot remember now without looking at the code.
Oh. Yeah. True, there is still the implication of 1:1 MDB-DF. (see above) But that's not the same as making the name a part of it!
The name limitation mostly comes from the Sold schema... We can't have cardinal fields there. And unless we work around that somehow (multiple cores per context / field naming policy) might be hard to overcome without switching tech. Both ways: big fat project...
We're talking about metadata field at the dataset level, but what about the file and variable level? See https://github.com/IQSS/dataverse-pm/issues/112 and its design doc.
Rather than using our existing dataset-level model, perhaps we could design something better for datasets, files, and variables. And eventually migrate the current dataset-level stuff to the new and improved model.
That is to say, we could design something new without disturbing the existing infrastructure.
Oh, yeah, file metadata! At the “mental model” level, it should be the same, isn’t it? Even at implementation level. That would be ideal.
I mean, 10 years ago our thought was to just use the same model, the dataset model, for file metadata. But we never got around to it. Probably we should build something better, given what we've learned with our TSV adventure. :sweat_smile:
As you know, this is exactly why we moved fully to RO-Crate: it gives us a single, consistent model for describing Dataset and File metadata while remaining compatible with Dataverse’s existing metadata storage. With RO-Crate we can even use existing Dataverse MDBs to describe files. So, for example, you can simply drop the Geospatial MDB onto a file and instantly add location metadata to it.
Now, the rest is just a far-fetched thought experiment for fun.
Imagine a Dataverse where MDBs are represented purely as JSON Schemas (or similar), and all dataset metadata lives in a single ro-crate-metadata.json file. No separate database representation, just one JSON file per dataset, which can describe the Dataset and Files. That JSON could still be stored in a PG jsonb field for indexing and full-text search, or fed into Elasticsearch/OpenSearch for more advanced discovery. But otherwise, most metadata and schema interactions would become straightforward JSON editing and parsing operations that require no backend roundtrips or database logic and could be handled entirely within the SPA.
Interesting. We definitely considered MongoDB a decade ago when we came up with the custom metadata blocks system.
But I like your idea of a Postgres jsonb field. Fewer moving parts. :sweat_smile:
I started writing a doc: https://docs.google.com/document/d/16qxyUejjkcSPb37d9lJlsLnuhLNLTpu5lAsJglugmbw/edit?usp=sharing
As I do have a hard time writing all of this, I wanted to ease my mind by doing some coding work on this. Looks like as I probably will get my hands dirty with exporters for citations and stuff, a better way to transfer the information about which fields actually exist etc would be VERY handy. Here's a start: https://github.com/gdcc/dataverse-spi/blob/core/core/src/main/java/io/gdcc/spi/core/metadata/description/Field.java
Check out #community > JSON Schema to Dataverse TSV converter by @Vera Clemens
Last updated: Apr 03 2026 at 06:08 UTC