Stream: docs

Topic: Metadata block naming convention docs and practice


view this post on Zulip Balázs Pataki (Jan 16 2026 at 10:42):

This may be a documentation issue, but it could also point to something deeper.

In the Dataverse documentation:
https://guides.dataverse.org/en/6.9/admin/metadatacustomization.html#id20
the convention for metadata field names is described as:

By convention, should start with a letter, and use lower camel case

This guidance was accurate for a long time. However, with the introduction of 3D Objects Metadata, it appears that field names can now also start with digits. If this behavior is intentional, the documentation should probably be updated to reflect it.

That said, field naming conventions are ultimately constrained by what Solr allows and officially supports. According to the Solr documentation:
https://solr.apache.org/guide/solr/latest/indexing-guide/fields.html

Field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed.

In other words, while field names starting with digits may work in practice, they are not officially supported by Solr and could lead to compatibility issues in the future.

Given that this behavior has existed for years, it is unlikely to cause immediate problems. Still, it is an important constraint to keep in mind when defining metadata fields, and it would be helpful for the Dataverse documentation to clarify this nuance explicitly.

view this post on Zulip Philip Durbin 🚀 (Jan 16 2026 at 12:09):

Huh. Interesting. @Julian Gautier, what do you think?

view this post on Zulip Julian Gautier (Jan 16 2026 at 14:05):

Hey @Balázs Pataki and @Philip Durbin 🚀.

I see in the Metadata Customization page's #datasetField (field) properties table that someone used that text you pointed to, @Balázs Pataki, from https://solr.apache.org/guide/solr/latest/indexing-guide/fields.html, when describing the allowed values and restrictions of the database names of metadata fields.

Since it sounds like the database names of metadata blocks and the database names of metadata fields should not start with digits for the same reason, maybe we should adjust the text in the allowed values and restrictions of the metadata block "name" property so it's similar to text in the allowed values and restrictions of the dataset field "name" property.

Do you think that would be helpful?

I don't have enough experience with Solr to guess how problematic database field names that start with digits might be in the future, either for the database names of metadata blocks or of metadata fields.

Do y'all think we should fix this and any other database names of metadata block fields and dataset fields that start with numbers?

I think that the folks who designed the 3D metadata block just didn't see that the guides say that these database names should start with letters.

I do remember being worried about the process we used to design and add this 3D Objects metadata block. We didn't have the time to test it as I had planned, and we were asked to include it like we include the other "Supported Metadata" instead of adding it as "Experimental Metadata", which to be fair I think has it's own challenges.

view this post on Zulip Balázs Pataki (Jan 16 2026 at 14:19):

From a pragmatic point of view, I think we should stick to “not starting with a digit,” since practically no mainstream programming languages allow symbol names that begin with digits. And there are probably good reasons for that.

Now, the 3D metadata block might already be in use by the community, so I’m not sure how difficult it would be to change it at this point.

On the other hand, it does work, so… ¯_(ツ)_/¯

Still, I’d support the pragmatic approach. Both in docs and in actual use.

view this post on Zulip Philip Durbin 🚀 (Jan 16 2026 at 14:30):

In #9176 @Oliver Bertuch added a script to check for duplicate metadata field names. Maybe it could be extended to enforce a naming convention.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:31):

I have been preaching that our "specification" of how our TSV format works and how this is mixing specification and file format and how the parser is lacking.... OMG finally someone else is stumbling over this and being annoyed.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:32):

After some rant, here some more productive stuff: I currently have a trainee that I assigned to doing something about this.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:34):

We will implement a Java library that a) has a proper data model for our metadata blocks, b) create an extensible parser interface, c) make these serializable (so we can generate Solr Schema from it or whatever), d) add validation, e) make it easy to use programmatically and f) put it also in a CLI application for more use cases.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:34):

That way I want him to have a solid understanding of all we do around these parts and get a real contribution also into the core! (As the idea would be to replace the current "parser")

view this post on Zulip Balázs Pataki (Jan 16 2026 at 14:35):

Maybe it also time to move away from TSV then, isn’t it?

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:35):

We will be free to do so, yes.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:35):

But this needs a proper model first! Then parsers can use whatever format they want.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:36):

If someone wants to use same fancy exotic whatever: fine, write a parser and be done with it.

view this post on Zulip Philip Durbin 🚀 (Jan 16 2026 at 14:38):

This reminds me of how markdown.pl evolved into CommonMark. :smile:

view this post on Zulip Balázs Pataki (Jan 16 2026 at 14:38):

I had some discussions with Gustavo, and he said it was a rather adhoc decision to use TSV at the time, not much thinking went to it, so I guess IQSS won’t be against some better input/output formats.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:39):

Again, the culprit was not the create a specification and data model first! What we have in the guides is a whole mix of spec, model and format, all influencing each other.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:39):

Would a discussion around the model/spec need a community review?

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:40):

As you said: these choices have consequences! (see 3D fields)

view this post on Zulip Philip Durbin 🚀 (Jan 16 2026 at 14:40):

We used TSV because the person developing the metadata blocks used Google Spreadsheets. We'd download the TSV from the spreadsheet. Then we wrote a parser.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:41):

Historically grown mess :stuck_out_tongue_wink: (I get it, no blame here. But we should pay back that tech debt...)

view this post on Zulip Philip Durbin 🚀 (Jan 16 2026 at 14:43):

Related: harmonize formats for metadata schema and dataset creation #4451

"As a repository administrator / curator / metadata person, I would like to deal with hierarchical metadata schema in a way that is less awkward than a spreadsheet..."

view this post on Zulip Balázs Pataki (Jan 16 2026 at 14:44):

Philip Durbin 🚀 said:

Related: harmonize formats for metadata schema and dataset creation #4451

"As a repository administrator / curator / metadata person, I would like to deal with hierarchical metadata schema in a way that is less awkward than a spreadsheet..."

This is me! :rolling_on_the_floor_laughing:

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:45):

Been there, done that. Failed... :sad:

view this post on Zulip Balázs Pataki (Jan 16 2026 at 14:47):

This “symbols starting with digit” problem actually came up with our CEDAR integration meeting with the new 3D MDB, because we validate/enforce the Solr rules, and this MDB failed to work.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:47):

After DCM2025 I already sketched up a few ideas for a JSON/YAML/TOML based format... But then again found myself locked in the problem of we don't have a model and a validator. @Balázs Pataki how interested are you in this? You know, for now I have plenty to do. But FZJ is offering paid development services... (As is Jim under the GDCC flag...)

view this post on Zulip Balázs Pataki (Jan 16 2026 at 14:48):

I am interested for sure.

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:48):

Yeah, I remember that field names where limited due to Solr... :wink: The current attempt at writing this is not publicly available (yet).

view this post on Zulip Oliver Bertuch (Jan 16 2026 at 14:49):

Balázs Pataki said:

I am interested for sure.

Ha! But I suppose that money is not gonna find itself, right? :wink: :money_with_wings: (DM me if you want to talk more)


Last updated: Apr 03 2026 at 06:08 UTC