Stream: dev

Topic: Solr index field types


view this post on Zulip Vera Clemens (Sep 13 2024 at 12:15):

Hi, I was just playing around with the Dataverse Search API and I noticed that almost all fields are indexed in Solr as text_en fields, which doesn't allow e.g. range searches for integer and date fields. I found this relatively old (and now closed) issue from 2014 talking about using more fitting index data types: https://github.com/IQSS/dataverse/issues/370 It seems that since the date range facet was so far not implemented, the index types have not been touched yet (?) Is there any open issue currently talking about this? Or any ongoing work? :)

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 13 2024 at 12:47):

Yes, we should do this and no, there hasn't been any recent discussion.

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 13 2024 at 12:47):

@Vera Clemens are you mostly interested in date ranges?

view this post on Zulip Vera Clemens (Sep 13 2024 at 12:53):

:+1: I'm interested in both date ranges and integer ranges

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 13 2024 at 13:19):

Cool. Out of curiosity, are the integers in one of your custom metadata blocks? Or one of the standard blocks we ship with Dataverse?

view this post on Zulip Vera Clemens (Sep 13 2024 at 13:24):

They're in custom metadata blocks.

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 13 2024 at 13:26):

Ok. I'm wondering if we should also fix anything in the standard blocks. There are dates everywhere. And I believe some of the astro fields use integers.

That way, when we make a release, there are some nice examples people can see without installing custom metadata blocks.

view this post on Zulip Vera Clemens (Sep 13 2024 at 13:32):

Yes, sure! I think this could benefit integer and date fields in any metadata block.

Even without implementing any new facet with sliders, I think if the fields were indexed as integers/dates, you could run Solr range queries via the search input field (or the search API ?q=...) like integerfield:[25 TO 50] or datefield:[2000-11-01 TO 2014-12-01].

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 13 2024 at 13:37):

Yes, definitely. Please see https://guides.dataverse.org/en/6.3/api/search.html#date-range-search-example

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 13 2024 at 13:39):

And fileSizeInBytes:[32212254720 TO *] at https://github.com/IQSS/dataverse/issues/4439#issuecomment-468685228

view this post on Zulip Vera Clemens (Sep 25 2024 at 14:21):

I've been playing around with trying to index fields as something other than text_en and it seems to be working OK.

You can try it out here: http://solr-fieldtypes-test-dataverse.qa.km.k8s.zbmed.de/ (note, this is on our dev cluster that is offline during the night, so if you're in the US, it might be offline if you are checking after 2pm-ish, sorry about that)

I've added an experimental metadata block containing an integer, a float and a date field and indexed them in Solr as plong, pdouble and date_range (by just manually editing the schema.xml) and created 3 test datasets with the following values:

Here are some sample queries based on these fields:

(Filtering the date field still has some issues and needs some more testing)

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 25 2024 at 14:23):

Great! I'm glad the testing is going well!

view this post on Zulip Vera Clemens (Sep 26 2024 at 09:11):

Found the issue with the date field. The date fields were getting indexed as years only, without months or days (if present). It seems this is intentional: https://github.com/IQSS/dataverse/blob/050064ef264c667c2473c78b893def832c33f992/src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java#L1070 Do you see any issue with changing this? I assume maybe there might be an issue if the date field is set to be facetable? That's something I didn't test yet.

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 26 2024 at 10:54):

Yes, probably we did that to have a reasonable (string-based) facet of four digits. But if we had a proper range facet... :grinning:

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 26 2024 at 10:55):

Would you want to add the UI for this in JSF? I always hesitate with this since it will be replaced by the new frontend. Would you rather see it added there instead?

view this post on Zulip Vera Clemens (Sep 26 2024 at 11:51):

Could we index the full date in "<dateFieldName>", but only the year in "<dateFieldName>_s"? From a quick test, that seems to allow proper range searches (using the [... TO ...] syntax) but keeps the facets as-is for now.

view this post on Zulip Vera Clemens (Sep 26 2024 at 11:52):

And yes, if we were to add a proper range facet to the UI, I would like to see it in the new frontend :)

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 26 2024 at 12:08):

Hmm, that might work.

view this post on Zulip Vera Clemens (Sep 27 2024 at 12:48):

Yep, seems to work

view this post on Zulip Vera Clemens (Sep 27 2024 at 12:48):

I've opened a PR with my changes here: https://github.com/IQSS/dataverse/pull/10887

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 27 2024 at 13:03):

@Vera Clemens looks great! Should the title of the PR have something about highlighting in it?

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 27 2024 at 13:11):

Also, can you please add a release note snippet?

view this post on Zulip Vera Clemens (Sep 27 2024 at 13:43):

Both done :smile:

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 27 2024 at 13:50):

Looks great. Amazing release note. Do you think it's worth it to add an API test?

view this post on Zulip Vera Clemens (Sep 27 2024 at 14:00):

Thanks! :dataverse_woman: Hm, yes maybe. I'll try and take a look on Monday!

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 27 2024 at 14:09):

Wonderful! Thanks!


Last updated: Nov 01 2025 at 14:11 UTC