[VUFIND-480] Add dynamic fields to Solr schema for easier customization Created: 09/Dec/11  Updated: 03/Jul/12  Resolved: 11/Jan/12

Status: Resolved
Project: VuFind®
Components: Search
Affects versions: None
Fix versions: 1.3

Type: Improvement Priority: Minor
Reporter: Demian Katz Assignee: Demian Katz
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: File dynamicfields.patch     File localsearchspecs.patch    

 Description   
We should add some dynamic fields (see http://wiki.apache.org/solr/SchemaXml#Dynamic_fields ) to make it easier for users to add custom fields to their VuFind indexes without being forced to modify the Solr schema.

Before adding this feature, we need to decide on a naming convention -- should we use a prefix or suffix? How should we represent field types? What field types do we need to support? How do we indicate whether fields are stored and/or multivalued? Which combinations of these settings do we need/want to support?

Please comment on this ticket if you have ideas or preferences.

 Comments   
Comment by Tod Olson [ 09/Dec/11 ]
I think the suffix system used by Blacklight works pretty well, as it indicates the type and parsing of the fields. In looking at their indexing, I found it pretty easy to add local fields and quickly get the kind of tokenizing I wanted.

You wind up with the first part of the field name describing the field content, suffix describing the data type, and then it's really easy to scan an alphabetical list of available fields and see what variations are available for each field, something akin to:

title_t
title_display
title_sort

-Tod
Comment by Ted Lawless [ 09/Dec/11 ]
Since we are indexing a fair amount of non-MARC metadata, I like the idea of dynamic fields. Particularly if there were a core set of (dublin core-ish) fields that were required - title, author/creator, date, language, topic, etc - the rest remained flexible. As Tod points out, the Blacklight group has come up with a nice way of accomplishing this.

One potential drawback to adding many dynamic fields, is that installation by installation will have to customize searchspecs.yaml to match their dynamic fields. Right now it is nice to have robust search weighting out of the box, which adds to the ease of getting Vufind up and running.

Thanks for starting this discussion.

Ted
Comment by Tod Olson [ 09/Dec/11 ]
Well, I think Blacklight also has a very robust out of the box indexing configuration.

They organize the indexing a bit differently, though. They put the search definitions with per-field boosts in the search handler definitions in solrconfig.xml. If I understand this correctly, this means you can the query Solr directly through a URL or the admin interface and get the same rankings you would get as you would in the web catalog interface. Seems useful, like for regression testing on ranking.

But regardless of where the searches are defined, the fields must be defined in one place and used in searches in another place. Anyone adding fields will have to be explicit about how they are to be used. I don't think there's a way around that.

-Tod
Comment by Demian Katz [ 12/Dec/11 ]
The reason VuFind's search definitions and boosts are not stored directly in solrconfig.xml is that VuFind offers options beyond what solrconfig.xml can do -- since it supports PHP-driven field munging (for things like its call number search, and to allow advanced search operators beyond what Dismax supports) it made more sense to put all of that information in one place, rather than having the Dismax-specific stuff in the Solr configuration and then (by necessity) having other definitions elsewhere.

I'm not currently in favor of stripping down the existing Solr schema and switching to dynamic fields. I think it makes sense to keep a fairly standardized out-of-the-box configuration to keep it simple to get things up and running. I'd just like to make it cleaner to extend the schema without modifying as many files... and I suppose we could consider trimming the existing schema a little bit (things that aren't used by default, like the longitude/latitude field, for example, might be worth switching to dynamic fields).

Another potentially helpful addition would be a searchspecs_local.yaml file that extends/overrides the default searchspecs.yaml file, so you can add new search types without editing the existing file (another way of making upgrades less complex).

Taking a look at the Blacklight schema (https://github.com/projectblacklight/blacklight-jetty/blob/master/solr/conf/schema.xml) it seems to be a bit of a mix of purely data-type-specific suffixes (i.e. *_i for integers) and use-type-specific suffixes (i.e. *_facet for facet fields). Whether things are stored, indexed or multivalued is not encoded into the suffixes and does not appear to be consistent (presumably the selections are based on specific common use cases). I think I'd like to see something a little more consistent in VuFind. I think I prefer the data-type-specific suffixes since that offers the most flexibility -- we can recommend a particular data type for a particular purpose, but we're not locking users into a specific set of use cases (though I understand that Blacklight's naming is at least partially driven by the way that Rails uses field names to control code behavior). Perhaps a starting point would be using type suffixes both with and without an additional "multi" suffix for multi-valued fields. Everything could be stored and indexed for simplicity's sake (we could add nostore and/or noindex suffixes if we really want to, but it gets combinatorically ugly -- maybe better to recommend manual schema customization at that point). Any further thoughts?

- Demian
Comment by Ted Lawless [ 12/Dec/11 ]

>>Another potentially helpful addition would be a searchspecs_local.yaml file that extends/overrides the default searchspecs.yaml file, so you can add >>new search types without editing the existing file (another way of making upgrades less complex).

>>Perhaps a starting point would be using type suffixes both with and without an additional "multi" suffix for multi-valued fields.

These points sound good to me. It would offer a lot of flexibility (if needed) and still allow for a straightforward setup process.

Ted
Comment by Demian Katz [ 14/Dec/11 ]
I'm attaching two patches:

localsearchspecs.patch adds a mechanism for overriding/extending searchspecs.yaml with searchspecs_local.yaml

dynamicfields.patch proposes some dynamic field definitions:

_date, _date_mv (date, single-valued or multi-valued)
_isn, _isn_mv (isn, single-valued or multi-valued)
_str, _str_mv (string, single-valued or multi-valued)
_txt, _txt_mv (text, single-valued or multi-valued)
_txtF, _txtF_mv (textFacet, single-valued or multi-valued)
_txtP, _txtP_mv (textProper, single-valued or multi-valued)

This was the best compromise I could think of between brevity and clarity -- but I'm open to suggestions for better suffixes.

I've also omitted the spelling-specific fields from the dynamic definitions; I can't think of a reason to create dynamic spelling fields since spelling changes would necessitate other configuration customizations anyway... but if you can think of a use case, please share it.

I'll commit these patches in time for the 1.3 release unless somebody objects in the meantime. As far as I can tell, the only possible downside is that the searchspecs_local.yaml check requires an extra file access and might slow things down by a trivial amount.
Comment by Demian Katz [ 11/Jan/12 ]
Resolved in r4822.
Comment by Václav Rosecký [ 04/Jun/12 ]
Add dynamic field for browse type to allow custom indexes for browsing ISBN, ISSN etc.

<dynamicField name="*_browse" type="string" indexed="true" stored="false" multiValued="true"/>
Comment by Demian Katz [ 04/Jun/12 ]
I see how this would be useful, but I wonder if it would be better to come up with a naming convention for unstored versions of fields since this type of field might be useful for other purposes as well... though then we end up with a profusion of field definitions as opposed to just one.
Comment by Ronan McHugh [ 02/Jul/12 ]
Hi, just wondering if there is any interest in adding this to the Authority Index at some point?
Comment by Demian Katz [ 02/Jul/12 ]
As of r5908, I've added dynamic fields to the authority core for consistency with biblio. Since authority doesn't define as many field types, the list is shorter, but date, string and text are supported using the same suffixes as in biblio. This change will be available in releases 1.4 and 2.0alpha.
Comment by Ronan McHugh [ 03/Jul/12 ]
great, thanks!
Generated at Fri Apr 26 13:45:36 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100251-rev:4690f9fa025ccb713885a7f8212eefdeb0c508be.