properties | |
---|---|
Page Owner | emaijala |
The features described on this page are available starting with VuFind® 2.3.
The deduplication feature allows multiple records from multiple sources to be combined together and displayed as a single search result. It requires some extra setup and external tools to fully implement. Users who choose not to use deduplication may instead wish to configure the Record Versions feature, which offers a different method for associating related records. (It is also possible for the Deduplication and Record Versions features to be used to complement one another).
VuFind® includes support for displaying deduplicated records. This requires that records are deduplicated before indexing into Solr, and that a so called merged record is created for each dedup group (group of original duplicate records) alongside the original records.
RecordManager can be used for deduplication as it has built-in support for VuFind®-compatible deduplication, but VuFind® doesn't require RecordManager to be used, just some index fields and the merged record to be present.
Using RecordManager does offer some specific advantages:
Here are a couple of shortened sample Solr records:
Two original records that were found to be duplicates:
field | contents |
---|---|
id | alli.503626 |
title | Network science |
language | eng |
publishDate | 2005 |
topic | Computer networks |
Electronics in military engineering | |
building | CityLib |
merged_child_boolean | 1 |
field | contents |
---|---|
id | testsrc.21346 |
title | Network science |
language | eng |
publishDate | 2005 |
topic | Computer networks |
Military engineering | |
building | TestLib |
merged_child_boolean | 1 |
field | contents |
---|---|
id | merge123 |
title | Network science |
language | eng |
publishDate | 2005 |
topic | Computer networks |
Electronics in military engineering | |
Military engineering | |
merged_boolean | 1 |
local_ids_str_mv | alli.503626 |
testsrc.21346 |
field | contents |
---|---|
id | testsrc.123 |
title | Network science illustrated |
language | eng |
publishDate | 2012 |
topic | Computer networks |
Electronics in military engineering | |
Military engineering |
Records with merged_child_boolean=true are filtered from the results during the initial Solr search. Then the preferred original record is selected from each merge record found, and the merge record replaced with the original record. Information on all the records belonging to the dedup group is added to the original records in “dedup_data” field so that this information can be displayed to the user e.g. with links to other records. The preferred record is always first in “dedup_data”.
Note that while Solr supports features such as “field collapsing” and “collapse/expand” which could be used to achieve similar deduplication behavior, the deduplication mechanism here does not utilize these features. This avoids the performance cost associated with such functionality, and also allows broader search results. Collapse/expand only works for a search result set. VuFind®'s deduplication doesn't require all the records in the group to match the search terms. It's enough that the merge record does. This may or may not be important depending on how things are done, but at least it allows one to present the “best” result record in search results without having to re-merge anything.
The following settings in the Records section of searches.ini affect deduplication:
There is also an optional datasources.ini that can be used to map institutions to source prefixes so that the correct record can be selected when the building facet is used. This mapping is needed unless the building facet values are identical to the source id's. Each section in datasources.ini, which is compatible with RecordManager's datasources.ini, resembles a source id. The institution setting in each section is matched to the selected building facet, and the section name is used to find the record priority from record_sources.
Source strings can be translated with source_ prefix. An example translation in en.ini could be:
source_alli = “Alli Library”
[alli] institution = CityLib [testsrc] institution = TestLib