About Features Downloads Getting Started Documentation Events Support GitHub

Site Tools


indexing:deduplication

Support for Deduplication

The features described on this page are available starting with VuFind 2.3.

Solr Setup

VuFind includes support for displaying deduplicated records. This requires that records are deduplicated before indexing into Solr, and that a so called merged record is created for each dedup group (group of original duplicate records) alongside the original records. RecordManager can be used for deduplication as it has built-in support for VuFind-compatible deduplication, but VuFind doesn't require RecordManager to be used, just some index fields and the merged record to be present.

Required Solr Fields in Merged Records

  • local_ids_str_mv - An array of original record id's belonging to this dedup group.
  • merged_boolean - A boolean field with value “true” to indicate that this is a merged record.

Required Solr Fields in Original Records with Duplicates

  • merged_child_boolean - A boolean field with value “true” to indicate that this is an original record belonging to a dedup group. Leave this out from records that don't have any duplicates.

Sample Records

Here are a couple of shortened sample Solr records:

Two original records that were found to be duplicates:

Record 1

field contents
id alli.503626
title Network science
language eng
publishDate 2005
topic Computer networks
Electronics in military engineering
building CityLib
merged_child_boolean 1

Record 2

field contents
id testsrc.21346
title Network science
language eng
publishDate 2005
topic Computer networks
Military engineering
building TestLib
merged_child_boolean 1

Merged Record

field contents
id merge123
title Network science
language eng
publishDate 2005
topic Computer networks
Electronics in military engineering
Military engineering
merged_boolean 1
local_ids_str_mv alli.503626
testsrc.21346

Record With No Duplicates

field contents
id testsrc.123
title Network science illustrated
language eng
publishDate 2012
topic Computer networks
Electronics in military engineering
Military engineering

Search Process

Records with merged_child_boolean=true are filtered from the results during the initial Solr search. Then the preferred original record is selected from each merge record found, and the merge record replaced with the original record. Information on all the records belonging to the dedup group is added to the original records in “dedup_data” field so that this information can be displayed to the user e.g. with links to other records. The preferred record is always first in “dedup_data”.

Configuration

The following settings in the Records section of searches.ini affect deduplication:

  • deduplication Whether support for deduplicated records is enabled
  • record_sources Optional: A comma-separated list of record sources (ID prefixes separated from the actual record id with a period) in order of precedence. The sooner the source is found in the list, the higher the priority when selecting a preferred record to display.

There is also an optional datasources.ini that can be used to map institutions to source prefixes so that the correct record can be selected when the building facet is used. This mapping is needed unless the building facet values are identical to the source id's. Each section in datasources.ini, which is compatible with RecordManager's datasources.ini, resembles a source id. The institution setting in each section is matched to the selected building facet, and the section name is used to find the record priority from record_sources.

Source strings can be translated with source_ prefix. An example translation in en.ini could be:

source_alli = “Alli Library”

Sample datasources.ini

[alli]
institution = CityLib

[testsrc]
institution = TestLib
indexing/deduplication.txt · Last modified: 2015/12/21 16:51 by demiankatz