Warning: This page has not been updated in over over a year and may be outdated or deprecated.
indexing:deduplication
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
deduplication [2014/01/02 20:40] – [Sample datasources.ini] demiankatz | indexing:deduplication [2023/03/20 16:33] (current) – [Search Process] demiankatz | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Support for Deduplication ====== | ====== Support for Deduplication ====== | ||
- | // The features described on this page are available starting with VuFind | + | // The features described on this page are available starting with VuFind® |
+ | |||
+ | ===== Introduction ===== | ||
+ | |||
+ | The deduplication feature allows multiple records from multiple sources to be combined together and displayed as a single search result. It requires some extra setup and external tools to fully implement. Users who choose not to use deduplication may instead wish to configure the [[configuration: | ||
===== Solr Setup ===== | ===== Solr Setup ===== | ||
- | VuFind | + | VuFind® |
+ | |||
+ | ==== RecordManager ==== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | Using RecordManager does offer some specific advantages: | ||
+ | - RecordManager can find the best record among the deduplicated records to use as the base record when creating a merged record. | ||
+ | - There' | ||
+ | - The records belonging to a dedup group can also be enriched with data from the merged record, so enrichment can be achieved in both directions. | ||
+ | - The mechanism can ensure e.g. that two records from the same data source never get deduplicated. This is a built-in assumption in RecordManager' | ||
==== Required Solr Fields in Merged Records ==== | ==== Required Solr Fields in Merged Records ==== | ||
Line 12: | Line 26: | ||
* **merged_boolean** - A boolean field with value " | * **merged_boolean** - A boolean field with value " | ||
| | ||
- | ==== Required Solr Fields in Original Records ==== | + | ==== Required Solr Fields in Original Records |
- | * **merged_child_boolean** - A boolean field with value " | + | * **merged_child_boolean** - A boolean field with value " |
==== Sample Records ==== | ==== Sample Records ==== | ||
Line 60: | Line 74: | ||
| | testsrc.21346 | | | | testsrc.21346 | | ||
+ | === Record With No Duplicates === | ||
+ | |||
+ | ^ field ^ contents | ||
+ | | id | testsrc.123 | ||
+ | | title | Network science illustrated | | ||
+ | | language | eng | | ||
+ | | publishDate | 2012 | | ||
+ | | topic | Computer networks | | ||
+ | | | Electronics in military engineering | | ||
+ | | | Military engineering | | ||
===== Search Process ===== | ===== Search Process ===== | ||
Line 65: | Line 89: | ||
Records with merged_child_boolean=true are filtered from the results during the initial Solr search. Then the preferred original record is selected from each merge record found, and the merge record replaced with the original record. Information on all the records belonging to the dedup group is added to the original records in " | Records with merged_child_boolean=true are filtered from the results during the initial Solr search. Then the preferred original record is selected from each merge record found, and the merge record replaced with the original record. Information on all the records belonging to the dedup group is added to the original records in " | ||
+ | ==== Architecture Note: Field Collapsing / Collapse/ | ||
+ | |||
+ | Note that while Solr supports features such as "field collapsing" | ||
===== Configuration ===== | ===== Configuration ===== | ||
- | The following settings in the Records section of [[searches.ini]] affect deduplication: | + | The following settings in the Records section of [[configuration: |
* **deduplication** Whether support for deduplicated records is enabled | * **deduplication** Whether support for deduplicated records is enabled | ||
* **record_sources** Optional: A comma-separated list of record sources (ID prefixes separated from the actual record id with a period) in order of precedence. The sooner the source is found in the list, the higher the priority when selecting a preferred record to display. | * **record_sources** Optional: A comma-separated list of record sources (ID prefixes separated from the actual record id with a period) in order of precedence. The sooner the source is found in the list, the higher the priority when selecting a preferred record to display. | ||
- | There is also an optional datasources.ini that can be used to map institutions to source prefixes so that the correct record can be selected when the building facet is used. Each section in datasources.ini, | + | There is also an optional datasources.ini that can be used to map institutions to source prefixes so that the correct record can be selected when the building facet is used. This mapping is needed unless the building facet values are identical to the source id's. Each section in datasources.ini, |
Source strings can be translated with source_ prefix. An example translation in en.ini could be: | Source strings can be translated with source_ prefix. An example translation in en.ini could be: | ||
Line 88: | Line 115: | ||
</ | </ | ||
---- struct data ---- | ---- struct data ---- | ||
+ | properties.Page Owner : emaijala | ||
---- | ---- | ||
indexing/deduplication.1388695253.txt.gz · Last modified: 2014/06/13 13:12 (external edit)