VuFind
  1. VuFind
  2. VUFIND-542

Improvements to Author indexing

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0
    • Component/s: Import Tools, Search
    • Labels:
      None

      Description

      Over the past while, users and librarians here at NLI have given us some feedback about problems with author searches in our Vufind instance. Eoghan has asked me to summarise these problems and suggest some solutions in order to kickstart a discussion about how to improve author search in Vufind. Since this would involve relatively core changes to the way that Vufind does search, we'd prefer to have some feedback from other developers before working on our own solution.

      Summary of issues:

      1) At present only Main Authors - Personal Name (MARC 100) are indexed in the author field in Solr. Since MARC records only permit one main author, this has the disadvantage of relegating second authors to the 700 field and thus the author2 field in Solr. The 700 field (Added Entry - Personal Name) is the same used for other contributors such as illustrators, donors etc. This additional relationship information is typically defined in the $e field, although second authors will not receive an entry in the $e field. This is means that second authors will not receive query boosting and will effectively be ranked the same in results as donors, illustrators etc. Similarly, where Main authors are Corporate Names or Meeting Names (MARC 110,111), they will be defined as Author2 in Solr instead of author. This problem also carries over into faceting. Since only main authors are used in faceting, it is not possible to facet by Corporate Name or second author.

      2) When searching for authors, users who enter only the initial for the first name, e.g. "Lee, J." for Joseph Lee will not receive any results. This is because Solr doesn't have any tokens for the initials.

       
      Suggested Solutions:

      Add 110, 111 to author in marc.properties. This will have the effect of weighting corporate authors / meetings on the same level as personal names.
       
      A beanshell script could be written to distinguish between different types of 700 field entries, e.g.:
      - When $e of 700 is blank or value denoting authorship, index in Solr author field

      - When 700$e contains value denoted contribution (e.g. illustrator) index as author 2

      - when 700$e contains other values not related to authorship (e.g. donor) don't index as an author but possibly index elsewhere

      This would require making author multi-valued which presumably would have a knock-on effect for both PHP logic and Smarty templates, and would require tweaking the search weightings. The script could use the LOC relator terms/codes [1] as a basis, but should be able to lookup a user-specified list of terms/codes too.

       A .bsh script or Solr regex script could be written to do some additional processing of names (e.g. Lee, Joseph -> Lee + J) and index the results in a new Solr field or in author_additional.
       

      Suggestions and comments welcome.

      1. authorInitials.patch
        5 kB
        Ronan McHugh
      2. author-mod 02-10-12.patch
        33 kB
        Ronan McHugh
      3. author-mod 10-04-12.patch
        32 kB
        Ronan McHugh
      4. author-mod 10-10-12.patch
        32 kB
        Ronan McHugh
      1. NLI - Multiple Author Facets.png
        11 kB

        Activity

        Hide
        Demian Katz added a comment -
        If I had to take a guess, I would speculate that the majority of users aren't going to care about corporate authorship getting its own distinct label in the record view... but there might be a vocal minority that needs this. It might be worth polling the mailing lists about this to see how people feel (getting the cataloger's perspective would be useful), but I would be inclined to let the corporate author field display go away in the default trunk setup, as long as we provide instructions on how to get it back as a custom option in case somebody really needs it.
        Show
        Demian Katz added a comment - If I had to take a guess, I would speculate that the majority of users aren't going to care about corporate authorship getting its own distinct label in the record view... but there might be a vocal minority that needs this. It might be worth polling the mailing lists about this to see how people feel (getting the cataloger's perspective would be useful), but I would be inclined to let the corporate author field display go away in the default trunk setup, as long as we provide instructions on how to get it back as a custom option in case somebody really needs it.
        Hide
        Ronan McHugh (Inactive) added a comment -
        Here is a bsh script that will process the supplied author fields and return initials in several forms. For example, Yeats, William Butler, will be indexed as "w b y wb wby", Duncan-Smith, Iain, will be indexed as "i d s id ids". International Labour Organisation will be indexed as "i l o ilo". The aim of this is to ensure that users get results when they search for initials without spaces, e.g. "wb yeats" or "ilo".
        Show
        Ronan McHugh (Inactive) added a comment - Here is a bsh script that will process the supplied author fields and return initials in several forms. For example, Yeats, William Butler, will be indexed as "w b y wb wby", Duncan-Smith, Iain, will be indexed as "i d s id ids". International Labour Organisation will be indexed as "i l o ilo". The aim of this is to ensure that users get results when they search for initials without spaces, e.g. "wb yeats" or "ilo".
        Hide
        Demian Katz added a comment -
        Another issue to consider as part of comprehensive author field redesign:

        Right now, author_additional is used only for table of contents authors; these names are searched but never displayed from Solr (since TOC display is handled by direct MARC processing). This seems inelegant; should we rename author_additional to author-toc, change the way the field is used, or do nothing?
        Show
        Demian Katz added a comment - Another issue to consider as part of comprehensive author field redesign: Right now, author_additional is used only for table of contents authors; these names are searched but never displayed from Solr (since TOC display is handled by direct MARC processing). This seems inelegant; should we rename author_additional to author-toc, change the way the field is used, or do nothing?
        Hide
        Demian Katz added a comment -
        A VuFind 2 port of this code is in progress in this pull request:

        https://github.com/vufind-org/vufind/pull/354

        I've significantly reworked the BeanShell code from the original patch to make it simpler and more generic. Functions have been renamed, and the basic idea here is that, rather than caring about semantic meanings of particular MARC tags, and instead of having built-in concepts of primary/secondary authors, the revised code simply filters the author results using a couple of parameters: the tags which may be included if no relator is present, and a whitelist of relators to allow when a relator value is found.

        I'm sure there's still room to further improve this code, but I feel this is a step in the right direction.

        I've also made some schema adjustments which are similar (but not identical) to the ones proposed in the older patch. The net result is just about the same, but I've also taken the liberty of eliminating some unused author fields to simplify matters.

        There's a lot of work still to be done on the PHP side -- watch the PR for progress there.

        I also haven't done anything with the initials patch yet. That feels like a separate (and smaller) issue, so I think I'll work through the primary patch before I worry too much about that one.
        Show
        Demian Katz added a comment - A VuFind 2 port of this code is in progress in this pull request: https://github.com/vufind-org/vufind/pull/354 I've significantly reworked the BeanShell code from the original patch to make it simpler and more generic. Functions have been renamed, and the basic idea here is that, rather than caring about semantic meanings of particular MARC tags, and instead of having built-in concepts of primary/secondary authors, the revised code simply filters the author results using a couple of parameters: the tags which may be included if no relator is present, and a whitelist of relators to allow when a relator value is found. I'm sure there's still room to further improve this code, but I feel this is a step in the right direction. I've also made some schema adjustments which are similar (but not identical) to the ones proposed in the older patch. The net result is just about the same, but I've also taken the liberty of eliminating some unused author fields to simplify matters. There's a lot of work still to be done on the PHP side -- watch the PR for progress there. I also haven't done anything with the initials patch yet. That feels like a separate (and smaller) issue, so I think I'll work through the primary patch before I worry too much about that one.
        Hide
        Demian Katz added a comment -
        Just an update to note that all functionality from this ticket (including the author initials) is now implemented in the pull request. We still need to do some review before merging to master (not to mention minting a new SolrMarc release), but this is great progress!
        Show
        Demian Katz added a comment - Just an update to note that all functionality from this ticket (including the author initials) is now implemented in the pull request. We still need to do some review before merging to master (not to mention minting a new SolrMarc release), but this is great progress!

          People

          • Assignee:
            Demian Katz
            Reporter:
            Ronan McHugh (Inactive)
          • Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: