[#VUFIND-542] Improvements to Author indexing

[VUFIND-542] Improvements to Author indexing Created: 29/Mar/12 Updated: 18/Mar/16 Resolved: 18/Mar/16
Status:	Resolved
Project:	VuFind®
Components:	Import Tools, Search
Affects versions:	None
Fix versions:	3.0

Type:

Improvement

Priority:

Minor

Reporter:

Ronan McHugh

Assignee:

Demian Katz

Resolution:

Fixed

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original estimate:

Not Specified

Attachments:

NLI - Multiple Author Facets.png

author-mod 02-10-12.patch

author-mod 10-04-12.patch

author-mod 10-10-12.patch

authorInitials.patch

Description

Over the past while, users and librarians here at NLI have given us some feedback about problems with author searches in our Vufind instance. Eoghan has asked me to summarise these problems and suggest some solutions in order to kickstart a discussion about how to improve author search in Vufind. Since this would involve relatively core changes to the way that Vufind does search, we'd prefer to have some feedback from other developers before working on our own solution.

Summary of issues:

1) At present only Main Authors - Personal Name (MARC 100) are indexed in the author field in Solr. Since MARC records only permit one main author, this has the disadvantage of relegating second authors to the 700 field and thus the author2 field in Solr. The 700 field (Added Entry - Personal Name) is the same used for other contributors such as illustrators, donors etc. This additional relationship information is typically defined in the $e field, although second authors will not receive an entry in the $e field. This is means that second authors will not receive query boosting and will effectively be ranked the same in results as donors, illustrators etc. Similarly, where Main authors are Corporate Names or Meeting Names (MARC 110,111), they will be defined as Author2 in Solr instead of author. This problem also carries over into faceting. Since only main authors are used in faceting, it is not possible to facet by Corporate Name or second author.

2) When searching for authors, users who enter only the initial for the first name, e.g. "Lee, J." for Joseph Lee will not receive any results. This is because Solr doesn't have any tokens for the initials.

Suggested Solutions:

Add 110, 111 to author in marc.properties. This will have the effect of weighting corporate authors / meetings on the same level as personal names.

A beanshell script could be written to distinguish between different types of 700 field entries, e.g.:
- When $e of 700 is blank or value denoting authorship, index in Solr author field

- When 700$e contains value denoted contribution (e.g. illustrator) index as author 2

- when 700$e contains other values not related to authorship (e.g. donor) don't index as an author but possibly index elsewhere

This would require making author multi-valued which presumably would have a knock-on effect for both PHP logic and Smarty templates, and would require tweaking the search weightings. The script could use the LOC relator terms/codes [1] as a basis, but should be able to lookup a user-specified list of terms/codes too.

A .bsh script or Solr regex script could be written to do some additional processing of names (e.g. Lee, Joseph -> Lee + J) and index the results in a new Solr field or in author_additional.

Suggestions and comments welcome.

Comments

Comment by Tod Olson [ 29/Mar/12 ]

Making author multi-value may have a side-effect on sort by author. Currently the author sort use authrStr, which is a field copy from author. Maybe that will work just fine when sorting records with multiple authors, maybe the author sort will require some modification.

Comment by Demian Katz [ 29/Mar/12 ]

You're right -- some modification will be needed; I don't think Solr will let you copyField a multi-valued field to a non-multi-valued field. We would have to change some indexing rules (and possibly add a new sort-specific field) to make this work.

Comment by Ronan McHugh [ 30/Mar/12 ]

Here are some screenshots from a quick test I made with some sample data. The only PHP error I am getting is as follows:

Warning: urlencode() expects parameter 1 to be string, array given in C:\vufind-test\web\RecordDrivers\IndexRecord.php on line 710

I guess that it won't be difficult to change the function to accept parameters from an array, but presumably this will have knock on effects elsewhere.

Otherwise:

As can be seen from the screenshots, the Author field is "Array" in List view. No surprise there. I guess wherever that is defined would need to be modified to check for how many authors there are and return "X and Y" for multiple authors. This will mean that the clickthrough will be for both the authors.

Searching for author works fine, even after I deleted the cache. I hadn't expected this to work, but perhaps there is an obvious reason? Likewise with a faceted search with author as a facet.

Checking my Solr Schema Browser, it seems that authorStr was populated correctly, so I suppose this is why Author search works but PHP doesn't display Author correctly.

Apologies if there's anything obvious I'm missing here, I'm still getting to grips with the system.

Comment by Ronan McHugh [ 10/Apr/12 ]

Author Modifications patch

This patch is a practical implementation of some of the ideas discussed in [~~VUFIND-542~~]. Namely it enables multiple first authors by modifying the relevant php and display templates and adding a configurable beanshell script to determine authorship based on user-supplied parameters.

1) Beanshell script

The role of this script is to allow a more nuanced determination of authorship and second authorship based on a MARC record. The script references a file /web/conf/author-classification.ini that classifies the LOC relator codes according to their creative role, firstAuthor, secondAuthor or nonCreative. This allows the system administrator to alter the fields defined as first or second author.
The script is called from marc_local.properties. The administrator calls the getAuthors method and passes in the list of fields to be searched (e.g. 100abcd:110abc) for this particular author type (first or second author). Finally they pass in the type of author field currently being populated, either Author, secondAuthor or AuthorStr.
The getAuthors method checks all the desired fields of each record to see 1) if they are populated 2) how they are populated. If the fields are 100,110 or 111, they are automatically considered to be first authors, unless the administrator wants 110 or 111 fields to be considered as second authors.

If the field is 700 and there is no relator information, the field is considered to be an author.If the field is 700 and there is relator information, the script compares the relator information given to the relator classifications in author-classification.ini. It assigns the field a value based on the result of this comparison.
NOTE - if the administrator does not want 700s with a blank relator field to be automatically considered authors, they can simply not pass 700s into the script from marc_local.properties.

2) Schema XML

The solr biblio import schema had to be altered to allow this patch to work. Author is now multivalued. AuthorStr is no longer populated by Author, but instead by the first Author value using the author-modifications script.

3) Internal PHP and display templates

Several php files and related tpl files had to be modified to allow for multiple author values. In IndexRecord.php the getPrimaryAuthor method was replaced with the getPrimaryAuthors in most cases. The getPrimaryAuthor method has been left in place but it now returns the first element from getPrimaryAuthors. Most calls to the method have been updated to getPrimaryAuthors(), except for the getOpenURL method which still uses getPrimaryAuthor.

The relevant tpl files have been updated to display multiple authors (core.tpl,listentry.tpl,result.tpl). One error remains with the similar items module /record/view.tpl whereby a trailing space after each author is displayed.

Comment by Ronan McHugh [ 29/Jun/12 ]

This is also relevant to a user complaint we have received about title searching. When users make a title search, VuFind returns results based on author name. This is because VuFind includes the value of the 245c field in the index of title_full. To rectify this, Solr should be prevented from indexing the 245c field as part of title_full. In the below example, title searching for murphy report will return records with title report and author murphy.

http://catalogue.nli.ie/Search/Results?lookfor=murphy+report&type=Title&submit=FIND

Our fix:
in marc.properties:

author_additional = 505r:245c
title_full = 245abfghknps, first

Comment by Filipe MS Bento [ 09/Jul/12 ]

Dear Ronan (et alii),

When I saw this ticket of yours I thought, ok, finally a solution that brings down the barriers of not being possible to have more than one Main Author (main intellectual responsibility), that VuFind SOLR biblio schema forces due to its design was originally thought of to be a pure OPAC 2.0 (just pure speculation of mine, no facts to prove it), having the OPAC as the sole source of records (and this weird thing of MARC not allowing the author field to be repeatable: I mean it forces to have always an author elected as the main and the other(s) relegated to a not so honorable position of co-authors (in UNIMARC, at least, they are near each other: 700, 701 & 702).

In real world, especially of scientific publishing most of the times, either if it corresponds to the truth of not, many articles and alike are fruit of joint efforts and co-writing of 2 or more colleagues.

I was about to install it because I thought it would resolve this for good, but then I saw this “documentation” from you:

2) Schema XML

(…). AuthorStr is no longer populated by Author, but instead by the first Author value using the author-modifications script.

Well, and writing to entire team, not directly at Ronan (by all means, I do truly appreciate all your patches and contributions, you’re doing a fantastic job; wish I had a small sample of your programming skills [I have but not so much in PHP > breaking loose of MS dependency) if we take in mind that the Author facet, sorting, list of authors retrieved in autocomplete, but most important, Author facet (yes, I know), all come from AuthorStr value and taking, for instances and randomly :) , the example of Author facet, just one of the authors is displayed (amongst same level ones) so in fact it’s not possible to filter by a certain author, because he/she is not present in the facet, just because he/she comes from a family which name is in the wrong side of the A to Z sort world…

Ironies apart, I’m aware that in terms of SOLR that is not so simple, turn it to multivalued, but for all the purposes mentioned above, since each author would have its own AuthorStr for certain record (that has several “main” authors, “contributors” as many OIA-PMH sources and formats call them) they would be displayed as an unique entry, not an array. Take for instance the example of “format” (multiValued="true") – and even “language” is formatted as multivalued.

For sure I am missing something here, else you wouldn’t taken that option (better, maintain it); apologies for that, but what ever are the implications I think, if not already thought of in VF2.0, of ways to overcome it. If not, there will be fingers point at VuFind install and say, “hey, why Xyz not being shown in the Author facet?”, if they aren’t doing so already.

Thanks,

Filipe

PS: I’m aware of <copyField source="author2" dest="author2Str"/> and of <copyField source="author_additional" dest="author_additionalStr"/>, but

./sys/Recommend/AuthorFacets.php: 'field' => 'authorStr', 'limit' => 10, 'sort' => 'count'

and more like that > none for author2Str or author_additionalStr:

[vuser@iia web]$ grep -i -R author2Str ./
[vuser@iia web]$ grep -i -R author_additionalStr ./
[vuser@iia web]$

= aren’t used anywhere at all! Poor guys! (men and women, authors that also worked hard and aren’t given credit for it :| )

Comment by Ronan McHugh [ 10/Jul/12 ]

Hi Felipe, that is a good point, I hadn't thought of the implications for narrowing searches on the authors. We haven't committed this patch to our local instance yet since it requires some more testing and feedback from our users, but I will make sure to look at that aspect before long. If I can improve it with your suggestions I will commit back here. At the moment I am living in Vufind 2.0 world, which is a very different place altogether.

Comment by Demian Katz [ 10/Jul/12 ]

Filipe brings up a good point that it makes sense to have all author names available for faceting. This will help make the author recommendation module better as well as improving the normal side faceting behavior.

However, there is one place where we can't escape picking a single author: sorting. You can't sort on a multi-valued field. You have to pick one single value to determine the position of the record in the list.

One possible solution would be to maintain the current "main author / secondary authors" system but use copyFields to generate an "allAuthors" multi-valued field. Then you have a single value in the one case where it really matters, but you can grab a pool of values in other situations. The one disadvantage of using copyFields (rather than directly populating) is that you may have less control over the order in which values are loaded into the index... for author lists where people are sometimes sensitive about ranking, that might be a problem... but it's worth experimenting with.

Comment by Ronan McHugh [ 02/Oct/12 ]

Here is a new version which enables faceting based on all primary authors. This was accomplished through the creation of an allAuthors copyfield which handles faceting while authorStr is retained as single value to allow sorting. Thanks to Demian and Felipe for the feedback and suggestions!

Comment by Demian Katz [ 05/Oct/12 ]

A few comments/questions related to this latest patch:

1.) Have you tested the implications for citations and exporting? Those areas of the code didn't appear to be touched -- I think they use a combination of getPrimaryAuthor and getSecondaryAuthors; if PrimaryAuthor + SecondaryAuthors != AllAuthors, these changes might cause some names to get dropped in those places.

2.) Did you check that new template strings are present in language files? For example, I think "Primary Authors" is probably new. And on that subject, is it worth counting the list so we can use "Primary Author" for the common single-author case and only pull out "Primary Authors" when necessary?

3.) .ini loading has become more complicated because of a need to account for different layouts in VuFind 1.x and 2.x. It might be worth adding some public methods to the compiled SolrMarc VuFindIndexer class so that Beanshell scripts don't need to reinvent the wheel when loading configurations. I'll see if I can get something like this into the next SolrMarc release. (No action needed on your part right now... just making a note about the issue for future reference).