[VUFIND-542] Improvements to Author indexing Created: 29/Mar/12  Updated: 18/Mar/16  Resolved: 18/Mar/16

Status: Resolved
Project: VuFind®
Components: Import Tools, Search
Affects versions: None
Fix versions: 3.0

Type: Improvement Priority: Minor
Reporter: Ronan McHugh Assignee: Demian Katz
Resolution: Fixed Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: PNG File NLI - Multiple Author Facets.png     File author-mod 02-10-12.patch     File author-mod 10-04-12.patch     File author-mod 10-10-12.patch     File authorInitials.patch    

 Description   
Over the past while, users and librarians here at NLI have given us some feedback about problems with author searches in our Vufind instance. Eoghan has asked me to summarise these problems and suggest some solutions in order to kickstart a discussion about how to improve author search in Vufind. Since this would involve relatively core changes to the way that Vufind does search, we'd prefer to have some feedback from other developers before working on our own solution.

Summary of issues:

1) At present only Main Authors - Personal Name (MARC 100) are indexed in the author field in Solr. Since MARC records only permit one main author, this has the disadvantage of relegating second authors to the 700 field and thus the author2 field in Solr. The 700 field (Added Entry - Personal Name) is the same used for other contributors such as illustrators, donors etc. This additional relationship information is typically defined in the $e field, although second authors will not receive an entry in the $e field. This is means that second authors will not receive query boosting and will effectively be ranked the same in results as donors, illustrators etc. Similarly, where Main authors are Corporate Names or Meeting Names (MARC 110,111), they will be defined as Author2 in Solr instead of author. This problem also carries over into faceting. Since only main authors are used in faceting, it is not possible to facet by Corporate Name or second author.

2) When searching for authors, users who enter only the initial for the first name, e.g. "Lee, J." for Joseph Lee will not receive any results. This is because Solr doesn't have any tokens for the initials.

 
Suggested Solutions:

Add 110, 111 to author in marc.properties. This will have the effect of weighting corporate authors / meetings on the same level as personal names.
 
A beanshell script could be written to distinguish between different types of 700 field entries, e.g.:
- When $e of 700 is blank or value denoting authorship, index in Solr author field

- When 700$e contains value denoted contribution (e.g. illustrator) index as author 2

- when 700$e contains other values not related to authorship (e.g. donor) don't index as an author but possibly index elsewhere

This would require making author multi-valued which presumably would have a knock-on effect for both PHP logic and Smarty templates, and would require tweaking the search weightings. The script could use the LOC relator terms/codes [1] as a basis, but should be able to lookup a user-specified list of terms/codes too.

 A .bsh script or Solr regex script could be written to do some additional processing of names (e.g. Lee, Joseph -> Lee + J) and index the results in a new Solr field or in author_additional.
 

Suggestions and comments welcome.



 Comments   
Comment by Tod Olson [ 29/Mar/12 ]
Making author multi-value may have a side-effect on sort by author. Currently the author sort use authrStr, which is a field copy from author. Maybe that will work just fine when sorting records with multiple authors, maybe the author sort will require some modification.
Comment by Demian Katz [ 29/Mar/12 ]
You're right -- some modification will be needed; I don't think Solr will let you copyField a multi-valued field to a non-multi-valued field. We would have to change some indexing rules (and possibly add a new sort-specific field) to make this work.
Comment by Ronan McHugh [ 30/Mar/12 ]
Here are some screenshots from a quick test I made with some sample data. The only PHP error I am getting is as follows:

Warning: urlencode() expects parameter 1 to be string, array given in C:\vufind-test\web\RecordDrivers\IndexRecord.php on line 710

I guess that it won't be difficult to change the function to accept parameters from an array, but presumably this will have knock on effects elsewhere.

Otherwise:

As can be seen from the screenshots, the Author field is "Array" in List view. No surprise there. I guess wherever that is defined would need to be modified to check for how many authors there are and return "X and Y" for multiple authors. This will mean that the clickthrough will be for both the authors.

Searching for author works fine, even after I deleted the cache. I hadn't expected this to work, but perhaps there is an obvious reason? Likewise with a faceted search with author as a facet.

Checking my Solr Schema Browser, it seems that authorStr was populated correctly, so I suppose this is why Author search works but PHP doesn't display Author correctly.

Apologies if there's anything obvious I'm missing here, I'm still getting to grips with the system.
Comment by Ronan McHugh [ 10/Apr/12 ]
Author Modifications patch

This patch is a practical implementation of some of the ideas discussed in [VUFIND-542]. Namely it enables multiple first authors by modifying the relevant php and display templates and adding a configurable beanshell script to determine authorship based on user-supplied parameters.

1) Beanshell script

The role of this script is to allow a more nuanced determination of authorship and second authorship based on a MARC record. The script references a file /web/conf/author-classification.ini that classifies the LOC relator codes according to their creative role, firstAuthor, secondAuthor or nonCreative. This allows the system administrator to alter the fields defined as first or second author.
The script is called from marc_local.properties. The administrator calls the getAuthors method and passes in the list of fields to be searched (e.g. 100abcd:110abc) for this particular author type (first or second author). Finally they pass in the type of author field currently being populated, either Author, secondAuthor or AuthorStr.
The getAuthors method checks all the desired fields of each record to see 1) if they are populated 2) how they are populated. If the fields are 100,110 or 111, they are automatically considered to be first authors, unless the administrator wants 110 or 111 fields to be considered as second authors.

If the field is 700 and there is no relator information, the field is considered to be an author.If the field is 700 and there is relator information, the script compares the relator information given to the relator classifications in author-classification.ini. It assigns the field a value based on the result of this comparison.
NOTE - if the administrator does not want 700s with a blank relator field to be automatically considered authors, they can simply not pass 700s into the script from marc_local.properties.

2) Schema XML

The solr biblio import schema had to be altered to allow this patch to work. Author is now multivalued. AuthorStr is no longer populated by Author, but instead by the first Author value using the author-modifications script.

3) Internal PHP and display templates

Several php files and related tpl files had to be modified to allow for multiple author values. In IndexRecord.php the getPrimaryAuthor method was replaced with the getPrimaryAuthors in most cases. The getPrimaryAuthor method has been left in place but it now returns the first element from getPrimaryAuthors. Most calls to the method have been updated to getPrimaryAuthors(), except for the getOpenURL method which still uses getPrimaryAuthor.

The relevant tpl files have been updated to display multiple authors (core.tpl,listentry.tpl,result.tpl). One error remains with the similar items module /record/view.tpl whereby a trailing space after each author is displayed.
Comment by Ronan McHugh [ 29/Jun/12 ]
This is also relevant to a user complaint we have received about title searching. When users make a title search, VuFind returns results based on author name. This is because VuFind includes the value of the 245c field in the index of title_full. To rectify this, Solr should be prevented from indexing the 245c field as part of title_full. In the below example, title searching for murphy report will return records with title report and author murphy.

http://catalogue.nli.ie/Search/Results?lookfor=murphy+report&type=Title&submit=FIND

Our fix:
in marc.properties:

author_additional = 505r:245c
title_full = 245abfghknps, first
Comment by Filipe MS Bento [ 09/Jul/12 ]
Dear Ronan (et alii),

When I saw this ticket of yours I thought, ok, finally a solution that brings down the barriers of not being possible to have more than one Main Author (main intellectual responsibility), that VuFind SOLR biblio schema forces due to its design was originally thought of to be a pure OPAC 2.0 (just pure speculation of mine, no facts to prove it), having the OPAC as the sole source of records (and this weird thing of MARC not allowing the author field to be repeatable: I mean it forces to have always an author elected as the main and the other(s) relegated to a not so honorable position of co-authors (in UNIMARC, at least, they are near each other: 700, 701 & 702).

In real world, especially of scientific publishing most of the times, either if it corresponds to the truth of not, many articles and alike are fruit of joint efforts and co-writing of 2 or more colleagues.
 
I was about to install it because I thought it would resolve this for good, but then I saw this “documentation” from you:

  2) Schema XML

  (…). AuthorStr is no longer populated by Author, but instead by the first Author value using the author-modifications script.

Well, and writing to entire team, not directly at Ronan (by all means, I do truly appreciate all your patches and contributions, you’re doing a fantastic job; wish I had a small sample of your programming skills [I have but not so much in PHP > breaking loose of MS dependency) if we take in mind that the Author facet, sorting, list of authors retrieved in autocomplete, but most important, Author facet (yes, I know), all come from AuthorStr value and taking, for instances and randomly :) , the example of Author facet, just one of the authors is displayed (amongst same level ones) so in fact it’s not possible to filter by a certain author, because he/she is not present in the facet, just because he/she comes from a family which name is in the wrong side of the A to Z sort world…

Ironies apart, I’m aware that in terms of SOLR that is not so simple, turn it to multivalued, but for all the purposes mentioned above, since each author would have its own AuthorStr for certain record (that has several “main” authors, “contributors” as many OIA-PMH sources and formats call them) they would be displayed as an unique entry, not an array. Take for instance the example of “format” (multiValued="true") – and even “language” is formatted as multivalued.

For sure I am missing something here, else you wouldn’t taken that option (better, maintain it); apologies for that, but what ever are the implications I think, if not already thought of in VF2.0, of ways to overcome it. If not, there will be fingers point at VuFind install and say, “hey, why Xyz not being shown in the Author facet?”, if they aren’t doing so already.

Thanks,

Filipe
 
PS: I’m aware of <copyField source="author2" dest="author2Str"/> and of <copyField source="author_additional" dest="author_additionalStr"/>, but

./sys/Recommend/AuthorFacets.php: 'field' => 'authorStr', 'limit' => 10, 'sort' => 'count'

and more like that > none for author2Str or author_additionalStr:

[vuser@iia web]$ grep -i -R author2Str ./
[vuser@iia web]$ grep -i -R author_additionalStr ./
[vuser@iia web]$

= aren’t used anywhere at all! Poor guys! (men and women, authors that also worked hard and aren’t given credit for it :| )
Comment by Ronan McHugh [ 10/Jul/12 ]
Hi Felipe, that is a good point, I hadn't thought of the implications for narrowing searches on the authors. We haven't committed this patch to our local instance yet since it requires some more testing and feedback from our users, but I will make sure to look at that aspect before long. If I can improve it with your suggestions I will commit back here. At the moment I am living in Vufind 2.0 world, which is a very different place altogether.
Comment by Demian Katz [ 10/Jul/12 ]
Filipe brings up a good point that it makes sense to have all author names available for faceting. This will help make the author recommendation module better as well as improving the normal side faceting behavior.

However, there is one place where we can't escape picking a single author: sorting. You can't sort on a multi-valued field. You have to pick one single value to determine the position of the record in the list.

One possible solution would be to maintain the current "main author / secondary authors" system but use copyFields to generate an "allAuthors" multi-valued field. Then you have a single value in the one case where it really matters, but you can grab a pool of values in other situations. The one disadvantage of using copyFields (rather than directly populating) is that you may have less control over the order in which values are loaded into the index... for author lists where people are sometimes sensitive about ranking, that might be a problem... but it's worth experimenting with.
Comment by Ronan McHugh [ 02/Oct/12 ]
Here is a new version which enables faceting based on all primary authors. This was accomplished through the creation of an allAuthors copyfield which handles faceting while authorStr is retained as single value to allow sorting. Thanks to Demian and Felipe for the feedback and suggestions!
Comment by Demian Katz [ 05/Oct/12 ]
A few comments/questions related to this latest patch:

1.) Have you tested the implications for citations and exporting? Those areas of the code didn't appear to be touched -- I think they use a combination of getPrimaryAuthor and getSecondaryAuthors; if PrimaryAuthor + SecondaryAuthors != AllAuthors, these changes might cause some names to get dropped in those places.

2.) Did you check that new template strings are present in language files? For example, I think "Primary Authors" is probably new. And on that subject, is it worth counting the list so we can use "Primary Author" for the common single-author case and only pull out "Primary Authors" when necessary?

3.) .ini loading has become more complicated because of a need to account for different layouts in VuFind 1.x and 2.x. It might be worth adding some public methods to the compiled SolrMarc VuFindIndexer class so that Beanshell scripts don't need to reinvent the wheel when loading configurations. I'll see if I can get something like this into the next SolrMarc release. (No action needed on your part right now... just making a note about the issue for future reference).
Comment by Ronan McHugh [ 08/Oct/12 ]
A new patch uploaded with changes to tpls (Main Author vs Main Authors) en.ini incorporated plus some tidying up around the place. Citation seems to be working fine for me but I couldn't fully test Export since I don't have EndNote and their website is down.
Comment by Demian Katz [ 08/Oct/12 ]
Thanks, Ronan. If you need to test EndNote without access to the application itself, you can also use the Zotero Firefox plugin -- I believe it still supports the same import format.
Comment by Ronan McHugh [ 08/Oct/12 ]
Hi Demian, I tried that and it seems like only the first author is getting exported as author. It looks like the export-endnote tpl is a pretty old bit of code. The entire marc record is passed to the tpl which then does all the parsing of the fields. So the author field is coming straight from the 100a. If I were to apply the changes here, I think I would need to replicate the entire logic from my beanshell script in smarty form which doesn't really seem worth the hassle. Do you think it's okay if I just leave it?
Comment by Demian Katz [ 08/Oct/12 ]
Ronan, you are correct -- I forgot how old that code was in 1.x. It's all been rewritten in 2.0, so we can worry about this when porting the patch forward.
Comment by Ronan McHugh [ 08/Oct/12 ]
Great, are you thinking of committing this or just leaving it as a patch?
Comment by Demian Katz [ 08/Oct/12 ]
I think it makes sense to commit this at some point -- it's a more flexible model than the current configuration. I haven't decided whether to make it part of 1.4 or leave it a patch for 1.x and worry about an official implementation for 2.0 only, though. Depends on time and demand. If it would make your life easier to have it in trunk, I'll certainly try to take that into consideration.
Comment by Ronan McHugh [ 08/Oct/12 ]
Just noticed a little bug in my getPrimaryAuthor method that would cause an error if there were no authors for a record.
Comment by Ronan McHugh [ 08/Oct/12 ]
Demian, we were just discussing how this would effect the presentation of corporate authors on record pages. The method at present doesn't distinguish between corporate authors and human authors, both types will be shown as either Main Authors or contributors. Therefore, the "Corporate Author" field on the record page is redundant for us. We have decided to comment it out in our trunk to prevent duplication, but were wondering if you had any thoughts about dealing with it in the trunk? If users still want to distinguish between corporate authors and human authors, they can do so by editing the fields supplied to the bsh in marc.properties. But if users don't do this, they will be presented with duplicate information based on our modifications to the import rules.
Comment by Demian Katz [ 08/Oct/12 ]
If I had to take a guess, I would speculate that the majority of users aren't going to care about corporate authorship getting its own distinct label in the record view... but there might be a vocal minority that needs this. It might be worth polling the mailing lists about this to see how people feel (getting the cataloger's perspective would be useful), but I would be inclined to let the corporate author field display go away in the default trunk setup, as long as we provide instructions on how to get it back as a custom option in case somebody really needs it.
Comment by Ronan McHugh [ 11/Oct/12 ]
Here is a bsh script that will process the supplied author fields and return initials in several forms. For example, Yeats, William Butler, will be indexed as "w b y wb wby", Duncan-Smith, Iain, will be indexed as "i d s id ids". International Labour Organisation will be indexed as "i l o ilo". The aim of this is to ensure that users get results when they search for initials without spaces, e.g. "wb yeats" or "ilo".
Comment by Demian Katz [ 18/Feb/15 ]
Another issue to consider as part of comprehensive author field redesign:

Right now, author_additional is used only for table of contents authors; these names are searched but never displayed from Solr (since TOC display is handled by direct MARC processing). This seems inelegant; should we rename author_additional to author-toc, change the way the field is used, or do nothing?
Comment by Demian Katz [ 27/Apr/15 ]
A VuFind 2 port of this code is in progress in this pull request:

https://github.com/vufind-org/vufind/pull/354

I've significantly reworked the BeanShell code from the original patch to make it simpler and more generic. Functions have been renamed, and the basic idea here is that, rather than caring about semantic meanings of particular MARC tags, and instead of having built-in concepts of primary/secondary authors, the revised code simply filters the author results using a couple of parameters: the tags which may be included if no relator is present, and a whitelist of relators to allow when a relator value is found.

I'm sure there's still room to further improve this code, but I feel this is a step in the right direction.

I've also made some schema adjustments which are similar (but not identical) to the ones proposed in the older patch. The net result is just about the same, but I've also taken the liberty of eliminating some unused author fields to simplify matters.

There's a lot of work still to be done on the PHP side -- watch the PR for progress there.

I also haven't done anything with the initials patch yet. That feels like a separate (and smaller) issue, so I think I'll work through the primary patch before I worry too much about that one.
Comment by Demian Katz [ 11/Jan/16 ]
Just an update to note that all functionality from this ticket (including the author initials) is now implemented in the pull request. We still need to do some review before merging to master (not to mention minting a new SolrMarc release), but this is great progress!
Generated at Wed Apr 24 20:51:13 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100251-rev:0a2056e15286310f4b5e220c64c9aafb1684da34.