[VUFIND-598] Normalization of Call Numbers Created: 08/Jun/12  Updated: 13/Dec/13  Resolved: 13/Dec/13

Status: Resolved
Project: VuFind®
Components: Import Tools
Affects versions: None
Fix versions: 2.2

Type: Improvement Priority: Trivial
Reporter: Luke O'Sullivan Assignee: Demian Katz
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: PNG File Solr admin page_20130403-140631.png     File callnumber.bsh     File callnumber_normalize.bsh     File callnumber_normalize.bsh     File normalizedCallnumber.patch    

 Description   
The addition of normalised call numbers to the index would allow a more accurate sorting of items by call number.

Bob Haschart has suggested that

"In SolrMarc in the Utils class there is a method that Naomi Dushay created that takes a LC call number and transforms it into an expanded version that should be directly sortable. So for a random sampling of records with LC call numbers "near" PQ239"

PQ239.Z56
PQ239.H63 2008
PQ239.S62 1982
PQ239.B68 1983
PQ2390.S35 A5
PQ2390.S35 B8 1898
PQ2389 .R65 F3 1854 t.1
PQ239.A7 1969
PQ2.N6 1959
PQ22.A4 D47 1949
PQ238.L57 1985

the expanded sortable version returned by Naomi's routine would be:-

PQ 0239.000000 Z0.560000
PQ 0239.000000 H0.630000 002008
PQ 0239.000000 S0.620000 001982
PQ 0239.000000 B0.680000 001983
PQ 2390.000000 S0.350000 A0.500000
PQ 2390.000000 S0.350000 B0.800000 001898
PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
PQ 0002.000000 N0.600000 001959
PQ 0022.000000 A0.400000 D0.470000 001949
PQ 0238.000000 L0.570000 001985

Which is easily sorted to produce the desired ordering.

I don't know whether there is an existing "standard" indexing function that invokes the CallNumUtils.getLCShelfkey() method that produces the above expanded strings, but it could be easily added as a custom method, or as a scripted method.

If user entered call numbers were also normalised, it would be possible to perform call number ranged searches. It has been suggested that a custom solr plugin might achieve the latter.


 Comments   
Comment by Nathan Tallman [ 12/Jun/12 ]
Would this also work for non-LC numbers? We're an archive so our call numbers are collection numbers, e.g. MS-1, MS-2, etc. Right now, in the alpha-browse, these are not sorted correctly because there is no normalization, e.g. MS-1, MS-10, MS-100, MS-101, etc.
Comment by Demian Katz [ 12/Jun/12 ]
Without trying it, I'm not sure how the existing normalization code would handle your call numbers, but if all of your call numbers are of the form prefix-dash-number, it should be possible to write a custom normalizer as a BeanShell script fairly easily. I can help if you're interested.
Comment by Luke O'Sullivan [ 18/Jun/12 ]
This patch will create a normalized LCC callnumber in the callnumber-norm_str dynamic field.

It can be used by uncommenting the field in marc_local.properties and switch the callnumber sort options in searches.ini
Comment by Luke O'Sullivan [ 18/Jun/12 ]
Initial Normalized Callnumber Patch
Comment by Alan Rykhus [ 24/Jan/13 ]
I have discovered 2 issues with this patch.

1. When there is no call number in the record it does not index the record. In the bsh script after calling:

  fieldSpec= indexer.getFirstFieldVal(record, fieldSpec);

You need to check to see if the value for fieldSpec is null. If it is, do not call getLCShelfkey, it throws an exception and does not index the record.

2. Some records I've tried to index have strange call numbers, just numbers. I'm not sure why it does, but getLCShelfkey throws an exception on these too. It might be that getLCstartLetters returns null so getLCShelfkey throws the exception. These records are getting their call numbers locally, not from the 099, 090, or 050 fields.

The problem with the exceptions is that the record is not indexed. If instead the getLCShelfkey would return null, at least the record would be indexed without the call number.
Comment by Alan Rykhus [ 24/Jan/13 ]
New suggested normalization routines
Comment by Demian Katz [ 25/Jan/13 ]
Would it help to surround the getLCShelfkey call with a try..catch? That way you could convert exceptions to nulls in your routine without having to change the way getLCShelfkey currently behaves.
Comment by Alan Rykhus [ 25/Jan/13 ]
I can agree with that. Then if there is some other error I haven't seen yet, at least the record is still in the database.

A revised function.
Comment by Luke O'Sullivan [ 20/Mar/13 ]
See VUFIND-657 also

It is possible that Solr 3.6+ might resolve the issues with ranged searches
http://wiki.apache.org/solr/MultitermQueryAnalysis
Comment by Luke O'Sullivan [ 03/Apr/13 ]
I have tried this with VuFind 2.0 and the ranged search is still not working as expected

callnumber-normalized=[DS+TO+FE] will correctly list items with callnumbers between DS and FE, starting with DT* and finishing with FD*

callnumber-normalized=[DS763+TO+FE] incorrectly starts at DT* and finishes with FD*

(See VUFIND-657)
Comment by Demian Katz [ 03/Apr/13 ]
Have you tried using debugQuery=true to see if that offers any clues to exactly what is going on? Is the problem that because of the normalization, you would have to put a space in "DS763" to match properly? (Obviously if that's the problem, it means that Solr itself is not normalizing input within range queries... but maybe it can be configured to do so using something similar to the technique discussed on VUFIND-172).
Comment by Luke O'Sullivan [ 03/Apr/13 ]
The analyzer suggests that matches are being made correctly...
Comment by Demian Katz [ 03/Apr/13 ]
The analyzer is showing correct matching, but if the range query processing bypasses analysis and uses only raw strings (which I suspect may be the case), then that doesn't help. That's what VUFIND-172 is about: the fact that some queries don't get analyzed, and that Solr 3.6 adds a mechanism for adding analysis to some of those queries (specifically wildcards, but perhaps ranges can be involved too -- I haven't done the research yet).
Comment by Luke O'Sullivan [ 03/Apr/13 ]
A search for callnumber-normalized:["ds+762"+TO+fe] also returns incorrect results
Comment by Demian Katz [ 03/Apr/13 ]
Is it case sensitive? What about callnumber-normalized:["DS+762"+TO+FE] ?
Comment by Luke O'Sullivan [ 03/Apr/13 ]
Here's the debug info:

<lst name="debug"><str name="rawquerystring">*:*</str><str name="querystring">*:*</str><str name="parsedquery">MatchAllDocsQuery(*:*)</str><str name="parsedquery_toString">*:*</str><lst name="explain"><str name="462874">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="156924">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="342087">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="528905">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="432930">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="70111">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="289553">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="457315">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="35154">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str><str name="367102">
1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
</str></lst><str name="QParser">LuceneQParser</str><arr name="filter_queries"><str>callnumber-normalized:["ds 762" TO fe]</str></arr><arr name="parsed_filter_queries"><str>callnumber-normalized:[ds 762 TO fe]</str></arr><lst name="timing"><double name="time">1.0</double><lst name="prepare"><double name="time">0.0</double><lst name="org.apache.solr.handler.component.QueryComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.FacetComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.MoreLikeThisComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.HighlightComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.StatsComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.SpellCheckComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.DebugComponent"><double name="time">0.0</double></lst></lst><lst name="process"><double name="time">1.0</double><lst name="org.apache.solr.handler.component.QueryComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.FacetComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.MoreLikeThisComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.HighlightComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.StatsComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.SpellCheckComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.DebugComponent"><double name="time">1.0</double></lst></lst></lst></lst>

Comment by Luke O'Sullivan [ 03/Apr/13 ]
At present, it does appear to be case sensitive using qf - though a search in the q field is not (I added the lowercase filter factory to the field). That by itself suggests to me that the qf values (at least in ranged searches) are not put through the same transformations as q?
Comment by Demian Katz [ 03/Apr/13 ]
Yes, I think the way this works is that (starting in Solr 3.6) you can configure a special "multiterm" analysis chain in your schema which gets applied to ranges and wildcard queries. Analyzers used in this chain may need to have some special characteristics (possibly related to that MultiTermAwareComponent interface you mentioned). Without multiterm configured, these types of queries are not subject to analysis at all.

So bottom line: upgrade to Solr 3.6+, make your custom analyzer multiterm aware, and adjust your schema accordingly.

Not sure exactly how difficult this will be... but if you're successful and can share results, I think this could help us resolve VUFIND-172.
Comment by Luke O'Sullivan [ 03/Apr/13 ]
I've just confirmed that the ranged search does work if you put in a normalised string into range - i.e. The qf is not being run through the normalisation process.

Comment by Demian Katz [ 03/Apr/13 ]
Great, that all makes sense then -- hopefully my multiterm solution suggested above can eventually solve the problem.
Comment by Luke O'Sullivan [ 03/Apr/13 ]
See VUFIND-657 for a solution to the range query issue

The patch attached to this ticket is still valid if all you want to do is sort LCC callnumbers more accurately
Comment by Nathan Tallman [ 13/Nov/13 ]
This script, when implemented, halts indexing (on a per record basis) when it hits a record with no 099:090:050. Is it possible to have the script bypass itself when there is no call number present, so that it will be indexed?
Comment by Demian Katz [ 13/Nov/13 ]
What is the exact error you are seeing? It doesn't look like it's intentionally failing when the field is missing, but I think it may need to be rewritten slightly to be more error-tolerant. Perhaps adding:

        if (fieldSpec == null) {
            return null;
        }

right after:

        fieldSpec= indexer.getFirstFieldVal(record, fieldSpec);

would help.
Comment by Nathan Tallman [ 13/Nov/13 ]
Below is the error message. Adding what you suggest will probably do the trick.

Nov 13, 2013 12:50:49 AM org.solrmarc.marc.MarcImporter importRecords
SEVERE: Unable to index record vtls000027057 (record count 27582) -- Error while trying to evaluate script: callnumber.bsh
java.lang.IllegalArgumentException: Error while trying to evaluate script: callnumber.bsh
at org.solrmarc.index.SolrIndexer.handleScript(SolrIndexer.java:1170)
at org.solrmarc.index.SolrIndexer.addFieldValueToMap(SolrIndexer.java:914)
at org.solrmarc.index.SolrIndexer.map(SolrIndexer.java:821)
at org.solrmarc.marc.MarcImporter.addToIndex(MarcImporter.java:399)
at org.solrmarc.marc.MarcImporter.importRecords(MarcImporter.java:313)
at org.solrmarc.marc.MarcImporter.handleAll(MarcImporter.java:607)
at org.solrmarc.marc.MarcImporter.main(MarcImporter.java:867)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at com.simontuffs.onejar.Boot.run(Boot.java:334)
at com.simontuffs.onejar.Boot.main(Boot.java:170)
Caused by: Typed variable declaration : Method Invocation CallNumUtils.getLCShelfkey : at Line: 29 : in file: inline evaluation of: ``/** * Custom call number script. * * This can be used to override built-in So . . . '' : CallNumUtils .getLCShelfkey ( fieldSpec , recordID )

Target exception: java.lang.NullPointerException

at bsh.BSHMethodInvocation.eval(Unknown Source)
at bsh.BSHPrimaryExpression.eval(Unknown Source)
at bsh.BSHPrimaryExpression.eval(Unknown Source)
at bsh.BSHVariableDeclarator.eval(Unknown Source)
at bsh.BSHTypedVariableDeclaration.eval(Unknown Source)
at bsh.BSHBlock.evalBlock(Unknown Source)
at bsh.BSHBlock.eval(Unknown Source)
at bsh.BSHBlock.eval(Unknown Source)
at bsh.BSHIfStatement.eval(Unknown Source)
at bsh.BSHBlock.evalBlock(Unknown Source)
at bsh.BSHBlock.eval(Unknown Source)
at bsh.BshMethod.invokeImpl(Unknown Source)
at bsh.BshMethod.invoke(Unknown Source)
at bsh.BshMethod.invoke(Unknown Source)
at org.solrmarc.index.SolrIndexer.handleScript(SolrIndexer.java:1149)
... 12 more
Nov 13, 2013 12:50:49 AM org.solrmarc.marc.MarcImporter importRecords
SEVERE: ******** Halting indexing! ********
Comment by Demian Katz [ 13/Nov/13 ]
Yeah, I think that should help -- if it works, please upload an updated .bsh for future reference! Thanks for your help!
Comment by Nathan Tallman [ 15/Nov/13 ]
Updated callnumber.bsh from patch, updated to handle records without a 099:090:050 field.
Comment by Demian Katz [ 13/Dec/13 ]
I have added the getFullCallNumberNormalized routine to callnumber.bsh and have also committed it to the SolrMarc trunk so that it will be available in the next SolrMarc release (2.7). This isn't currently being used for anything in the default configuration -- it's just available as an option. We should consider the options offered by VUFIND-657 before making schema/default indexing changes.
Generated at Fri Apr 26 07:14:43 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100251-rev:2d0d695520e7095763476433152508933e579798.