[VUFIND-274] Support for fulltext indexing Created: 21/May/10  Updated: 17/Jun/11  Resolved: 17/Jun/11

Status: Resolved
Project: VuFind®
Components: Import Tools
Affects versions: Wishlist
Fix versions: 1.2

Type: New Feature Priority: Minor
Reporter: Till Kinstler Assignee: Till Kinstler
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified
Environment: solrmarc, Solr, VuFind (searchspec.yaml)

Attachments: File fulltext.ini     File getFulltext.bsh     File getFulltext_old.bsh    

 Description   
Add a bean shell sctipt to solmarc that takes care of indexing of full text documents linked in MARC records (in 856$u).
The script will
- pull the content of URLs in MARC 856$u (maybe shouldn't take all 856$u URLs, but only those that match some criteria)
- extract text using Aperture (http://aperture.sourceforge.net/)
- return that text
solrmarc may then add that extracted text to a fulltext field in the Solr index. VuFind will search this additional fulltext field by additional configuration in serachspec.yaml.

 Comments   
Comment by Demian Katz [ 21/Oct/10 ]
As of r3064, I have added fulltext fields to the Solr schema and searchspecs.yaml. I will be working on tools to populate these fields via the XSLT indexer (since this will be useful for harvesting open journal/digital library content which will frequently have full text available). We still need to work on dealing with this from the MARC perspective, and we may still want to discuss some tuning options (i.e. field weights in searchspecs.yaml, Solr options regarding indexed words per document, etc.).
Comment by Demian Katz [ 25/Oct/10 ]
As of r3066, I have added hooks that allow the XSLT indexer to take advantage of Aperture. I don't think we want to bundle Aperture with VuFind by default (it's huge and has cryptographic components that might cause export problems), so I have added a web/conf/fulltext.ini that allows the user to configure the location of an external Aperture installation in either Windows or Linux. If we eventually add full text support to SolrMarc as well, perhaps we can leverage this file, adding some new settings as needed.
Comment by Eoghan Ó Carragáin [ 11/Jun/11 ]
Attached getFulltext.bsh is a first pass at porting the current XSLT indexer fulltext process to a beanshell script for use with SolrMarc.

The script can be called by adding the following line to marc.properties:

fulltext = script(getFulltext.bsh), getFulltext

The harvestWithAperture function tries to match its namesake from /import/xsl/vufind.php as closely as possible. I had to use org.apache.commons.lang.StringEscapeUtils.unescapeHtml to mimick html_entity_decode. You will need to get org.apache.commons.lang libraries to use the script as is. Does anyone know a standard java alternative that won't introduce a new dependency?

getFulltext.bsh also gets the aperture path from /web/conf/fulltext.ini, but I had to make minor changes to that file to allow it be be read with the Properties class. Amended version attached.

Some possible improvements:
- allow user to specify tag or an array of tags from marc.properties. The script is currently hardcoded to look at each 856u occurrence
- allow user to specify a filter criteria for URLs, e.g. only those that end in .pdf etc.

Feedback much appreciated, especially re choice of Java classes/methods and performance.
Comment by Eoghan Ó Carragáin [ 14/Jun/11 ]
Improved patch (getFulltext.bsh). Thanks to Demian for suggestions.

-- Uses ini4j to parse fulltext.ini as it is already packaged with vufind (however, the changes to fulltext.ini still seem to be necessary)
-- Removes hardcoded windows paths
-- Uses fieldSpec to allow user to pass in an array of marc tags/subfields
-- Uses Java xml parser rather than regex to get contents of Aperture output. This removes the need for org.apache.commons.lang

This script should now be called from marc.properties as follows;
fulltext = script(getFulltext.bsh), getFulltext(856u:530u)
Comment by Demian Katz [ 17/Jun/11 ]
I've committed a modified version of Eoghan's script as r4015. Main changes:

- Minor style cleanup
- Overloaded getFulltext function to allow default parameter values
- Enabled extension filtering
- Fixed .ini parsing so that code works with existing fulltext.ini file (no additional modifications necessary)

At some point in the future, I will look into creating a compiled version of the script as part of the main SolrMarc distribution.
Comment by Eoghan Ó Carragáin [ 17/Jun/11 ]
Till suggested calling Aperture directly rather than through the *.bat/*.sh file (see http://aperture.sourceforge.net/tutorial/extractors.html).

Might be worth looking into in the future if there are any performance issues with the current script etc.
Comment by Demian Katz [ 17/Jun/11 ]
Agreed -- calling Aperture directly is probably a slightly more efficient option... though doing it through the command-line scripts has the advantage of sharing configuration files with the XSLT importer and avoiding the complexity of dealing with Java class-paths (since I don't think we want to bundle Aperture with VuFind by default).
Generated at Thu Mar 28 14:05:29 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100248-rev:e207e3a88e19bebfd0fd5834088a20d22d89a0a2.