[VUFIND-600] Investigate using Tika instead of Aperture for Full Text Created: 12/Jun/12 Updated: 23/Jan/13 Resolved: 23/Jan/13 |
|
Status: | Resolved |
Project: | VuFind® |
Components: | Import Tools |
Affects versions: | None |
Fix versions: | 1.4, 2.0RC1 |
Type: | Bug | Priority: | Major |
Reporter: | Demian Katz | Assignee: | Demian Katz |
Resolution: | Fixed | Votes: | 1 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original estimate: | Not Specified |
Attachments: | TikaURLContentReader.java tika.patch tika_xslt.patch tika_xslt_09-10-12.patch tika_xslt_25-09-12.patch | ||||||||
Issue links: |
|
Description |
The Aperture library that VuFind uses for full text indexing is no longer being developed. It sounds like Apache Tika is the logical successor. There is a tool here which implements Aperture-like crawler functionality on top of Tika: http://leechcrawler.github.com/leech/ We should investigate replacing Aperture with something that is still being maintained before the project disappears entirely. |
Comments |
Comment by guenter.hipler [ 07/Sep/12 ] |
Integration of Tika as enrichement for document processing (as it is done in the swissbib.ch project) - Java Code is called from XSLT - once a document is parsed - we store the parsed content within a DB - so we don't have to fetch the content a second time if the document is processed repeatedly This prevents e.g. difficulties with repositories providing a lot of content because requests against the repository might be heavily and the whole document process is faster. configuration gives us the possibility to exclude repositories temporarily - we fetch only links in bibliographic records we know they provide a minimal quality. This is done via configuration - All this configuration is provided via a type we call ConfigurationContainer - special content might be stored in Lucene indexes. Therefore we are able to activate a connection via configuration, if desired We use solrj to fetch content from an Lucene index. SolrJ caused problems in conjunction with Tika because the versions of used SLFJ were inconsistent - at least the versions we used so far. We had to adjust a line in the pom.xml of the Maven package ./tika-app/pom.xml: <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <!-- Anpassung GH: Konform mit SOLR(J) 3.x und SOLR(J) 4.x <version>1.5.6</version> --> <version>1.6.1</version> <scope>provided</scope> </dependency> Günter |
Comment by Ronan McHugh [ 07/Sep/12 ] |
This patch contains a new bsh script for harvesting records with Tika based on the previous getFullText script. Instead of creating a tempFile and parsing this, it builds the output of the Tika command into a string which is returned to Solr. Note that we have observed some problems with incorrect character encodings but do not know how to solve these at present. |
Comment by Ronan McHugh [ 07/Sep/12 ] |
updated patch to preserve correct character encodings. Thanks Demian! |
Comment by Ronan McHugh [ 21/Sep/12 ] |
Another patch to enable use of Tika in xslt parsing. A generic harvestWithParser method added to import/xslt/Vufind.php which will call either harvestWithTika or harvestWithAperture depending on the parser settings in fulltext.ini. |
Comment by Ronan McHugh [ 25/Sep/12 ] |
updated version to improve modularity and compatibility with updates to VuFindSitemap.php |
Comment by Demian Katz [ 09/Oct/12 ] |
I've revised the latest XSLT patch slightly (style fixes, expanded comments, made some things more explicit) -- see tika_xslt_09-10-12.patch. This version has been committed to trunk as r5965. |
Comment by Ronan McHugh [ 09/Oct/12 ] |
Looks good! |
Comment by Demian Katz [ 09/Oct/12 ] |
I have updated your SolrMarc import code to more closely match the XSLT version (i.e. one BeanShell script with both Tika and Aperture methods inside; picking which method to call based on configuration). I have committed the updated BeanShell as well as a compiled Java version of the same logic as r1663 of the SolrMarc trunk. These enhancements will make it into VuFind after the next official SolrMarc release; we can close this ticket at the same time as |
Comment by Demian Katz [ 23/Jan/13 ] |
Resolved as of SolrMarc 2.5 update in r6238. |