[VUFIND-600] Investigate using Tika instead of Aperture for Full Text Created: 12/Jun/12  Updated: 23/Jan/13  Resolved: 23/Jan/13

Status: Resolved
Project: VuFind®
Components: Import Tools
Affects versions: None
Fix versions: 1.4, 2.0RC1

Type: Bug Priority: Major
Reporter: Demian Katz Assignee: Demian Katz
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: Java Source File TikaURLContentReader.java     File tika.patch     File tika_xslt.patch     File tika_xslt_09-10-12.patch     File tika_xslt_25-09-12.patch    
Issue links:
incorporate
is incorporated by VUFIND-693 Upgrade to SolrMarc 2.5 Resolved

 Description   
The Aperture library that VuFind uses for full text indexing is no longer being developed. It sounds like Apache Tika is the logical successor. There is a tool here which implements Aperture-like crawler functionality on top of Tika:

http://leechcrawler.github.com/leech/

We should investigate replacing Aperture with something that is still being maintained before the project disappears entirely.

 Comments   
Comment by guenter.hipler [ 07/Sep/12 ]
Integration of Tika as enrichement for document processing (as it is done in the swissbib.ch project)

- Java Code is called from XSLT
- once a document is parsed - we store the parsed content within a DB - so we don't have to fetch the content a second time if the document is processed repeatedly
This prevents e.g. difficulties with repositories providing a lot of content because requests against the repository might be heavily and the whole document process is faster.
configuration gives us the possibility to exclude repositories temporarily
- we fetch only links in bibliographic records we know they provide a minimal quality. This is done via configuration
- All this configuration is provided via a type we call ConfigurationContainer
- special content might be stored in Lucene indexes. Therefore we are able to activate a connection via configuration, if desired

We use solrj to fetch content from an Lucene index. SolrJ caused problems in conjunction with Tika because the versions of used SLFJ were inconsistent - at least the versions we used so far. We had to adjust a line in the pom.xml of the Maven package
./tika-app/pom.xml:

   <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
      <!-- Anpassung GH: Konform mit SOLR(J) 3.x und SOLR(J) 4.x
        <version>1.5.6</version> -->
      <version>1.6.1</version>
      <scope>provided</scope>
    </dependency>

Günter
Comment by Ronan McHugh [ 07/Sep/12 ]
This patch contains a new bsh script for harvesting records with Tika based on the previous getFullText script. Instead of creating a tempFile and parsing this, it builds the output of the Tika command into a string which is returned to Solr. Note that we have observed some problems with incorrect character encodings but do not know how to solve these at present.
Comment by Ronan McHugh [ 07/Sep/12 ]
updated patch to preserve correct character encodings. Thanks Demian!
Comment by Ronan McHugh [ 21/Sep/12 ]
Another patch to enable use of Tika in xslt parsing. A generic harvestWithParser method added to import/xslt/Vufind.php which will call either harvestWithTika or harvestWithAperture depending on the parser settings in fulltext.ini.
Comment by Ronan McHugh [ 25/Sep/12 ]
updated version to improve modularity and compatibility with updates to VuFindSitemap.php
Comment by Demian Katz [ 09/Oct/12 ]
I've revised the latest XSLT patch slightly (style fixes, expanded comments, made some things more explicit) -- see
tika_xslt_09-10-12.patch. This version has been committed to trunk as r5965.
Comment by Ronan McHugh [ 09/Oct/12 ]
Looks good!
Comment by Demian Katz [ 09/Oct/12 ]
I have updated your SolrMarc import code to more closely match the XSLT version (i.e. one BeanShell script with both Tika and Aperture methods inside; picking which method to call based on configuration). I have committed the updated BeanShell as well as a compiled Java version of the same logic as r1663 of the SolrMarc trunk. These enhancements will make it into VuFind after the next official SolrMarc release; we can close this ticket at the same time as VUFIND-693.
Comment by Demian Katz [ 23/Jan/13 ]
Resolved as of SolrMarc 2.5 update in r6238.
Generated at Fri Apr 19 04:42:04 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100250-rev:2b88e55752dc82be8616a67bc2b73a87c8e22b48.