VuFind
  1. VuFind
  2. VUFIND-600

Investigate using Tika instead of Aperture for Full Text

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4, 2.0RC1
    • Component/s: Import Tools
    • Labels:
      None

      Description

      The Aperture library that VuFind uses for full text indexing is no longer being developed. It sounds like Apache Tika is the logical successor. There is a tool here which implements Aperture-like crawler functionality on top of Tika:

      http://leechcrawler.github.com/leech/

      We should investigate replacing Aperture with something that is still being maintained before the project disappears entirely.
      1. tika_xslt_09-10-12.patch
        9 kB
        Demian Katz
      2. tika_xslt_25-09-12.patch
        8 kB
        Ronan McHugh
      3. tika_xslt.patch
        6 kB
        Ronan McHugh
      4. tika.patch
        6 kB
        Ronan McHugh
      5. TikaURLContentReader.java
        10 kB
        Guenter Hipler

        Issue Links

          Activity

          Hide
          Guenter Hipler added a comment -
          Integration of Tika as enrichement for document processing (as it is done in the swissbib.ch project)

          - Java Code is called from XSLT
          - once a document is parsed - we store the parsed content within a DB - so we don't have to fetch the content a second time if the document is processed repeatedly
          This prevents e.g. difficulties with repositories providing a lot of content because requests against the repository might be heavily and the whole document process is faster.
          configuration gives us the possibility to exclude repositories temporarily
          - we fetch only links in bibliographic records we know they provide a minimal quality. This is done via configuration
          - All this configuration is provided via a type we call ConfigurationContainer
          - special content might be stored in Lucene indexes. Therefore we are able to activate a connection via configuration, if desired

          We use solrj to fetch content from an Lucene index. SolrJ caused problems in conjunction with Tika because the versions of used SLFJ were inconsistent - at least the versions we used so far. We had to adjust a line in the pom.xml of the Maven package
          ./tika-app/pom.xml:

             <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
                <!-- Anpassung GH: Konform mit SOLR(J) 3.x und SOLR(J) 4.x
                  <version>1.5.6</version> -->
                <version>1.6.1</version>
                <scope>provided</scope>
              </dependency>

          Günter
          Show
          Guenter Hipler added a comment - Integration of Tika as enrichement for document processing (as it is done in the swissbib.ch project) - Java Code is called from XSLT - once a document is parsed - we store the parsed content within a DB - so we don't have to fetch the content a second time if the document is processed repeatedly This prevents e.g. difficulties with repositories providing a lot of content because requests against the repository might be heavily and the whole document process is faster. configuration gives us the possibility to exclude repositories temporarily - we fetch only links in bibliographic records we know they provide a minimal quality. This is done via configuration - All this configuration is provided via a type we call ConfigurationContainer - special content might be stored in Lucene indexes. Therefore we are able to activate a connection via configuration, if desired We use solrj to fetch content from an Lucene index. SolrJ caused problems in conjunction with Tika because the versions of used SLFJ were inconsistent - at least the versions we used so far. We had to adjust a line in the pom.xml of the Maven package ./tika-app/pom.xml:    <dependency>       <groupId>org.slf4j</groupId>       <artifactId>slf4j-log4j12</artifactId>       <!-- Anpassung GH: Konform mit SOLR(J) 3.x und SOLR(J) 4.x         <version>1.5.6</version> -->       <version>1.6.1</version>       <scope>provided</scope>     </dependency> Günter
          Hide
          Ronan McHugh (Inactive) added a comment -
          This patch contains a new bsh script for harvesting records with Tika based on the previous getFullText script. Instead of creating a tempFile and parsing this, it builds the output of the Tika command into a string which is returned to Solr. Note that we have observed some problems with incorrect character encodings but do not know how to solve these at present.
          Show
          Ronan McHugh (Inactive) added a comment - This patch contains a new bsh script for harvesting records with Tika based on the previous getFullText script. Instead of creating a tempFile and parsing this, it builds the output of the Tika command into a string which is returned to Solr. Note that we have observed some problems with incorrect character encodings but do not know how to solve these at present.
          Hide
          Ronan McHugh (Inactive) added a comment -
          updated patch to preserve correct character encodings. Thanks Demian!
          Show
          Ronan McHugh (Inactive) added a comment - updated patch to preserve correct character encodings. Thanks Demian!
          Hide
          Ronan McHugh (Inactive) added a comment -
          Another patch to enable use of Tika in xslt parsing. A generic harvestWithParser method added to import/xslt/Vufind.php which will call either harvestWithTika or harvestWithAperture depending on the parser settings in fulltext.ini.
          Show
          Ronan McHugh (Inactive) added a comment - Another patch to enable use of Tika in xslt parsing. A generic harvestWithParser method added to import/xslt/Vufind.php which will call either harvestWithTika or harvestWithAperture depending on the parser settings in fulltext.ini.
          Hide
          Ronan McHugh (Inactive) added a comment -
          updated version to improve modularity and compatibility with updates to VuFindSitemap.php
          Show
          Ronan McHugh (Inactive) added a comment - updated version to improve modularity and compatibility with updates to VuFindSitemap.php
          Hide
          Demian Katz added a comment -
          I've revised the latest XSLT patch slightly (style fixes, expanded comments, made some things more explicit) -- see
          tika_xslt_09-10-12.patch. This version has been committed to trunk as r5965.
          Show
          Demian Katz added a comment - I've revised the latest XSLT patch slightly (style fixes, expanded comments, made some things more explicit) -- see tika_xslt_09-10-12.patch. This version has been committed to trunk as r5965.
          Hide
          Ronan McHugh (Inactive) added a comment -
          Looks good!
          Show
          Ronan McHugh (Inactive) added a comment - Looks good!
          Hide
          Demian Katz added a comment -
          I have updated your SolrMarc import code to more closely match the XSLT version (i.e. one BeanShell script with both Tika and Aperture methods inside; picking which method to call based on configuration). I have committed the updated BeanShell as well as a compiled Java version of the same logic as r1663 of the SolrMarc trunk. These enhancements will make it into VuFind after the next official SolrMarc release; we can close this ticket at the same time as VUFIND-693.
          Show
          Demian Katz added a comment - I have updated your SolrMarc import code to more closely match the XSLT version (i.e. one BeanShell script with both Tika and Aperture methods inside; picking which method to call based on configuration). I have committed the updated BeanShell as well as a compiled Java version of the same logic as r1663 of the SolrMarc trunk. These enhancements will make it into VuFind after the next official SolrMarc release; we can close this ticket at the same time as VUFIND-693 .
          Hide
          Demian Katz added a comment -
          Resolved as of SolrMarc 2.5 update in r6238.
          Show
          Demian Katz added a comment - Resolved as of SolrMarc 2.5 update in r6238.

            People

            • Assignee:
              Demian Katz
              Reporter:
              Demian Katz
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: