Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: Wishlist
    • Fix Version/s: 1.2
    • Component/s: Import Tools
    • Labels:
      None
    • Environment:
      solrmarc, Solr, VuFind (searchspec.yaml)

      Description

      Add a bean shell sctipt to solmarc that takes care of indexing of full text documents linked in MARC records (in 856$u).
      The script will
      - pull the content of URLs in MARC 856$u (maybe shouldn't take all 856$u URLs, but only those that match some criteria)
      - extract text using Aperture (http://aperture.sourceforge.net/)
      - return that text
      solrmarc may then add that extracted text to a fulltext field in the Solr index. VuFind will search this additional fulltext field by additional configuration in serachspec.yaml.
      1. fulltext.ini
        0.7 kB
        Eoghan Ó Carragáin
      2. getFulltext_old.bsh
        5 kB
        Eoghan Ó Carragáin
      3. getFulltext.bsh
        4 kB
        Eoghan Ó Carragáin

        Activity

        Hide
        Demian Katz added a comment -
        As of r3064, I have added fulltext fields to the Solr schema and searchspecs.yaml. I will be working on tools to populate these fields via the XSLT indexer (since this will be useful for harvesting open journal/digital library content which will frequently have full text available). We still need to work on dealing with this from the MARC perspective, and we may still want to discuss some tuning options (i.e. field weights in searchspecs.yaml, Solr options regarding indexed words per document, etc.).
        Show
        Demian Katz added a comment - As of r3064, I have added fulltext fields to the Solr schema and searchspecs.yaml. I will be working on tools to populate these fields via the XSLT indexer (since this will be useful for harvesting open journal/digital library content which will frequently have full text available). We still need to work on dealing with this from the MARC perspective, and we may still want to discuss some tuning options (i.e. field weights in searchspecs.yaml, Solr options regarding indexed words per document, etc.).
        Hide
        Demian Katz added a comment -
        As of r3066, I have added hooks that allow the XSLT indexer to take advantage of Aperture. I don't think we want to bundle Aperture with VuFind by default (it's huge and has cryptographic components that might cause export problems), so I have added a web/conf/fulltext.ini that allows the user to configure the location of an external Aperture installation in either Windows or Linux. If we eventually add full text support to SolrMarc as well, perhaps we can leverage this file, adding some new settings as needed.
        Show
        Demian Katz added a comment - As of r3066, I have added hooks that allow the XSLT indexer to take advantage of Aperture. I don't think we want to bundle Aperture with VuFind by default (it's huge and has cryptographic components that might cause export problems), so I have added a web/conf/fulltext.ini that allows the user to configure the location of an external Aperture installation in either Windows or Linux. If we eventually add full text support to SolrMarc as well, perhaps we can leverage this file, adding some new settings as needed.
        Hide
        Eoghan Ó Carragáin added a comment -
        Attached getFulltext.bsh is a first pass at porting the current XSLT indexer fulltext process to a beanshell script for use with SolrMarc.

        The script can be called by adding the following line to marc.properties:

        fulltext = script(getFulltext.bsh), getFulltext

        The harvestWithAperture function tries to match its namesake from /import/xsl/vufind.php as closely as possible. I had to use org.apache.commons.lang.StringEscapeUtils.unescapeHtml to mimick html_entity_decode. You will need to get org.apache.commons.lang libraries to use the script as is. Does anyone know a standard java alternative that won't introduce a new dependency?

        getFulltext.bsh also gets the aperture path from /web/conf/fulltext.ini, but I had to make minor changes to that file to allow it be be read with the Properties class. Amended version attached.

        Some possible improvements:
        - allow user to specify tag or an array of tags from marc.properties. The script is currently hardcoded to look at each 856u occurrence
        - allow user to specify a filter criteria for URLs, e.g. only those that end in .pdf etc.

        Feedback much appreciated, especially re choice of Java classes/methods and performance.
        Show
        Eoghan Ó Carragáin added a comment - Attached getFulltext.bsh is a first pass at porting the current XSLT indexer fulltext process to a beanshell script for use with SolrMarc. The script can be called by adding the following line to marc.properties: fulltext = script(getFulltext.bsh), getFulltext The harvestWithAperture function tries to match its namesake from /import/xsl/vufind.php as closely as possible. I had to use org.apache.commons.lang.StringEscapeUtils.unescapeHtml to mimick html_entity_decode. You will need to get org.apache.commons.lang libraries to use the script as is. Does anyone know a standard java alternative that won't introduce a new dependency? getFulltext.bsh also gets the aperture path from /web/conf/fulltext.ini, but I had to make minor changes to that file to allow it be be read with the Properties class. Amended version attached. Some possible improvements: - allow user to specify tag or an array of tags from marc.properties. The script is currently hardcoded to look at each 856u occurrence - allow user to specify a filter criteria for URLs, e.g. only those that end in .pdf etc. Feedback much appreciated, especially re choice of Java classes/methods and performance.
        Hide
        Eoghan Ó Carragáin added a comment - - edited
        Improved patch (getFulltext.bsh). Thanks to Demian for suggestions.

        -- Uses ini4j to parse fulltext.ini as it is already packaged with vufind (however, the changes to fulltext.ini still seem to be necessary)
        -- Removes hardcoded windows paths
        -- Uses fieldSpec to allow user to pass in an array of marc tags/subfields
        -- Uses Java xml parser rather than regex to get contents of Aperture output. This removes the need for org.apache.commons.lang

        This script should now be called from marc.properties as follows;
        fulltext = script(getFulltext.bsh), getFulltext(856u:530u)
        Show
        Eoghan Ó Carragáin added a comment - - edited Improved patch (getFulltext.bsh). Thanks to Demian for suggestions. -- Uses ini4j to parse fulltext.ini as it is already packaged with vufind (however, the changes to fulltext.ini still seem to be necessary) -- Removes hardcoded windows paths -- Uses fieldSpec to allow user to pass in an array of marc tags/subfields -- Uses Java xml parser rather than regex to get contents of Aperture output. This removes the need for org.apache.commons.lang This script should now be called from marc.properties as follows; fulltext = script(getFulltext.bsh), getFulltext(856u:530u)
        Hide
        Demian Katz added a comment - - edited
        I've committed a modified version of Eoghan's script as r4015. Main changes:

        - Minor style cleanup
        - Overloaded getFulltext function to allow default parameter values
        - Enabled extension filtering
        - Fixed .ini parsing so that code works with existing fulltext.ini file (no additional modifications necessary)

        At some point in the future, I will look into creating a compiled version of the script as part of the main SolrMarc distribution.
        Show
        Demian Katz added a comment - - edited I've committed a modified version of Eoghan's script as r4015. Main changes: - Minor style cleanup - Overloaded getFulltext function to allow default parameter values - Enabled extension filtering - Fixed .ini parsing so that code works with existing fulltext.ini file (no additional modifications necessary) At some point in the future, I will look into creating a compiled version of the script as part of the main SolrMarc distribution.
        Hide
        Eoghan Ó Carragáin added a comment -
        Till suggested calling Aperture directly rather than through the *.bat/*.sh file (see http://aperture.sourceforge.net/tutorial/extractors.html).

        Might be worth looking into in the future if there are any performance issues with the current script etc.
        Show
        Eoghan Ó Carragáin added a comment - Till suggested calling Aperture directly rather than through the *.bat/*.sh file (see http://aperture.sourceforge.net/tutorial/extractors.html) . Might be worth looking into in the future if there are any performance issues with the current script etc.
        Hide
        Demian Katz added a comment -
        Agreed -- calling Aperture directly is probably a slightly more efficient option... though doing it through the command-line scripts has the advantage of sharing configuration files with the XSLT importer and avoiding the complexity of dealing with Java class-paths (since I don't think we want to bundle Aperture with VuFind by default).
        Show
        Demian Katz added a comment - Agreed -- calling Aperture directly is probably a slightly more efficient option... though doing it through the command-line scripts has the advantage of sharing configuration files with the XSLT importer and avoiding the complexity of dealing with Java class-paths (since I don't think we want to bundle Aperture with VuFind by default).

          People

          • Assignee:
            Till Kinstler
            Reporter:
            Till Kinstler
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: