VuFind
  1. VuFind
  2. VUFIND-1330

Multiple Tika processes are spawned & hanging while indexing

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.4
    • Fix Version/s: 5.1.1
    • Component/s: Import Tools
    • Labels:
    • Environment:
      Ubuntu 16.04 LTS, tika-app-1.20.jar

      Description

      When importing the attached record, Tika seems to hang, and with it the entire import process (harvest/batch-import-marc.sh).

      Running the Tika command directly from terminal finishes within a second. In my case the command is `java -jar /usr/local/vufind/tika/tika.jar -t -eUTF8 https://www.muenzfunde.ch/downloads/bulletins/ifs_bulletin_2008.pdf`, the exact same command shows up multiple times in the process list while the import hangs.

      Any ideas how to debug this further?

      Originally I tried to import a rather large XML file with ~20k records, I traced this record by searching for the muenfunde.ch URL.

        Activity

        Hide
        Tod Olson added a comment - - edited
        I've looked a little at the code and have done some searching around. Some buffering madness seems likely. The "About Runtime.exec()" article[1] on Stack Overflow sums up what I've found:
        1. use `ProcessBuilder`[2] instead of `Runtime.exec()`, and
        2. implement all of the suggestions in "When Runtime.exec() won't work".[3]

        [1] https://stackoverflow.com/tags/runtime.exec/info
        [2] https://docs.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html]
        [3] https://www.javaworld.com/article/2071275/when-runtime-exec---won-t.html
        Show
        Tod Olson added a comment - - edited I've looked a little at the code and have done some searching around. Some buffering madness seems likely. The "About Runtime.exec()" article[1] on Stack Overflow sums up what I've found: 1. use `ProcessBuilder`[2] instead of `Runtime.exec()`, and 2. implement all of the suggestions in "When Runtime.exec() won't work".[3] [1] https://stackoverflow.com/tags/runtime.exec/info [2] https://docs.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html ] [3] https://www.javaworld.com/article/2071275/when-runtime-exec---won-t.html
        Hide
        Demian Katz added a comment -
        Thanks, Tod, this was very helpful! I've put together a pull request which (at least on my system) seems to resolve the problem:

        https://github.com/vufind-org/vufind/pull/1366

        The issue was, as I had already sort of suspected, that the increased output to stderr in newer versions of Tika was causing the spawned process to choke. However, the simplistic loop-based solution I was originally trying to use to work around that wasn't helping -- the suggestions in [3] led me to the idea of creating a Thread to handle the error output, and that does seem to do the trick.

        I'm a little nervous about this solution, though, since Threads make everything more complicated, and I'm not an expert. I'm not sure, for example, whether there might be dire consequences to using this solution in combination with running multiple indexing threads in SolrMarc.

        I'd welcome further feedback from Tod on whether or not this looks like a sane solution... and of course I'd love to hear if this fully solves Simon's problem.
        Show
        Demian Katz added a comment - Thanks, Tod, this was very helpful! I've put together a pull request which (at least on my system) seems to resolve the problem: https://github.com/vufind-org/vufind/pull/1366 The issue was, as I had already sort of suspected, that the increased output to stderr in newer versions of Tika was causing the spawned process to choke. However, the simplistic loop-based solution I was originally trying to use to work around that wasn't helping -- the suggestions in [3] led me to the idea of creating a Thread to handle the error output, and that does seem to do the trick. I'm a little nervous about this solution, though, since Threads make everything more complicated, and I'm not an expert. I'm not sure, for example, whether there might be dire consequences to using this solution in combination with running multiple indexing threads in SolrMarc. I'd welcome further feedback from Tod on whether or not this looks like a sane solution... and of course I'd love to hear if this fully solves Simon's problem.
        Hide
        Tod Olson added a comment -
        I consulted future with some who does Java all the time. You could do it with an async logger, with the caveat that some stderr data could be lost.

        Appenders: https://logging.apache.org/log4j/2.x/manual/appenders.html
        AsyncAppender: https://logging.apache.org/log4j/2.x/manual/async.html

        I can't say where the tradeoff falls in this instance.
        Show
        Tod Olson added a comment - I consulted future with some who does Java all the time. You could do it with an async logger, with the caveat that some stderr data could be lost. Appenders: https://logging.apache.org/log4j/2.x/manual/appenders.html AsyncAppender: https://logging.apache.org/log4j/2.x/manual/async.html I can't say where the tradeoff falls in this instance.
        Hide
        Demian Katz added a comment -
        Tod, are you proposing that some changes should be made to my pull request, or are you suggesting that if performance problems are encountered, adjusting the Log4j configuration might resolve them? Just want to be sure I understand what my next step should be. In any case, thanks for the input!
        Show
        Demian Katz added a comment - Tod, are you proposing that some changes should be made to my pull request, or are you suggesting that if performance problems are encountered, adjusting the Log4j configuration might resolve them? Just want to be sure I understand what my next step should be. In any case, thanks for the input!
        Hide
        Demian Katz added a comment -
        Since the 5.1.1 release is next week, I decided to go ahead and do some more thorough testing of my fix. It seems to work in multi-threaded mode with multiple problem records, so I'm going to go ahead and merge it. If further adjustments are needed, we can address that in 6.0. Thanks, everyone, and please let me know if you run into any new problems!
        Show
        Demian Katz added a comment - Since the 5.1.1 release is next week, I decided to go ahead and do some more thorough testing of my fix. It seems to work in multi-threaded mode with multiple problem records, so I'm going to go ahead and merge it. If further adjustments are needed, we can address that in 6.0. Thanks, everyone, and please let me know if you run into any new problems!

          People

          • Assignee:
            Unassigned
            Reporter:
            Simon Hohl
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: