VuFind
  1. VuFind
  2. VUFIND-926

speed up OAI-PMH results XSLT transformation

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.2
    • Component/s: Import Tools, OAI
    • Labels:
      None

      Description

      OAI harvesting and import is slow. The reason is that the harvested records are cut up into one xml file per record and when an XSLT transformation is ran on it, there's the JVM startup penalty of the XSLT processor for each record.

      This could be improved by
      a) not cutting up the harvested record batch (the batch is "naturally" grouped into pages separated by resumption tokens given by the OAI provider) and running the XSLT transformation on the whole batch; or by
      b) applying some solution that eliminates repeated JVM startup (one such solution is Nailgun, but there might be a better solution).

        Activity

        Hide
        Demian Katz added a comment -
        Actually, there is no JVM involved at all in the OAI harvesting and XML import -- it's all PHP. Of course, I'm sure there's still a startup penalty, but I don't think it's as severe as for a Java app.

        The current division of labor was a design decision to make VuFind's tools more flexible -- following the "Unix pipeline" model of doing small, simple tasks that can be chained together. We have one tool that imports XML files and doesn't care where they come from. We have another tool that harvests via OAI-PMH to produce XML files. We use shell scripts to glue them together for batch operations.

        An importer that loads entire batches of OAI-PMH records might be a bit more efficient, but it would require moving significant chunks of logic from the harvester into the importer (all the record manipulation stuff for inserting/normalizing IDs, etc.) and would also require some workflow changes (currently, with each record in its own file, it's easy to isolate errors since problem records are always processed on their own). I'm not completely opposed to this idea, but I think it makes things a little more complicated, and I'm not sure if the benefits would outweigh the costs.

        If PHP startup costs are really the problem here, another possibility worth considering is rewriting the current bash scripts/batch files that do batch processing of XML records as PHP. The import-xsl.php script could take a "batch" switch or something so it knows it needs to process multiple records, and then it could be responsible for looping through all files in a directory, etc., etc. This might be beneficial in terms of reducing the amount of script maintenance we need to do, since there would be less Linux/Windows-specific code to manage. Not really sure if the performance implications would be significant, though.

        If you're interested in experimenting with any of this, I'd certainly be happy to talk further!
        Show
        Demian Katz added a comment - Actually, there is no JVM involved at all in the OAI harvesting and XML import -- it's all PHP. Of course, I'm sure there's still a startup penalty, but I don't think it's as severe as for a Java app. The current division of labor was a design decision to make VuFind's tools more flexible -- following the "Unix pipeline" model of doing small, simple tasks that can be chained together. We have one tool that imports XML files and doesn't care where they come from. We have another tool that harvests via OAI-PMH to produce XML files. We use shell scripts to glue them together for batch operations. An importer that loads entire batches of OAI-PMH records might be a bit more efficient, but it would require moving significant chunks of logic from the harvester into the importer (all the record manipulation stuff for inserting/normalizing IDs, etc.) and would also require some workflow changes (currently, with each record in its own file, it's easy to isolate errors since problem records are always processed on their own). I'm not completely opposed to this idea, but I think it makes things a little more complicated, and I'm not sure if the benefits would outweigh the costs. If PHP startup costs are really the problem here, another possibility worth considering is rewriting the current bash scripts/batch files that do batch processing of XML records as PHP. The import-xsl.php script could take a "batch" switch or something so it knows it needs to process multiple records, and then it could be responsible for looping through all files in a directory, etc., etc. This might be beneficial in terms of reducing the amount of script maintenance we need to do, since there would be less Linux/Windows-specific code to manage. Not really sure if the performance implications would be significant, though. If you're interested in experimenting with any of this, I'd certainly be happy to talk further!
        Hide
        Demian Katz added a comment -
        Starting with release 2.2, there is a "combine records" setting in the OAI-PMH harvester which can retrieve batches of records and group them together. This can be combined with custom XSLT to achieve faster import.

        The default behavior of VuFind remains the one-record-at-a-time approach, and there may be some value in modifying the defaults for greater performance. I'm open to suggestions in this area. However, since this ticket has been open a long time without further discussion and new tools now exist to address the needs, I am going to close it for now. Feel free to open a new one if specific problems remain.
        Show
        Demian Katz added a comment - Starting with release 2.2, there is a "combine records" setting in the OAI-PMH harvester which can retrieve batches of records and group them together. This can be combined with custom XSLT to achieve faster import. The default behavior of VuFind remains the one-record-at-a-time approach, and there may be some value in modifying the defaults for greater performance. I'm open to suggestions in this area. However, since this ticket has been open a long time without further discussion and new tools now exist to address the needs, I am going to close it for now. Feel free to open a new one if specific problems remain.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ivan Masár
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: