[VUFIND-926] speed up OAI-PMH results XSLT transformation Created: 29/Oct/13  Updated: 14/Jul/15  Resolved: 14/Jul/15

Status: Resolved
Project: VuFind®
Components: Import Tools, OAI
Affects versions: 2.0
Fix versions: 2.2

Type: Improvement Priority: Major
Reporter: helix84 Assignee: Unassigned
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified


 Description   
OAI harvesting and import is slow. The reason is that the harvested records are cut up into one xml file per record and when an XSLT transformation is ran on it, there's the JVM startup penalty of the XSLT processor for each record.

This could be improved by
a) not cutting up the harvested record batch (the batch is "naturally" grouped into pages separated by resumption tokens given by the OAI provider) and running the XSLT transformation on the whole batch; or by
b) applying some solution that eliminates repeated JVM startup (one such solution is Nailgun, but there might be a better solution).

 Comments   
Comment by Demian Katz [ 30/Oct/13 ]
Actually, there is no JVM involved at all in the OAI harvesting and XML import -- it's all PHP. Of course, I'm sure there's still a startup penalty, but I don't think it's as severe as for a Java app.

The current division of labor was a design decision to make VuFind's tools more flexible -- following the "Unix pipeline" model of doing small, simple tasks that can be chained together. We have one tool that imports XML files and doesn't care where they come from. We have another tool that harvests via OAI-PMH to produce XML files. We use shell scripts to glue them together for batch operations.

An importer that loads entire batches of OAI-PMH records might be a bit more efficient, but it would require moving significant chunks of logic from the harvester into the importer (all the record manipulation stuff for inserting/normalizing IDs, etc.) and would also require some workflow changes (currently, with each record in its own file, it's easy to isolate errors since problem records are always processed on their own). I'm not completely opposed to this idea, but I think it makes things a little more complicated, and I'm not sure if the benefits would outweigh the costs.

If PHP startup costs are really the problem here, another possibility worth considering is rewriting the current bash scripts/batch files that do batch processing of XML records as PHP. The import-xsl.php script could take a "batch" switch or something so it knows it needs to process multiple records, and then it could be responsible for looping through all files in a directory, etc., etc. This might be beneficial in terms of reducing the amount of script maintenance we need to do, since there would be less Linux/Windows-specific code to manage. Not really sure if the performance implications would be significant, though.

If you're interested in experimenting with any of this, I'd certainly be happy to talk further!
Comment by Demian Katz [ 14/Jul/15 ]
Starting with release 2.2, there is a "combine records" setting in the OAI-PMH harvester which can retrieve batches of records and group them together. This can be combined with custom XSLT to achieve faster import.

The default behavior of VuFind remains the one-record-at-a-time approach, and there may be some value in modifying the defaults for greater performance. I'm open to suggestions in this area. However, since this ticket has been open a long time without further discussion and new tools now exist to address the needs, I am going to close it for now. Feel free to open a new one if specific problems remain.
Generated at Wed Apr 24 13:41:25 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100251-rev:0a2056e15286310f4b5e220c64c9aafb1684da34.