Starting with VuFind 1.0.1, a simple tool is included for harvesting records using the OAI-PMH protocol.
Setting up OAI-PMH
To set up OAI-PMH harvesting, simply edit the oai.ini file in the harvest subdirectory of your VuFind installation (or better still, edit a copy of it inside the harvest subdirectory of your local settings directory).
You can set up one or more OAI-PMH repositories in the configuration – details are included in comments within the file.
Once OAI-PMH is configured, you can follow these steps to get documents from an OAI-PMH repository into your VuFind index:
- Run the harvester by switching to the harvest subdirectory of your VuFind installation and running “php harvest_oai.php”. If you configured multiple repositories and want to harvest from just one, you can add the name of the repository (as specified as a section header in oai.ini) as a parameter to limit your harvesting.
- For each OAI-PMH repository you harvested, a number of files will have been created in a subdirectory of harvest whose name matches the appropriate section of the oai.ini configuration file. This subdirectory will be found under $VUFIND_LOCAL_DIR/harvest or $VUFIND_HOME/harvest depending on whether the $VUFIND_LOCAL_DIR environment variable is set.
- Run the ./batch-delete.sh file (with a harvest subdirectory name as a parameter) to remove any records from your index that have been reported as deleted by the OAI-PMH server.
- Run the ./batch-import-marc.sh file (with a harvest subdirectory name as a parameter) to index all MARC records harvested from an OAI-PMH server. If you are harvesting non-MARC data, you may wish to use ./batch-import-xsl.sh instead – see notes on XSLT above.
- After all deleted and new records have been processed, the records retrieved from the OAI-PMH server will have been moved to a “processed” subdirectory of their containing directory. You can periodically clear out this directory if you no longer feel you need to retain records. However, it may be useful to keep them, since you can always move them back up a directory level and re-run the batch processing scripts in order to reindex everything.
- A “last_harvest.txt” file is created in each OAI-PMH harvest directory to keep track of the most recent harvest. This allows subsequent harvest operations to pick up where previous ones left off. To reindex all records, you can simply delete this file. Note that it is normal for some duplicate records to be retrieved on subsequent harvests – new harvests overlap slightly with the previous set in order to ensure that nothing is missed.
It should be possible to automate this process using a top-level script and cron job in order to do a nightly harvest/index operation.
- Processing a large number of MARC files using default settings can be very slow, since records are processed one file at a time. The “combineRecords” and “combineRecordsTag” settings in oai.ini can be used to counteract this problem. These settings were introduced in VuFind 2.4.