Video 6: OAI-PMH Server and Harvest Functionality
The sixth VuFind® instructional video provides a brief overview of VuFind's use of the OAI-PMH protocol for sharing and harvesting metadata.
Hello and welcome to this VuFind tutorial video, in which I am going to talk about how VuFind uses the OAI-PMH protocol to both share and receive records.
OAI-PMH is the open archives initiative protocol for metadata harvesting and is a well supported and widely used method of sharing xml metadata between systems. It supports not just harvesting entire collections of metadata but also doing incremental harvests so you can get only things that have changed since your prior harvest, and it can also address deleted records so you can find out what has been removed from an upstream system. The protocol always supports Dublin core metadata but it also can support any kind of XML format as well. The server and client are both able to deal with the same standard.
First of all I am going to show you how you can turn on VuFind's OAI-PMH server. I'm going to the command line and I'm going to edit my local config.ini file and you'll see that in the default configuration that comes with VuFind the entire [OAI] section is commented out, so by deleting this semicolon and uncommenting the section header I have now activated my OAI-PMH server.
That's all I need to do to turn on the basic functionality but there are a few things here that I would probably also want to do like give the name, and you can set a separate administrative email for your OAI server or otherwise it will use the default email address.
There are also some settings related to sets since OAI servers can divide a collection into specific sets. You can use a Solr field like a facet for defining sets or you can specify particular named sets with particular queries associated with them if you want to allow people to harvest specific subsets of your collection, but if you just leave all this stuff commented out then set functionality will be disabled and people will only be able to harvest your entire collection.
There is another important step though that you have to take before you can use OAI-PMH server capabilities in VuFind, and that is to turn on record change tracking because the OAI-PMH protocol needs to know the history of when everything in your system was created or changed so that it can do incremental updates. VuFind needs to track more information at index time so that the server has the information that it needs. By default, VuFind does not track record change information because doing so makes the index process slower, but if you do turn this on you not only get the benefit of being able to use the OAI-PMH server but you also gain access to some other functionality that otherwise won't work including RSS feeds that are sorted based on actual record creation times and the ability to use Solr-based new record searching where you can actually limit your search by how recently records were added to the index.
To turn this on you just need to uncomment a couple of lines in the default marc_local.properties file, so I'm going to bring that up. This is the same file that we've worked on. You can see here near the top there are two lines, first_indexed and last_indexed, and just by uncommenting these I turn on change tracking. The difference between these two fields is that the first_indexed field will contain the date of the first time a particular record ID was indexed into the system and the last_indexed date will contain the most recent time that record changed, so when you index a record for the first time first_indexed and last_indexed will be set, but if that record gets revised over time last_indexed will change to reflect those changes but first_indexed will always stay the same so you know the age of the overall record as well as the date of its most recent change and this is sort of the minimum amount of information needed to implement OAI-PMH.
Of course, simply making a change to my marc_local.properties file is not enough. I also index all of my records and just in keeping with past demos I'm going to index 3 of the sample MARC record files included with VuFind: journals.mrc, geo.mrc and authoritybibs.mrc.
Of course I've showed you how to turn on change tracking for MARC records. At some point in the future we'll also index XML. When we get that far you can also turn on change tracking there, it's just done in a different way. For now we've got our index updated the way we need it to be. We have the OAI server functionality turned on in config.ini, so I'm going to switch over to a web browser and show you how this works.
If you go to your VuFind URL with /oai on the end of it you will get to a convenient page that shows you all of the verbs supported by the OAI-PMH protocol. It lets you test them out on your instance, so for example, the most simple thing you can do is just say “identify” which will dump out basic information about the server and as you can see that the “Demian's repo” repository name I put into config.ini comes through here.
Of course, much more interesting is finding out what kind of metadata formats are supported by an OAI-PMH server. As I mentioned before, they always support Dublin Core but different formats may be supported by different servers so in a view case I'm just going to give one of the records and the index and find out what formats are supported.
So here is the Oei DC which is dublin core but you'll also see there's a mark 21 format supported. Mark XML can be used and if we wanted to actually see some records we can use the list records verb which at a bare minimum requires that we give it a metadata format. So I'm going to give it one hit go and there's my response.
And as you can see, there's some Mark XML getting dumped out here. So by turning on this functionality, you can share all of the records in your VuFind index with other systems, Union catalogs participating in projects like the digital Public Library of America and also actually indexing things into VuFind.
So now that we've showed how OAI-PMH server functionality works, let's show what VuFind can do as an OAI-PMH client and actually make it harvest itself as an example.
So going back to the command line, there is a folder we haven't looked at yet called harvest in the VuFind directory and like just about everything in VuFind, you can override things from the harvest directory inside the local harvest direct.
One of the important files under harvest is called oai.ini which is just an inny file that you can use to set up OAI harvesting. So I'm going to copy harvest/oai.ini and I into local harvest local copy that I local settings directory.
So oai.ini has lots of comments at the top and the many many settings that are supported by this file. Through those at your convenience. At a bare minimum, all you need to do to perform an OAI harvest is to create a section named you find because we are are you find and the main purpose of the section name is that records that are harvested will be saved in a directory whose name matches the section.
When I perform a harvest, I will end up with a local/harvest/you find directory filled with XML files. Now I need to give it the base URL of an OAI server. In this case, that's gonna be HTTP localhost/vufind/OAI/Server. This is the URL that you would share with others who want to harvest from you, though of course, in a real-life scenario, the hostname would be something other than localhost.
I also have to provide a metadata prefix telling it what metadata format to harvest and in this example, I actually just want to see what the Dublin OAI-DC. Save this file. Once you have your oai.ini set up, there is a PHP script called harvest/harvest_oai.php and when you run that, it will loop through oai.ini and harvest every section it volumes.
Or, you can tell it the name of a specific section and it will be that one repository. I'll do that. I'll tell it harvest vufind. Now there we go, it just downloaded 250 Dublin Core records in just a couple of seconds.
So now, if I go into my local/harvest/vufind directory and list my files, I have lots and lots of XML files. And if I was out there is a little bit of Dublin Core with title and a creator and identifier.
So, that's all I wanted to show this month. This will become much more interesting when we talk about ingesting XML because you can harvest with OAI and then load a whole directory of records into VuFind. We will look at that next time. In the meantime, I also just wanted to quickly mention that if you want to do this OAI-PMH harvesting without having to install all of VuFind, it has actually been split out into a separate project called VuFind Harvest. So, you can just check out VuFind Harvest and run a simplified version of the script without having to carry the whole way to VuFind around with you. And, I will include a link to that project in the notes with the video. That's all for now. Thank you, and I will provide more information next month.
This is an edited version of an automated transcript. Apologies for any errors.