About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:indexing_xml_records

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
videos:indexing_xml_records [2023/04/24 21:48] – [Transcript] crhallbergvideos:indexing_xml_records [2023/04/26 13:35] (current) crhallberg
Line 1: Line 1:
 ====== Video 7: Indexing XML Records ====== ====== Video 7: Indexing XML Records ======
  
-The seventh VuFind instructional video explains how to import XML records using XSLT, with an emphasis on records that were harvested via OAI-PMH.+The seventh VuFind® instructional video explains how to import XML records using XSLT, with an emphasis on records that were harvested via OAI-PMH.
  
 Video is available as an [[https://vufind.org/video/Ingesting_XML.mp4|mp4 download]] or through [[https://www.youtube.com/watch?v=qzY5nC9PLLQ&feature=youtu.be|YouTube]]. Video is available as an [[https://vufind.org/video/Ingesting_XML.mp4|mp4 download]] or through [[https://www.youtube.com/watch?v=qzY5nC9PLLQ&feature=youtu.be|YouTube]].
Line 30: Line 30:
 For today's example, I'm going to harvest an OJS journal called Expositions, which is hosted at Villanova. OJS is the Open Journal System, an open-source journal hosting platform that supports OAI PMH. So this is a good example of a real-world system that you can harvest from and index in VuFind. VuFind includes some sample configurations and an XSLT for harvesting from OJS and indexing the resulting data. Again, it's a pretty good simple real-world example. For today's example, I'm going to harvest an OJS journal called Expositions, which is hosted at Villanova. OJS is the Open Journal System, an open-source journal hosting platform that supports OAI PMH. So this is a good example of a real-world system that you can harvest from and index in VuFind. VuFind includes some sample configurations and an XSLT for harvesting from OJS and indexing the resulting data. Again, it's a pretty good simple real-world example.
  
-So I'm going to go to the command line where I'm in my VuFind home directory and just show a couple of files to give you a taste of what this all looks like. All of VuFind's sample XSLT sheets are in the `import/XSLsubdirectory of your VuFind home. And as you can see, we actually have three different flavors of OJS XSLTs. We have NLM-OJS.XSL, which uses the National Library of Medicine's metadata standard, which is a bit richer than the default OAI DC Dublin Core data. But for today's demonstration, I'm just going to use OJS-XSL, which indexes the Dublin Core.+So I'm going to go to the command line where I'm in my VuFind home directory and just show a couple of files to give you a taste of what this all looks like. All of VuFind's sample XSLT sheets are in the ''import/XSL'' subdirectory of your VuFind home. And as you can see, we actually have three different flavors of OJS XSLTs. We have NLM-OJS.XSL, which uses the National Library of Medicine's metadata standard, which is a bit richer than the default OAI DC Dublin Core data. But for today's demonstration, I'm just going to use OJS-XSL, which indexes the Dublin Core.
  
-We also have OJS multi-record which I will show you a little later, so stay tuned for that. But to get things started, I'm just going to show you what the `ojs.xsllooks like. As I mentioned, an XSLT is just an XML document, and it really works by pattern matching using XPath, which is a way of specifying particular locations within an XML document.+We also have OJS multi-record which I will show you a little later, so stay tuned for that. But to get things started, I'm just going to show you what the ''ojs.xsl'' looks like. As I mentioned, an XSLT is just an XML document, and it really works by pattern matching using XPath, which is a way of specifying particular locations within an XML document.
  
 So within an XSLT, anything that you see that's prefixed with XSL colon is an XSLT command, and anything else is actually output that the XSLT is going to create. The XSLTs in VuFind are all designed to create Solr documents for indexing, which always have a top-level add tag that contains doc tags that contain fields that need to be added to the Solr documents. So within an XSLT, anything that you see that's prefixed with XSL colon is an XSLT command, and anything else is actually output that the XSLT is going to create. The XSLTs in VuFind are all designed to create Solr documents for indexing, which always have a top-level add tag that contains doc tags that contain fields that need to be added to the Solr documents.
Line 56: Line 56:
 So, this is a really important feature of view find's harvester that enables you to harvest just about anything and reliably be able to index it in Solr with a unique ID. But the IDs that you get back from OAI PMH are often extremely verbose, and they would make for ugly and unreadable URLs. So, we also have some settings called ID search and ID replace, which let us use regular expressions to transform the identifiers at the same time that we're injecting them. So, this is a really important feature of view find's harvester that enables you to harvest just about anything and reliably be able to index it in Solr with a unique ID. But the IDs that you get back from OAI PMH are often extremely verbose, and they would make for ugly and unreadable URLs. So, we also have some settings called ID search and ID replace, which let us use regular expressions to transform the identifiers at the same time that we're injecting them.
  
-So in the case of OJS, the IDs have a long prefix: `/oai:ojs.pkp.sfu.ca:/`. We don't want to show that to our users, so we're going to replace it with `expositions-`. This way, everything that we index from Expositions will have a distinctive prefix on the ID, so we don't have to worry about Expositions records clashing with records from other sources. The other thing about this is that there are several slashes in some of the IDs, and slashes in IDs can create problems because slashes have a special meaning in URLs, and it requires extra configuration of your web server to make things work nicely. So let's just get rid of all the slashes as well. We're going to say `isSearch[] = '|/|'and `isReplace[] = '-'`.+So in the case of OJS, the IDs have a long prefix: ''/oai:ojs.pkp.sfu.ca:/''. We don't want to show that to our users, so we're going to replace it with ''expositions-''. This way, everything that we index from Expositions will have a distinctive prefix on the ID, so we don't have to worry about Expositions records clashing with records from other sources. The other thing about this is that there are several slashes in some of the IDs, and slashes in IDs can create problems because slashes have a special meaning in URLs, and it requires extra configuration of your web server to make things work nicely. So let's just get rid of all the slashes as well. We're going to say ''isSearch[] = '|/|' '' and ''isReplace[] = '-' ''.
  
 Let me explain all of this in whole now that I've typed it all in. ID search and ID replace are repeatable settings in the file. You can have as many pairs of search and replace as you need to transform your IDs. You just have to be sure the brackets on the end of ID search and ID replace, so that when the configuration is read, the multiple values are processed correctly. ID search, as I mentioned, is a regular expression. It uses the Perl-style regular expressions supported in PHP, and those regular expressions require you to start and end the expression for the pattern you're matching with the same character. So, in this first example, where we're getting rid of the OAI OJS prefix, I surrounded it with matching forward slashes because that is a fairly common convention for regular expressions. Let me explain all of this in whole now that I've typed it all in. ID search and ID replace are repeatable settings in the file. You can have as many pairs of search and replace as you need to transform your IDs. You just have to be sure the brackets on the end of ID search and ID replace, so that when the configuration is read, the multiple values are processed correctly. ID search, as I mentioned, is a regular expression. It uses the Perl-style regular expressions supported in PHP, and those regular expressions require you to start and end the expression for the pattern you're matching with the same character. So, in this first example, where we're getting rid of the OAI OJS prefix, I surrounded it with matching forward slashes because that is a fairly common convention for regular expressions.
Line 62: Line 62:
 But for the second pair, where we want to turn forward slashes into dashes, I can't surround the forward/with forward slashes; that would confuse the regular expression engine. So I just used pipe characters instead so that it has matching characters on the beginning and end of the expression that don't conflict with the internal part. I could have chosen a different character here. It doesn't really matter, but I think pipes look pretty. So there you go. But for the second pair, where we want to turn forward slashes into dashes, I can't surround the forward/with forward slashes; that would confuse the regular expression engine. So I just used pipe characters instead so that it has matching characters on the beginning and end of the expression that don't conflict with the internal part. I could have chosen a different character here. It doesn't really matter, but I think pipes look pretty. So there you go.
  
-With all of that in place, we're now ready to harvest Expositions. So now I just need to run a few `finds OAI PMHto harvest the Expositions content. So I run `PHP harvest/harvest OAI.phpand I tell it I want to harvest Expositions, and now I wait as it pulls down a whole bunch of records. 285 records, one for each record in Expositions; each of them is an XML file, and they are all in my local `harvest/expositionsdirectory.+With all of that in place, we're now ready to harvest Expositions. So now I just need to run a few ''finds OAI PMH'' to harvest the Expositions content. So I run ''PHP harvest/harvest OAI.php'' and I tell it I want to harvest Expositions, and now I wait as it pulls down a whole bunch of records. 285 records, one for each record in Expositions; each of them is an XML file, and they are all in my local ''harvest/expositions'' directory.
  
-So now we're ready to put all these pieces together. We have a directory full of XML files in Dublin Core format. We have an XSLT and a properties file. There is a command-line tool that comes with VuFind called `importXSL.php`. So it'`PHP import/import-XSL.php`. And this has a nice `--test-onlymode that you can use if you want to see what it does without actually writing anything into Solr. So I'm going to use that for the first run here, just to demonstrate what happens. The first parameter to this command is the name of an XML file. So I'm going to choose just one of these files, more or less at random.+So now we're ready to put all these pieces together. We have a directory full of XML files in Dublin Core format. We have an XSLT and a properties file. There is a command-line tool that comes with VuFind called ''importXSL.php''. So it'''PHP import/import-XSL.php''. And this has a nice ''--test-only'' mode that you can use if you want to see what it does without actually writing anything into Solr. So I'm going to use that for the first run here, just to demonstrate what happens. The first parameter to this command is the name of an XML file. So I'm going to choose just one of these files, more or less at random.
  
 So I chose "local/harvest/expositions/1588685192/expositions-article-2486.xml". That big number at the front is actually just a time stamp, it's the harvester plus the time of harvest on every file download. The second parameter is the name of the properties file. I've configured it to do the import, and I don't need to tell it the path to that file. I just need to tell it the file name because, like many things in VuFind, what it's going to do is it's first going to look in "VuFind/local_dir/import" to see if we have a local customized properties file. If it doesn't find that, it's then going to fall back and look in "VuFind/home/import" and use the default one. So since I haven't customized anything yet, it's just going to go for the defaults. So I chose "local/harvest/expositions/1588685192/expositions-article-2486.xml". That big number at the front is actually just a time stamp, it's the harvester plus the time of harvest on every file download. The second parameter is the name of the properties file. I've configured it to do the import, and I don't need to tell it the path to that file. I just need to tell it the file name because, like many things in VuFind, what it's going to do is it's first going to look in "VuFind/local_dir/import" to see if we have a local customized properties file. If it doesn't find that, it's then going to fall back and look in "VuFind/home/import" and use the default one. So since I haven't customized anything yet, it's just going to go for the defaults.
Line 70: Line 70:
 So I'm going to run this command and it outputs a Solr document which is created by transforming the input. So as you can see, like in all fields, it's just a whole bunch of text. It extracted all the free text from the XML, taking the tags off of it. There's that hard-coded record format of OJS. The ID is that identifier that we injected, and as you can see, it's prefixed with "expositions" like we told it to be, and the/that would have been here has become a dash, so all my regular expressions worked, and here's my university and OJS that came in from those variables that were set in the properties file, and a whole bunch of other stuff. So let's repeat that command but just take the test only off to actually index it into Solr. So I'm going to run this command and it outputs a Solr document which is created by transforming the input. So as you can see, like in all fields, it's just a whole bunch of text. It extracted all the free text from the XML, taking the tags off of it. There's that hard-coded record format of OJS. The ID is that identifier that we injected, and as you can see, it's prefixed with "expositions" like we told it to be, and the/that would have been here has become a dash, so all my regular expressions worked, and here's my university and OJS that came in from those variables that were set in the properties file, and a whole bunch of other stuff. So let's repeat that command but just take the test only off to actually index it into Solr.
  
-The XML import does not immediately commit changes to Solr. If you run the command and search for a record, it won't show up instantly. To ensure that Solr is up to date, run the util/commit.php script to send a Solr commit. I'll do that now to demonstrate that it worked. If I search for all records prior to indexing, I can see there were 250 records at that time, but now there are 251. One record is from my university, which was working from the `ojs.propertiesfile. If I click on it to filter down, I can see the non-violence article that we indexed from the XML.+The XML import does not immediately commit changes to Solr. If you run the command and search for a record, it won't show up instantly. To ensure that Solr is up to date, run the util/commit.php script to send a Solr commit. I'll do that now to demonstrate that it worked. If I search for all records prior to indexing, I can see there were 250 records at that time, but now there are 251. One record is from my university, which was working from the ''ojs.properties'' file. If I click on it to filter down, I can see the non-violence article that we indexed from the XML.
  
 We have more than 200 of these records, and we don't want to have to index them by hand one at a time. Fortunately, there is a script called harvest/batchimport_ssl.sh. It takes the name of a directory under your local harvest path and the name of a property file. It loops through and indexes every single file in that directory using that configuration. This saves lots of typing. As it indexes, it creates a subdirectory of your harvest directory called "processed" and moves those files into the process directory. We have more than 200 of these records, and we don't want to have to index them by hand one at a time. Fortunately, there is a script called harvest/batchimport_ssl.sh. It takes the name of a directory under your local harvest path and the name of a property file. It loops through and indexes every single file in that directory using that configuration. This saves lots of typing. As it indexes, it creates a subdirectory of your harvest directory called "processed" and moves those files into the process directory.
Line 82: Line 82:
 First, I'm going to edit my OAI harvesting configuration in local harvests OAI.i. All I need to do is add one more setting at the bottom of this called combine records equals true. This is going to tell harvester instead of writing one Dublin core record into each file, you want to create one file for every batch of records that come back over OAI PMH, and you're going to wrap them in a tag called collection. If you want to use a different tag name, there's another setting you can use for that, but for this example, just turning on combine records and accepting the default tag name of collection is good enough. First, I'm going to edit my OAI harvesting configuration in local harvests OAI.i. All I need to do is add one more setting at the bottom of this called combine records equals true. This is going to tell harvester instead of writing one Dublin core record into each file, you want to create one file for every batch of records that come back over OAI PMH, and you're going to wrap them in a tag called collection. If you want to use a different tag name, there's another setting you can use for that, but for this example, just turning on combine records and accepting the default tag name of collection is good enough.
  
-The other thing we need to do is set up the `obj.propertiesfile to use the combined XSLT file. Let's copy the default import `obj.propertiesinto local import because as with everything else, files inside local are going to override defaults in the core code, and let's edit local import `obj.properties`. I'm just going to comment out `ojs.xsland uncomment OJS multi record.xsl.+The other thing we need to do is set up the ''obj.properties'' file to use the combined XSLT file. Let's copy the default import ''obj.properties'' into local import because as with everything else, files inside local are going to override defaults in the core code, and let's edit local import ''obj.properties''. I'm just going to comment out ''ojs.xsl'' and uncomment OJS multi record.xsl.
  
 Let's just take a quick look at that other XSLT to see what the differences are. So I'm going to edit import/XSL/OJS multi record.xsl. This uses template matching. It's going to match the top-level collection tag, and then it's going to loop through the collection looking for OAI DC and apply templates to each of them in turn. Then there's this OAI DC template, and this code is quite similar to the single record code. Let's just take a quick look at that other XSLT to see what the differences are. So I'm going to edit import/XSL/OJS multi record.xsl. This uses template matching. It's going to match the top-level collection tag, and then it's going to loop through the collection looking for OAI DC and apply templates to each of them in turn. Then there's this OAI DC template, and this code is quite similar to the single record code.
Line 88: Line 88:
 It just matches within the scope of a single OAI DC instead of globally looking for particular tags. This is really probably a better way to approach all XSLT writing. The difference between multi record and single record is that I wrote the single record one when I didn't know what I was doing, and somebody else who's better at XSLT than me wrote the multi record one. So I welcome contributions of multi record import scripts for other metadata formats as well, but I do offer the single and multi record options because there are scenarios where each can be useful. We'll talk about that a little more momentarily. It just matches within the scope of a single OAI DC instead of globally looking for particular tags. This is really probably a better way to approach all XSLT writing. The difference between multi record and single record is that I wrote the single record one when I didn't know what I was doing, and somebody else who's better at XSLT than me wrote the multi record one. So I welcome contributions of multi record import scripts for other metadata formats as well, but I do offer the single and multi record options because there are scenarios where each can be useful. We'll talk about that a little more momentarily.
  
-In any case, I've now showed you the multi record XSLT. I've reconfigured the OAI PMH harvester to harvester in groups, and I've configured OJS properties to use the multi record XSLT. So everything should be aligned correctly. So let's run the OAI PMH harvester again. So `php harvester_OAI.phpto harvest expositions, and the harvest should take the same amount of time. We're still harvesting the same 285 records, but if I look inside local harvest expositions this time, there are only three files there because the OAI server provided us with three batches of records, and each of those got saved to a single file.+In any case, I've now showed you the multi record XSLT. I've reconfigured the OAI PMH harvester to harvester in groups, and I've configured OJS properties to use the multi record XSLT. So everything should be aligned correctly. So let's run the OAI PMH harvester again. So ''php harvester_OAI.php'' to harvest expositions, and the harvest should take the same amount of time. We're still harvesting the same 285 records, but if I look inside local harvest expositions this time, there are only three files there because the OAI server provided us with three batches of records, and each of those got saved to a single file.
  
 And now, if I were to run the single file import XSL.php script in test only mode on one of these files, you'll see that the output is much longer than before because now, instead of just having one record transform to Solr, we now have a whole collection of records, 285 of them to be precise. And now, if I were to run the single file import XSL.php script in test only mode on one of these files, you'll see that the output is much longer than before because now, instead of just having one record transform to Solr, we now have a whole collection of records, 285 of them to be precise.
videos/indexing_xml_records.1682372916.txt.gz · Last modified: 2023/04/24 21:48 by crhallberg