Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:indexing_xml_records
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
videos:indexing_xml_records [2020/05/22 19:05] – created demiankatz | videos:indexing_xml_records [2023/04/26 13:35] (current) – crhallberg | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Video 7: Indexing XML Records ====== | ====== Video 7: Indexing XML Records ====== | ||
- | The seventh | + | The seventh |
- | Video is available as an [[https:// | + | Video is available as an [[https:// |
===== Related Resources ===== | ===== Related Resources ===== | ||
- [[indexing: | - [[indexing: | ||
+ | |||
+ | ===== Update Notes ===== | ||
+ | |||
+ | :!: This video was recorded using VuFind 6.1. In VuFind 8.0, changes were made which impact the content of this video: | ||
+ | |||
+ | * The ojs-multirecord.xsl file has been removed, and the standard ojs.xsl file has been updated to handle both the single-record and multi-record cases. All of the information in this video about the advantages and disadvantages of each technique still applies, but it is no longer necessary to make changes to ojs.properties in order to support the multi-record case. The only changes you need to make are in oai.ini, to control how records are harvested. | ||
+ | * All of the other example XSLT files have been adjusted to support multi-record indexing, so you can apply this technique to records harvested from other systems as well. | ||
===== Transcript ===== | ===== Transcript ===== | ||
- | // This is a raw machine-generated transcript; it will be cleaned | + | Welcome to the seventh VuFind tutorial video. This is a continuation of last month' |
+ | |||
+ | The first thing that I should emphasize is that MARC XML is a special exception. You can use VuFind' | ||
+ | |||
+ | So, I've already mentioned XSLT, so that was a bit of a spoiler. VuFind uses XSLT for loading XML data into Solr. So, first, I should talk a little bit about what XSLT is. It's short for extensible stylesheet language transformations, | ||
+ | |||
+ | There are several versions of XSLT. I believe the language is up to version 3.0 right now, but PHP's built-in XSLT processor only supports version 1.0 of the language. Obviously, I'm not going to teach you XSLT today in five minutes. It's a bit of a project to learn. So, if you do go off and read a tutorial about it, be sure you find one about the original version of the language and not the later ones that add a lot of additional features. | ||
+ | |||
+ | It's perhaps a little unfortunate that PHP doesn' | ||
+ | |||
+ | For today' | ||
+ | |||
+ | So I'm going to go to the command line where I'm in my VuFind home directory and just show a couple of files to give you a taste of what this all looks like. All of VuFind' | ||
+ | |||
+ | We also have OJS multi-record which I will show you a little later, so stay tuned for that. But to get things started, I'm just going to show you what the '' | ||
+ | |||
+ | So within an XSLT, anything that you see that's prefixed with XSL colon is an XSLT command, and anything else is actually output that the XSLT is going to create. The XSLTs in VuFind are all designed to create Solr documents for indexing, which always have a top-level add tag that contains doc tags that contain fields that need to be added to the Solr documents. | ||
+ | |||
+ | So the XSLTs are mostly defining Solr fields and containing rules using XSLT to fill those fields with the appropriate data. For example, to get our unique ID, we're pulling from an XML tag called identifier. We have a hard-coded record format, so this is just putting this literal value into every record, which would enable us to create an OJS specific record driver if we wanted to. | ||
+ | |||
+ | We have an all fields field to index all of the text within the XML document, which uses some XSLT functions to extract that text. We use variables which XSLT supports to pass in institution and collection values. I will show you momentarily how these variables get set. XSLT supports looping for multi-values. So for example, this code here populates VuFind' | ||
+ | |||
+ | It calls a PHP function which translates the strings from two-letter or three-letter codes into all textual representations. Again, I obviously can't go into great depth about how all of this works here, but hopefully, this gives you a little taste. If you go off and read an XSLT tutorial or two, it should make even more sense. | ||
+ | |||
+ | So, the XSLT is only part of what VuFind needs to do XML indexing. The other part is a properties file for the import tool, which tells it not only which XSLT to use but also what custom PHP functions to make available and what values to set for any variables that are used within the XSLT. | ||
+ | |||
+ | Let's look at a properties file that goes with that XSLT. As I just showed you, all of the import properties files live in the import directory, and they all contain lots of comments explaining in detail what all of the settings mean. But just to go through the highlights, of course, there' | ||
+ | |||
+ | The multi-record is much faster. It just requires some extra work when you harvest, and I'm going to show you how to use both of these today. We'll start one at a time, and we'll work our way up to multi-record. You also can expose specific PHP functions directly into the XSLT by just creating a list of functions here. | ||
+ | |||
+ | By default, none of the package configurations do this, but it is a possibility if you want to make PHP functions available to your XSLT. You can also create a class full of custom functions and expose all of them to your XSLT. Most of VuFind' | ||
+ | |||
+ | So I showed you earlier that the institution and collection fields in the Solr index are getting set to variables, and the variables are set here. So by default, you're going to get institution set to "my university" | ||
+ | |||
+ | So, when I run the harvest, all my records go to the directory called " | ||
+ | |||
+ | So, this is a really important feature of view find's harvester that enables you to harvest just about anything and reliably be able to index it in Solr with a unique ID. But the IDs that you get back from OAI PMH are often extremely verbose, and they would make for ugly and unreadable URLs. So, we also have some settings called ID search and ID replace, which let us use regular expressions to transform the identifiers at the same time that we're injecting them. | ||
+ | |||
+ | So in the case of OJS, the IDs have a long prefix: ''/ | ||
+ | |||
+ | Let me explain all of this in whole now that I've typed it all in. ID search and ID replace are repeatable settings in the file. You can have as many pairs of search and replace as you need to transform your IDs. You just have to be sure the brackets on the end of ID search and ID replace, so that when the configuration is read, the multiple values are processed correctly. ID search, as I mentioned, is a regular expression. It uses the Perl-style regular expressions supported in PHP, and those regular expressions require you to start and end the expression for the pattern you're matching with the same character. So, in this first example, where we're getting rid of the OAI OJS prefix, I surrounded it with matching forward slashes because that is a fairly common convention for regular expressions. | ||
+ | |||
+ | But for the second pair, where we want to turn forward slashes into dashes, I can't surround the forward/ | ||
+ | |||
+ | With all of that in place, we're now ready to harvest Expositions. So now I just need to run a few '' | ||
+ | |||
+ | So now we're ready to put all these pieces together. We have a directory full of XML files in Dublin Core format. We have an XSLT and a properties file. There is a command-line tool that comes with VuFind called '' | ||
+ | |||
+ | So I chose " | ||
+ | |||
+ | So I'm going to run this command and it outputs a Solr document which is created by transforming the input. So as you can see, like in all fields, it's just a whole bunch of text. It extracted all the free text from the XML, taking the tags off of it. There' | ||
+ | |||
+ | The XML import does not immediately commit changes to Solr. If you run the command and search for a record, it won't show up instantly. To ensure that Solr is up to date, run the util/ | ||
+ | |||
+ | We have more than 200 of these records, and we don't want to have to index them by hand one at a time. Fortunately, | ||
+ | |||
+ | The batch process is smart enough that if anything should go wrong during the index, it will not move files that failed to import correctly. So, if I had one bad record in this batch, all the good ones would get successfully indexed and moved into the process directory, but the bad one would stay there. I could then run that test mode I showed you on the one record to see exactly what the error message is that was preventing the transformation or to see if there' | ||
+ | |||
+ | Now the index process has completed, and if I do a directory listing of local harvester expositions, | ||
+ | |||
+ | So what I'm going to do is: remove the whole local harvester expositions directory so we can start over and I can show you how much faster this is if we do records and batches instead of one at a time. | ||
+ | |||
+ | First, I'm going to edit my OAI harvesting configuration in local harvests OAI.i. All I need to do is add one more setting at the bottom of this called combine records equals true. This is going to tell harvester instead of writing one Dublin core record into each file, you want to create one file for every batch of records that come back over OAI PMH, and you're going to wrap them in a tag called collection. If you want to use a different tag name, there' | ||
+ | |||
+ | The other thing we need to do is set up the '' | ||
+ | |||
+ | Let's just take a quick look at that other XSLT to see what the differences are. So I'm going to edit import/XSL/OJS multi record.xsl. This uses template matching. It's going to match the top-level collection tag, and then it's going to loop through the collection looking for OAI DC and apply templates to each of them in turn. Then there' | ||
+ | |||
+ | It just matches within the scope of a single OAI DC instead of globally looking for particular tags. This is really probably a better way to approach all XSLT writing. The difference between multi record and single record is that I wrote the single record one when I didn't know what I was doing, and somebody else who's better at XSLT than me wrote the multi record one. So I welcome contributions of multi record import scripts for other metadata formats as well, but I do offer the single and multi record options because there are scenarios where each can be useful. We'll talk about that a little more momentarily. | ||
+ | |||
+ | In any case, I've now showed you the multi record XSLT. I've reconfigured the OAI PMH harvester to harvester in groups, and I've configured OJS properties to use the multi record XSLT. So everything should be aligned correctly. So let's run the OAI PMH harvester again. So '' | ||
+ | |||
+ | And now, if I were to run the single file import XSL.php script in test only mode on one of these files, you'll see that the output is much longer than before because now, instead of just having one record transform to Solr, we now have a whole collection of records, 285 of them to be precise. | ||
+ | |||
+ | So it goes on and on and on. But the advantage of this is you remember how long it took to batch import the expositions when every file contained only one record. Let me show you how much faster it is when there are only three files containing under records. Each " | ||
+ | |||
+ | The only disadvantage to doing things this way that I can see is that, as I mentioned, the import script will skip files that fail the import. So if I had one corrupted record in this OJS instance and I ran this batch import, one of these three files would fail, and I would know there was a problem with one of the hundred records within that file, but it would be hard to figure out which one had caused the problem. So doing single-record importing may be valuable for troubleshooting purposes if nothing else, and I would suggest that if you do a batch import and you run into trouble, try doing a single import that will probably help you pinpoint the causes of your problems. | ||
+ | |||
+ | I should also note that, as I said, most of the example XSLTs are things I wrote that are designed for a single record at a time. There' | ||
+ | |||
+ | So that's it for this month. Thank you for listening, and we'll have more next time. | ||
- | // Coming soon... // | + | //This is an edited version of an automated transcript. Apologies for any errors.// |
---- struct data ---- | ---- struct data ---- | ||
+ | properties.Page Owner : | ||
---- | ---- | ||
videos/indexing_xml_records.1590174324.txt.gz · Last modified: 2020/05/22 19:05 by demiankatz