Video 2: Indexing MARC Records

The second VuFind instructional video discusses how to index MARC records (and configure/customize the indexing process).

Video is available as an mp4 download or through YouTube.


Alright. So in this video, we are going to talk about loading mark records into VuFind, which is a good first step after you have installed it, as was described in our previous video. Because they are the easiest to load we are going to work exclusively with marc records today and I'm going to assume that you know what a marc record is but just the one-sentence description is it's a standard interchange format for library records where the data is split up into numeric fields with alphabetic subfields. So marc records are included with VuFind as part of its test suite so if you just wanted to kick the tires, it's really easy to get some records loaded in there. You just switch into your VuFind directory and at the the moment I'm actually looking at the virtual machine I set up in the previous video and I'm in the VuFind home directory. If I do an “ls tests/data/*.mrc”, I can see all of the marc records that are available.

So first things first we need to start up our Solr index, which is where VuFind stores all of its records. We can't put things in the index until the index is running. So I do that with “./solr.sh start” and then it's going to tell me that Solr is starting up on port 8080 a fact that is useful to remember for later since we might want to look at the Solr index once it has records in it. And now that Solr is running there's just a handy script called “import-marc.sh” that if we feed it the name of a marc file (which can be either binary marc or marc XML) it will load everything into the index for us. So I'm going to choose journals.mrc which contains some journal records and it takes just a couple seconds to load all of the data into the index, once it gets going. And there we go! So I'm going to go to my web browser now and just do a blank search in VuFind which is a good way to get a list of all the records that have been indexed and I will see that I now have ten journal records appearing in the search results.

So what just happened? It might be useful to take a step back and talk a little bit about what Solr is. Solr is an open-source search engine which essentially takes a set of keys and values and applies some rules to those to make them very easy to search and perform faceted filtering on. So what we did with our script was map some marc records into Solr.

I mentioned when I started up Solr that it's useful to make note of the port number that it's running on and that is because Solr has a web interface that you can actually look at. If you just go to the name of the server where it's running access the port it's running on and then go to “/solr” you will find a handy administrative interface. Solr arranges its records into a set of cores so that you can store different types of records in different index schemas and in the case of VuFind the main records are stored in a core named “biblio”. So if we go to biblio and we go to the querying tool we can run their default “*:*” everything search and we will see that here indeed are the ten records that were put into the index and we can see some of the field names and values that got placed there so all of this data was mapped out of those marc records that we loaded.

So how did the indexing tool know what to do with the contents of the marc records? The answer is there is a configuration file called “marc.properties” in VuFind's import directory which contains all of the rules for mapping marc fields into Solr fields. So if we edit this file we can browse through and see some of the mappings that are present. This is using a separate open-source tool called SolrMarc which is special-built for mapping marc records into Solr and so it contains a domain-specific language for specifying these mappings. The simplest thing that you can possibly do is set a field to a fixed string so that every record you import will just have those particular values and in VuFind's default we set everything to a collection of catalog, an institution of “My Institution”, and a building of “Library A”. These are obviously settings that you will want to override and I'll show you how to do that in a little bit.

So marc records contain two types of fields the first nine fields (001 through 009) are called control fields and they simply have numeric field designations without subfields you can see that by default VuFind takes the first value found in an 001 field and stores that as the unique identifier for the record. IDs are really important in the Solr index because that's how Solr tells records apart. Every record must have a unique ID and if you index two records with the same ID, the second record you index will replace the first one. So if your marc records don't have their ideas in the 001 field this is another detail that you will need to customize.

So after those first 9 control fields all of the other fields in the marc record are called data fields and they have alphabetic subfields containing different bits of data and we store a lot of things from those data fields in the Solr index. So for example the second line here we get the Library of Congress card number out of the 010 field, subfield a and this first designation again just stores only the first value in the Solr index.

Similarly here we take the 035a control number but because we're not specifying only taking the first one that could potentially store multiple values in the index. Moving down a little bit to how we index languages. This is a little bit more complicated. Taking this part by part, the SolrMarc field specifications can have a colon separated list of different things to pull out and it will take all of those values and store them in the index as separate values. So here we are taking characters 35-37 of the 008 control field (which is usually a language code) as well as 041a, d, h, and j all as separate values and SolrMarc supports what it calls translation maps which can take one string and map it into another. There's a folder in the translation maps subfolder of import called language map properties which translates all of the three-letter language codes that are found in marc records into the full English names for those codes.

You'll also see in here that there are some custom methods and when SolrMarc sees the word “custom” that means it's looking for a Java function with the name that follows. So there's actually a Java class that ships with VuFind and contains a getFormats method which contains all the rules for specifying record formats. Looking at that is beyond the scope of this introductory video but by including custom methods in Java it is possible to write your own and to extend the ones that ship with VuFind and that might be a good topic for a future video. Also, you can apply translation maps to the output of custom methods so there's a format map properties which takes the very granular formats that come out of our Java code and translate them into a smaller set of more human-readable values that will be displayed in the index. One last example of some of the power and flexibility of SolrMarc is here at the very bottom. So the pattern maps I mentioned before or rather the translation maps I mentioned before are separate files but you can also include translation maps inline in the marc.properties file by giving them a name and wrapping them in parentheses. The translations don't have to be straight string-to-string you can also define a set of regular expression rules to apply. So these last five lines of the file are the rules for extracting the numeric portion of OCLC numbers found in 035a. So the rule says take all the values from O35a and apply this pattern map and then this pattern_0, pattern_1, pattern_2, pattern_3 stuff is providing a series of regular expressions that will be tested against all the strings that are extracted and if matches are found mappings will be made. So this, for example, takes the different case variations of O-C-O-L-C in parentheses and then extract just the numeric portion using a regular expression subpattern matching to $1. Then this gets the OCM versions, the OCN versions, and the ON versions and if you don't know what I'm talking about: there are a variety of different ways that OCLC numbers are prefixed depending on the age and source of the records and for VuFind's purposes we only care about the numbers hence all of these mapping rules.

So! Now that I've shown you how VuFind works by default, it would be useful to know how you can change and customize these things. Before I dive completely into that, a quick aside about the VuFind local directory, which is an important concept for not just import but also configuration more generally of VuFind. So when you setup VuFind, the installer gives you two environment variables there's VUFIND_HOME, which in this instance is “/usr/local/vufind” and there's VUFIND_LOCAL_DIR, which in this instance is “/usr/local/vufind/local”. The reason for this local directory is to give you a place to store all of your local files separately from the files that are distributed as part of VuFind. This offers a couple of advantages. One is that you can have multiple configurations of VuFind at once and switch between them by just changing the variable that tells VuFind where to load its configurations the other is that when you upgrade VuFind you only need to keep track of a couple of directories of your local files the rest you can delete and replace with a new version which makes the upgrade process a little bit more streamlined. In general, the way VuFind uses the local directory is that when it's trying to load a configuration of any sort it will look in the local directory first. If it finds something there it will use it. If it does not find it there it will go up to VUFIND_HOME. So this is why there are a lot of parallel structures so for example there's an import directory inside VUFIND_HOME which contains lots of default configurations but there's also a local import directory which contains a smaller number of files which are all local overrides. As you can see here when you first install VuFind you actually get a couple of files in there for free: “import.properties” and “import_auth.properties”. These are the files which tell SolrMarc where to find its configurations and other dependencies and so they are auto-generated for you by the installer to point at the directories that VuFind is being installed into. “import.properties” is the primary configuration file for loading into the biblio core and “import_auth” is for loading authority records which again may be a topic for a future video.

Let's just take a quick look at “import.properties” so you can see what's in there. It's not a whole lot. We just specify which core name we're loading things into. We're specifying the names of the properties files that we're using for mappings. So as you can see here is the “marc.properties” file I showed you and also a second file called “marc-local.properties” which is a way that you can create a file that overrides parts of “marc.properties” without overwriting all of it and we will be working with that in a little bit. Also we have the URL where Solr is running so that SolrMarc can push the records into its update handler and then “solrmarc.path” which is the location of the SolrMarc files and as you can see we specify first the local import directory and then the main import directory which is how we load local files ahead of default files. Then there are also a few defaults about error handling and character encodings that you should be able to leave alone but which are here for you to tweak if you ever need to.

So with all of that out of the way, let's start customizing some things. The first thing that we should do is create a “marc-local.properties” file in our local import directory so we have a local place to begin putting our customizations. So the easiest way to do that is to copy the standard “marc-local.properties” which contains a whole lot of commented-out examples that might be useful into local import and then edit it there. As I mentioned by default VuFind indexes everything to “My Institution” and “Library A” and so some of the suggestions here in marc-local are about changing that. So I'm going to call the institution “Demian's University” and I'm going to call the building “Demian's house” because I have started a university in which I record videos about VuFind. So now having made that change, I need to re-index all of my records in order for the change to take effect. Whenever you change your import rules you need to reload your records because the mappings occur at index time. So once again: “./import-marc.sh tests/data/journals.mrc”. We wait a few seconds while the records all load and now if I go back to my web browser and my VuFind search results you'll see that when I loaded things before “My Institution” and “Library A” were showing up in these facets but if I refresh the page my changes have taken effect. It's now Demian's University and Demian's House. Please don't come to my house. Now going back to the terminal let's look at another thing we might want to change. As I mentioned this is a file full of journals, so if I look at the format facet I'll see there are 10 journals here but suppose that I don't like calling them journals I want to degrade them by calling them magazines. That is something I can change by editing the format translation maps that I mentioned earlier. So once again we can create a copy of the map in our local directory and this will override the default map. First let me create a directory to hold my custom translation maps because you're not going to have such a directory by default. We have to create “local/import/translation_maps” and then I can copy “import/translation_maps/format_map.properties” into local “local/import/translation_maps”. I've now created a local copy that I can customize, so I will edit that. As you can see here is where we map the more technical terms that come out of the custom getFormat method into more human-readable versions but down here we just map journal straight back into journal again but I can change this. “Journal = Magazine” and now if I just go back and import my records one more time now when I go back to my results and refresh the page “Journal” has just turned into “Magazine”. As you can see it's pretty easy to customize some of these strings but let's look at some more complicated things. As I mentioned languages are currently pulled from a whole bunch of fields and run through a translation map but let's suppose that I don't want to display those English language names for some reason. Perhaps I would rather set up some settings in my code to change them into something on the fly. What I can do is look at the rule in the core import-marc.properties and I am going to copy this rule then I'm going to edit my “local/import/marc-local.properties” so I can override it. Just going to go to the bottom and I'm going to paste the rule in. I'm actually going to paste it in twice so I can keep the old version commented out for reference and then I'm just going to get rid of the language map. Now this will just take the codes and index them directly without doing any translation work on them. So now I'm going to import the records one more time so again before I refresh the page to show the results, let's see what this looked like before so here we have “English”, “French”, “German”, “Italian”. Now I'm going to refresh the page and sure enough now we are only getting the raw codes.

One more interesting feature of SolrMarc is that you can actually combine together multiple rules. I realize my example is getting a little bit contrived but let's suppose that I actually wanted to see both the mapped versions and the raw versions. What I can do is just put both versions of the rule in place but on the second one change the “=” to a “+=” and what this will do is it will take all the mapped versions put them in the language field and then also take all the raw versions and add those to the language field. This kind of combination can be really useful when you have information in multiple fields of the marc record formatted in different ways and you want to normalize it in some fashion. You can create a separate rule for each field and then combine all the fields together with “+=” but for this example we'll just do it this way. Now I will reindex one more time and now if I refresh this page one more time and scroll down to language, you'll see that we now have both the three-letter codes and the mapped names.

So that covers the very basic ways that you can import marc records and customize the way that they're mapped. Of course, there's a lot deeper dive possible here because of looking at the custom Java code or the whole end-to-end experience of adding a new field to VuFind not only indexing it but also making it display in the interface. As I said these are good topics for the future but for right now I'm just trying to introduce all of the basics if you want to learn more about SolrMarc and indexing be sure to take a look at the SolrMarc page in VuFind's wiki and the SolrMarc GitHub project, both of which have some more documentation on how to solve common problems using the importer and more detail about the depths of the importing language which is actually quite rich and has quite a few features that can be strung together to do complicated transformations. There's also a SolrMarc tech mailing list which is a great place to ask for help, in addition to the usual VuFind support venues.

I hope this has been useful and I will share another video next month!

