Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:indexing_marc_records
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
videos:indexing_marc_records [2020/01/29 17:37] – created demiankatz | videos:indexing_marc_records [2023/04/26 13:26] (current) – crhallberg | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Video 2: Indexing MARC Records ====== | ====== Video 2: Indexing MARC Records ====== | ||
- | The second | + | The second |
Video is available as an [[https:// | Video is available as an [[https:// | ||
Line 7: | Line 7: | ||
===== Related Resources ===== | ===== Related Resources ===== | ||
+ | - [[indexing: | ||
- [[indexing: | - [[indexing: | ||
- [[https:// | - [[https:// | ||
Line 13: | Line 14: | ||
===== Transcript ===== | ===== Transcript ===== | ||
- | // coming soon! // | + | Alright. |
+ | So in this video, we are going to | ||
+ | talk about loading mark records into | ||
+ | VuFind, which is a good first step | ||
+ | after you have installed it, as was | ||
+ | described in our previous video. Because | ||
+ | they are the easiest to load we are | ||
+ | going to work exclusively with marc | ||
+ | records today and I'm going to assume | ||
+ | that you know what a marc record is but | ||
+ | just the one-sentence description is | ||
+ | it's a standard interchange format for | ||
+ | library records where the data is split | ||
+ | up into numeric fields with alphabetic | ||
+ | subfields. So marc records are included | ||
+ | with VuFind as part of its test suite | ||
+ | so if you just wanted to kick the tires, | ||
+ | it's really easy to get some records | ||
+ | loaded in there. You just switch into | ||
+ | your VuFind directory and at the | ||
+ | the moment I'm actually looking at the | ||
+ | virtual machine I set up in the previous | ||
+ | video and I'm in the VuFind home directory. | ||
+ | If I do an "ls tests/data/*.mrc", | ||
+ | I can see all of the marc records that | ||
+ | are available. | ||
+ | So first things first we | ||
+ | need to start up our Solr index, which | ||
+ | is where VuFind stores all of its | ||
+ | records. We can't put things in the index | ||
+ | until the index is running. So I do that | ||
+ | with " | ||
+ | to tell me that Solr is starting up on | ||
+ | port 8080 a fact that is useful to | ||
+ | remember for later since we might want | ||
+ | to look at the Solr index once it has | ||
+ | records in it. And now that Solr is | ||
+ | running there' | ||
+ | called " | ||
+ | feed it the name of a marc file (which | ||
+ | can be either binary marc or marc XML) it | ||
+ | will load everything into the index for | ||
+ | us. So I'm going to choose journals.mrc | ||
+ | which contains some journal records | ||
+ | and it takes just a couple seconds to | ||
+ | load all of the data into the index, once | ||
+ | it gets going. And there we go! So I'm | ||
+ | going to go to my web browser | ||
+ | now and just do a blank search in | ||
+ | VuFind which is a good way to get a list | ||
+ | of all the records that have been | ||
+ | indexed and I will see that I now have | ||
+ | ten journal records appearing in the | ||
+ | search results. | ||
+ | |||
+ | So what just happened? | ||
+ | It might be useful to take a step back and | ||
+ | talk a little bit about what Solr is. | ||
+ | Solr is an open-source search engine | ||
+ | which essentially takes a set of keys | ||
+ | and values and applies some rules to | ||
+ | those to make them very easy to search | ||
+ | and perform faceted filtering on. | ||
+ | So what we did with our script was map some | ||
+ | marc records into Solr. | ||
+ | |||
+ | I mentioned | ||
+ | when I started up Solr that it's useful | ||
+ | to make note of the port number that | ||
+ | it's running on and that is because | ||
+ | Solr has a web interface that you can | ||
+ | actually look at. If you just go to the | ||
+ | name of the server where it's running | ||
+ | access the port it's running on and then | ||
+ | go to "/ | ||
+ | a handy administrative interface. Solr | ||
+ | arranges its records into a set of cores | ||
+ | so that you can store different types of | ||
+ | records in different index schemas and | ||
+ | in the case of VuFind the main records | ||
+ | are stored in a core named " | ||
+ | So if we go to biblio and we go to the | ||
+ | querying tool we can run their default | ||
+ | " | ||
+ | will see that here indeed are the ten | ||
+ | records that were put into the index and | ||
+ | we can see some of the field names and | ||
+ | values that got placed there so all of | ||
+ | this data was mapped out of those marc | ||
+ | records that we loaded. | ||
+ | |||
+ | So how did the | ||
+ | indexing tool know what to do with the | ||
+ | contents of the marc records? | ||
+ | The answer is there is a configuration file | ||
+ | called " | ||
+ | import directory which contains all of | ||
+ | the rules for mapping marc fields into | ||
+ | Solr fields. So if we edit this file we | ||
+ | can browse through and see some of the | ||
+ | mappings that are present. This is using | ||
+ | a separate open-source tool called SolrMarc | ||
+ | which is special-built for mapping | ||
+ | marc records into Solr and so it | ||
+ | contains a domain-specific language for | ||
+ | specifying these mappings. The simplest | ||
+ | thing that you can possibly do is set a | ||
+ | field to a fixed string so that every | ||
+ | record you import will just have those | ||
+ | particular values and in VuFind' | ||
+ | default we set everything to a | ||
+ | collection of catalog, an institution of | ||
+ | "My Institution", | ||
+ | These are obviously settings that you | ||
+ | will want to override and I'll show you | ||
+ | how to do that in a little bit. | ||
+ | |||
+ | So marc records contain two types of fields | ||
+ | the first nine fields (001 through 009) | ||
+ | are called control fields and they simply | ||
+ | have numeric field designations without | ||
+ | subfields you can see that by default VuFind | ||
+ | takes the first value found in an 001 field | ||
+ | and stores that as the unique identifier | ||
+ | for the record. IDs are really important | ||
+ | in the Solr index because that's how | ||
+ | Solr tells records apart. Every record | ||
+ | must have a unique ID and if you index | ||
+ | two records with the same ID, the second | ||
+ | record you index will replace the first | ||
+ | one. So if your marc records don't have | ||
+ | their ideas in the 001 field this is | ||
+ | another detail that you will need to | ||
+ | customize. | ||
+ | |||
+ | So after those first 9 control | ||
+ | fields all of the other fields in the marc | ||
+ | record are called data fields and they | ||
+ | have alphabetic subfields containing | ||
+ | different bits of data and we store a | ||
+ | lot of things from those data fields in | ||
+ | the Solr index. So for example the | ||
+ | second line here we get the Library of | ||
+ | Congress card number out of | ||
+ | the 010 field, subfield a | ||
+ | and this first designation | ||
+ | again just stores only the first value | ||
+ | in the Solr index. | ||
+ | |||
+ | Similarly here we | ||
+ | take the 035a control number | ||
+ | but because we're not | ||
+ | specifying only taking the first one | ||
+ | that could potentially store multiple | ||
+ | values in the index. | ||
+ | Moving down a little | ||
+ | bit to how we index languages. This is a | ||
+ | little bit more complicated. Taking this | ||
+ | part by part, the SolrMarc field | ||
+ | specifications can have a colon | ||
+ | separated list of different things to | ||
+ | pull out and it will take all of those | ||
+ | values and store them in the index as | ||
+ | separate values. So here we are taking | ||
+ | characters 35-37 of the 008 control field | ||
+ | (which is usually a language code) | ||
+ | as well as 041a, d, h, and j | ||
+ | all as separate values and SolrMarc | ||
+ | supports what it calls translation | ||
+ | maps which can take one string and map | ||
+ | it into another. There' | ||
+ | translation maps subfolder of import | ||
+ | called language map properties which | ||
+ | translates all of the three-letter | ||
+ | language codes that are found in marc | ||
+ | records into the full English names for | ||
+ | those codes. | ||
+ | |||
+ | You'll also see in here that | ||
+ | there are some custom methods and when | ||
+ | SolrMarc sees the word " | ||
+ | means it's looking for a Java function | ||
+ | with the name that follows. So there' | ||
+ | actually a Java class that ships with | ||
+ | VuFind and contains a getFormats | ||
+ | method which contains all the rules for | ||
+ | specifying record formats. Looking at | ||
+ | that is beyond the scope of this | ||
+ | introductory video but by including | ||
+ | custom methods in Java it is possible to | ||
+ | write your own and to extend the ones | ||
+ | that ship with VuFind and that might | ||
+ | be a good topic for a future video. | ||
+ | Also, you can apply translation maps to | ||
+ | the output of custom methods so there' | ||
+ | a format map properties which takes the | ||
+ | very granular formats that come out of | ||
+ | our Java code and translate them into a | ||
+ | smaller set of more human-readable | ||
+ | values that will be displayed in the | ||
+ | index. One last example of some of the | ||
+ | power and flexibility of SolrMarc is | ||
+ | here at the very bottom. So the pattern | ||
+ | maps I mentioned before or rather the | ||
+ | translation maps I mentioned before are | ||
+ | separate files but you can also include | ||
+ | translation maps inline in the | ||
+ | marc.properties file by giving them a name | ||
+ | and wrapping them in parentheses. | ||
+ | The translations don't have to be straight | ||
+ | string-to-string you can also define a | ||
+ | set of regular expression rules to apply. | ||
+ | So these last five lines of the file are | ||
+ | the rules for extracting the numeric | ||
+ | portion of OCLC numbers found in 035a. | ||
+ | So the rule says take all the values | ||
+ | from O35a and apply this pattern map | ||
+ | and then this pattern_0, pattern_1, | ||
+ | pattern_2, pattern_3 stuff is providing a | ||
+ | series of regular expressions that will | ||
+ | be tested against all the strings that | ||
+ | are extracted and if matches are found | ||
+ | mappings will be made. So this, for | ||
+ | example, takes the different case | ||
+ | variations of O-C-O-L-C in parentheses and | ||
+ | then extract just the numeric portion | ||
+ | using a regular expression subpattern | ||
+ | matching to $1. Then this gets the | ||
+ | OCM versions, the OCN versions, and the | ||
+ | ON versions and if you don't know what | ||
+ | I'm talking about: there are a variety of | ||
+ | different ways that OCLC numbers are | ||
+ | prefixed depending on the age and source | ||
+ | of the records and for VuFind' | ||
+ | purposes we only care about the numbers | ||
+ | hence all of these mapping rules. | ||
+ | |||
+ | So! Now that I've shown you how VuFind | ||
+ | works by default, it would be useful | ||
+ | to know how you can change | ||
+ | and customize these things. Before I | ||
+ | dive completely into that, a quick | ||
+ | aside about the VuFind local | ||
+ | directory, which is an important concept | ||
+ | for not just import but also | ||
+ | configuration more generally of VuFind. | ||
+ | So when you setup VuFind, the installer | ||
+ | gives you two environment variables | ||
+ | there' | ||
+ | instance is "/ | ||
+ | there' | ||
+ | this instance is "/ | ||
+ | The reason for this local | ||
+ | directory is to give you a place to | ||
+ | store all of your local files separately | ||
+ | from the files that are distributed as | ||
+ | part of VuFind. This offers a couple of | ||
+ | advantages. One is that you can have | ||
+ | multiple configurations of VuFind at | ||
+ | once and switch between them by just | ||
+ | changing the variable that tells VuFind | ||
+ | where to load its configurations | ||
+ | the other is that when you upgrade VuFind | ||
+ | you only need to keep track of a | ||
+ | couple of directories of your local | ||
+ | files the rest you can delete and | ||
+ | replace with a new version which makes | ||
+ | the upgrade process a little bit more | ||
+ | streamlined. In general, the way VuFind | ||
+ | uses the local directory is | ||
+ | that when it's trying to load a | ||
+ | configuration of any sort it will look | ||
+ | in the local directory first. | ||
+ | If it finds something there it will use it. | ||
+ | If it does not find it there it will go up to | ||
+ | VUFIND_HOME. So this is why there are | ||
+ | a lot of parallel structures so for | ||
+ | example there' | ||
+ | inside VUFIND_HOME which contains | ||
+ | lots of default configurations but | ||
+ | there' | ||
+ | which contains a smaller number of files | ||
+ | which are all local overrides. As you | ||
+ | can see here when you first install | ||
+ | VuFind you actually get a couple of files | ||
+ | in there for free: " | ||
+ | " | ||
+ | the files which tell SolrMarc where to | ||
+ | find its configurations and | ||
+ | other dependencies and so they are | ||
+ | auto-generated for you by the installer | ||
+ | to point at the directories that VuFind | ||
+ | is being installed into. | ||
+ | " | ||
+ | file for loading into the biblio core | ||
+ | and " | ||
+ | records which again may be a topic for a | ||
+ | future video. | ||
+ | |||
+ | Let's just take a quick | ||
+ | look at " | ||
+ | see what's in there. It's not a whole lot. | ||
+ | We just specify which core name we're | ||
+ | loading things into. We're specifying the | ||
+ | names of the properties files that we're | ||
+ | using for mappings. So as you can see | ||
+ | here is the " | ||
+ | showed you and also a second file called | ||
+ | " | ||
+ | that you can create a file that | ||
+ | overrides parts of " | ||
+ | without overwriting all of it and we | ||
+ | will be working with that in a little | ||
+ | bit. Also we have the URL where Solr is | ||
+ | running so that SolrMarc can push the | ||
+ | records into its update handler and then | ||
+ | " | ||
+ | location of the SolrMarc files and as | ||
+ | you can see we specify first the local | ||
+ | import directory and then the main | ||
+ | import directory which is how we load | ||
+ | local files ahead of default files. Then | ||
+ | there are also a few defaults about | ||
+ | error handling and character encodings | ||
+ | that you should be able to leave alone | ||
+ | but which are here for you to tweak if | ||
+ | you ever need to. | ||
+ | |||
+ | So with all of that out of the way, | ||
+ | let's start customizing some things. | ||
+ | The first thing that we should | ||
+ | do is create a " | ||
+ | file in our local import directory so we | ||
+ | have a local place to begin putting our | ||
+ | customizations. | ||
+ | So the easiest way to do that is to copy | ||
+ | the standard " | ||
+ | which contains a whole lot of | ||
+ | commented-out examples that might be useful | ||
+ | into local import and then edit it there. | ||
+ | As I mentioned by default VuFind | ||
+ | indexes everything to "My Institution" | ||
+ | " | ||
+ | here in marc-local are about changing | ||
+ | that. So I'm going to call the | ||
+ | institution " | ||
+ | going to call the building " | ||
+ | because I have started a university | ||
+ | in which I record videos about VuFind. | ||
+ | So now having made that | ||
+ | change, I need to re-index all of my | ||
+ | records in order for the change to take | ||
+ | effect. Whenever you change your import | ||
+ | rules you need to reload your records | ||
+ | because the mappings occur at index time. | ||
+ | So once again: | ||
+ | " | ||
+ | We wait a few seconds | ||
+ | while the records all load and now if I | ||
+ | go back to my web browser and my VuFind | ||
+ | search results you'll see that when | ||
+ | I loaded things before "My Institution" | ||
+ | and " | ||
+ | facets but if I refresh the page my | ||
+ | changes have taken effect. It's now | ||
+ | Demian' | ||
+ | Please don't come to my house. | ||
+ | Now going back to the terminal let's look at | ||
+ | another thing we might want to change. | ||
+ | As I mentioned this is a file full of | ||
+ | journals, so if I look at the format | ||
+ | facet I'll see there are 10 journals | ||
+ | here but suppose that I don't like | ||
+ | calling them journals | ||
+ | I want to degrade them by calling them | ||
+ | magazines. That is something I can | ||
+ | change by editing the | ||
+ | format translation maps that I mentioned earlier. | ||
+ | So once again we can create a copy of | ||
+ | the map in our local directory and this | ||
+ | will override the default map. First | ||
+ | let me create a directory to hold my | ||
+ | custom translation maps because you're | ||
+ | not going to have such a directory by | ||
+ | default. We have to create | ||
+ | " | ||
+ | and then I can copy | ||
+ | " | ||
+ | into local " | ||
+ | I've now created a local copy | ||
+ | that I can customize, so I | ||
+ | will edit that. | ||
+ | As you can see here | ||
+ | is where we map the more technical terms | ||
+ | that come out of the custom getFormat | ||
+ | method into more human-readable versions | ||
+ | but down here we just map journal | ||
+ | straight back into journal again but I | ||
+ | can change this. | ||
+ | " | ||
+ | and now if I just go back and import my | ||
+ | records one more time now when I go back | ||
+ | to my results and refresh the page | ||
+ | " | ||
+ | As you can see it's pretty easy to | ||
+ | customize some of these strings but | ||
+ | let's look at some more complicated things. | ||
+ | As I mentioned languages are | ||
+ | currently pulled from a whole bunch of | ||
+ | fields and run through a translation map | ||
+ | but let's suppose that I don't want to | ||
+ | display those English language names for | ||
+ | some reason. Perhaps I would rather set | ||
+ | up some settings in my code to change | ||
+ | them into something on the fly. What I | ||
+ | can do is look at the rule | ||
+ | in the core import-marc.properties | ||
+ | and I am going to copy this rule | ||
+ | then I'm going to edit my | ||
+ | " | ||
+ | so I can override it. | ||
+ | Just going to go to the bottom | ||
+ | and I'm going to paste the rule in. | ||
+ | I'm actually going to paste it in twice | ||
+ | so I can keep the old version | ||
+ | commented out for reference and then I'm | ||
+ | just going to get rid of the language map. | ||
+ | Now this will just take the codes and | ||
+ | index them directly without doing any | ||
+ | translation work on them. | ||
+ | So now I'm going to import the records | ||
+ | one more time so again before I refresh | ||
+ | the page to show the results, let's see | ||
+ | what this looked like before so here we | ||
+ | have " | ||
+ | Now I'm going to refresh the page | ||
+ | and sure enough now we are only getting | ||
+ | the raw codes. | ||
+ | |||
+ | One more interesting | ||
+ | feature of SolrMarc is that you can | ||
+ | actually combine together multiple rules. | ||
+ | I realize my example is getting a | ||
+ | little bit contrived but let's suppose | ||
+ | that I actually wanted to see both the | ||
+ | mapped versions and the raw versions. | ||
+ | What I can do is just put both versions | ||
+ | of the rule in place but on the second | ||
+ | one change the " | ||
+ | and what this will do is it will take | ||
+ | all the mapped versions put them in the | ||
+ | language field and then also take all | ||
+ | the raw versions and add those to the | ||
+ | language field. | ||
+ | This kind of combination | ||
+ | can be really useful when you have | ||
+ | information in multiple fields of the | ||
+ | marc record formatted in different ways | ||
+ | and you want to normalize it in some | ||
+ | fashion. You can create a separate rule | ||
+ | for each field and then combine all the | ||
+ | fields together with " | ||
+ | this example we'll just do it this way. | ||
+ | Now I will reindex one more time | ||
+ | and now if I refresh this page one more | ||
+ | time and scroll down to language, you'll | ||
+ | see that we now have both the | ||
+ | three-letter codes and the mapped names. | ||
+ | |||
+ | So that covers the very basic ways that | ||
+ | you can import marc records and | ||
+ | customize the way that they' | ||
+ | Of course, there' | ||
+ | possible here because of looking at the | ||
+ | custom Java code or the whole end-to-end | ||
+ | experience of adding a new field to | ||
+ | VuFind not only indexing it but also | ||
+ | making it display in the interface. | ||
+ | As I said these are good topics for the future | ||
+ | but for right now I'm just trying to | ||
+ | introduce all of the basics if you want | ||
+ | to learn more about SolrMarc and | ||
+ | indexing be sure to take a look at the | ||
+ | SolrMarc page in VuFind' | ||
+ | the SolrMarc GitHub project, both of | ||
+ | which have some more documentation on | ||
+ | how to solve common problems using the | ||
+ | importer and more detail about the | ||
+ | depths of the importing language which | ||
+ | is actually quite rich and has quite a | ||
+ | few features that can be strung together | ||
+ | to do complicated transformations. | ||
+ | There' | ||
+ | list which is a great place to ask for | ||
+ | help, in addition to the usual VuFind | ||
+ | support venues. | ||
+ | |||
+ | I hope this has been | ||
+ | useful and I will share another video | ||
+ | next month! | ||
---- struct data ---- | ---- struct data ---- | ||
+ | properties.Page Owner : | ||
---- | ---- | ||
videos/indexing_marc_records.1580319468.txt.gz · Last modified: 2020/01/29 17:37 by demiankatz