About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:indexing_marc_records

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
videos:indexing_marc_records [2020/01/29 17:37] – created demiankatzvideos:indexing_marc_records [2023/04/26 13:26] (current) crhallberg
Line 1: Line 1:
 ====== Video 2: Indexing MARC Records ====== ====== Video 2: Indexing MARC Records ======
  
-The second VuFind instructional video discusses how to index MARC records (and configure/customize the indexing process).+The second VuFind® instructional video discusses how to index MARC records (and configure/customize the indexing process).
  
 Video is available as an [[https://vufind.org/video/Importing_MARC_Records.mp4|mp4 download]] or through [[https://www.youtube.com/watch?v=ayhWQFi3h_w&feature=youtu.be|YouTube]]. Video is available as an [[https://vufind.org/video/Importing_MARC_Records.mp4|mp4 download]] or through [[https://www.youtube.com/watch?v=ayhWQFi3h_w&feature=youtu.be|YouTube]].
Line 7: Line 7:
 ===== Related Resources ===== ===== Related Resources =====
  
 +  - [[indexing:marc#related_video|Indexing MARC Records wiki page]]
   - [[indexing:solrmarc|SolrMarc wiki page]]   - [[indexing:solrmarc|SolrMarc wiki page]]
   - [[https://github.com/solrmarc/solrmarc|SolrMarc Documentation]]   - [[https://github.com/solrmarc/solrmarc|SolrMarc Documentation]]
Line 13: Line 14:
 ===== Transcript ===== ===== Transcript =====
  
-// coming soon! //+Alright.  
 +So in this video, we are going to 
 +talk about loading mark records into 
 +VuFind, which is a good first step 
 +after you have installed it, as was 
 +described in our previous video. Because 
 +they are the easiest to load we are 
 +going to work exclusively with marc 
 +records today and I'm going to assume 
 +that you know what a marc record is but 
 +just the one-sentence description is 
 +it's a standard interchange format for 
 +library records where the data is split 
 +up into numeric fields with alphabetic 
 +subfields. So marc records are included 
 +with VuFind as part of its test suite 
 +so if you just wanted to kick the tires, 
 +it's really easy to get some records 
 +loaded in there. You just switch into 
 +your VuFind directory and at the 
 +the moment I'm actually looking at the 
 +virtual machine I set up in the previous 
 +video and I'm in the VuFind home directory.  
 +If I do an "ls tests/data/*.mrc", 
 +I can see all of the marc records that 
 +are available. 
  
 +So first things first we
 +need to start up our Solr index, which
 +is where VuFind stores all of its
 +records. We can't put things in the index
 +until the index is running. So I do that
 +with "./solr.sh start" and then it's going
 +to tell me that Solr is starting up on
 +port 8080 a fact that is useful to
 +remember for later since we might want
 +to look at the Solr index once it has
 +records in it. And now that Solr is
 +running there's just a handy script
 +called "import-marc.sh" that if we
 +feed it the name of a marc file (which
 +can be either binary marc or marc XML) it
 +will load everything into the index for
 +us. So I'm going to choose journals.mrc 
 +which contains some journal records
 +and it takes just a couple seconds to
 +load all of the data into the index, once
 +it gets going. And there we go! So I'm
 +going to go to my web browser
 +now and just do a blank search in 
 +VuFind which is a good way to get a list
 +of all the records that have been
 +indexed and I will see that I now have
 +ten journal records appearing in the
 +search results. 
 +
 +So what just happened? 
 +It might be useful to take a step back and
 +talk a little bit about what Solr is.
 +Solr is an open-source search engine
 +which essentially takes a set of keys
 +and values and applies some rules to
 +those to make them very easy to search
 +and perform faceted filtering on. 
 +So what we did with our script was map some
 +marc records into Solr. 
 +
 +I mentioned
 +when I started up Solr that it's useful
 +to make note of the port number that
 +it's running on and that is because
 +Solr has a web interface that you can
 +actually look at. If you just go to the
 +name of the server where it's running
 +access the port it's running on and then
 +go to "/solr" you will find
 +a handy administrative interface. Solr 
 +arranges its records into a set of cores
 +so that you can store different types of
 +records in different index schemas and
 +in the case of VuFind the main records
 +are stored in a core named "biblio"
 +So if we go to biblio and we go to the
 +querying tool we can run their default
 +"*:*" everything search and we
 +will see that here indeed are the ten
 +records that were put into the index and
 +we can see some of the field names and
 +values that got placed there so all of
 +this data was mapped out of those marc
 +records that we loaded. 
 +
 +So how did the
 +indexing tool know what to do with the
 +contents of the marc records? 
 +The answer is there is a configuration file 
 +called "marc.properties" in VuFind'
 +import directory which contains all of
 +the rules for mapping marc fields into
 +Solr fields. So if we edit this file we
 +can browse through and see some of the
 +mappings that are present. This is using
 +a separate open-source tool called SolrMarc
 +which is special-built for mapping
 +marc records into Solr and so it
 +contains a domain-specific language for
 +specifying these mappings. The simplest
 +thing that you can possibly do is set a
 +field to a fixed string so that every
 +record you import will just have those
 +particular values and in VuFind's
 +default we set everything to a
 +collection of catalog, an institution of
 +"My Institution", and a building of "Library A".
 +These are obviously settings that you
 +will want to override and I'll show you
 +how to do that in a little bit. 
 +
 +So marc records contain two types of fields 
 +the first nine fields (001 through 009)
 +are called control fields and they simply
 +have numeric field designations without
 +subfields you can see that by default VuFind
 +takes the first value found in an 001 field
 +and stores that as the unique identifier
 +for the record. IDs are really important
 +in the Solr index because that's how
 +Solr tells records apart. Every record
 +must have a unique ID and if you index
 +two records with the same ID, the second
 +record you index will replace the first
 +one. So if your marc records don't have
 +their ideas in the 001 field this is
 +another detail that you will need to
 +customize. 
 +
 +So after those first 9 control
 +fields all of the other fields in the marc
 +record are called data fields and they
 +have alphabetic subfields containing
 +different bits of data and we store a
 +lot of things from those data fields in
 +the Solr index. So for example the
 +second line here we get the Library of
 +Congress card number out of 
 +the 010 field, subfield a 
 +and this first designation
 +again just stores only the first value
 +in the Solr index. 
 +
 +Similarly here we
 +take the 035a control number 
 +but because we're not
 +specifying only taking the first one
 +that could potentially store multiple
 +values in the index. 
 +Moving down a little
 +bit to how we index languages. This is a
 +little bit more complicated. Taking this
 +part by part, the SolrMarc field
 +specifications can have a colon
 +separated list of different things to
 +pull out and it will take all of those
 +values and store them in the index as
 +separate values. So here we are taking
 +characters 35-37 of the 008 control field
 +(which is usually a language code) 
 +as well as 041a, d, h, and j
 +all as separate values and SolrMarc
 +supports what it calls translation
 +maps which can take one string and map
 +it into another. There's a folder in the
 +translation maps subfolder of import
 +called language map properties which
 +translates all of the three-letter
 +language codes that are found in marc
 +records into the full English names for
 +those codes. 
 +
 +You'll also see in here that
 +there are some custom methods and when
 +SolrMarc sees the word "custom" that
 +means it's looking for a Java function
 +with the name that follows. So there's
 +actually a Java class that ships with
 +VuFind and contains a getFormats
 +method which contains all the rules for
 +specifying record formats. Looking at
 +that is beyond the scope of this
 +introductory video but by including
 +custom methods in Java it is possible to
 +write your own and to extend the ones
 +that ship with VuFind and that might
 +be a good topic for a future video. 
 +Also, you can apply translation maps to
 +the output of custom methods so there's
 +a format map properties which takes the
 +very granular formats that come out of
 +our Java code and translate them into a
 +smaller set of more human-readable
 +values that will be displayed in the
 +index. One last example of some of the
 +power and flexibility of SolrMarc is
 +here at the very bottom. So the pattern
 +maps I mentioned before or rather the
 +translation maps I mentioned before are
 +separate files but you can also include
 +translation maps inline in the 
 +marc.properties file by giving them a name
 +and wrapping them in parentheses. 
 +The translations don't have to be straight
 +string-to-string you can also define a
 +set of regular expression rules to apply.
 +So these last five lines of the file are
 +the rules for extracting the numeric
 +portion of OCLC numbers found in 035a.
 +So the rule says take all the values
 +from O35a and apply this pattern map
 +and then this pattern_0, pattern_1,
 +pattern_2, pattern_3 stuff is providing a
 +series of regular expressions that will
 +be tested against all the strings that
 +are extracted and if matches are found
 +mappings will be made. So this, for
 +example, takes the different case
 +variations of O-C-O-L-C in parentheses and
 +then extract just the numeric portion
 +using a regular expression subpattern
 +matching to $1. Then this gets the
 +OCM versions, the OCN versions, and the
 +ON versions and if you don't know what
 +I'm talking about: there are a variety of
 +different ways that OCLC numbers are
 +prefixed depending on the age and source
 +of the records and for VuFind's
 +purposes we only care about the numbers
 +hence all of these mapping rules.
 +
 +So! Now that I've shown you how VuFind 
 +works by default, it would be useful
 +to know how you can change
 +and customize these things. Before I
 +dive completely into that, a quick
 +aside about the VuFind local
 +directory, which is an important concept
 +for not just import but also
 +configuration more generally of VuFind.
 +So when you setup VuFind, the installer
 +gives you two environment variables
 +there's VUFIND_HOME, which in this
 +instance is "/usr/local/vufind" and
 +there's VUFIND_LOCAL_DIR, which in
 +this instance is "/usr/local/vufind/local"
 +The reason for this local
 +directory is to give you a place to
 +store all of your local files separately
 +from the files that are distributed as
 +part of VuFind. This offers a couple of
 +advantages. One is that you can have
 +multiple configurations of VuFind at
 +once and switch between them by just
 +changing the variable that tells VuFind 
 +where to load its configurations
 +the other is that when you upgrade VuFind 
 +you only need to keep track of a
 +couple of directories of your local
 +files the rest you can delete and
 +replace with a new version which makes
 +the upgrade process a little bit more
 +streamlined. In general, the way VuFind 
 +uses the local directory is
 +that when it's trying to load a
 +configuration of any sort it will look
 +in the local directory first. 
 +If it finds something there it will use it.
 +If it does not find it there it will go up to
 +VUFIND_HOME. So this is why there are
 +a lot of parallel structures so for
 +example there's an import directory
 +inside VUFIND_HOME which contains
 +lots of default configurations but
 +there's also a local import directory
 +which contains a smaller number of files
 +which are all local overrides. As you
 +can see here when you first install 
 +VuFind you actually get a couple of files
 +in there for free: "import.properties" and
 +"import_auth.properties". These are
 +the files which tell SolrMarc where to
 +find its configurations and
 +other dependencies and so they are
 +auto-generated for you by the installer
 +to point at the directories that VuFind 
 +is being installed into. 
 +"import.properties" is the primary configuration
 +file for loading into the biblio core
 +and "import_auth" is for loading authority
 +records which again may be a topic for a
 +future video.
 +
 +Let's just take a quick
 +look at "import.properties" so you can
 +see what's in there. It's not a whole lot.
 +We just specify which core name we're
 +loading things into. We're specifying the
 +names of the properties files that we're
 +using for mappings. So as you can see
 +here is the "marc.properties" file I
 +showed you and also a second file called
 +"marc-local.properties" which is a way
 +that you can create a file that
 +overrides parts of "marc.properties"
 +without overwriting all of it and we
 +will be working with that in a little
 +bit. Also we have the URL where Solr is
 +running so that SolrMarc can push the
 +records into its update handler and then
 +"solrmarc.path" which is the
 +location of the SolrMarc files and as
 +you can see we specify first the local
 +import directory and then the main
 +import directory which is how we load
 +local files ahead of default files. Then
 +there are also a few defaults about
 +error handling and character encodings
 +that you should be able to leave alone
 +but which are here for you to tweak if
 +you ever need to.
 +
 +So with all of that out of the way, 
 +let's start customizing some things. 
 +The first thing that we should
 +do is create a "marc-local.properties"
 +file in our local import directory so we
 +have a local place to begin putting our
 +customizations.
 +So the easiest way to do that is to copy
 +the standard "marc-local.properties"
 +which contains a whole lot of 
 +commented-out examples that might be useful 
 +into local import and then edit it there.
 +As I mentioned by default VuFind
 +indexes everything to "My Institution" and
 +"Library A" and so some of the suggestions
 +here in marc-local are about changing
 +that. So I'm going to call the
 +institution "Demian's University" and I'm
 +going to call the building "Demian's house" 
 +because I have started a university 
 +in which I record videos about VuFind. 
 +So now having made that
 +change, I need to re-index all of my
 +records in order for the change to take
 +effect. Whenever you change your import
 +rules you need to reload your records
 +because the mappings occur at index time.
 +So once again:
 +"./import-marc.sh tests/data/journals.mrc".
 +We wait a few seconds
 +while the records all load and now if I
 +go back to my web browser and my VuFind 
 +search results you'll see that when
 +I loaded things before "My Institution" 
 +and "Library A" were showing up in these
 +facets but if I refresh the page my
 +changes have taken effect. It's now
 +Demian's University and Demian's House.
 +Please don't come to my house. 
 +Now going back to the terminal let's look at
 +another thing we might want to change.
 +As I mentioned this is a file full of
 +journals, so if I look at the format
 +facet I'll see there are 10 journals
 +here but suppose that I don't like
 +calling them journals
 +I want to degrade them by calling them
 +magazines. That is something I can
 +change by editing the 
 +format translation maps that I mentioned earlier.
 +So once again we can create a copy of
 +the map in our local directory and this
 +will override the default map. First
 +let me create a directory to hold my
 +custom translation maps because you're
 +not going to have such a directory by
 +default. We have to create 
 +"local/import/translation_maps"
 +and then I can copy 
 +"import/translation_maps/format_map.properties" 
 +into local "local/import/translation_maps"
 +I've now created a local copy 
 +that I can customize, so I
 +will edit that.
 +As you can see here
 +is where we map the more technical terms
 +that come out of the custom getFormat
 +method into more human-readable versions
 +but down here we just map journal
 +straight back into journal again but I
 +can change this. 
 +"Journal = Magazine"
 +and now if I just go back and import my
 +records one more time now when I go back
 +to my results and refresh the page
 +"Journal" has just turned into "Magazine".
 +As you can see it's pretty easy to
 +customize some of these strings but
 +let's look at some more complicated things. 
 +As I mentioned languages are
 +currently pulled from a whole bunch of
 +fields and run through a translation map
 +but let's suppose that I don't want to
 +display those English language names for
 +some reason. Perhaps I would rather set
 +up some settings in my code to change
 +them into something on the fly. What I
 +can do is look at the rule 
 +in the core import-marc.properties 
 +and I am going to copy this rule 
 +then I'm going to edit my
 +"local/import/marc-local.properties" 
 +so I can override it.
 +Just going to go to the bottom 
 +and I'm going to paste the rule in. 
 +I'm actually going to paste it in twice 
 +so I can keep the old version
 +commented out for reference and then I'm
 +just going to get rid of the language map.
 +Now this will just take the codes and
 +index them directly without doing any
 +translation work on them.
 +So now I'm going to import the records
 +one more time so again before I refresh
 +the page to show the results, let's see
 +what this looked like before so here we
 +have "English", "French", "German", "Italian".
 +Now I'm going to refresh the page 
 +and sure enough now we are only getting 
 +the raw codes. 
 +
 +One more interesting
 +feature of SolrMarc is that you can
 +actually combine together multiple rules.
 +I realize my example is getting a
 +little bit contrived but let's suppose
 +that I actually wanted to see both the
 +mapped versions and the raw versions.
 +What I can do is just put both versions
 +of the rule in place but on the second
 +one change the "=" to a "+="
 +and what this will do is it will take
 +all the mapped versions put them in the
 +language field and then also take all
 +the raw versions and add those to the
 +language field.
 +This kind of combination
 +can be really useful when you have
 +information in multiple fields of the
 +marc record formatted in different ways
 +and you want to normalize it in some
 +fashion. You can create a separate rule
 +for each field and then combine all the
 +fields together with "+=" but for
 +this example we'll just do it this way.
 +Now I will reindex one more time
 +and now if I refresh this page one more
 +time and scroll down to language, you'll
 +see that we now have both the
 +three-letter codes and the mapped names.
 +
 +So that covers the very basic ways that
 +you can import marc records and
 +customize the way that they're mapped. 
 +Of course, there's a lot deeper dive
 +possible here because of looking at the
 +custom Java code or the whole end-to-end
 +experience of adding a new field to 
 +VuFind not only indexing it but also
 +making it display in the interface. 
 +As I said these are good topics for the future
 +but for right now I'm just trying to
 +introduce all of the basics if you want
 +to learn more about SolrMarc and
 +indexing be sure to take a look at the
 +SolrMarc page in VuFind's wiki and
 +the SolrMarc GitHub project, both of
 +which have some more documentation on
 +how to solve common problems using the
 +importer and more detail about the
 +depths of the importing language which
 +is actually quite rich and has quite a
 +few features that can be strung together
 +to do complicated transformations.
 +There's also a SolrMarc tech mailing
 +list which is a great place to ask for
 +help, in addition to the usual VuFind
 +support venues.
 +
 +I hope this has been
 +useful and I will share another video
 +next month!
 ---- struct data ---- ---- struct data ----
 +properties.Page Owner : 
 ---- ----
  
videos/indexing_marc_records.1580319468.txt.gz · Last modified: 2020/01/29 17:37 by demiankatz