Differences

This shows you the differences between two versions of the page.

--- videos:indexing_marc_records [2020/01/29 17:37] – created demiankatz
+++ videos:indexing_marc_records [2023/04/26 13:26] (current) – crhallberg
@@ Line 1: / Line 1: @@
 ====== Video 2: Indexing MARC Records ======
-The second VuFind instructional video discusses how to index MARC records (and configure/customize the indexing process).
+The second VuFind® instructional video discusses how to index MARC records (and configure/customize the indexing process).
 Video is available as an [[https://vufind.org/video/Importing_MARC_Records.mp4|mp4 download]] or through [[https://www.youtube.com/watch?v=ayhWQFi3h_w&feature=youtu.be|YouTube]].
@@ Line 7: / Line 7: @@
 ===== Related Resources =====
+  - [[indexing:marc#related_video|Indexing MARC Records wiki page]]
   - [[indexing:solrmarc|SolrMarc wiki page]]
   - [[https://github.com/solrmarc/solrmarc|SolrMarc Documentation]]
@@ Line 13: / Line 14: @@
 ===== Transcript =====
-// coming soon! //
+Alright.
+So in this video, we are going to
+talk about loading mark records into
+VuFind, which is a good first step
+after you have installed it, as was
+described in our previous video. Because
+they are the easiest to load we are
+going to work exclusively with marc
+records today and I'm going to assume
+that you know what a marc record is but
+just the one-sentence description is
+it's a standard interchange format for
+library records where the data is split
+up into numeric fields with alphabetic
+subfields. So marc records are included
+with VuFind as part of its test suite
+so if you just wanted to kick the tires,
+it's really easy to get some records
+loaded in there. You just switch into
+your VuFind directory and at the
+the moment I'm actually looking at the
+virtual machine I set up in the previous
+video and I'm in the VuFind home directory.
+If I do an "ls tests/data/*.mrc",
+I can see all of the marc records that
+are available.
+So first things first we
+need to start up our Solr index, which
+is where VuFind stores all of its
+records. We can't put things in the index
+until the index is running. So I do that
+with "./solr.sh start" and then it's going
+to tell me that Solr is starting up on
+port 8080 a fact that is useful to
+remember for later since we might want
+to look at the Solr index once it has
+records in it. And now that Solr is
+running there's just a handy script
+called "import-marc.sh" that if we
+feed it the name of a marc file (which
+can be either binary marc or marc XML) it
+will load everything into the index for
+us. So I'm going to choose journals.mrc
+which contains some journal records
+and it takes just a couple seconds to
+load all of the data into the index, once
+it gets going. And there we go! So I'm
+going to go to my web browser
+now and just do a blank search in
+VuFind which is a good way to get a list
+of all the records that have been
+indexed and I will see that I now have
+ten journal records appearing in the
+search results.
+So what just happened?
+It might be useful to take a step back and
+talk a little bit about what Solr is.
+Solr is an open-source search engine
+which essentially takes a set of keys
+and values and applies some rules to
+those to make them very easy to search
+and perform faceted filtering on.
+So what we did with our script was map some
+marc records into Solr.
+I mentioned
+when I started up Solr that it's useful
+to make note of the port number that
+it's running on and that is because
+Solr has a web interface that you can
+actually look at. If you just go to the
+name of the server where it's running
+access the port it's running on and then
+go to "/solr" you will find
+a handy administrative interface. Solr
+arranges its records into a set of cores
+so that you can store different types of
+records in different index schemas and
+in the case of VuFind the main records
+are stored in a core named "biblio".
+So if we go to biblio and we go to the
+querying tool we can run their default
+"*:*" everything search and we
+will see that here indeed are the ten
+records that were put into the index and
+we can see some of the field names and
+values that got placed there so all of
+this data was mapped out of those marc
+records that we loaded.
+So how did the
+indexing tool know what to do with the
+contents of the marc records?
+The answer is there is a configuration file
+called "marc.properties" in VuFind's
+import directory which contains all of
+the rules for mapping marc fields into
+Solr fields. So if we edit this file we
+can browse through and see some of the
+mappings that are present. This is using
+a separate open-source tool called SolrMarc
+which is special-built for mapping
+marc records into Solr and so it
+contains a domain-specific language for
+specifying these mappings. The simplest
+thing that you can possibly do is set a
+field to a fixed string so that every
+record you import will just have those
+particular values and in VuFind's
+default we set everything to a
+collection of catalog, an institution of
+"My Institution", and a building of "Library A".
+These are obviously settings that you
+will want to override and I'll show you
+how to do that in a little bit.
+So marc records contain two types of fields
+the first nine fields (001 through 009)
+are called control fields and they simply
+have numeric field designations without
+subfields you can see that by default VuFind
+takes the first value found in an 001 field
+and stores that as the unique identifier
+for the record. IDs are really important
+in the Solr index because that's how
+Solr tells records apart. Every record
+must have a unique ID and if you index
+two records with the same ID, the second
+record you index will replace the first
+one. So if your marc records don't have
+their ideas in the 001 field this is
+another detail that you will need to
+customize.
+So after those first 9 control
+fields all of the other fields in the marc
+record are called data fields and they
+have alphabetic subfields containing
+different bits of data and we store a
+lot of things from those data fields in
+the Solr index. So for example the
+second line here we get the Library of
+Congress card number out of
+the 010 field, subfield a
+and this first designation
+again just stores only the first value
+in the Solr index.
+Similarly here we
+take the 035a control number
+but because we're not
+specifying only taking the first one
+that could potentially store multiple
+values in the index.
+Moving down a little
+bit to how we index languages. This is a
+little bit more complicated. Taking this
+part by part, the SolrMarc field
+specifications can have a colon
+separated list of different things to
+pull out and it will take all of those
+values and store them in the index as
+separate values. So here we are taking
+characters 35-37 of the 008 control field
+(which is usually a language code)
+as well as 041a, d, h, and j
+all as separate values and SolrMarc
+supports what it calls translation
+maps which can take one string and map
+it into another. There's a folder in the
+translation maps subfolder of import
+called language map properties which
+translates all of the three-letter
+language codes that are found in marc
+records into the full English names for
+those codes.
+You'll also see in here that
+there are some custom methods and when
+SolrMarc sees the word "custom" that
+means it's looking for a Java function
+with the name that follows. So there's
+actually a Java class that ships with
+VuFind and contains a getFormats
+method which contains all the rules for
+specifying record formats. Looking at
+that is beyond the scope of this
+introductory video but by including
+custom methods in Java it is possible to
+write your own and to extend the ones
+that ship with VuFind and that might
+be a good topic for a future video.
+Also, you can apply translation maps to
+the output of custom methods so there's
+a format map properties which takes the
+very granular formats that come out of
+our Java code and translate them into a
+smaller set of more human-readable
+values that will be displayed in the
+index. One last example of some of the
+power and flexibility of SolrMarc is
+here at the very bottom. So the pattern
+maps I mentioned before or rather the
+translation maps I mentioned before are
+separate files but you can also include
+translation maps inline in the
+marc.properties file by giving them a name
+and wrapping them in parentheses.
+The translations don't have to be straight
+string-to-string you can also define a
+set of regular expression rules to apply.
+So these last five lines of the file are
+the rules for extracting the numeric
+portion of OCLC numbers found in 035a.
+So the rule says take all the values
+from O35a and apply this pattern map
+and then this pattern_0, pattern_1,
+pattern_2, pattern_3 stuff is providing a
+series of regular expressions that will
+be tested against all the strings that
+are extracted and if matches are found
+mappings will be made. So this, for
+example, takes the different case
+variations of O-C-O-L-C in parentheses and
+then extract just the numeric portion
+using a regular expression subpattern
+matching to $1. Then this gets the
+OCM versions, the OCN versions, and the
+ON versions and if you don't know what
+I'm talking about: there are a variety of
+different ways that OCLC numbers are
+prefixed depending on the age and source
+of the records and for VuFind's
+purposes we only care about the numbers
+hence all of these mapping rules.
+So! Now that I've shown you how VuFind
+works by default, it would be useful
+to know how you can change
+and customize these things. Before I
+dive completely into that, a quick
+aside about the VuFind local
+directory, which is an important concept
+for not just import but also
+configuration more generally of VuFind.
+So when you setup VuFind, the installer
+gives you two environment variables
+there's VUFIND_HOME, which in this
+instance is "/usr/local/vufind" and
+there's VUFIND_LOCAL_DIR, which in
+this instance is "/usr/local/vufind/local".
+The reason for this local
+directory is to give you a place to
+store all of your local files separately
+from the files that are distributed as
+part of VuFind. This offers a couple of
+advantages. One is that you can have
+multiple configurations of VuFind at
+once and switch between them by just
+changing the variable that tells VuFind
+where to load its configurations
+the other is that when you upgrade VuFind
+you only need to keep track of a
+couple of directories of your local
+files the rest you can delete and
+replace with a new version which makes
+the upgrade process a little bit more
+streamlined. In general, the way VuFind
+uses the local directory is
+that when it's trying to load a
+configuration of any sort it will look
+in the local directory first.
+If it finds something there it will use it.
+If it does not find it there it will go up to
+VUFIND_HOME. So this is why there are
+a lot of parallel structures so for
+example there's an import directory
+inside VUFIND_HOME which contains
+lots of default configurations but
+there's also a local import directory
+which contains a smaller number of files
+which are all local overrides. As you
+can see here when you first install
+VuFind you actually get a couple of files
+in there for free: "import.properties" and
+"import_auth.properties". These are
+the files which tell SolrMarc where to
+find its configurations and
+other dependencies and so they are
+auto-generated for you by the installer
+to point at the directories that VuFind
+is being installed into.
+"import.properties" is the primary configuration
+file for loading into the biblio core
+and "import_auth" is for loading authority
+records which again may be a topic for a
+future video.
+Let's just take a quick
+look at "import.properties" so you can
+see what's in there. It's not a whole lot.
+We just specify which core name we're
+loading things into. We're specifying the
+names of the properties files that we're
+using for mappings. So as you can see
+here is the "marc.properties" file I
+showed you and also a second file called
+"marc-local.properties" which is a way
+that you can create a file that
+overrides parts of "marc.properties"
+without overwriting all of it and we
+will be working with that in a little
+bit. Also we have the URL where Solr is
+running so that SolrMarc can push the
+records into its update handler and then
+"solrmarc.path" which is the
+location of the SolrMarc files and as
+you can see we specify first the local
+import directory and then the main
+import directory which is how we load
+local files ahead of default files. Then
+there are also a few defaults about
+error handling and character encodings
+that you should be able to leave alone
+but which are here for you to tweak if
+you ever need to.
+So with all of that out of the way,
+let's start customizing some things.
+The first thing that we should
+do is create a "marc-local.properties"
+file in our local import directory so we
+have a local place to begin putting our
+customizations.
+So the easiest way to do that is to copy
+the standard "marc-local.properties"
+which contains a whole lot of
+commented-out examples that might be useful
+into local import and then edit it there.
+As I mentioned by default VuFind
+indexes everything to "My Institution" and
+"Library A" and so some of the suggestions
+here in marc-local are about changing
+that. So I'm going to call the
+institution "Demian's University" and I'm
+going to call the building "Demian's house"
+because I have started a university
+in which I record videos about VuFind.
+So now having made that
+change, I need to re-index all of my
+records in order for the change to take
+effect. Whenever you change your import
+rules you need to reload your records
+because the mappings occur at index time.
+So once again:
+"./import-marc.sh tests/data/journals.mrc".
+We wait a few seconds
+while the records all load and now if I
+go back to my web browser and my VuFind
+search results you'll see that when
+I loaded things before "My Institution"
+and "Library A" were showing up in these
+facets but if I refresh the page my
+changes have taken effect. It's now
+Demian's University and Demian's House.
+Please don't come to my house.
+Now going back to the terminal let's look at
+another thing we might want to change.
+As I mentioned this is a file full of
+journals, so if I look at the format
+facet I'll see there are 10 journals
+here but suppose that I don't like
+calling them journals
+I want to degrade them by calling them
+magazines. That is something I can
+change by editing the
+format translation maps that I mentioned earlier.
+So once again we can create a copy of
+the map in our local directory and this
+will override the default map. First
+let me create a directory to hold my
+custom translation maps because you're
+not going to have such a directory by
+default. We have to create
+"local/import/translation_maps"
+and then I can copy
+"import/translation_maps/format_map.properties"
+into local "local/import/translation_maps".
+I've now created a local copy
+that I can customize, so I
+will edit that.
+As you can see here
+is where we map the more technical terms
+that come out of the custom getFormat
+method into more human-readable versions
+but down here we just map journal
+straight back into journal again but I
+can change this.
+"Journal = Magazine"
+and now if I just go back and import my
+records one more time now when I go back
+to my results and refresh the page
+"Journal" has just turned into "Magazine".
+As you can see it's pretty easy to
+customize some of these strings but
+let's look at some more complicated things.
+As I mentioned languages are
+currently pulled from a whole bunch of
+fields and run through a translation map
+but let's suppose that I don't want to
+display those English language names for
+some reason. Perhaps I would rather set
+up some settings in my code to change
+them into something on the fly. What I
+can do is look at the rule
+in the core import-marc.properties
+and I am going to copy this rule
+then I'm going to edit my
+"local/import/marc-local.properties"
+so I can override it.
+Just going to go to the bottom
+and I'm going to paste the rule in.
+I'm actually going to paste it in twice
+so I can keep the old version
+commented out for reference and then I'm
+just going to get rid of the language map.
+Now this will just take the codes and
+index them directly without doing any
+translation work on them.
+So now I'm going to import the records
+one more time so again before I refresh
+the page to show the results, let's see
+what this looked like before so here we
+have "English", "French", "German", "Italian".
+Now I'm going to refresh the page
+and sure enough now we are only getting
+the raw codes.
+One more interesting
+feature of SolrMarc is that you can
+actually combine together multiple rules.
+I realize my example is getting a
+little bit contrived but let's suppose
+that I actually wanted to see both the
+mapped versions and the raw versions.
+What I can do is just put both versions
+of the rule in place but on the second
+one change the "=" to a "+="
+and what this will do is it will take
+all the mapped versions put them in the
+language field and then also take all
+the raw versions and add those to the
+language field.
+This kind of combination
+can be really useful when you have
+information in multiple fields of the
+marc record formatted in different ways
+and you want to normalize it in some
+fashion. You can create a separate rule
+for each field and then combine all the
+fields together with "+=" but for
+this example we'll just do it this way.
+Now I will reindex one more time
+and now if I refresh this page one more
+time and scroll down to language, you'll
+see that we now have both the
+three-letter codes and the mapped names.
+So that covers the very basic ways that
+you can import marc records and
+customize the way that they're mapped.
+Of course, there's a lot deeper dive
+possible here because of looking at the
+custom Java code or the whole end-to-end
+experience of adding a new field to
+VuFind not only indexing it but also
+making it display in the interface.
+As I said these are good topics for the future
+but for right now I'm just trying to
+introduce all of the basics if you want
+to learn more about SolrMarc and
+indexing be sure to take a look at the
+SolrMarc page in VuFind's wiki and
+the SolrMarc GitHub project, both of
+which have some more documentation on
+how to solve common problems using the
+importer and more detail about the
+depths of the importing language which
+is actually quite rich and has quite a
+few features that can be strung together
+to do complicated transformations.
+There's also a SolrMarc tech mailing
+list which is a great place to ask for
+help, in addition to the usual VuFind
+support venues.
+I hope this has been
+useful and I will share another video
+next month!
 ---- struct data ----
+properties.Page Owner :
 ----