About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:indexing_marc_records

This is an old revision of the document!


Video 2: Indexing MARC Records

The second VuFind instructional video discusses how to index MARC records (and configure/customize the indexing process).

Video is available as an mp4 download or through YouTube.

Transcript

:!: Needs editing and cleanup

alright so in this video we are going to talk about loading mark records into view fines which is a good first step after you have installed it as was described in our previous video because they are the easiest to load we are going to work exclusively with mark records today and I'm going to assume that you know what a mark record is but just the one sentence description is it's a standard interchange format for library records where the data is split up into numeric fields with alphabetic subfields so mark records are included with view find as part of its test suite so if you just want to kick the tires it's really easy to get some records loaded in there you just switch into your view finder Ector e and at the moment I'm actually looking at the virtual machine I set up in the previous video and I'm in the view finder Ector e if I do an LS of tests data star dot MRC I can see all of the mark records that are available so first things first we need to start up our solar index which is where if you find stores all of its records we can't put things in the index until the index is running so I do that with solar Sh start and then it's going to tell me that solar is starting up on port 8080 a fact that it's useful to remember for later since we might want to look at the solar index once it has records in it and now that solar is running there's just a handy script called import - mark dot SH that if we feed it the name of a mark file which can be either binary mark or mark XML it will load everything into the index for us so I'm going to choose journals dot MRC which contains some journal records and it takes just a couple seconds to load all of the data into the index once it gets going and there we go so I'm going to go to my web browser now and just do a blank search in view find which is a good way to get a list of all the records that have been indexed and I will see that I now have ten journal records appearing in the search results so what just happened it might be useful to take a step back and talk a little bit about what solar is Solar is an open source search engine which essentially takes a set of keys and values and applies some rules to those to make them very easy to search and perform faceted filtering on and so what we did with our script was map some mark records into solar so I mentioned when I started up solar that it's useful to make note of the port number that it's running on and that is because solar has a web interface that you can actually look at if you just go to the name of the server where it's running access the port it's running on and then go to slash solar you will you will find a handy administrative interface solar arranges its records into a set of cores so that you can store different types of records in different index schemas and in the case of you find the main records are stored in a core named Biblio so if we go to Biblio and we go to the querying tool we can run their default star colon star everything search and we will see that here indeed are the ten records that were put into the index and we can see some of the field names and values that got placed there so all of this data was mapped out of those mark records that we loaded so how did the indexing tool know what to do with the contents of the mark records the answer is there is a configuration file called mark dot properties in view finds index sorry in view import directory which contains all of the rules for mapping mark fields into solar fields so if we edit this file we can browse through and see some of the mappings that are present this is using a separate open source tool called solar mark which is special built for mapping mark records into solar and so it contains a domain-specific language for specifying these mappings the simplest thing that you can possibly do is set a field to a fixed string so that every record you import will just have those particular values and in view finds default we set everything to a collection of catalog and institution of my institution and a building of library a these are obviously settings that you will want to override and I'll show you how to do that in a little bit so mark records contain two types of fields the first nine fields go one through oo9 are called control fields and they simply have numeric field designations without subfields you can see that by default of you find takes the first value found in an oo 1 field and stores that as the unique identifier for the record IDs are really important in the solar index because that's how solar tells records apart every record must have a unique ID and if you index two records with the same ID the second record you index will replace the first one so if your mark records don't have their ideas in the oo-one field this is another detail that you will need to customize so after those first 9 control fields all of the other fields in a Marc record are called data fields and they have alphabetic subfields containing different bits of data and we store a lot of things from those data fields in the solar index so for example the second line here we get the library of congress card number out of the Oh 100 field sub filled a and this first designation again just stores only the first value in the solar index similarly here we take the oath right o three five a control number but because we're not specifying only taking the first one that could potentially store multiple values in the index moving down a little bit to how we index languages this is a little bit more complicated taking this part by part the solar mark field specifications can have a colon separated list of different things to pull out and it will take all of those values and store them in the index as separate values so here we are taking characters 35 through 37 of the oo 8 control field which is usually a language code as well as oh for one a D H and J all as separate values and solar mark supports what it calls translation maps which can take one string and map it into another there's a folder in the translation Maps subfolder of import called language map properties which translates all of the three-letter language codes that are found in mark records into the full English names for those codes you'll also see in here that there are some custom methods and when solar mark sees the word custom that means it's looking for a Java function with the name that follows so there's actually a Java class that ships with you find and contains a get formats method which contains all the rules for specifying record formats looking at that is beyond the scope of this introductory video but by including custom methods in Java it is possible to write your own and to extend the ones that ship with you find and that might be a good topic for a future video also you can apply translation maps to the output of custom methods so there's a format map properties which takes the very granular formats that come out of our Java code and translate them into a smaller set of more human readable values that will be displayed in the index one last example of some of the power and flexibility of solar mark is here at the very bottom so the pattern maps I mentioned before or rather the translation maps I mentioned before are separate files but you can also include translation maps in line in the mark dot properties file by giving them a name and wrapping them in parentheses and the translations don't have to be straight string to string you can also define a set of regular expression rules to apply so these last five lines of the file are the rules for extracting the numeric portion of OCLC numbers found in oh 3 5 a so the rule says take all the values from O 3 5 a and apply this pattern map and then this pattern 0 pattern 1 pattern 2 pattern 3 stuff is providing a series of regular expressions that will be tested against all the strings that are extracted and if matches are found mappings will be made so this for example takes the different case variations of OC olc in parentheses and then extract just the numeric portion using a regular expression sub pattern matching to dollar 1 and then the step Co CM versions the OC n versions and the O n versions and if you don't know what I'm talking about there are a variety of different ways that OCLC numbers are prefixed depending on the age and source of the records and for view fines purposes we only care about the numbers hence all of these mapping rules so now that I've showed you how if you find works by default it would be useful to know how that you how you can change and customize these things so before I die of completely into that a quick aside about the view find local directory which is an important concept for not just import but also configuration more generally of you find so when you setup you find the installer gives you two environment variables there's few find home which in this instance is user local view find and there's view find local der which in this instance is user local view find local the reason for this local directory is to give you a place to store all of your local files separately from the files that are distributed as part of you find this offers a couple of advantages one being that you can have multiple configurations of you find at once and switch between them by just changing the variable that tells you find where to load its configurations the other is that when you upgrade you find you only need to keep track of a couple of directories of your local files the rest you can delete and replace with a new version which makes the upgrade process a little bit more streamlined but in general the way if you find uses the local directory is that when it's trying to load a configuration of any sort it will look in the local directory first if it finds something there it will use it if it does not find it there it will go up to view find home so this is why there are a lot of parallel structures so for example there's an import directory inside view find home which contains lots of default configurations but there's also a local import directory which contains a smaller number of files which are all local overrides and as you can see here when you first install view find you actually get a couple of files in there for free import properties and import off dot properties and these are the files which tell Solr mark where to find its configurations and other dependencies and so they are auto-generated for you by the Installer to point at the directories that you find is being installed into import dot properties is the primary configuration file for loading into the biblio core and import auth is for loading authority records which again may be a topic for a future video let's just take a quick look at import dot properties so you can see what's in there it's not a whole lot we just specify which core name we're loading things into we're specifying the names of the properties files that we're using for mappings so as you can see here is the mark dot properties file I showed you and also a second file called mark local dot properties which is a way that you can create a file that overrides parts of mark dot properties without overwriting all of it and we will be working with that in a little bit also we have the URL where solar is running so that solar mark can push the records into its update Handler and then solar mark dot path which is the location of the solar mark files and as you can see we specify first the local import directory and then the main import directory which is how we load local files ahead of default files then there are also a few defaults about error handling and character encodings that you should be able to leave alone but which are here for you to tweak if you ever need to so with all of that out of the way let's start customizing some things so the first thing that we should do is create a mark local dot properties file in our local import directory so we have a local place to begin putting our customizations so the easiest way to do that is to copy the standard mark local dot properties which contains a whole lot of commented out examples that might be useful in to local import and then edit it there so as I mentioned by default view find indexes everything to my institution and library a and so some of the suggestions here in mark local are about changing that so I'm going to call the institution Damian's University and I'm going to call the building Damian's house because I have started a University in which I record videos about view func so now having made that change I need to re index all of my records in order for the change to take effect whenever you change your import rules you need to reload your records because the mappings occur at index time so once again import mark Sh tests data journals mark we wait a few seconds while the records all load and now if I go back to my web browser and my view find search results you'll see that when I loaded things before my institution in library a were showing up in these facets but if i refresh the page my changes have taken effect it's now Damien's University and Damian's house please don't come to my house now going back to the terminal let's look at another thing we might want to change so as I mentioned this is a file full of journals so if I look at the format facet I'll see there are 10 journals here but suppose that I don't like calling them journals I want to degrade them by calling them magazines so that is something I can change by editing the format translation Maps that I mentioned earlier so once again we can create a copy of the map in our local directory and this will override the default map so first let me create a directory to hold my custom Translation maps because you're not going to have such a directory by default so we have to create local import translation Maps and then I can copy import translation Maps format properties sorry format map dot properties into local import translation Maps I've now created a local copy that I can customize so I will edit that and as you can see here is where we map the more technical terms that come out of the custom get format method into more human readable versions but down here we just map journal straight back into journal again but I can change this journal equals magazine and now if I just go back and import my records one more time now when I go back to my results and refresh the page Journal has just turned into magazine so as you can see it's pretty easy to customize some of these strings but let's look at some more complicated things so as I mentioned languages are currently pulled from a whole bunch of fields and run through a translation map but let's suppose that I don't want to display those English language names for some reason perhaps I would rather set up some settings in my code to change them into something on the fly what I can do is look at the rule in the core import mark properties and I am going to copy this rule then I'm going to edit my local import mark local properties so I can override it that's going to go to the bottom and I'm going to paste the rule in I'm actually going to paste it in twice so I can keep the old version commented out for reference and then I'm just going to get rid of the language map so this will just take the codes and index them directly without doing any translation work on them so now I'm going to import the records one more time so again before I refresh the page to show the results let's see what this looked like before so here we have English French German Italian so now I'm going to refresh the page and sure enough now we are only getting the raw codes so one more interesting feature of solar mark is that you can actually combine together multiple rules so I realize my example is getting a little bit contrived but let's suppose that I actually wanted to see both the mapped versions and the raw versions what I can do is just put both versions of the rule in place but on the second one change the equals to a plus equals and what this will do is it will take all the mapped versions put them in the language field and then also take all the raw versions and add those to the language field this kind of combination can be really useful when you have information in multiple fields of the Marc record formatted in different ways and you want to normalize it in some fashion you can create a separate rule for each field and then combine all the fields together with plus equals but for this example we'll just do it this way and now I will reindex one more time and now if i refresh this page one more time and scroll down to language you'll see that we now have both the three-letter codes and the mapped names so that covers the very basic ways that you can import mark records and customize the way that they're mapped of course there's a lot deeper dive possible here because of looking at the custom Java code or the whole end-to-end experience of adding a new field to view find not only indexing it but also making it display in the interface as I say these are good topics for the future but for right now I'm just trying to introduce all of the basics if you want to learn more about solar mark and indexing be sure to take a look at the solar mark page in view finds wiki and the solar mark github project both of which have some more documentation on how to solve common problems using the importer and more detail about the depths of the importing language which is actually quite rich and has quite a few features that can be strung together to do complicated transformations there's also a solar mark tech mailing list which is a great place to ask for help in addition to the usual view find support venues so I hope this has been useful and I will share another video next month

videos/indexing_marc_records.1580761723.txt.gz · Last modified: 2020/02/03 20:28 by demiankatz