This is an old revision of the document!

Video 8: Sitemaps and Web Indexing

The eighth VuFind instructional video explains how to share your VuFind contents using XML sitemaps as well as how to crawl the contents of existing sitemaps to build a web index.

Video is available as an mp4 download or through YouTube.

Related Resources

Transcript

This is a raw machine-generated transcript; it will be cleaned up as time permits.

so this month's video is going to be a discussion of sitemaps and web crawling and it's sort of a logical follow-up to last month's video about indexing XML so first a quick introduction to what I'm talking about if you put a website up on the internet chances are that sooner or later search engines will find it and crawl through it and make it searchable but without a little bit of help this can happen somewhat haphazardly and that is where XML sitemaps come into play a a sitemap is just an XML document that lists all of the pages on your site which you can submit to a search engine in order to let them find all of your content now this is certainly important for something like view find where it's a search driven interface and there may not actually be a way to crawl into every single record without typing something into a box first so by publishing a sitemap we make it possible for all of our index records to be findable and so if you find includes facilities for generating sitemaps which make it very search engine friendly and will make your content a lot more visible on the flip side of the coin sitemaps are also a really useful tool for harvesting content and so if you find also includes tools or crawling all of the pages in the sitemap and creating a website index so today I will show sort of both sides of that equation how you can make sitemaps from view find and how you can populate view find using sitemaps and I have up here sitemaps org which is where the sitemap specification can be found if you want to learn more about how these documents are structured so just as a really simple example I've created this beautiful website on my virtual machine it's just a couple of HTML files I hand edited in the web route so I've got this front page and I have a link that leads to this other page and I have by hand generated a sitemap XML file which just lists both of these pages the the root of the site and the linked page so suppose that I want this website I've just created to live in harmony with the view find instance I've been demonstrating for some time what I would want to do is create a sitemap containing all of the content of view fine as well as a sitemap index which is another part of the sitemap specification which allows you to group together multiple sitemaps so that they can all be discovered as a bundle fortunately you find includes a command line tool that will do all of this for you so I am just going to drop to the terminal and go to my view find home directory there is a configuration file called sitemap I and I so I'm going to copy config you find sitemap dot ini into my local config defined directory and then I'm going to edit that configuration file so like all of you finds configuration files the sitemap dot ini file is full of comments explaining what all of the settings do and I won't go through all of these right now I will just highlight the ones that are most important to get things working so we have a top sitemap section that's going to control how you find that generates your sitemaps there are some settings in here like frequency which will affect the content of the generated sitemap and which can impact how currently search engines will come back and recall your pages count per page is going to control how many URLs you find puts in each of the sitemap files it generates because view find sites could potentially have millions of Records and creating one sitemap with a million records in it it's probably going to cause some problems thus there's a mechanism for breaking that up into chunks by default 10,000 records per chunk and then if you find will generate a sitemap index file that points to all of the chunks and this is another part of the the sitemap spec that you can create lists of sitemaps the file name setting is going to control the name of the file that you find generates for insight map of course it defaults to sitemap XML but in many cases you don't want you find overwrite an existing sitemap XML that was created either by hand or by a different tool so this gives us the ability to give it a more specific name in this case I'm going to call it you find sitemap file location determines where you find generates the sitemap files by default it uses the temporary directory but that's not going to be very useful in a real life situation so I'm going to change it to bar dub-dub-dub HTML which happens to be the web root on the ubuntu server I am using for this example we can also control which indexes get indexed but we can stick with defaults there we can affect how you find retrieves the URLs to put into the sitemap again we'll use the default there but you can tune that in some ways if it's not performing quickly enough on next up is the sitemap index section and this controls how you find and generates that high level index XML that I mentioned there are a couple of important things that we need to set here one being the base sitemap URL when we set the directory where the files will be generated we also need to tell you find what URL that directory corresponds with in this instance it's HTTP colon slash slash localhost obviously in a real world scenario you would not be using localhost here but for this example it will do you can control the name of the index file that you find generates we'll leave that as the default and we can tell it the name of an existing sitemap that we want to include in the index so the default of Base sitemap is not a file that actually exists in our example so we will just tell it use the sitemap XML that was generated by hand or the demonstration web page incorporate that into the index so that it's findable along with my generated content so now you find sitemap generator is fully configured there's just one more important detail which is that I need to be sure that the user that runs the command line tool has right access to the web root or else it won't be able to successfully write out the sitemaps that it generates so right now I'm just running all of these tools using my own cats account in a production environment of course you would want to have a dedicated user for running view find utilities and you would have ownership set accordingly but in the interest of expediency I'm just going to use cats for all of these demonstrations so I'm going to sue do CH own d-pad spar dub dub dub HTML so that I now own the web route and have permission to write files there and then all I need to do is a PHP util slash sitemap PHP while I'm in the viewfinder Ector e and that will run the generator that only took a couple seconds and now if i do a file listing of bar dub dub dub HTML i will see that there is a view find sitemap XML and a sitemap index dot XML that we're not there before look at those through our web browser we go to localhost / slight map index.xml sure enough this points us to two different files the existing sitemap XML that was already there that we told you find about as the base sitemap and also this new you find sitemap dot XML which has been generated and if I go there we will see that this contains a list of all the records in defined so they can be easily crawl there's one more small step that you might want to take with sitemaps which is to publish for search engines where they can be found you may be familiar with a file called robots.txt which you can use to tell crawlers which parts of your site they should or should not be crawling you can also use that file to specify where a sitemap lives so if I edit var dub dub dub HTML robots.txt and in this example I'm just creating a new file but you might have an existing one in some situations all I need to do is say sitemap colon and if the you are of my sitemap in this case HTTP colon slash slash localhost / site now index dot XML and now if a robot comes to the site and looks at robots.txt and supports the part of the protocol that includes the sitemap specification I will know exactly where to look and it can find all of your content so now that we've seen how to create sitemaps within if you find let's talk about how view fine can take advantage of other people's sitemaps including our own sitemaps from our content management systems or websites as I mentioned earlier you find has the capacity to index content from sitemaps to create a web index so that you can search your own website however before it can do that you need to set up a full text extraction tool there are several places in view find where it can take advantage of full text extraction to make the content a file searchable so for example when indexing mark records or XML files you can do some custom rules that will use URLs within metadata to retrieve content and then index all of the text coming back from those URLs and that same mechanism is used by Pugh finds sitemap index err if you find supports two different full-text extraction tools one is called aperture and has not been updated in many years so I strongly encourage that everyone use the second option which is patching teeka that can be obtained at pika patchy dot-org if you just go to the download page are a number of downloads available but what we want is the tikka App jar file which is all that you need to extract text content from a variety of different file formats including PDF Office documents and fortunately for us web pages I've actually already done the the download of this file to save a little bit of time so once we've downloaded this we should set it up in a way where it's easily accessible did you find what I like to do is give it its own directory we'll call it user local tika and I'm going to copy the file from my download directory into user local tika and I like to create a symbolic link shortcut to from the long tika jar file to just tika jar which makes if you find configuration easier because we can download new versions of the app as they're released in the future and we just have to rewrite the symbolic link instead of having to constantly edit of you find configuration files so I'm just going to do a quick CDU ln- s from user local tika tika app 1.2 4.1 that jar to user local tika tika jar so that's all we have to do to install tika just download a jar file put it someplace nice and easy but we also have to tell you find where to find it and for that there is a configuration file called full text dot ini so let's copy config view find full text I and I to local big do you find so we always do to override it and then edit local config define okay so like all of you finds configuration files once again there are lots of comments here explaining what it does we just need to do two simple things uncomment the general section and tell it to use tika if we didn't do that it would try to auto detect which tool is being used which is fine but telling it is a little bit faster you can skip all the aperture settings and then we can just uncomment the tika path so if you find those where to find tika and as you can see the default setting matches the symbolic link that I set up so there's no need to change anything other than uncommenting the line now we're halfway there we've got the full text extractor setup but now we need to set up the web crawler and there's another config file for that so we'll copy config you find web crawl ini to local config you find and we'll edit that file so all the web crawl dot ini does is till you find sitemap index err where to find sitemaps create a list of as many sitemaps as you want but for this example the only thing we want to index is our locally created sitemap XML we could index the sitemap index our view fine crawler is smart enough to crawl into all of the sitemaps referenced by an index but we really don't want to index view finds record pages into view finds web index that would only be confusing so we're going to focus in on the content that exists outside of you find itself we could also turn on this for beau setting just to get some more feedback out of the crawler while it runs but it makes no functional difference whether we do that or not so now we're all set up if you find knows where to find tika for full text extraction it knows where to find a sitemap to crawl so all we need to do is run the crawler which is PHP import slash web crawl from the home you find directory now we see it's harvesting the sitemap XML file doing a bit of work and in just a moment we should have our content so what the web crawler is actually doing is running the XML importer that was demonstrated last month we have an XSLT that uses sitemap XML in combination with some custom PHP code to use tika to extract content from webpages and then index them into a special solar core that was designed specifically for searching webpages the other little piece that the web crawler does is it keeps track of when it runs and it deletes anything that was indexed on a prior run so every time you run the web crawler it captures a timestamp it indexes all of the sitemaps you've referred to and then it deletes anything that's older than the time at which the process started so if webpages are removed the indexer will get rid of them on the next run you do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline it's going to wipe out parts of your index so use with caution in any case now that the indexing has completed we can go back to our beautifying interface and if we go to you find slash web with a capital W that brings up the website search which uses the solar website index I mentioned I just do a blank search we will see that I now have two pages both of the pages from my sitemap were indexed and just to prove that the full-text searching works right I type the word lazy here that word appears on only one of these two pages and sure enough there it is it highlights where the text matched everything is working so one quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest there's a website ini file which controls all the behavior of the web search and this is kind of like a combination of facets dot ini and searches dot ini for the main Biblio index but these settings are applied the website search so if you want to customize recommendation modules or change labels or sort options etc facets it's all in here so this is how you can control the presentation of your website search also a possible interest is config / view find / we have a search specs that Yambol again this is just like the regular search specs that yeah mph or the Biblio index but this is tuned to the website index so if you want to change which fields are searched or how relevancy ranking works this is the place where you can do that finally if you want to configure the actual indexing process there is a an import / XSL sitemap XSL which is the XSLT that gets applied to all of the sitemaps in order to index them and as you can see here this is really just a wrapper around a PHP function called define sitemap get document and import sitemap dot properties is the import configuration that sets up the custom class it specifies the XSLT and so forth so it's beyond the scope of today's video but if you want to customize things what you want to do is override the view find app class with your own behavior and you can do anything you like in that PHP for example you might want to extract values from particular HTML meta tags and use them for facets or whatever you need to do so that's it for this month next month we are going to look at how you can combine different kinds of searches in view find which will be useful because it would be nice to be able to search our new website index in our regular biblio book and journal index at the same time so I will show you how to do that until then have a good month and thank you for listening

VuFind Documentation

Table of Contents

Video 8: Sitemaps and Web Indexing

Related Resources

Transcript