Video 8: Sitemaps and Web Indexing
The eighth VuFind® instructional video explains how to share your VuFind contents using XML sitemaps as well as how to crawl the contents of existing sitemaps to build a web index.
So, this month's video is going to be a discussion of site maps and web crawling, and it's sort of a logical follow up to last month's video about indexing XML.
So, first, a quick introduction to what I'm talking about. If you put a website up on the internet, chances are that sooner or later search engines will find it and crawl through it and make it searchable. But without a little bit of help, this can happen somewhat haphazardly, and that is where XML site maps come into play.
A site map is just an XML document that lists all of the pages on your site, which you can submit to a search engine in order to let them find all of your content. This is certainly important for something like VuFind where it's a search driven interface and there may not actually be a way to crawl into every single record without typing something into a box first.
So, by publishing a site map, we make it possible for all of our index records to be findable. And so VuFind includes facilities for generating site maps, which make it very search engine friendly and will make your content a lot more visible.
On the flip side of the coin site maps are also a really useful tool for harvesting content. And so VuFind also includes tools for crawling all of the pages in the site map and creating a website index.
I will show sort of both sides of that equation, how you can make site maps from VuFind and how you can populate VuFind using site maps. And I have up here sitemaps.org, which is where the site map specification can be found if you want to learn more about how these documents are structured.
As a really simple example. I've created this beautiful website on my virtual machine. It's just a couple of HTML files I hand edited in the web route. So I've got this front page and I have a link that leads to this other page. And I have by hand generated a sitemap.xml file, which just lists both of these pages, the root of the site and the link page.
So, suppose that I want this website I've just created to live in harmony with the VuFind instance. I'm demonstrating for some time. What I would want to do is create a site map containing all of the content of VuFind as well as a site map index, which is another part of the site map specification, which allows you to group together multiple site maps so that they can all be discovered as a bundle.
Fortunately, VuFind includes a command line tool that will do all of this for you. I'm just going to drop to the terminal and go to my VuFind home directory. There is a configuration file named “sitemap.ini” that I will copy into my local “config” directory and then edit. The “sitemap.ini” file contains comments explaining the settings, and I will highlight the most important ones. The “top sitemap” section controls how VuFind generates sitemaps. The “frequency” setting affects the content of the sitemap, and “count per page” controls the number of URLs in each sitemap file. By default, VuFind generates 10,000 URLs per chunk, and a sitemap index file points to all the chunks. The “name” setting controls the name of the sitemap file, which defaults to “sitemap.xml.” The “file location” setting determines where VuFind generates the sitemap files, and I will change it to “/var/www/html,” the web root of my Ubuntu server.
The “site map index” section controls how VuFind generates the high-level index XML. One important setting is the “base site map URL,” which corresponds to the directory where the files will be generated. In this instance, it's http://localhost. Obviously, in a real world scenario, you would not be using localhost here, but for this example, it will do.
You can control the name of the index file that VuFind generates. We'll leave that as the default. And we can tell it's the name of an existing sitemap that we want to include in the index. So the default of base sitemap is not a file that actually exists in our example. So we will just tell it to use the sitemap.xml that was generated by hand or the demonstration web page and incorporate that into the index so that it's findable along with VuFind-generated content.
And now VuFind sitemap generator is fully configured. There's just one more important detail, which is that I need to be sure that the user that runs the command-line tool has write access to the web route or else it won't be able to successfully write out the sitemaps that it generates.
And now I'm just running all of these tools using my own decats account. In a production environment, of course, you would want to have a dedicated user for running VuFind utilities, and you would have ownership set accordingly. But in the interest of expediency, I'm just going to use e-cats for all of these demonstrations, so I'm going to chown e-cats our web route so that I now own the web route and have permission to write files there.
And then all I need to do is say
phputil/sitemap.php while I'm in the VuFind home directory, and that will run the generator. That only took a couple of seconds, and now if I do a file listing of our
/var/www/html, I will see that there is a
vufind-sitemap.xml and a
sitemap-index.xml that were not there before. You can look at those in our web browser if you go to
localhost/sitemap-index.xml. Sure enough, this points us to two different files, the existing
sitemap.xml that was already there that we told VuFind about as the base sitemap and also this new
vufind-sitemap.xml, which has been generated. And if I go there, we will see that this contains a list of all the records in VuFind so they can be easily crawled.
So one more small step that you might want to take with sitemaps is to publish for search engines where they can be found. You may be familiar with a file called
robots.txt, which you can use to tell crawlers which parts of your site they should or should not be crawling. So use that file to specify where a sitemap lives.
So if I edit
/var/www/html/robots.txt, in this example, I'm just creating a new file, but you might have an existing one in some situations. All I need to do is say
Sitemap: and give the URL of my site map. In this case,
Sitemap: http://localhost/sitemapIndex.xml. Now, if a robot comes to the site and looks at
robots.txt and supports the part of the protocol that includes the site map specification, it will know exactly where to look and can find all of your content.
Now that we've seen how to create site maps within VuFind, let's talk about how VuFind can take advantage of other people's site maps, including our own site maps from our content management systems for websites. VuFind has the capacity to index content from site maps to create a web index so that you can search your own website. However, before it can do that, you need to set up a full-text extraction tool. There are several places in VuFind where it can take advantage of full-text extraction to make the content of a file searchable. For example, when indexing MARC records or XML files, you can do some custom rules that will use URLs within metadata to retrieve content and then index all of the text coming back from those URLs. That same mechanism is used by VuFind's site map indexer.
If VuFind supports two different full-text extraction tools, one is called Aperture and has not been updated in many years. So, it's strongly encouraged that everyone uses the second option, which is Tika that can be obtained at
https://tika.apache.org. You just go to the download page. There are a number of downloads available, but what we want is the tika-app Jar file, which is all that you need to extract text content from a variety of different file formats, including PDFs, office documents, and web pages. I've actually already downloaded this file to save a little bit of time.
Once we've downloaded Tika, we should set it up in a way where it's easily accessible to VuFind. What I like to do is give it its directory. We will call it user local Tika. I'm going to copy the file from my download directory, and I like to create a symbolic link shortcut from the long Tika jar file to just
tika.jar, which makes VuFind configuration easier because we can download new versions of the app as they are released in the future, and we just have to rewrite the symbolic link instead of having to constantly edit VuFind configuration files. I'm just going to do a quick
sudo ln -s /usr/local/tika/tika-app-1.24.1.jar /usr/local/tika/tika.jar. So that's all we have to do to install Tika. Just download a jar file to put it someplace nice and easy. But we also have to tell VuFind where to find it. And for that, there is a configuration file called
fulltext.ini. So let's copy
local/config/vufind/. So you always do to override it. And then edit
Like all of VuFind configuration files, once again there are lots of comments here explaining what it does. We just need to do two simple things. Uncomment the general section and tell it to use Tika. If we didn't do that, it would try to auto-detect which tools are being used, which is fine, but telling it is a little bit faster. You can skip all the aperture settings, and then we can just uncomment the Tika path. So if VuFind those where to find Tika, and as you can see, the default setting matches the symbolic link that I set up, so there's no need to change anything other than uncommenting the line.
Now we're halfway there. We've got the full-text extractor set up, but now we need to set up the web crawler, and there's another config file for that. So we'll copy
local/config/vufind. And we'll edit that file.
So all the web crawl.ini does is tell VuFind site map index or where to find site maps. You can create a list of as many site maps as you want, but for this example, the only thing we want to index is our locally created
sitemap.xml. We could index
sitemapIndex.xml our VuFind crawler is smart enough to crawl into all of the site maps referenced by an index, but we really don't want to index VuFind record pages into the VuFind web index. That would only be confusing, so we're going to focus on the content that exists outside of VuFind itself.
We could also turn on this verbose setting just to get some more feedback out of the crawler while it runs, but it makes no functional difference whether we do that or not.
So now we're all set up! If VuFind knows where to find Tika for full-text extraction, it knows where to find a site map to crawl, so all we need to do is run the crawler, which is
php import/webcrawl.php from the home VuFind directory. It's harvesting the sitemap XML file, doing a bit of work, and in just a moment, we should have our content.
So what the web crawler is actually doing is running the XML importer that was demonstrated last month. We have an XSLT that uses site map XML in combination with some custom PHP code.
To use Tika to extract content from web pages and then index them into a special Solr core that was designed specifically for searching web pages. The other little piece that the web crawler does is it keeps track of when it runs and it deletes anything that was indexed on a prior run.
Every time you run the web crawler, it captures a timestamp. It indexes all of the site maps you referred to, and then it deletes anything that's older than the time at which the process started. So if web pages are removed, the indexer will get rid of them on the next run. You do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline, it's going to wipe out parts of your index, so use with caution.
In any case, now that the indexing has completed, we can go back to our VuFind interface. And if we go to
http://localhost/vufind/Web with a capital W, that brings up the website search, which uses the Solr website index I mentioned by just doing a blank search.
That I now have two pages, both of the pages from my site map were indexed, and just to prove that the full-text searching works right. If I type the word “lazy” here, that word appears on only one of these two pages, and sure enough, there it is. It highlights where the text matched, everything is working.
One quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest. There's a website.ini file which controls all the behavior of the web search, and this is kind of like a combination of facets.ini and searches.ini for the mean. But these settings are applied the website search, so if you want to customize recommendation modules or change labels, sort options, etc. facets, it's all in here. So this is how you can control the presentation of your website search.
So, a possible interest is
config/vufind/web/search-specs.yml. Again, this is just like the regular search-specs.yml for the bibli index, but this is tuned to the website index. So if you want to change which fields are searched or how relevancy ranking works, this is the place where you can do that.
Finally, if you want to configure the actual indexing process, there is an import/XSL/sitemap.xsl, which is the XSLT that gets applied to all of the site maps in order to index them. As you can see here, this is really just a wrapper around a PHP function called “VuFind/SiteMap::getDocument()” and import/sitemap.properties. So, import configuration that sets up the custom class and specifies the XSLT and so forth. So, it's beyond the scope of today's video. But if you want to customize things, what you want to do is override the VuFind site map class with your own behavior. So, if you do anything you like in that PHP. For example, you might want to extract values from particular HTML meta tags and use them for facets or whatever you need to do.
For this month, next month we are going to look at how you can combine different kinds of searches in VuFind, which will be useful because it would be nice to be able to search our new website index and our regular bibliographic journal index at the same time. I will show you how to do that. Until then, have a good month, and thank you for listening.
This is an edited version of an automated transcript. Apologies for any errors.