About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:sitemaps_and_web_indexing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Last revisionBoth sides next revision
videos:sitemaps_and_web_indexing [2023/04/25 14:14] – [Transcript] crhallbergvideos:sitemaps_and_web_indexing [2023/04/25 19:19] – [Transcript] crhallberg
Line 39: Line 39:
 And now I'm just running all of these tools using my own decats account. In a production environment, of course, you would want to have a dedicated user for running VuFind™ utilities, and you would have ownership set accordingly. But in the interest of expediency, I'm just going to use e-cats for all of these demonstrations, so I'm going to chown e-cats our web route so that I now own the web route and have permission to write files there. And now I'm just running all of these tools using my own decats account. In a production environment, of course, you would want to have a dedicated user for running VuFind™ utilities, and you would have ownership set accordingly. But in the interest of expediency, I'm just going to use e-cats for all of these demonstrations, so I'm going to chown e-cats our web route so that I now own the web route and have permission to write files there.
  
-And then all I need to do is say `phputil/sitemap.phpwhile I'm in the VuFind™ home directory, and that will run the generator. That only took a couple of seconds, and now if I do a file listing of our `/var/www/html`, I will see that there is a `vufind-sitemap.xmland a `sitemap-index.xmlthat were not there before. You can look at those in our web browser if you go to `localhost/sitemap-index.xml`. Sure enough, this points us to two different files, the existing `sitemap.xmlthat was already there that we told VuFind™ about as the base sitemap and also this new `vufind-sitemap.xml`, which has been generated. And if I go there, we will see that this contains a list of all the records in VuFind™ so they can be easily crawled.+And then all I need to do is say ''phputil/sitemap.php'' while I'm in the VuFind™ home directory, and that will run the generator. That only took a couple of seconds, and now if I do a file listing of our ''/var/www/html'', I will see that there is a ''vufind-sitemap.xml'' and a ''sitemap-index.xml'' that were not there before. You can look at those in our web browser if you go to ''localhost/sitemap-index.xml''. Sure enough, this points us to two different files, the existing ''sitemap.xml'' that was already there that we told VuFind™ about as the base sitemap and also this new ''vufind-sitemap.xml'', which has been generated. And if I go there, we will see that this contains a list of all the records in VuFind™ so they can be easily crawled.
  
-So one more small step that you might want to take with sitemaps is to publish for search engines where they can be found. You may be familiar with a file called `robots.txt`, which you can use to tell crawlers which parts of your site they should or should not be crawling. So use that file to specify where a sitemap lives.+So one more small step that you might want to take with sitemaps is to publish for search engines where they can be found. You may be familiar with a file called ''robots.txt'', which you can use to tell crawlers which parts of your site they should or should not be crawling. So use that file to specify where a sitemap lives.
  
-So if I edit `/var/www/html/robots.txt`, in this example, I'm just creating a new file, but you might have an existing one in some situations. All I need to do is say `Sitemap:and give the URL of my site map. In this case, `Sitemap: http://localhost/sitemapIndex.xml`. Now, if a robot comes to the site and looks at `robots.txtand supports the part of the protocol that includes the site map specification, it will know exactly where to look and can find all of your content.+So if I edit ''/var/www/html/robots.txt'', in this example, I'm just creating a new file, but you might have an existing one in some situations. All I need to do is say ''Sitemap:'' and give the URL of my site map. In this case, ''Sitemap: http://localhost/sitemapIndex.xml''. Now, if a robot comes to the site and looks at ''robots.txt'' and supports the part of the protocol that includes the site map specification, it will know exactly where to look and can find all of your content.
  
 Now that we've seen how to create site maps within VuFind™, let's talk about how VuFind™ can take advantage of other people's site maps, including our own site maps from our content management systems for websites. VuFind™ has the capacity to index content from site maps to create a web index so that you can search your own website. However, before it can do that, you need to set up a full-text extraction tool. There are several places in VuFind™ where it can take advantage of full-text extraction to make the content of a file searchable. For example, when indexing MARC records or XML files, you can do some custom rules that will use URLs within metadata to retrieve content and then index all of the text coming back from those URLs. That same mechanism is used by VuFind™'s site map indexer. Now that we've seen how to create site maps within VuFind™, let's talk about how VuFind™ can take advantage of other people's site maps, including our own site maps from our content management systems for websites. VuFind™ has the capacity to index content from site maps to create a web index so that you can search your own website. However, before it can do that, you need to set up a full-text extraction tool. There are several places in VuFind™ where it can take advantage of full-text extraction to make the content of a file searchable. For example, when indexing MARC records or XML files, you can do some custom rules that will use URLs within metadata to retrieve content and then index all of the text coming back from those URLs. That same mechanism is used by VuFind™'s site map indexer.
  
-If VuFind™ supports two different full-text extraction tools, one is called Aperture and has not been updated in many years. So, it's strongly encouraged that everyone uses the second option, which is Tika that can be obtained at `https://tika.apache.org`. You just go to the download page. There are a number of downloads available, but what we want is the tika-app Jar file, which is all that you need to extract text content from a variety of different file formats, including PDFs, office documents, and web pages. I've actually already downloaded this file to save a little bit of time.+If VuFind™ supports two different full-text extraction tools, one is called Aperture and has not been updated in many years. So, it's strongly encouraged that everyone uses the second option, which is Tika that can be obtained at ''https://tika.apache.org''. You just go to the download page. There are a number of downloads available, but what we want is the tika-app Jar file, which is all that you need to extract text content from a variety of different file formats, including PDFs, office documents, and web pages. I've actually already downloaded this file to save a little bit of time.
  
-Once we've downloaded Tika, we should set it up in a way where it's easily accessible to VuFind™. What I like to do is give it its directory. We will call it user local Tika. I'm going to copy the file from my download directory, and I like to create a symbolic link shortcut from the long Tika jar file to just `tika.jar`, which makes VuFind™ configuration easier because we can download new versions of the app as they are released in the future, and we just have to rewrite the symbolic link instead of having to constantly edit VuFind™ configuration files. I'm just going to do a quick `sudo ln -s /usr/local/tika/tika-app-1.24.1.jar /usr/local/tika/tika.jar`. So that's all we have to do to install Tika. Just download a jar file to put it someplace nice and easy. But we also have to tell VuFind™ where to find it. And for that, there is a configuration file called `fulltext.ini`. So let's copy `config/vufind/fulltext.inito `local/config/vufind/`. So you always do to override it. And then edit `local/config/vufind/fulltext.ini`.+Once we've downloaded Tika, we should set it up in a way where it's easily accessible to VuFind™. What I like to do is give it its directory. We will call it user local Tika. I'm going to copy the file from my download directory, and I like to create a symbolic link shortcut from the long Tika jar file to just ''tika.jar'', which makes VuFind™ configuration easier because we can download new versions of the app as they are released in the future, and we just have to rewrite the symbolic link instead of having to constantly edit VuFind™ configuration files. I'm just going to do a quick ''sudo ln -s /usr/local/tika/tika-app-1.24.1.jar /usr/local/tika/tika.jar''. So that's all we have to do to install Tika. Just download a jar file to put it someplace nice and easy. But we also have to tell VuFind™ where to find it. And for that, there is a configuration file called ''fulltext.ini''. So let's copy ''config/vufind/fulltext.ini'' to ''local/config/vufind/''. So you always do to override it. And then edit ''local/config/vufind/fulltext.ini''.
  
 Like all of VuFind™ configuration files, once again there are lots of comments here explaining what it does. We just need to do two simple things. Uncomment the general section and tell it to use Tika. If we didn't do that, it would try to auto-detect which tools are being used, which is fine, but telling it is a little bit faster. You can skip all the aperture settings, and then we can just uncomment the Tika path. So if VuFind™ those where to find Tika, and as you can see, the default setting matches the symbolic link that I set up, so there's no need to change anything other than uncommenting the line. Like all of VuFind™ configuration files, once again there are lots of comments here explaining what it does. We just need to do two simple things. Uncomment the general section and tell it to use Tika. If we didn't do that, it would try to auto-detect which tools are being used, which is fine, but telling it is a little bit faster. You can skip all the aperture settings, and then we can just uncomment the Tika path. So if VuFind™ those where to find Tika, and as you can see, the default setting matches the symbolic link that I set up, so there's no need to change anything other than uncommenting the line.
  
-Now we're halfway there. We've got the full-text extractor set up, but now we need to set up the web crawler, and there's another config file for that. So we'll copy `config/vufind/webcrawl.inito `local/config/vufind`. And we'll edit that file.+Now we're halfway there. We've got the full-text extractor set up, but now we need to set up the web crawler, and there's another config file for that. So we'll copy ''config/vufind/webcrawl.ini'' to ''local/config/vufind''. And we'll edit that file.
  
-So all the web crawl.ini does is tell VuFind™ site map index or where to find site maps. You can create a list of as many site maps as you want, but for this example, the only thing we want to index is our locally created `sitemap.xml`. We could index `sitemapIndex.xmlour VuFind™ crawler is smart enough to crawl into all of the site maps referenced by an index, but we really don't want to index VuFind™ record pages into the VuFind™ web index. That would only be confusing, so we're going to focus on the content that exists outside of VuFind™ itself.+So all the web crawl.ini does is tell VuFind™ site map index or where to find site maps. You can create a list of as many site maps as you want, but for this example, the only thing we want to index is our locally created ''sitemap.xml''. We could index ''sitemapIndex.xml'' our VuFind™ crawler is smart enough to crawl into all of the site maps referenced by an index, but we really don't want to index VuFind™ record pages into the VuFind™ web index. That would only be confusing, so we're going to focus on the content that exists outside of VuFind™ itself.
  
 We could also turn on this verbose setting just to get some more feedback out of the crawler while it runs, but it makes no functional difference whether we do that or not. We could also turn on this verbose setting just to get some more feedback out of the crawler while it runs, but it makes no functional difference whether we do that or not.
  
-So now we're all set up! If VuFind™ knows where to find Tika for full-text extraction, it knows where to find a site map to crawl, so all we need to do is run the crawler, which is `php import/webcrawl.phpfrom the home VuFind™ directory. It's harvesting the sitemap XML file, doing a bit of work, and in just a moment, we should have our content.+So now we're all set up! If VuFind™ knows where to find Tika for full-text extraction, it knows where to find a site map to crawl, so all we need to do is run the crawler, which is ''php import/webcrawl.php'' from the home VuFind™ directory. It's harvesting the sitemap XML file, doing a bit of work, and in just a moment, we should have our content.
  
 So what the web crawler is actually doing is running the XML importer that was demonstrated last month. We have an XSLT that uses site map XML in combination with some custom PHP code. So what the web crawler is actually doing is running the XML importer that was demonstrated last month. We have an XSLT that uses site map XML in combination with some custom PHP code.
Line 67: Line 67:
 Every time you run the web crawler, it captures a timestamp. It indexes all of the site maps you referred to, and then it deletes anything that's older than the time at which the process started. So if web pages are removed, the indexer will get rid of them on the next run. You do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline, it's going to wipe out parts of your index, so use with caution. Every time you run the web crawler, it captures a timestamp. It indexes all of the site maps you referred to, and then it deletes anything that's older than the time at which the process started. So if web pages are removed, the indexer will get rid of them on the next run. You do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline, it's going to wipe out parts of your index, so use with caution.
  
-In any case, now that the indexing has completed, we can go back to our VuFind™ interface. And if we go to `http://localhost/vufind/Webwith a capital W, that brings up the website search, which uses the Solr website index I mentioned by just doing a blank search.+In any case, now that the indexing has completed, we can go back to our VuFind™ interface. And if we go to ''http://localhost/vufind/Web'' with a capital W, that brings up the website search, which uses the Solr website index I mentioned by just doing a blank search.
  
 That I now have two pages, both of the pages from my site map were indexed, and just to prove that the full-text searching works right. If I type the word "lazy" here, that word appears on only one of these two pages, and sure enough, there it is. It highlights where the text matched, everything is working. That I now have two pages, both of the pages from my site map were indexed, and just to prove that the full-text searching works right. If I type the word "lazy" here, that word appears on only one of these two pages, and sure enough, there it is. It highlights where the text matched, everything is working.
Line 73: Line 73:
 One quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest. There's a website.ini file which controls all the behavior of the web search, and this is kind of like a combination of facets.ini and searches.ini for the mean. But these settings are applied the website search, so if you want to customize recommendation modules or change labels, sort options, etc. facets, it's all in here. So this is how you can control the presentation of your website search. One quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest. There's a website.ini file which controls all the behavior of the web search, and this is kind of like a combination of facets.ini and searches.ini for the mean. But these settings are applied the website search, so if you want to customize recommendation modules or change labels, sort options, etc. facets, it's all in here. So this is how you can control the presentation of your website search.
  
-So, a possible interest is `config/vufind/web/search-specs.yml`. Again, this is just like the regular search-specs.yml for the bibli index, but this is tuned to the website index. So if you want to change which fields are searched or how relevancy ranking works, this is the place where you can do that.+So, a possible interest is ''config/vufind/web/search-specs.yml''. Again, this is just like the regular search-specs.yml for the bibli index, but this is tuned to the website index. So if you want to change which fields are searched or how relevancy ranking works, this is the place where you can do that.
  
 Finally, if you want to configure the actual indexing process, there is an import/XSL/sitemap.xsl, which is the XSLT that gets applied to all of the site maps in order to index them. As you can see here, this is really just a wrapper around a PHP function called "VuFind/SiteMap::getDocument()" and import/sitemap.properties. So, import configuration that sets up the custom class and specifies the XSLT and so forth. So, it's beyond the scope of today's video. But if you want to customize things, what you want to do is override the ViewFind site map class with your own behavior. So, if you do anything you like in that PHP. For example, you might want to extract values from particular HTML meta tags and use them for facets or whatever you need to do. Finally, if you want to configure the actual indexing process, there is an import/XSL/sitemap.xsl, which is the XSLT that gets applied to all of the site maps in order to index them. As you can see here, this is really just a wrapper around a PHP function called "VuFind/SiteMap::getDocument()" and import/sitemap.properties. So, import configuration that sets up the custom class and specifies the XSLT and so forth. So, it's beyond the scope of today's video. But if you want to customize things, what you want to do is override the ViewFind site map class with your own behavior. So, if you do anything you like in that PHP. For example, you might want to extract values from particular HTML meta tags and use them for facets or whatever you need to do.
videos/sitemaps_and_web_indexing.txt · Last modified: 2023/04/26 13:29 by crhallberg