Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:sitemaps_and_web_indexing
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
videos:sitemaps_and_web_indexing [2023/04/25 14:14] – [Transcript] crhallberg | videos:sitemaps_and_web_indexing [2023/04/26 13:29] (current) – crhallberg | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Video 8: Sitemaps and Web Indexing ====== | ====== Video 8: Sitemaps and Web Indexing ====== | ||
- | The eighth | + | The eighth |
Video is available as an [[https:// | Video is available as an [[https:// | ||
Line 17: | Line 17: | ||
So, first, a quick introduction to what I'm talking about. If you put a website up on the internet, chances are that sooner or later search engines will find it and crawl through it and make it searchable. But without a little bit of help, this can happen somewhat haphazardly, | So, first, a quick introduction to what I'm talking about. If you put a website up on the internet, chances are that sooner or later search engines will find it and crawl through it and make it searchable. But without a little bit of help, this can happen somewhat haphazardly, | ||
- | A site map is just an XML document that lists all of the pages on your site, which you can submit to a search engine in order to let them find all of your content. This is certainly important for something like VuFind™ where it's a search driven interface and there may not actually be a way to crawl into every single record without typing something into a box first. | + | A site map is just an XML document that lists all of the pages on your site, which you can submit to a search engine in order to let them find all of your content. This is certainly important for something like VuFind where it's a search driven interface and there may not actually be a way to crawl into every single record without typing something into a box first. |
- | So, by publishing a site map, we make it possible for all of our index records to be findable. And so VuFind™ includes facilities for generating site maps, which make it very search engine friendly and will make your content a lot more visible. | + | So, by publishing a site map, we make it possible for all of our index records to be findable. And so VuFind includes facilities for generating site maps, which make it very search engine friendly and will make your content a lot more visible. |
- | On the flip side of the coin site maps are also a really useful tool for harvesting content. And so VuFind™ also includes tools for crawling all of the pages in the site map and creating a website index. | + | On the flip side of the coin site maps are also a really useful tool for harvesting content. And so VuFind also includes tools for crawling all of the pages in the site map and creating a website index. |
- | I will show sort of both sides of that equation, how you can make site maps from VuFind™ and how you can populate VuFind™ using site maps. And I have up here sitemaps.org, | + | I will show sort of both sides of that equation, how you can make site maps from VuFind and how you can populate VuFind using site maps. And I have up here sitemaps.org, |
As a really simple example. I've created this beautiful website on my virtual machine. It's just a couple of HTML files I hand edited in the web route. So I've got this front page and I have a link that leads to this other page. And I have by hand generated a sitemap.xml file, which just lists both of these pages, the root of the site and the link page. | As a really simple example. I've created this beautiful website on my virtual machine. It's just a couple of HTML files I hand edited in the web route. So I've got this front page and I have a link that leads to this other page. And I have by hand generated a sitemap.xml file, which just lists both of these pages, the root of the site and the link page. | ||
- | So, suppose that I want this website I've just created to live in harmony with the VuFind™ instance. I'm demonstrating for some time. What I would want to do is create a site map containing all of the content of VuFind™ as well as a site map index, which is another part of the site map specification, | + | So, suppose that I want this website I've just created to live in harmony with the VuFind instance. I'm demonstrating for some time. What I would want to do is create a site map containing all of the content of VuFind as well as a site map index, which is another part of the site map specification, |
- | Fortunately, | + | Fortunately, |
- | The "site map index" section controls how VuFind™ generates the high-level index XML. One important setting is the "base site map URL," which corresponds to the directory where the files will be generated. In this instance, it's http:// | + | The "site map index" section controls how VuFind generates the high-level index XML. One important setting is the "base site map URL," which corresponds to the directory where the files will be generated. In this instance, it's http:// |
- | You can control the name of the index file that VuFind™ generates. We'll leave that as the default. And we can tell it's the name of an existing sitemap that we want to include in the index. So the default of base sitemap is not a file that actually exists in our example. So we will just tell it to use the sitemap.xml that was generated by hand or the demonstration web page and incorporate that into the index so that it's findable along with VuFind™-generated content. | + | You can control the name of the index file that VuFind generates. We'll leave that as the default. And we can tell it's the name of an existing sitemap that we want to include in the index. So the default of base sitemap is not a file that actually exists in our example. So we will just tell it to use the sitemap.xml that was generated by hand or the demonstration web page and incorporate that into the index so that it's findable along with VuFind-generated content. |
- | And now VuFind™ sitemap generator is fully configured. There' | + | And now VuFind sitemap generator is fully configured. There' |
- | And now I'm just running all of these tools using my own decats account. In a production environment, | + | And now I'm just running all of these tools using my own decats account. In a production environment, |
- | And then all I need to do is say `phputil/ | + | And then all I need to do is say '' |
- | So one more small step that you might want to take with sitemaps is to publish for search engines where they can be found. You may be familiar with a file called | + | So one more small step that you might want to take with sitemaps is to publish for search engines where they can be found. You may be familiar with a file called |
- | So if I edit `/ | + | So if I edit '' |
- | Now that we've seen how to create site maps within VuFind™, let's talk about how VuFind™ can take advantage of other people' | + | Now that we've seen how to create site maps within VuFind, let's talk about how VuFind can take advantage of other people' |
- | If VuFind™ supports two different full-text extraction tools, one is called Aperture and has not been updated in many years. So, it's strongly encouraged that everyone uses the second option, which is Tika that can be obtained at `https:// | + | If VuFind supports two different full-text extraction tools, one is called Aperture and has not been updated in many years. So, it's strongly encouraged that everyone uses the second option, which is Tika that can be obtained at '' |
- | Once we've downloaded Tika, we should set it up in a way where it's easily accessible to VuFind™. What I like to do is give it its directory. We will call it user local Tika. I'm going to copy the file from my download directory, and I like to create a symbolic link shortcut from the long Tika jar file to just `tika.jar`, which makes VuFind™ configuration easier because we can download new versions of the app as they are released in the future, and we just have to rewrite the symbolic link instead of having to constantly edit VuFind™ configuration files. I'm just going to do a quick `sudo ln -s / | + | Once we've downloaded Tika, we should set it up in a way where it's easily accessible to VuFind. What I like to do is give it its directory. We will call it user local Tika. I'm going to copy the file from my download directory, and I like to create a symbolic link shortcut from the long Tika jar file to just '' |
- | Like all of VuFind™ configuration files, once again there are lots of comments here explaining what it does. We just need to do two simple things. Uncomment the general section and tell it to use Tika. If we didn't do that, it would try to auto-detect which tools are being used, which is fine, but telling it is a little bit faster. You can skip all the aperture settings, and then we can just uncomment the Tika path. So if VuFind™ those where to find Tika, and as you can see, the default setting matches the symbolic link that I set up, so there' | + | Like all of VuFind configuration files, once again there are lots of comments here explaining what it does. We just need to do two simple things. Uncomment the general section and tell it to use Tika. If we didn't do that, it would try to auto-detect which tools are being used, which is fine, but telling it is a little bit faster. You can skip all the aperture settings, and then we can just uncomment the Tika path. So if VuFind those where to find Tika, and as you can see, the default setting matches the symbolic link that I set up, so there' |
- | Now we're halfway there. We've got the full-text extractor set up, but now we need to set up the web crawler, and there' | + | Now we're halfway there. We've got the full-text extractor set up, but now we need to set up the web crawler, and there' |
- | So all the web crawl.ini does is tell VuFind™ site map index or where to find site maps. You can create a list of as many site maps as you want, but for this example, the only thing we want to index is our locally created | + | So all the web crawl.ini does is tell VuFind site map index or where to find site maps. You can create a list of as many site maps as you want, but for this example, the only thing we want to index is our locally created |
We could also turn on this verbose setting just to get some more feedback out of the crawler while it runs, but it makes no functional difference whether we do that or not. | We could also turn on this verbose setting just to get some more feedback out of the crawler while it runs, but it makes no functional difference whether we do that or not. | ||
- | So now we're all set up! If VuFind™ knows where to find Tika for full-text extraction, it knows where to find a site map to crawl, so all we need to do is run the crawler, which is `php import/ | + | So now we're all set up! If VuFind knows where to find Tika for full-text extraction, it knows where to find a site map to crawl, so all we need to do is run the crawler, which is '' |
So what the web crawler is actually doing is running the XML importer that was demonstrated last month. We have an XSLT that uses site map XML in combination with some custom PHP code. | So what the web crawler is actually doing is running the XML importer that was demonstrated last month. We have an XSLT that uses site map XML in combination with some custom PHP code. | ||
Line 67: | Line 67: | ||
Every time you run the web crawler, it captures a timestamp. It indexes all of the site maps you referred to, and then it deletes anything that's older than the time at which the process started. So if web pages are removed, the indexer will get rid of them on the next run. You do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline, it's going to wipe out parts of your index, so use with caution. | Every time you run the web crawler, it captures a timestamp. It indexes all of the site maps you referred to, and then it deletes anything that's older than the time at which the process started. So if web pages are removed, the indexer will get rid of them on the next run. You do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline, it's going to wipe out parts of your index, so use with caution. | ||
- | In any case, now that the indexing has completed, we can go back to our VuFind™ interface. And if we go to `http:// | + | In any case, now that the indexing has completed, we can go back to our VuFind interface. And if we go to '' |
That I now have two pages, both of the pages from my site map were indexed, and just to prove that the full-text searching works right. If I type the word " | That I now have two pages, both of the pages from my site map were indexed, and just to prove that the full-text searching works right. If I type the word " | ||
Line 73: | Line 73: | ||
One quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest. There' | One quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest. There' | ||
- | So, a possible interest is `config/ | + | So, a possible interest is '' |
- | Finally, if you want to configure the actual indexing process, there is an import/ | + | Finally, if you want to configure the actual indexing process, there is an import/ |
- | For this month, next month we are going to look at how you can combine different kinds of searches in ViewFind, which will be useful because it would be nice to be able to search our new website index and our regular bibliographic journal index at the same time. I will show you how to do that. Until then, have a good month, and thank you for listening. | + | For this month, next month we are going to look at how you can combine different kinds of searches in VuFind, which will be useful because it would be nice to be able to search our new website index and our regular bibliographic journal index at the same time. I will show you how to do that. Until then, have a good month, and thank you for listening. |
//This is an edited version of an automated transcript. Apologies for any errors.// | //This is an edited version of an automated transcript. Apologies for any errors.// |
videos/sitemaps_and_web_indexing.1682432057.txt.gz · Last modified: 2023/04/25 14:14 by crhallberg