Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:sitemaps_and_web_indexing
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
videos:sitemaps_and_web_indexing [2020/08/07 13:30] – [Transcript] demiankatz | videos:sitemaps_and_web_indexing [2023/04/26 13:29] (current) – crhallberg | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Video 8: Sitemaps and Web Indexing ====== | ====== Video 8: Sitemaps and Web Indexing ====== | ||
- | The eighth | + | The eighth |
Video is available as an [[https:// | Video is available as an [[https:// | ||
Line 13: | Line 13: | ||
===== Transcript ===== | ===== Transcript ===== | ||
- | // This is a raw machine-generated transcript; | + | So, this month' |
- | so this month' | + | So, first, a quick introduction to what I'm talking about. If you put a website up on the internet, chances are that sooner or later search engines will find it and crawl through it and make it searchable. But without a little bit of help, this can happen somewhat haphazardly, and that is where XML site maps come into play. |
- | discussion of sitemaps and web crawling | + | |
- | and it's sort of a logical follow-up to | + | A site map is just an XML document that lists all of the pages on your site, which you can submit to a search engine in order to let them find all of your content. This is certainly important for something like VuFind |
- | last month' | + | |
- | first a quick introduction to what I'm | + | So, by publishing a site map, we make it possible for all of our index records to be findable. And so VuFind |
- | talking about if you put a website up on | + | |
- | the internet chances are that sooner or | + | On the flip side of the coin site maps are also a really useful tool for harvesting content. And so VuFind |
- | later search engines will find it and | + | |
- | crawl through it and make it searchable | + | I will show sort of both sides of that equation, how you can make site maps from VuFind |
- | but without a little bit of help | + | |
- | this can happen somewhat haphazardly and | + | As a really simple example. I've created this beautiful website on my virtual machine. It's just a couple of HTML files I hand edited in the web route. So I've got this front page and I have a link that leads to this other page. And I have by hand generated a sitemap.xml file, which just lists both of these pages, the root of the site and the link page. |
- | that is where XML sitemaps | + | |
- | play a a sitemap | + | So, suppose that I want this website I've just created to live in harmony with the VuFind |
- | that lists all of the pages on your site | + | |
- | which you can submit to a search engine | + | Fortunately, |
- | in order to let them find all of your | + | |
- | content | + | The "site map index" |
- | for something like view find where it's | + | |
- | a search driven interface and there may | + | You can control the name of the index file that VuFind |
- | not actually be a way to crawl into | + | |
- | every single record without typing | + | And now VuFind |
- | something into a box first | + | |
- | so by publishing a sitemap | + | And now I'm just running all of these tools using my own decats |
- | possible for all of our index records to | + | |
- | be findable | + | And then all I need to do is say '' |
- | facilities for generating | + | |
- | make it very search engine friendly and | + | So one more small step that you might want to take with sitemaps is to publish for search engines where they can be found. You may be familiar with a file called |
- | will make your content a lot more | + | |
- | visible on the flip side of the coin | + | So if I edit ''/ |
- | sitemaps | + | |
- | for harvesting content | + | Now that we've seen how to create |
- | find also includes tools or crawling all | + | |
- | of the pages in the sitemap | + | If VuFind |
- | a website index so today I will show | + | |
- | sort of both sides of that equation how | + | Once we've downloaded |
- | you can make sitemaps | + | |
- | how you can populate | + | Like all of VuFind |
- | sitemaps and I have up here sitemaps org | + | |
- | which is where the sitemap | + | Now we're halfway there. We've got the full-text extractor |
- | can be found if you want to learn more | + | |
- | about how these documents are structured | + | So all the web crawl.ini does is tell VuFind site map index or where to find site maps. You can create a list of as many site maps as you want, but for this example, the only thing we want to index is our locally created |
- | so just as a really simple example I've | + | |
- | created this beautiful website on my | + | We could also turn on this verbose |
- | virtual machine | + | |
- | HTML files I hand edited in the web | + | So now we're all set up! If VuFind |
- | route so I've got this front page and I | + | |
- | have a link that leads to this other | + | So what the web crawler is actually doing is running the XML importer that was demonstrated last month. We have an XSLT that uses site map XML in combination with some custom PHP code. |
- | page and I have by hand generated a | + | |
- | sitemap | + | To use Tika to extract content from web pages and then index them into a special |
- | of these pages the the root of the site | + | |
- | and the linked | + | Every time you run the web crawler, it captures a timestamp. It indexes all of the site maps you referred to, and then it deletes anything that's older than the time at which the process started. So if web pages are removed, the indexer will get rid of them on the next run. You do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline, it's going to wipe out parts of your index, so use with caution. |
- | want this website I've just created to | + | |
- | live in harmony with the view find | + | In any case, now that the indexing has completed, we can go back to our VuFind |
- | instance I've been demonstrating for | + | |
- | some time what I would want to do is | + | That I now have two pages, both of the pages from my site map were indexed, and just to prove that the full-text searching works right. If I type the word "lazy" |
- | create a sitemap | + | |
- | content of view fine as well as a | + | One quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest. There's a website.ini file which controls all the behavior of the web search, and this is kind of like a combination of facets.ini and searches.ini for the mean. But these settings are applied the website search, so if you want to customize recommendation modules or change labels, sort options, etc. facets, it's all in here. So this is how you can control the presentation of your website search. |
- | sitemap | + | |
- | the sitemap | + | So, a possible interest is '' |
- | you to group together multiple | + | |
- | so that they can all be discovered as a | + | Finally, |
- | bundle | + | |
- | fortunately you find includes a command | + | For this month, next month we are going to look at how you can combine different kinds of searches in VuFind, |
- | line tool that will do all of this for | + | |
- | you so I am just going to drop to the | + | //This is an edited version of an automated transcript. Apologies for any errors.// |
- | terminal and go to my view find home | + | |
- | directory | + | |
- | there is a configuration file called | + | |
- | sitemap I and I so I'm going to copy | + | |
- | config you find sitemap dot ini into my | + | |
- | local config | + | |
- | I'm going to edit that configuration | + | |
- | file so like all of you finds | + | |
- | configuration files the sitemap | + | |
- | file is full of comments explaining | + | |
- | all of the settings | + | |
- | through all of these right now I will | + | |
- | just highlight the ones that are most | + | |
- | important | + | |
- | have a top sitemap section | + | |
- | to control | + | |
- | your sitemaps | + | |
- | here like frequency | + | |
- | the content of the generated | + | |
- | which can impact how | + | |
- | currently search engines will come back | + | |
- | and recall your pages count per page is | + | |
- | going to control how many URLs | + | |
- | you find puts in each of the sitemap | + | |
- | files it generates | + | |
- | sites could potentially have millions of | + | |
- | Records and creating one sitemap with a | + | |
- | million records in it it's probably | + | |
- | going to cause some problems thus | + | |
- | there' | + | |
- | into chunks by default | + | |
- | per chunk and then if you find will | + | |
- | generate | + | |
- | points to all of the chunks | + | |
- | another part of the the sitemap spec | + | |
- | that you can create lists of sitemaps | + | |
- | the file name setting | + | |
- | control | + | |
- | find generates for insight map of course | + | |
- | it defaults to sitemap | + | |
- | cases you don't want you find overwrite | + | |
- | an existing sitemap XML that was created | + | |
- | either by hand or by a different tool so | + | |
- | this gives us the ability to give it a | + | |
- | more specific name in this case I'm | + | |
- | going to call it you find sitemap | + | |
- | location determines where you find | + | |
- | generates the sitemap files by default | + | |
- | it uses the temporary directory but | + | |
- | that's not going to be very useful in a | + | |
- | real life situation so I'm going to | + | |
- | change it to bar dub-dub-dub HTML which | + | |
- | happens to be the web root on the ubuntu | + | |
- | server | + | |
- | we can also control which indexes get | + | |
- | indexed but we can stick with defaults | + | |
- | there we can affect how you find | + | |
- | retrieves the URLs to put into the | + | |
- | sitemap again we'll use the default | + | |
- | there but you can tune that in some ways | + | |
- | if it's not performing quickly enough on | + | |
- | next up is the sitemap | + | |
- | this controls how you find and generates | + | |
- | that high level index XML that I | + | |
- | mentioned there are a couple of | + | |
- | important | + | |
- | here one being the base sitemap | + | |
- | we set the directory where the files | + | |
- | will be generated | + | |
- | you find what URL that directory | + | |
- | corresponds with in this instance it's | + | |
- | HTTP colon slash slash localhost | + | |
- | obviously | + | |
- | would not be using localhost here but | + | |
- | for this example it will do you can | + | |
- | control the name of the index file that | + | |
- | you find generates | + | |
- | the default | + | |
- | of an existing sitemap that we want to | + | |
- | include in the index so the default of | + | |
- | Base sitemap is not a file that actually | + | |
- | exists in our example | + | |
- | tell it use the sitemap | + | |
- | generated by hand or the demonstration | + | |
- | web page incorporate that into the index | + | |
- | so that it's findable along with my | + | |
- | generated content | + | |
- | sitemap generator is fully configured | + | |
- | there's just one more important detail | + | |
- | which is that I need to be sure that the | + | |
- | user that runs the command line tool has | + | |
- | right access to the web root or else it | + | |
- | won't be able to successfully write out | + | |
- | the sitemaps that it generates | + | |
- | now I'm just running all of these tools | + | |
- | using my own cats account | + | |
- | production environment of course you | + | |
- | would want to have | + | |
- | a dedicated user for running | + | |
- | utilities and you would have ownership | + | |
- | set accordingly | + | |
- | expediency I'm just going to use cats | + | |
- | for all of these demonstrations so I'm | + | |
- | going to sue do CH own d-pad spar dub | + | |
- | dub dub HTML so that I now own the web | + | |
- | route and have permission to write files | + | |
- | there and then all I need to do is a PHP | + | |
- | util slash sitemap | + | |
- | viewfinder Ector e and that will run the | + | |
- | generator | + | |
- | seconds and now if i do a file listing | + | |
- | of bar dub dub dub HTML i will see that | + | |
- | there is a view find sitemap | + | |
- | sitemap index dot XML that we're not | + | |
- | there before | + | |
- | look at those through | + | |
- | go to localhost / slight map index.xml | + | |
- | sure enough this points us to two | + | |
- | different files the existing sitemap | + | |
- | that was already there that we told you | + | |
- | find about as the base sitemap and also | + | |
- | this new you find sitemap | + | |
- | has been generated | + | |
- | will see that this contains a list of | + | |
- | all the records in defined | + | |
- | be easily | + | |
- | step that you might want to take with | + | |
- | sitemaps | + | |
- | engines where they can be found you may | + | |
- | be familiar with a file called | + | |
- | robots.txt which you can use to tell | + | |
- | crawlers which parts of your site they | + | |
- | should or should not be crawling | + | |
- | also use that file to specify where a | + | |
- | sitemap lives so if I edit var dub dub | + | |
- | dub HTML robots.txt | + | |
- | I'm just creating a new file but you | + | |
- | might have an existing one in some | + | |
- | situations | + | |
- | sitemap colon and if the you are | + | |
- | of my sitemap in this case HTTP colon | + | |
- | slash slash localhost / site now index | + | |
- | dot XML and now if a robot comes to the | + | |
- | site and looks at robots.txt and | + | |
- | supports the part of the protocol that | + | |
- | includes the sitemap | + | |
- | will know exactly where to look and it | + | |
- | can find all of your content | + | |
- | we've seen how to create | + | |
- | if you find let's talk about how view | + | |
- | fine can take advantage of other | + | |
- | people' | + | |
- | sitemaps | + | |
- | systems | + | |
- | earlier you find has the capacity to | + | |
- | index content from sitemaps | + | |
- | web index so that you can search your | + | |
- | own website | + | |
- | that you need to set up a full text | + | |
- | extraction tool there are several places | + | |
- | in view find where it can take advantage | + | |
- | of full text extraction to make the | + | |
- | content a file searchable | + | |
- | when indexing | + | |
- | you can do some custom rules that will | + | |
- | use URLs within metadata to retrieve | + | |
- | content and then index all of the text | + | |
- | coming back from those URLs and that | + | |
- | same mechanism is used by Pugh finds | + | |
- | sitemap index err if you find supports | + | |
- | two different full-text extraction tools | + | |
- | one is called | + | |
- | updated in many years so I strongly | + | |
- | encourage | + | |
- | option which is patching teeka that can | + | |
- | be obtained at pika patchy dot-org if | + | |
- | you just go to the download page | + | |
- | are a number of downloads available but | + | |
- | what we want is the tikka App jar file | + | |
- | which is all that you need to extract | + | |
- | text content from a variety of different | + | |
- | file formats including | + | |
- | documents and fortunately for us web | + | |
- | pages I've actually already | + | |
- | download of this file to save a little | + | |
- | bit of time | + | |
- | so once we've downloaded | + | |
- | set it up in a way where it's easily | + | |
- | accessible | + | |
- | do is give it its own directory | + | |
- | call it user local tika and I'm going to | + | |
- | copy the file from my download directory | + | |
- | into user local tika and I like to | + | |
- | create a symbolic link shortcut | + | |
- | the long tika jar file to just tika jar | + | |
- | which makes if you find configuration | + | |
- | easier because we can download new | + | |
- | versions of the app as they' | + | |
- | in the future and we just have to | + | |
- | rewrite the symbolic link instead of | + | |
- | having to constantly edit of you find | + | |
- | configuration files so I'm just going to | + | |
- | do a quick CDU ln- s from user local | + | |
- | tika tika app 1.2 4.1 that jar to user | + | |
- | local tika tika jar so that's all we | + | |
- | have to do to install | + | |
- | a jar file put it someplace nice and | + | |
- | easy but we also have to tell you find | + | |
- | where to find it and for that there is a | + | |
- | configuration file called | + | |
- | ini so let's copy config | + | |
- | text I and I to local big do you find so | + | |
- | we always do to override it and then | + | |
- | edit local config | + | |
- | like all of you finds configuration | + | |
- | files once again there are lots of | + | |
- | comments here explaining what it does we | + | |
- | just need to do two simple things | + | |
- | uncomment | + | |
- | it to use tika if we didn't do that it | + | |
- | would try to auto detect which tool is | + | |
- | being used which is fine but telling it | + | |
- | is a little bit faster | + | |
- | the aperture settings and then we can | + | |
- | just uncomment the tika path so if you | + | |
- | find those where to find tika and as you | + | |
- | can see the default setting matches the | + | |
- | symbolic link that I set up so there' | + | |
- | no need to change anything other than | + | |
- | uncommenting the line now we're halfway | + | |
- | there we've got the full text extractor | + | |
- | setup but now we need to set up the web | + | |
- | crawler and there' | + | |
- | for that so we'll copy config | + | |
- | web crawl ini to local config | + | |
- | and we'll edit that file so all the web | + | |
- | crawl dot ini does is till you find | + | |
- | sitemap | + | |
- | create a list of as many sitemaps | + | |
- | want but for this example the only thing | + | |
- | we want to index is our locally created | + | |
- | sitemap | + | |
- | index our view fine crawler is smart | + | |
- | enough to crawl into all of the sitemaps | + | |
- | referenced by an index but we really | + | |
- | don't want to index view finds record | + | |
- | pages into view finds web index that | + | |
- | would only be confusing so we're going | + | |
- | to focus in on the content that exists | + | |
- | outside of you find itself | + | |
- | turn on this for beau setting just to | + | |
- | get some more feedback out of the | + | |
- | crawler while it runs but it makes no | + | |
- | functional difference whether we do that | + | |
- | or not | + | |
- | so now we're all set up if you find | + | |
- | knows where to find tika for full text | + | |
- | extraction it knows where to find a | + | |
- | sitemap | + | |
- | run the crawler which is PHP import | + | |
- | slash web crawl from the home you find | + | |
- | directory | + | |
- | sitemap XML file | + | |
- | doing a bit of work and in just a moment | + | |
- | we should have our content | + | |
- | web crawler is actually doing is running | + | |
- | the XML importer that was demonstrated | + | |
- | last month we have an XSLT that uses | + | |
- | sitemap | + | |
- | custom PHP code to use tika to extract | + | |
- | content from webpages | + | |
- | them into a special | + | |
- | designed specifically for searching | + | |
- | webpages the other little piece that the | + | |
- | web crawler does is it keeps track of | + | |
- | when it runs and it deletes anything | + | |
- | that was indexed on a prior run so every | + | |
- | time you run the web crawler it captures | + | |
- | a timestamp | + | |
- | sitemaps | + | |
- | deletes anything that's older than the | + | |
- | time at which the process started | + | |
- | so if webpages | + | |
- | will get rid of them on the next run you | + | |
- | do have to be careful about this though | + | |
- | because if you run the web crawler at a | + | |
- | time when a website is temporarily | + | |
- | offline it's going to wipe out parts of | + | |
- | your index so use with caution | + | |
- | case now that the indexing has completed | + | |
- | we can go back to our beautifying | + | |
- | interface | + | |
- | web with a capital W that brings up the | + | |
- | website search which uses the solar | + | |
- | website index I mentioned | + | |
- | blank search | + | |
- | we will see that I now have two pages | + | |
- | both of the pages from my sitemap | + | |
- | indexed and just to prove that the | + | |
- | full-text searching works right I type | + | |
- | the word lazy here that word appears on | + | |
- | only one of these two pages and sure | + | |
- | enough there it is it highlights where | + | |
- | the text matched everything is working | + | |
- | so one quick thing to demonstrate before | + | |
- | we call it a day is that there are two | + | |
- | configuration files that might be of | + | |
- | interest | + | |
- | which controls all the behavior of the | + | |
- | web search and this is kind of like a | + | |
- | combination of facets | + | |
- | searches | + | |
- | index but these settings are applied the | + | |
- | website search so if you want to | + | |
- | customize recommendation modules or | + | |
- | change labels | + | |
- | it's all in here so this is how you can | + | |
- | control the presentation of your website | + | |
- | search | + | |
- | config / view find / we have a search | + | |
- | specs that Yambol again this is just | + | |
- | like the regular search specs that yeah | + | |
- | mph or the Biblio | + | |
- | tuned to the website index so if you | + | |
- | want to change which fields are searched | + | |
- | or how relevancy ranking works this is | + | |
- | the place where you can do that finally | + | |
- | if you want to configure the actual | + | |
- | indexing process there is a an import / | + | |
- | XSL sitemap | + | |
- | gets applied to all of the sitemaps | + | |
- | order to index them and as you can see | + | |
- | here this is really just a wrapper | + | |
- | around a PHP function called | + | |
- | sitemap get document | + | |
- | and import sitemap | + | |
- | import configuration that sets up the | + | |
- | custom class it specifies the XSLT and | + | |
- | so forth so it's beyond the scope of | + | |
- | today' | + | |
- | customize things what you want to do is | + | |
- | override the view find app class with | + | |
- | your own behavior | + | |
- | anything you like in that PHP for | + | |
- | example you might want to extract values | + | |
- | from particular HTML meta tags and use | + | |
- | them for facets or whatever you need to | + | |
- | do so that's it for this month next | + | |
- | month we are going to look at how you | + | |
- | can combine different kinds of searches | + | |
- | in view find which will be useful | + | |
- | because it would be nice to be able to | + | |
- | search our new website index in our | + | |
- | regular | + | |
- | the same time so I will show you how to | + | |
- | do that until then have a good month and | + | |
- | thank you for listening | + | |
---- struct data ---- | ---- struct data ---- | ||
+ | properties.Page Owner : | ||
---- | ---- | ||
videos/sitemaps_and_web_indexing.1596807013.txt.gz · Last modified: 2020/08/07 13:30 by demiankatz