About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:sitemaps_and_web_indexing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
videos:sitemaps_and_web_indexing [2020/07/07 23:45] – [Transcript] demiankatzvideos:sitemaps_and_web_indexing [2020/07/08 11:40] – [Transcript] demiankatz
Line 15: Line 15:
 // This is a raw machine-generated transcript; it will be cleaned up in the near future. // // This is a raw machine-generated transcript; it will be cleaned up in the near future. //
  
-// coming soon ... //+ so this month's video is going to be a 
 +discussion of sitemaps and web crawling 
 +and it's sort of a logical follow-up to 
 +last month's video about indexing XML so 
 +first a quick introduction to what I'm 
 +talking about if you put a website up on 
 +the internet chances are that sooner or 
 +later search engines will find it and 
 +crawl through it and make it searchable 
 +but without a little bit of help 
 +this can happen somewhat haphazardly and 
 +that is where XML sitemaps come into 
 +play a a sitemap is just an XML document 
 +that lists all of the pages on your site 
 +which you can submit to a search engine 
 +in order to let them find all of your 
 +content now this is certainly important 
 +for something like view find where it's 
 +a search driven interface and there may 
 +not actually be a way to crawl into 
 +every single record without typing 
 +something into a box first 
 +so by publishing a sitemap we make it 
 +possible for all of our index records to 
 +be findable and so if you find includes 
 +facilities for generating sitemaps which 
 +make it very search engine friendly and 
 +will make your content a lot more 
 +visible on the flip side of the coin 
 +sitemaps are also a really useful tool 
 +for harvesting content and so if you 
 +find also includes tools or crawling all 
 +of the pages in the sitemap and creating 
 +a website index so today I will show 
 +sort of both sides of that equation how 
 +you can make sitemaps from view find and 
 +how you can populate view find using 
 +sitemaps and I have up here sitemaps org 
 +which is where the sitemap specification 
 +can be found if you want to learn more 
 +about how these documents are structured 
 +so just as a really simple example I've 
 +created this beautiful website on my 
 +virtual machine it's just a couple of 
 +HTML files I hand edited in the web 
 +route so I've got this front page and I 
 +have a link that leads to this other 
 +page and I have by hand generated a 
 +sitemap XML file which just lists both 
 +of these pages the the root of the site 
 +and the linked page so suppose that I 
 +want this website I've just created to 
 +live in harmony with the view find 
 +instance I've been demonstrating for 
 +some time what I would want to do is 
 +create a sitemap containing all of the 
 +content of view fine as well as a 
 +sitemap index which is another part of 
 +the sitemap specification which allows 
 +you to group together multiple sitemaps 
 +so that they can all be discovered as a 
 +bundle 
 +fortunately you find includes a command 
 +line tool that will do all of this for 
 +you so I am just going to drop to the 
 +terminal and go to my view find home 
 +directory 
 +there is a configuration file called 
 +sitemap I and I so I'm going to copy 
 +config you find sitemap dot ini into my 
 +local config defined directory and then 
 +I'm going to edit that configuration 
 +file so like all of you finds 
 +configuration files the sitemap dot ini 
 +file is full of comments explaining what 
 +all of the settings do and I won't go 
 +through all of these right now I will 
 +just highlight the ones that are most 
 +important to get things working so we 
 +have a top sitemap section that's going 
 +to control how you find that generates 
 +your sitemaps there are some settings in 
 +here like frequency which will affect 
 +the content of the generated sitemap and 
 +which can impact how 
 +currently search engines will come back 
 +and recall your pages count per page is 
 +going to control how many URLs 
 +you find puts in each of the sitemap 
 +files it generates because view find 
 +sites could potentially have millions of 
 +Records and creating one sitemap with a 
 +million records in it it's probably 
 +going to cause some problems thus 
 +there's a mechanism for breaking that up 
 +into chunks by default 10,000 records 
 +per chunk and then if you find will 
 +generate a sitemap index file that 
 +points to all of the chunks and this is 
 +another part of the the sitemap spec 
 +that you can create lists of sitemaps 
 +the file name setting is going to 
 +control the name of the file that you 
 +find generates for insight map of course 
 +it defaults to sitemap XML but in many 
 +cases you don't want you find overwrite 
 +an existing sitemap XML that was created 
 +either by hand or by a different tool so 
 +this gives us the ability to give it a 
 +more specific name in this case I'm 
 +going to call it you find sitemap file 
 +location determines where you find 
 +generates the sitemap files by default 
 +it uses the temporary directory but 
 +that's not going to be very useful in a 
 +real life situation so I'm going to 
 +change it to bar dub-dub-dub HTML which 
 +happens to be the web root on the ubuntu 
 +server I am using for this example 
 +we can also control which indexes get 
 +indexed but we can stick with defaults 
 +there we can affect how you find 
 +retrieves the URLs to put into the 
 +sitemap again we'll use the default 
 +there but you can tune that in some ways 
 +if it's not performing quickly enough on 
 +next up is the sitemap index section and 
 +this controls how you find and generates 
 +that high level index XML that I 
 +mentioned there are a couple of 
 +important things that we need to set 
 +here one being the base sitemap URL when 
 +we set the directory where the files 
 +will be generated we also need to tell 
 +you find what URL that directory 
 +corresponds with in this instance it's 
 +HTTP colon slash slash localhost 
 +obviously in a real world scenario you 
 +would not be using localhost here but 
 +for this example it will do you can 
 +control the name of the index file that 
 +you find generates we'll leave that as 
 +the default and we can tell it the name 
 +of an existing sitemap that we want to 
 +include in the index so the default of 
 +Base sitemap is not a file that actually 
 +exists in our example so we will just 
 +tell it use the sitemap XML that was 
 +generated by hand or the demonstration 
 +web page incorporate that into the index 
 +so that it's findable along with my 
 +generated content so now you find 
 +sitemap generator is fully configured 
 +there's just one more important detail 
 +which is that I need to be sure that the 
 +user that runs the command line tool has 
 +right access to the web root or else it 
 +won't be able to successfully write out 
 +the sitemaps that it generates so right 
 +now I'm just running all of these tools 
 +using my own cats account in a 
 +production environment of course you 
 +would want to have 
 +a dedicated user for running view find 
 +utilities and you would have ownership 
 +set accordingly but in the interest of 
 +expediency I'm just going to use cats 
 +for all of these demonstrations so I'm 
 +going to sue do CH own d-pad spar dub 
 +dub dub HTML so that I now own the web 
 +route and have permission to write files 
 +there and then all I need to do is a PHP 
 +util slash sitemap PHP while I'm in the 
 +viewfinder Ector e and that will run the 
 +generator that only took a couple 
 +seconds and now if i do a file listing 
 +of bar dub dub dub HTML i will see that 
 +there is a view find sitemap XML and a 
 +sitemap index dot XML that we're not 
 +there before 
 +look at those through our web browser we 
 +go to localhost slight map index.xml 
 +sure enough this points us to two 
 +different files the existing sitemap XML 
 +that was already there that we told you 
 +find about as the base sitemap and also 
 +this new you find sitemap dot XML which 
 +has been generated and if I go there we 
 +will see that this contains a list of 
 +all the records in defined so they can 
 +be easily crawl there's one more small 
 +step that you might want to take with 
 +sitemaps which is to publish for search 
 +engines where they can be found you may 
 +be familiar with a file called 
 +robots.txt which you can use to tell 
 +crawlers which parts of your site they 
 +should or should not be crawling you can 
 +also use that file to specify where a 
 +sitemap lives so if I edit var dub dub 
 +dub HTML robots.txt and in this example 
 +I'm just creating a new file but you 
 +might have an existing one in some 
 +situations all I need to do is say 
 +sitemap colon and if the you are 
 +of my sitemap in this case HTTP colon 
 +slash slash localhost site now index 
 +dot XML and now if a robot comes to the 
 +site and looks at robots.txt and 
 +supports the part of the protocol that 
 +includes the sitemap specification I 
 +will know exactly where to look and it 
 +can find all of your content so now that 
 +we've seen how to create sitemaps within 
 +if you find let's talk about how view 
 +fine can take advantage of other 
 +people's sitemaps including our own 
 +sitemaps from our content management 
 +systems or websites as I mentioned 
 +earlier you find has the capacity to 
 +index content from sitemaps to create a 
 +web index so that you can search your 
 +own website however before it can do 
 +that you need to set up a full text 
 +extraction tool there are several places 
 +in view find where it can take advantage 
 +of full text extraction to make the 
 +content a file searchable so for example 
 +when indexing mark records or XML files 
 +you can do some custom rules that will 
 +use URLs within metadata to retrieve 
 +content and then index all of the text 
 +coming back from those URLs and that 
 +same mechanism is used by Pugh finds 
 +sitemap index err if you find supports 
 +two different full-text extraction tools 
 +one is called aperture and has not been 
 +updated in many years so I strongly 
 +encourage that everyone use the second 
 +option which is patching teeka that can 
 +be obtained at pika patchy dot-org if 
 +you just go to the download page 
 +are a number of downloads available but 
 +what we want is the tikka App jar file 
 +which is all that you need to extract 
 +text content from a variety of different 
 +file formats including PDF Office 
 +documents and fortunately for us web 
 +pages I've actually already done the the 
 +download of this file to save a little 
 +bit of time 
 +so once we've downloaded this we should 
 +set it up in a way where it's easily 
 +accessible did you find what I like to 
 +do is give it its own directory we'll 
 +call it user local tika and I'm going to 
 +copy the file from my download directory 
 +into user local tika and I like to 
 +create a symbolic link shortcut to from 
 +the long tika jar file to just tika jar 
 +which makes if you find configuration 
 +easier because we can download new 
 +versions of the app as they're released 
 +in the future and we just have to 
 +rewrite the symbolic link instead of 
 +having to constantly edit of you find 
 +configuration files so I'm just going to 
 +do a quick CDU ln- s from user local 
 +tika tika app 1.2 4.1 that jar to user 
 +local tika tika jar so that's all we 
 +have to do to install tika just download 
 +a jar file put it someplace nice and 
 +easy but we also have to tell you find 
 +where to find it and for that there is a 
 +configuration file called full text dot 
 +ini so let's copy config view find full 
 +text I and I to local big do you find so 
 +we always do to override it and then 
 +edit local config define okay so 
 +like all of you finds configuration 
 +files once again there are lots of 
 +comments here explaining what it does we 
 +just need to do two simple things 
 +uncomment the general section and tell 
 +it to use tika if we didn't do that it 
 +would try to auto detect which tool is 
 +being used which is fine but telling it 
 +is a little bit faster you can skip all 
 +the aperture settings and then we can 
 +just uncomment the tika path so if you 
 +find those where to find tika and as you 
 +can see the default setting matches the 
 +symbolic link that I set up so there'
 +no need to change anything other than 
 +uncommenting the line now we're halfway 
 +there we've got the full text extractor 
 +setup but now we need to set up the web 
 +crawler and there's another config file 
 +for that so we'll copy config you find 
 +web crawl ini to local config you find 
 +and we'll edit that file so all the web 
 +crawl dot ini does is till you find 
 +sitemap index err where to find sitemaps 
 +create a list of as many sitemaps as you 
 +want but for this example the only thing 
 +we want to index is our locally created 
 +sitemap XML we could index the sitemap 
 +index our view fine crawler is smart 
 +enough to crawl into all of the sitemaps 
 +referenced by an index but we really 
 +don't want to index view finds record 
 +pages into view finds web index that 
 +would only be confusing so we're going 
 +to focus in on the content that exists 
 +outside of you find itself we could also 
 +turn on this for beau setting just to 
 +get some more feedback out of the 
 +crawler while it runs but it makes no 
 +functional difference whether we do that 
 +or not 
 +so now we're all set up if you find 
 +knows where to find tika for full text 
 +extraction it knows where to find a 
 +sitemap to crawl so all we need to do is 
 +run the crawler which is PHP import 
 +slash web crawl from the home you find 
 +directory now we see it's harvesting the 
 +sitemap XML file 
 +doing a bit of work and in just a moment 
 +we should have our content so what the 
 +web crawler is actually doing is running 
 +the XML importer that was demonstrated 
 +last month we have an XSLT that uses 
 +sitemap XML in combination with some 
 +custom PHP code to use tika to extract 
 +content from webpages and then index 
 +them into a special solar core that was 
 +designed specifically for searching 
 +webpages the other little piece that the 
 +web crawler does is it keeps track of 
 +when it runs and it deletes anything 
 +that was indexed on a prior run so every 
 +time you run the web crawler it captures 
 +a timestamp it indexes all of the 
 +sitemaps you've referred to and then it 
 +deletes anything that's older than the 
 +time at which the process started 
 +so if webpages are removed the indexer 
 +will get rid of them on the next run you 
 +do have to be careful about this though 
 +because if you run the web crawler at a 
 +time when a website is temporarily 
 +offline it's going to wipe out parts of 
 +your index so use with caution in any 
 +case now that the indexing has completed 
 +we can go back to our beautifying 
 +interface and if we go to you find slash 
 +web with a capital W that brings up the 
 +website search which uses the solar 
 +website index I mentioned I just do a 
 +blank search 
 +we will see that I now have two pages 
 +both of the pages from my sitemap were 
 +indexed and just to prove that the 
 +full-text searching works right I type 
 +the word lazy here that word appears on 
 +only one of these two pages and sure 
 +enough there it is it highlights where 
 +the text matched everything is working 
 +so one quick thing to demonstrate before 
 +we call it a day is that there are two 
 +configuration files that might be of 
 +interest there's a website ini file 
 +which controls all the behavior of the 
 +web search and this is kind of like a 
 +combination of facets dot ini and 
 +searches dot ini for the main Biblio 
 +index but these settings are applied the 
 +website search so if you want to 
 +customize recommendation modules or 
 +change labels or sort options etc facets 
 +it's all in here so this is how you can 
 +control the presentation of your website 
 +search also a possible interest is 
 +config / view find we have a search 
 +specs that Yambol again this is just 
 +like the regular search specs that yeah 
 +mph or the Biblio index but this is 
 +tuned to the website index so if you 
 +want to change which fields are searched 
 +or how relevancy ranking works this is 
 +the place where you can do that finally 
 +if you want to configure the actual 
 +indexing process there is a an import / 
 +XSL sitemap XSL which is the XSLT that 
 +gets applied to all of the sitemaps in 
 +order to index them and as you can see 
 +here this is really just a wrapper 
 +around a PHP function called define 
 +sitemap get document 
 +and import sitemap dot properties is the 
 +import configuration that sets up the 
 +custom class it specifies the XSLT and 
 +so forth so it's beyond the scope of 
 +today's video but if you want to 
 +customize things what you want to do is 
 +override the view find app class with 
 +your own behavior and you can do 
 +anything you like in that PHP for 
 +example you might want to extract values 
 +from particular HTML meta tags and use 
 +them for facets or whatever you need to 
 +do so that's it for this month next 
 +month we are going to look at how you 
 +can combine different kinds of searches 
 +in view find which will be useful 
 +because it would be nice to be able to 
 +search our new website index in our 
 +regular biblio book and journal index at 
 +the same time so I will show you how to 
 +do that until then have a good month and 
 +thank you for listening 
 ---- struct data ---- ---- struct data ----
 ---- ----
  
videos/sitemaps_and_web_indexing.txt · Last modified: 2023/04/26 13:29 by crhallberg