Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:sitemaps_and_web_indexing
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revisionNext revisionBoth sides next revision | ||
videos:sitemaps_and_web_indexing [2020/07/07 23:45] – created demiankatz | videos:sitemaps_and_web_indexing [2020/07/08 11:40] – [Transcript] demiankatz | ||
---|---|---|---|
Line 15: | Line 15: | ||
// This is a raw machine-generated transcript; it will be cleaned up in the near future. // | // This is a raw machine-generated transcript; it will be cleaned up in the near future. // | ||
+ | so this month' | ||
+ | discussion of sitemaps and web crawling | ||
+ | and it's sort of a logical follow-up to | ||
+ | last month' | ||
+ | first a quick introduction to what I'm | ||
+ | talking about if you put a website up on | ||
+ | the internet chances are that sooner or | ||
+ | later search engines will find it and | ||
+ | crawl through it and make it searchable | ||
+ | but without a little bit of help | ||
+ | this can happen somewhat haphazardly and | ||
+ | that is where XML sitemaps come into | ||
+ | play a a sitemap is just an XML document | ||
+ | that lists all of the pages on your site | ||
+ | which you can submit to a search engine | ||
+ | in order to let them find all of your | ||
+ | content now this is certainly important | ||
+ | for something like view find where it's | ||
+ | a search driven interface and there may | ||
+ | not actually be a way to crawl into | ||
+ | every single record without typing | ||
+ | something into a box first | ||
+ | so by publishing a sitemap we make it | ||
+ | possible for all of our index records to | ||
+ | be findable and so if you find includes | ||
+ | facilities for generating sitemaps which | ||
+ | make it very search engine friendly and | ||
+ | will make your content a lot more | ||
+ | visible on the flip side of the coin | ||
+ | sitemaps are also a really useful tool | ||
+ | for harvesting content and so if you | ||
+ | find also includes tools or crawling all | ||
+ | of the pages in the sitemap and creating | ||
+ | a website index so today I will show | ||
+ | sort of both sides of that equation how | ||
+ | you can make sitemaps from view find and | ||
+ | how you can populate view find using | ||
+ | sitemaps and I have up here sitemaps org | ||
+ | which is where the sitemap specification | ||
+ | can be found if you want to learn more | ||
+ | about how these documents are structured | ||
+ | so just as a really simple example I've | ||
+ | created this beautiful website on my | ||
+ | virtual machine it's just a couple of | ||
+ | HTML files I hand edited in the web | ||
+ | route so I've got this front page and I | ||
+ | have a link that leads to this other | ||
+ | page and I have by hand generated a | ||
+ | sitemap XML file which just lists both | ||
+ | of these pages the the root of the site | ||
+ | and the linked page so suppose that I | ||
+ | want this website I've just created to | ||
+ | live in harmony with the view find | ||
+ | instance I've been demonstrating for | ||
+ | some time what I would want to do is | ||
+ | create a sitemap containing all of the | ||
+ | content of view fine as well as a | ||
+ | sitemap index which is another part of | ||
+ | the sitemap specification which allows | ||
+ | you to group together multiple sitemaps | ||
+ | so that they can all be discovered as a | ||
+ | bundle | ||
+ | fortunately you find includes a command | ||
+ | line tool that will do all of this for | ||
+ | you so I am just going to drop to the | ||
+ | terminal and go to my view find home | ||
+ | directory | ||
+ | there is a configuration file called | ||
+ | sitemap I and I so I'm going to copy | ||
+ | config you find sitemap dot ini into my | ||
+ | local config defined directory and then | ||
+ | I'm going to edit that configuration | ||
+ | file so like all of you finds | ||
+ | configuration files the sitemap dot ini | ||
+ | file is full of comments explaining what | ||
+ | all of the settings do and I won't go | ||
+ | through all of these right now I will | ||
+ | just highlight the ones that are most | ||
+ | important to get things working so we | ||
+ | have a top sitemap section that's going | ||
+ | to control how you find that generates | ||
+ | your sitemaps there are some settings in | ||
+ | here like frequency which will affect | ||
+ | the content of the generated sitemap and | ||
+ | which can impact how | ||
+ | currently search engines will come back | ||
+ | and recall your pages count per page is | ||
+ | going to control how many URLs | ||
+ | you find puts in each of the sitemap | ||
+ | files it generates because view find | ||
+ | sites could potentially have millions of | ||
+ | Records and creating one sitemap with a | ||
+ | million records in it it's probably | ||
+ | going to cause some problems thus | ||
+ | there' | ||
+ | into chunks by default 10,000 records | ||
+ | per chunk and then if you find will | ||
+ | generate a sitemap index file that | ||
+ | points to all of the chunks and this is | ||
+ | another part of the the sitemap spec | ||
+ | that you can create lists of sitemaps | ||
+ | the file name setting is going to | ||
+ | control the name of the file that you | ||
+ | find generates for insight map of course | ||
+ | it defaults to sitemap XML but in many | ||
+ | cases you don't want you find overwrite | ||
+ | an existing sitemap XML that was created | ||
+ | either by hand or by a different tool so | ||
+ | this gives us the ability to give it a | ||
+ | more specific name in this case I'm | ||
+ | going to call it you find sitemap file | ||
+ | location determines where you find | ||
+ | generates the sitemap files by default | ||
+ | it uses the temporary directory but | ||
+ | that's not going to be very useful in a | ||
+ | real life situation so I'm going to | ||
+ | change it to bar dub-dub-dub HTML which | ||
+ | happens to be the web root on the ubuntu | ||
+ | server I am using for this example | ||
+ | we can also control which indexes get | ||
+ | indexed but we can stick with defaults | ||
+ | there we can affect how you find | ||
+ | retrieves the URLs to put into the | ||
+ | sitemap again we'll use the default | ||
+ | there but you can tune that in some ways | ||
+ | if it's not performing quickly enough on | ||
+ | next up is the sitemap index section and | ||
+ | this controls how you find and generates | ||
+ | that high level index XML that I | ||
+ | mentioned there are a couple of | ||
+ | important things that we need to set | ||
+ | here one being the base sitemap URL when | ||
+ | we set the directory where the files | ||
+ | will be generated we also need to tell | ||
+ | you find what URL that directory | ||
+ | corresponds with in this instance it's | ||
+ | HTTP colon slash slash localhost | ||
+ | obviously in a real world scenario you | ||
+ | would not be using localhost here but | ||
+ | for this example it will do you can | ||
+ | control the name of the index file that | ||
+ | you find generates we'll leave that as | ||
+ | the default and we can tell it the name | ||
+ | of an existing sitemap that we want to | ||
+ | include in the index so the default of | ||
+ | Base sitemap is not a file that actually | ||
+ | exists in our example so we will just | ||
+ | tell it use the sitemap XML that was | ||
+ | generated by hand or the demonstration | ||
+ | web page incorporate that into the index | ||
+ | so that it's findable along with my | ||
+ | generated content so now you find | ||
+ | sitemap generator is fully configured | ||
+ | there' | ||
+ | which is that I need to be sure that the | ||
+ | user that runs the command line tool has | ||
+ | right access to the web root or else it | ||
+ | won't be able to successfully write out | ||
+ | the sitemaps that it generates so right | ||
+ | now I'm just running all of these tools | ||
+ | using my own cats account in a | ||
+ | production environment of course you | ||
+ | would want to have | ||
+ | a dedicated user for running view find | ||
+ | utilities and you would have ownership | ||
+ | set accordingly but in the interest of | ||
+ | expediency I'm just going to use cats | ||
+ | for all of these demonstrations so I'm | ||
+ | going to sue do CH own d-pad spar dub | ||
+ | dub dub HTML so that I now own the web | ||
+ | route and have permission to write files | ||
+ | there and then all I need to do is a PHP | ||
+ | util slash sitemap PHP while I'm in the | ||
+ | viewfinder Ector e and that will run the | ||
+ | generator that only took a couple | ||
+ | seconds and now if i do a file listing | ||
+ | of bar dub dub dub HTML i will see that | ||
+ | there is a view find sitemap XML and a | ||
+ | sitemap index dot XML that we're not | ||
+ | there before | ||
+ | look at those through our web browser we | ||
+ | go to localhost / slight map index.xml | ||
+ | sure enough this points us to two | ||
+ | different files the existing sitemap XML | ||
+ | that was already there that we told you | ||
+ | find about as the base sitemap and also | ||
+ | this new you find sitemap dot XML which | ||
+ | has been generated and if I go there we | ||
+ | will see that this contains a list of | ||
+ | all the records in defined so they can | ||
+ | be easily crawl there' | ||
+ | step that you might want to take with | ||
+ | sitemaps which is to publish for search | ||
+ | engines where they can be found you may | ||
+ | be familiar with a file called | ||
+ | robots.txt which you can use to tell | ||
+ | crawlers which parts of your site they | ||
+ | should or should not be crawling you can | ||
+ | also use that file to specify where a | ||
+ | sitemap lives so if I edit var dub dub | ||
+ | dub HTML robots.txt and in this example | ||
+ | I'm just creating a new file but you | ||
+ | might have an existing one in some | ||
+ | situations all I need to do is say | ||
+ | sitemap colon and if the you are | ||
+ | of my sitemap in this case HTTP colon | ||
+ | slash slash localhost / site now index | ||
+ | dot XML and now if a robot comes to the | ||
+ | site and looks at robots.txt and | ||
+ | supports the part of the protocol that | ||
+ | includes the sitemap specification I | ||
+ | will know exactly where to look and it | ||
+ | can find all of your content so now that | ||
+ | we've seen how to create sitemaps within | ||
+ | if you find let's talk about how view | ||
+ | fine can take advantage of other | ||
+ | people' | ||
+ | sitemaps from our content management | ||
+ | systems or websites as I mentioned | ||
+ | earlier you find has the capacity to | ||
+ | index content from sitemaps to create a | ||
+ | web index so that you can search your | ||
+ | own website however before it can do | ||
+ | that you need to set up a full text | ||
+ | extraction tool there are several places | ||
+ | in view find where it can take advantage | ||
+ | of full text extraction to make the | ||
+ | content a file searchable so for example | ||
+ | when indexing mark records or XML files | ||
+ | you can do some custom rules that will | ||
+ | use URLs within metadata to retrieve | ||
+ | content and then index all of the text | ||
+ | coming back from those URLs and that | ||
+ | same mechanism is used by Pugh finds | ||
+ | sitemap index err if you find supports | ||
+ | two different full-text extraction tools | ||
+ | one is called aperture and has not been | ||
+ | updated in many years so I strongly | ||
+ | encourage that everyone use the second | ||
+ | option which is patching teeka that can | ||
+ | be obtained at pika patchy dot-org if | ||
+ | you just go to the download page | ||
+ | are a number of downloads available but | ||
+ | what we want is the tikka App jar file | ||
+ | which is all that you need to extract | ||
+ | text content from a variety of different | ||
+ | file formats including PDF Office | ||
+ | documents and fortunately for us web | ||
+ | pages I've actually already done the the | ||
+ | download of this file to save a little | ||
+ | bit of time | ||
+ | so once we've downloaded this we should | ||
+ | set it up in a way where it's easily | ||
+ | accessible did you find what I like to | ||
+ | do is give it its own directory we'll | ||
+ | call it user local tika and I'm going to | ||
+ | copy the file from my download directory | ||
+ | into user local tika and I like to | ||
+ | create a symbolic link shortcut to from | ||
+ | the long tika jar file to just tika jar | ||
+ | which makes if you find configuration | ||
+ | easier because we can download new | ||
+ | versions of the app as they' | ||
+ | in the future and we just have to | ||
+ | rewrite the symbolic link instead of | ||
+ | having to constantly edit of you find | ||
+ | configuration files so I'm just going to | ||
+ | do a quick CDU ln- s from user local | ||
+ | tika tika app 1.2 4.1 that jar to user | ||
+ | local tika tika jar so that's all we | ||
+ | have to do to install tika just download | ||
+ | a jar file put it someplace nice and | ||
+ | easy but we also have to tell you find | ||
+ | where to find it and for that there is a | ||
+ | configuration file called full text dot | ||
+ | ini so let's copy config view find full | ||
+ | text I and I to local big do you find so | ||
+ | we always do to override it and then | ||
+ | edit local config define okay so | ||
+ | like all of you finds configuration | ||
+ | files once again there are lots of | ||
+ | comments here explaining what it does we | ||
+ | just need to do two simple things | ||
+ | uncomment the general section and tell | ||
+ | it to use tika if we didn't do that it | ||
+ | would try to auto detect which tool is | ||
+ | being used which is fine but telling it | ||
+ | is a little bit faster you can skip all | ||
+ | the aperture settings and then we can | ||
+ | just uncomment the tika path so if you | ||
+ | find those where to find tika and as you | ||
+ | can see the default setting matches the | ||
+ | symbolic link that I set up so there' | ||
+ | no need to change anything other than | ||
+ | uncommenting the line now we're halfway | ||
+ | there we've got the full text extractor | ||
+ | setup but now we need to set up the web | ||
+ | crawler and there' | ||
+ | for that so we'll copy config you find | ||
+ | web crawl ini to local config you find | ||
+ | and we'll edit that file so all the web | ||
+ | crawl dot ini does is till you find | ||
+ | sitemap index err where to find sitemaps | ||
+ | create a list of as many sitemaps as you | ||
+ | want but for this example the only thing | ||
+ | we want to index is our locally created | ||
+ | sitemap XML we could index the sitemap | ||
+ | index our view fine crawler is smart | ||
+ | enough to crawl into all of the sitemaps | ||
+ | referenced by an index but we really | ||
+ | don't want to index view finds record | ||
+ | pages into view finds web index that | ||
+ | would only be confusing so we're going | ||
+ | to focus in on the content that exists | ||
+ | outside of you find itself we could also | ||
+ | turn on this for beau setting just to | ||
+ | get some more feedback out of the | ||
+ | crawler while it runs but it makes no | ||
+ | functional difference whether we do that | ||
+ | or not | ||
+ | so now we're all set up if you find | ||
+ | knows where to find tika for full text | ||
+ | extraction it knows where to find a | ||
+ | sitemap to crawl so all we need to do is | ||
+ | run the crawler which is PHP import | ||
+ | slash web crawl from the home you find | ||
+ | directory now we see it's harvesting the | ||
+ | sitemap XML file | ||
+ | doing a bit of work and in just a moment | ||
+ | we should have our content so what the | ||
+ | web crawler is actually doing is running | ||
+ | the XML importer that was demonstrated | ||
+ | last month we have an XSLT that uses | ||
+ | sitemap XML in combination with some | ||
+ | custom PHP code to use tika to extract | ||
+ | content from webpages and then index | ||
+ | them into a special solar core that was | ||
+ | designed specifically for searching | ||
+ | webpages the other little piece that the | ||
+ | web crawler does is it keeps track of | ||
+ | when it runs and it deletes anything | ||
+ | that was indexed on a prior run so every | ||
+ | time you run the web crawler it captures | ||
+ | a timestamp it indexes all of the | ||
+ | sitemaps you've referred to and then it | ||
+ | deletes anything that's older than the | ||
+ | time at which the process started | ||
+ | so if webpages are removed the indexer | ||
+ | will get rid of them on the next run you | ||
+ | do have to be careful about this though | ||
+ | because if you run the web crawler at a | ||
+ | time when a website is temporarily | ||
+ | offline it's going to wipe out parts of | ||
+ | your index so use with caution in any | ||
+ | case now that the indexing has completed | ||
+ | we can go back to our beautifying | ||
+ | interface and if we go to you find slash | ||
+ | web with a capital W that brings up the | ||
+ | website search which uses the solar | ||
+ | website index I mentioned I just do a | ||
+ | blank search | ||
+ | we will see that I now have two pages | ||
+ | both of the pages from my sitemap were | ||
+ | indexed and just to prove that the | ||
+ | full-text searching works right I type | ||
+ | the word lazy here that word appears on | ||
+ | only one of these two pages and sure | ||
+ | enough there it is it highlights where | ||
+ | the text matched everything is working | ||
+ | so one quick thing to demonstrate before | ||
+ | we call it a day is that there are two | ||
+ | configuration files that might be of | ||
+ | interest there' | ||
+ | which controls all the behavior of the | ||
+ | web search and this is kind of like a | ||
+ | combination of facets dot ini and | ||
+ | searches dot ini for the main Biblio | ||
+ | index but these settings are applied the | ||
+ | website search so if you want to | ||
+ | customize recommendation modules or | ||
+ | change labels or sort options etc facets | ||
+ | it's all in here so this is how you can | ||
+ | control the presentation of your website | ||
+ | search also a possible interest is | ||
+ | config / view find / we have a search | ||
+ | specs that Yambol again this is just | ||
+ | like the regular search specs that yeah | ||
+ | mph or the Biblio index but this is | ||
+ | tuned to the website index so if you | ||
+ | want to change which fields are searched | ||
+ | or how relevancy ranking works this is | ||
+ | the place where you can do that finally | ||
+ | if you want to configure the actual | ||
+ | indexing process there is a an import / | ||
+ | XSL sitemap XSL which is the XSLT that | ||
+ | gets applied to all of the sitemaps in | ||
+ | order to index them and as you can see | ||
+ | here this is really just a wrapper | ||
+ | around a PHP function called define | ||
+ | sitemap get document | ||
+ | and import sitemap dot properties is the | ||
+ | import configuration that sets up the | ||
+ | custom class it specifies the XSLT and | ||
+ | so forth so it's beyond the scope of | ||
+ | today' | ||
+ | customize things what you want to do is | ||
+ | override the view find app class with | ||
+ | your own behavior and you can do | ||
+ | anything you like in that PHP for | ||
+ | example you might want to extract values | ||
+ | from particular HTML meta tags and use | ||
+ | them for facets or whatever you need to | ||
+ | do so that's it for this month next | ||
+ | month we are going to look at how you | ||
+ | can combine different kinds of searches | ||
+ | in view find which will be useful | ||
+ | because it would be nice to be able to | ||
+ | search our new website index in our | ||
+ | regular biblio book and journal index at | ||
+ | the same time so I will show you how to | ||
+ | do that until then have a good month and | ||
+ | thank you for listening | ||
---- struct data ---- | ---- struct data ---- | ||
---- | ---- | ||
videos/sitemaps_and_web_indexing.txt · Last modified: 2023/04/26 13:29 by crhallberg