About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
videos:sitemaps_and_web_indexing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revisionBoth sides next revision
videos:sitemaps_and_web_indexing [2020/07/07 23:45] – created demiankatzvideos:sitemaps_and_web_indexing [2020/07/08 11:40] – [Transcript] demiankatz
Line 15: Line 15:
 // This is a raw machine-generated transcript; it will be cleaned up in the near future. // // This is a raw machine-generated transcript; it will be cleaned up in the near future. //
  
 + so this month's video is going to be a
 +discussion of sitemaps and web crawling
 +and it's sort of a logical follow-up to
 +last month's video about indexing XML so
 +first a quick introduction to what I'm
 +talking about if you put a website up on
 +the internet chances are that sooner or
 +later search engines will find it and
 +crawl through it and make it searchable
 +but without a little bit of help
 +this can happen somewhat haphazardly and
 +that is where XML sitemaps come into
 +play a a sitemap is just an XML document
 +that lists all of the pages on your site
 +which you can submit to a search engine
 +in order to let them find all of your
 +content now this is certainly important
 +for something like view find where it's
 +a search driven interface and there may
 +not actually be a way to crawl into
 +every single record without typing
 +something into a box first
 +so by publishing a sitemap we make it
 +possible for all of our index records to
 +be findable and so if you find includes
 +facilities for generating sitemaps which
 +make it very search engine friendly and
 +will make your content a lot more
 +visible on the flip side of the coin
 +sitemaps are also a really useful tool
 +for harvesting content and so if you
 +find also includes tools or crawling all
 +of the pages in the sitemap and creating
 +a website index so today I will show
 +sort of both sides of that equation how
 +you can make sitemaps from view find and
 +how you can populate view find using
 +sitemaps and I have up here sitemaps org
 +which is where the sitemap specification
 +can be found if you want to learn more
 +about how these documents are structured
 +so just as a really simple example I've
 +created this beautiful website on my
 +virtual machine it's just a couple of
 +HTML files I hand edited in the web
 +route so I've got this front page and I
 +have a link that leads to this other
 +page and I have by hand generated a
 +sitemap XML file which just lists both
 +of these pages the the root of the site
 +and the linked page so suppose that I
 +want this website I've just created to
 +live in harmony with the view find
 +instance I've been demonstrating for
 +some time what I would want to do is
 +create a sitemap containing all of the
 +content of view fine as well as a
 +sitemap index which is another part of
 +the sitemap specification which allows
 +you to group together multiple sitemaps
 +so that they can all be discovered as a
 +bundle
 +fortunately you find includes a command
 +line tool that will do all of this for
 +you so I am just going to drop to the
 +terminal and go to my view find home
 +directory
 +there is a configuration file called
 +sitemap I and I so I'm going to copy
 +config you find sitemap dot ini into my
 +local config defined directory and then
 +I'm going to edit that configuration
 +file so like all of you finds
 +configuration files the sitemap dot ini
 +file is full of comments explaining what
 +all of the settings do and I won't go
 +through all of these right now I will
 +just highlight the ones that are most
 +important to get things working so we
 +have a top sitemap section that's going
 +to control how you find that generates
 +your sitemaps there are some settings in
 +here like frequency which will affect
 +the content of the generated sitemap and
 +which can impact how
 +currently search engines will come back
 +and recall your pages count per page is
 +going to control how many URLs
 +you find puts in each of the sitemap
 +files it generates because view find
 +sites could potentially have millions of
 +Records and creating one sitemap with a
 +million records in it it's probably
 +going to cause some problems thus
 +there's a mechanism for breaking that up
 +into chunks by default 10,000 records
 +per chunk and then if you find will
 +generate a sitemap index file that
 +points to all of the chunks and this is
 +another part of the the sitemap spec
 +that you can create lists of sitemaps
 +the file name setting is going to
 +control the name of the file that you
 +find generates for insight map of course
 +it defaults to sitemap XML but in many
 +cases you don't want you find overwrite
 +an existing sitemap XML that was created
 +either by hand or by a different tool so
 +this gives us the ability to give it a
 +more specific name in this case I'm
 +going to call it you find sitemap file
 +location determines where you find
 +generates the sitemap files by default
 +it uses the temporary directory but
 +that's not going to be very useful in a
 +real life situation so I'm going to
 +change it to bar dub-dub-dub HTML which
 +happens to be the web root on the ubuntu
 +server I am using for this example
 +we can also control which indexes get
 +indexed but we can stick with defaults
 +there we can affect how you find
 +retrieves the URLs to put into the
 +sitemap again we'll use the default
 +there but you can tune that in some ways
 +if it's not performing quickly enough on
 +next up is the sitemap index section and
 +this controls how you find and generates
 +that high level index XML that I
 +mentioned there are a couple of
 +important things that we need to set
 +here one being the base sitemap URL when
 +we set the directory where the files
 +will be generated we also need to tell
 +you find what URL that directory
 +corresponds with in this instance it's
 +HTTP colon slash slash localhost
 +obviously in a real world scenario you
 +would not be using localhost here but
 +for this example it will do you can
 +control the name of the index file that
 +you find generates we'll leave that as
 +the default and we can tell it the name
 +of an existing sitemap that we want to
 +include in the index so the default of
 +Base sitemap is not a file that actually
 +exists in our example so we will just
 +tell it use the sitemap XML that was
 +generated by hand or the demonstration
 +web page incorporate that into the index
 +so that it's findable along with my
 +generated content so now you find
 +sitemap generator is fully configured
 +there's just one more important detail
 +which is that I need to be sure that the
 +user that runs the command line tool has
 +right access to the web root or else it
 +won't be able to successfully write out
 +the sitemaps that it generates so right
 +now I'm just running all of these tools
 +using my own cats account in a
 +production environment of course you
 +would want to have
 +a dedicated user for running view find
 +utilities and you would have ownership
 +set accordingly but in the interest of
 +expediency I'm just going to use cats
 +for all of these demonstrations so I'm
 +going to sue do CH own d-pad spar dub
 +dub dub HTML so that I now own the web
 +route and have permission to write files
 +there and then all I need to do is a PHP
 +util slash sitemap PHP while I'm in the
 +viewfinder Ector e and that will run the
 +generator that only took a couple
 +seconds and now if i do a file listing
 +of bar dub dub dub HTML i will see that
 +there is a view find sitemap XML and a
 +sitemap index dot XML that we're not
 +there before
 +look at those through our web browser we
 +go to localhost / slight map index.xml
 +sure enough this points us to two
 +different files the existing sitemap XML
 +that was already there that we told you
 +find about as the base sitemap and also
 +this new you find sitemap dot XML which
 +has been generated and if I go there we
 +will see that this contains a list of
 +all the records in defined so they can
 +be easily crawl there's one more small
 +step that you might want to take with
 +sitemaps which is to publish for search
 +engines where they can be found you may
 +be familiar with a file called
 +robots.txt which you can use to tell
 +crawlers which parts of your site they
 +should or should not be crawling you can
 +also use that file to specify where a
 +sitemap lives so if I edit var dub dub
 +dub HTML robots.txt and in this example
 +I'm just creating a new file but you
 +might have an existing one in some
 +situations all I need to do is say
 +sitemap colon and if the you are
 +of my sitemap in this case HTTP colon
 +slash slash localhost / site now index
 +dot XML and now if a robot comes to the
 +site and looks at robots.txt and
 +supports the part of the protocol that
 +includes the sitemap specification I
 +will know exactly where to look and it
 +can find all of your content so now that
 +we've seen how to create sitemaps within
 +if you find let's talk about how view
 +fine can take advantage of other
 +people's sitemaps including our own
 +sitemaps from our content management
 +systems or websites as I mentioned
 +earlier you find has the capacity to
 +index content from sitemaps to create a
 +web index so that you can search your
 +own website however before it can do
 +that you need to set up a full text
 +extraction tool there are several places
 +in view find where it can take advantage
 +of full text extraction to make the
 +content a file searchable so for example
 +when indexing mark records or XML files
 +you can do some custom rules that will
 +use URLs within metadata to retrieve
 +content and then index all of the text
 +coming back from those URLs and that
 +same mechanism is used by Pugh finds
 +sitemap index err if you find supports
 +two different full-text extraction tools
 +one is called aperture and has not been
 +updated in many years so I strongly
 +encourage that everyone use the second
 +option which is patching teeka that can
 +be obtained at pika patchy dot-org if
 +you just go to the download page
 +are a number of downloads available but
 +what we want is the tikka App jar file
 +which is all that you need to extract
 +text content from a variety of different
 +file formats including PDF Office
 +documents and fortunately for us web
 +pages I've actually already done the the
 +download of this file to save a little
 +bit of time
 +so once we've downloaded this we should
 +set it up in a way where it's easily
 +accessible did you find what I like to
 +do is give it its own directory we'll
 +call it user local tika and I'm going to
 +copy the file from my download directory
 +into user local tika and I like to
 +create a symbolic link shortcut to from
 +the long tika jar file to just tika jar
 +which makes if you find configuration
 +easier because we can download new
 +versions of the app as they're released
 +in the future and we just have to
 +rewrite the symbolic link instead of
 +having to constantly edit of you find
 +configuration files so I'm just going to
 +do a quick CDU ln- s from user local
 +tika tika app 1.2 4.1 that jar to user
 +local tika tika jar so that's all we
 +have to do to install tika just download
 +a jar file put it someplace nice and
 +easy but we also have to tell you find
 +where to find it and for that there is a
 +configuration file called full text dot
 +ini so let's copy config view find full
 +text I and I to local big do you find so
 +we always do to override it and then
 +edit local config define okay so
 +like all of you finds configuration
 +files once again there are lots of
 +comments here explaining what it does we
 +just need to do two simple things
 +uncomment the general section and tell
 +it to use tika if we didn't do that it
 +would try to auto detect which tool is
 +being used which is fine but telling it
 +is a little bit faster you can skip all
 +the aperture settings and then we can
 +just uncomment the tika path so if you
 +find those where to find tika and as you
 +can see the default setting matches the
 +symbolic link that I set up so there's
 +no need to change anything other than
 +uncommenting the line now we're halfway
 +there we've got the full text extractor
 +setup but now we need to set up the web
 +crawler and there's another config file
 +for that so we'll copy config you find
 +web crawl ini to local config you find
 +and we'll edit that file so all the web
 +crawl dot ini does is till you find
 +sitemap index err where to find sitemaps
 +create a list of as many sitemaps as you
 +want but for this example the only thing
 +we want to index is our locally created
 +sitemap XML we could index the sitemap
 +index our view fine crawler is smart
 +enough to crawl into all of the sitemaps
 +referenced by an index but we really
 +don't want to index view finds record
 +pages into view finds web index that
 +would only be confusing so we're going
 +to focus in on the content that exists
 +outside of you find itself we could also
 +turn on this for beau setting just to
 +get some more feedback out of the
 +crawler while it runs but it makes no
 +functional difference whether we do that
 +or not
 +so now we're all set up if you find
 +knows where to find tika for full text
 +extraction it knows where to find a
 +sitemap to crawl so all we need to do is
 +run the crawler which is PHP import
 +slash web crawl from the home you find
 +directory now we see it's harvesting the
 +sitemap XML file
 +doing a bit of work and in just a moment
 +we should have our content so what the
 +web crawler is actually doing is running
 +the XML importer that was demonstrated
 +last month we have an XSLT that uses
 +sitemap XML in combination with some
 +custom PHP code to use tika to extract
 +content from webpages and then index
 +them into a special solar core that was
 +designed specifically for searching
 +webpages the other little piece that the
 +web crawler does is it keeps track of
 +when it runs and it deletes anything
 +that was indexed on a prior run so every
 +time you run the web crawler it captures
 +a timestamp it indexes all of the
 +sitemaps you've referred to and then it
 +deletes anything that's older than the
 +time at which the process started
 +so if webpages are removed the indexer
 +will get rid of them on the next run you
 +do have to be careful about this though
 +because if you run the web crawler at a
 +time when a website is temporarily
 +offline it's going to wipe out parts of
 +your index so use with caution in any
 +case now that the indexing has completed
 +we can go back to our beautifying
 +interface and if we go to you find slash
 +web with a capital W that brings up the
 +website search which uses the solar
 +website index I mentioned I just do a
 +blank search
 +we will see that I now have two pages
 +both of the pages from my sitemap were
 +indexed and just to prove that the
 +full-text searching works right I type
 +the word lazy here that word appears on
 +only one of these two pages and sure
 +enough there it is it highlights where
 +the text matched everything is working
 +so one quick thing to demonstrate before
 +we call it a day is that there are two
 +configuration files that might be of
 +interest there's a website ini file
 +which controls all the behavior of the
 +web search and this is kind of like a
 +combination of facets dot ini and
 +searches dot ini for the main Biblio
 +index but these settings are applied the
 +website search so if you want to
 +customize recommendation modules or
 +change labels or sort options etc facets
 +it's all in here so this is how you can
 +control the presentation of your website
 +search also a possible interest is
 +config / view find / we have a search
 +specs that Yambol again this is just
 +like the regular search specs that yeah
 +mph or the Biblio index but this is
 +tuned to the website index so if you
 +want to change which fields are searched
 +or how relevancy ranking works this is
 +the place where you can do that finally
 +if you want to configure the actual
 +indexing process there is a an import /
 +XSL sitemap XSL which is the XSLT that
 +gets applied to all of the sitemaps in
 +order to index them and as you can see
 +here this is really just a wrapper
 +around a PHP function called define
 +sitemap get document
 +and import sitemap dot properties is the
 +import configuration that sets up the
 +custom class it specifies the XSLT and
 +so forth so it's beyond the scope of
 +today's video but if you want to
 +customize things what you want to do is
 +override the view find app class with
 +your own behavior and you can do
 +anything you like in that PHP for
 +example you might want to extract values
 +from particular HTML meta tags and use
 +them for facets or whatever you need to
 +do so that's it for this month next
 +month we are going to look at how you
 +can combine different kinds of searches
 +in view find which will be useful
 +because it would be nice to be able to
 +search our new website index in our
 +regular biblio book and journal index at
 +the same time so I will show you how to
 +do that until then have a good month and
 +thank you for listening 
 ---- struct data ---- ---- struct data ----
 ---- ----
  
videos/sitemaps_and_web_indexing.txt · Last modified: 2023/04/26 13:29 by crhallberg