Differences

This shows you the differences between two versions of the page.

--- videos:sitemaps_and_web_indexing [2020/07/07 23:45] – created demiankatz
+++ videos:sitemaps_and_web_indexing [2020/07/08 11:40] – [Transcript] demiankatz
@@ Line 15: / Line 15: @@
 // This is a raw machine-generated transcript; it will be cleaned up in the near future. //
+ so this month's video is going to be a
+discussion of sitemaps and web crawling
+and it's sort of a logical follow-up to
+last month's video about indexing XML so
+first a quick introduction to what I'm
+talking about if you put a website up on
+the internet chances are that sooner or
+later search engines will find it and
+crawl through it and make it searchable
+but without a little bit of help
+this can happen somewhat haphazardly and
+that is where XML sitemaps come into
+play a a sitemap is just an XML document
+that lists all of the pages on your site
+which you can submit to a search engine
+in order to let them find all of your
+content now this is certainly important
+for something like view find where it's
+a search driven interface and there may
+not actually be a way to crawl into
+every single record without typing
+something into a box first
+so by publishing a sitemap we make it
+possible for all of our index records to
+be findable and so if you find includes
+facilities for generating sitemaps which
+make it very search engine friendly and
+will make your content a lot more
+visible on the flip side of the coin
+sitemaps are also a really useful tool
+for harvesting content and so if you
+find also includes tools or crawling all
+of the pages in the sitemap and creating
+a website index so today I will show
+sort of both sides of that equation how
+you can make sitemaps from view find and
+how you can populate view find using
+sitemaps and I have up here sitemaps org
+which is where the sitemap specification
+can be found if you want to learn more
+about how these documents are structured
+so just as a really simple example I've
+created this beautiful website on my
+virtual machine it's just a couple of
+HTML files I hand edited in the web
+route so I've got this front page and I
+have a link that leads to this other
+page and I have by hand generated a
+sitemap XML file which just lists both
+of these pages the the root of the site
+and the linked page so suppose that I
+want this website I've just created to
+live in harmony with the view find
+instance I've been demonstrating for
+some time what I would want to do is
+create a sitemap containing all of the
+content of view fine as well as a
+sitemap index which is another part of
+the sitemap specification which allows
+you to group together multiple sitemaps
+so that they can all be discovered as a
+bundle
+fortunately you find includes a command
+line tool that will do all of this for
+you so I am just going to drop to the
+terminal and go to my view find home
+directory
+there is a configuration file called
+sitemap I and I so I'm going to copy
+config you find sitemap dot ini into my
+local config defined directory and then
+I'm going to edit that configuration
+file so like all of you finds
+configuration files the sitemap dot ini
+file is full of comments explaining what
+all of the settings do and I won't go
+through all of these right now I will
+just highlight the ones that are most
+important to get things working so we
+have a top sitemap section that's going
+to control how you find that generates
+your sitemaps there are some settings in
+here like frequency which will affect
+the content of the generated sitemap and
+which can impact how
+currently search engines will come back
+and recall your pages count per page is
+going to control how many URLs
+you find puts in each of the sitemap
+files it generates because view find
+sites could potentially have millions of
+Records and creating one sitemap with a
+million records in it it's probably
+going to cause some problems thus
+there's a mechanism for breaking that up
+into chunks by default 10,000 records
+per chunk and then if you find will
+generate a sitemap index file that
+points to all of the chunks and this is
+another part of the the sitemap spec
+that you can create lists of sitemaps
+the file name setting is going to
+control the name of the file that you
+find generates for insight map of course
+it defaults to sitemap XML but in many
+cases you don't want you find overwrite
+an existing sitemap XML that was created
+either by hand or by a different tool so
+this gives us the ability to give it a
+more specific name in this case I'm
+going to call it you find sitemap file
+location determines where you find
+generates the sitemap files by default
+it uses the temporary directory but
+that's not going to be very useful in a
+real life situation so I'm going to
+change it to bar dub-dub-dub HTML which
+happens to be the web root on the ubuntu
+server I am using for this example
+we can also control which indexes get
+indexed but we can stick with defaults
+there we can affect how you find
+retrieves the URLs to put into the
+sitemap again we'll use the default
+there but you can tune that in some ways
+if it's not performing quickly enough on
+next up is the sitemap index section and
+this controls how you find and generates
+that high level index XML that I
+mentioned there are a couple of
+important things that we need to set
+here one being the base sitemap URL when
+we set the directory where the files
+will be generated we also need to tell
+you find what URL that directory
+corresponds with in this instance it's
+HTTP colon slash slash localhost
+obviously in a real world scenario you
+would not be using localhost here but
+for this example it will do you can
+control the name of the index file that
+you find generates we'll leave that as
+the default and we can tell it the name
+of an existing sitemap that we want to
+include in the index so the default of
+Base sitemap is not a file that actually
+exists in our example so we will just
+tell it use the sitemap XML that was
+generated by hand or the demonstration
+web page incorporate that into the index
+so that it's findable along with my
+generated content so now you find
+sitemap generator is fully configured
+there's just one more important detail
+which is that I need to be sure that the
+user that runs the command line tool has
+right access to the web root or else it
+won't be able to successfully write out
+the sitemaps that it generates so right
+now I'm just running all of these tools
+using my own cats account in a
+production environment of course you
+would want to have
+a dedicated user for running view find
+utilities and you would have ownership
+set accordingly but in the interest of
+expediency I'm just going to use cats
+for all of these demonstrations so I'm
+going to sue do CH own d-pad spar dub
+dub dub HTML so that I now own the web
+route and have permission to write files
+there and then all I need to do is a PHP
+util slash sitemap PHP while I'm in the
+viewfinder Ector e and that will run the
+generator that only took a couple
+seconds and now if i do a file listing
+of bar dub dub dub HTML i will see that
+there is a view find sitemap XML and a
+sitemap index dot XML that we're not
+there before
+look at those through our web browser we
+go to localhost / slight map index.xml
+sure enough this points us to two
+different files the existing sitemap XML
+that was already there that we told you
+find about as the base sitemap and also
+this new you find sitemap dot XML which
+has been generated and if I go there we
+will see that this contains a list of
+all the records in defined so they can
+be easily crawl there's one more small
+step that you might want to take with
+sitemaps which is to publish for search
+engines where they can be found you may
+be familiar with a file called
+robots.txt which you can use to tell
+crawlers which parts of your site they
+should or should not be crawling you can
+also use that file to specify where a
+sitemap lives so if I edit var dub dub
+dub HTML robots.txt and in this example
+I'm just creating a new file but you
+might have an existing one in some
+situations all I need to do is say
+sitemap colon and if the you are
+of my sitemap in this case HTTP colon
+slash slash localhost / site now index
+dot XML and now if a robot comes to the
+site and looks at robots.txt and
+supports the part of the protocol that
+includes the sitemap specification I
+will know exactly where to look and it
+can find all of your content so now that
+we've seen how to create sitemaps within
+if you find let's talk about how view
+fine can take advantage of other
+people's sitemaps including our own
+sitemaps from our content management
+systems or websites as I mentioned
+earlier you find has the capacity to
+index content from sitemaps to create a
+web index so that you can search your
+own website however before it can do
+that you need to set up a full text
+extraction tool there are several places
+in view find where it can take advantage
+of full text extraction to make the
+content a file searchable so for example
+when indexing mark records or XML files
+you can do some custom rules that will
+use URLs within metadata to retrieve
+content and then index all of the text
+coming back from those URLs and that
+same mechanism is used by Pugh finds
+sitemap index err if you find supports
+two different full-text extraction tools
+one is called aperture and has not been
+updated in many years so I strongly
+encourage that everyone use the second
+option which is patching teeka that can
+be obtained at pika patchy dot-org if
+you just go to the download page
+are a number of downloads available but
+what we want is the tikka App jar file
+which is all that you need to extract
+text content from a variety of different
+file formats including PDF Office
+documents and fortunately for us web
+pages I've actually already done the the
+download of this file to save a little
+bit of time
+so once we've downloaded this we should
+set it up in a way where it's easily
+accessible did you find what I like to
+do is give it its own directory we'll
+call it user local tika and I'm going to
+copy the file from my download directory
+into user local tika and I like to
+create a symbolic link shortcut to from
+the long tika jar file to just tika jar
+which makes if you find configuration
+easier because we can download new
+versions of the app as they're released
+in the future and we just have to
+rewrite the symbolic link instead of
+having to constantly edit of you find
+configuration files so I'm just going to
+do a quick CDU ln- s from user local
+tika tika app 1.2 4.1 that jar to user
+local tika tika jar so that's all we
+have to do to install tika just download
+a jar file put it someplace nice and
+easy but we also have to tell you find
+where to find it and for that there is a
+configuration file called full text dot
+ini so let's copy config view find full
+text I and I to local big do you find so
+we always do to override it and then
+edit local config define okay so
+like all of you finds configuration
+files once again there are lots of
+comments here explaining what it does we
+just need to do two simple things
+uncomment the general section and tell
+it to use tika if we didn't do that it
+would try to auto detect which tool is
+being used which is fine but telling it
+is a little bit faster you can skip all
+the aperture settings and then we can
+just uncomment the tika path so if you
+find those where to find tika and as you
+can see the default setting matches the
+symbolic link that I set up so there's
+no need to change anything other than
+uncommenting the line now we're halfway
+there we've got the full text extractor
+setup but now we need to set up the web
+crawler and there's another config file
+for that so we'll copy config you find
+web crawl ini to local config you find
+and we'll edit that file so all the web
+crawl dot ini does is till you find
+sitemap index err where to find sitemaps
+create a list of as many sitemaps as you
+want but for this example the only thing
+we want to index is our locally created
+sitemap XML we could index the sitemap
+index our view fine crawler is smart
+enough to crawl into all of the sitemaps
+referenced by an index but we really
+don't want to index view finds record
+pages into view finds web index that
+would only be confusing so we're going
+to focus in on the content that exists
+outside of you find itself we could also
+turn on this for beau setting just to
+get some more feedback out of the
+crawler while it runs but it makes no
+functional difference whether we do that
+or not
+so now we're all set up if you find
+knows where to find tika for full text
+extraction it knows where to find a
+sitemap to crawl so all we need to do is
+run the crawler which is PHP import
+slash web crawl from the home you find
+directory now we see it's harvesting the
+sitemap XML file
+doing a bit of work and in just a moment
+we should have our content so what the
+web crawler is actually doing is running
+the XML importer that was demonstrated
+last month we have an XSLT that uses
+sitemap XML in combination with some
+custom PHP code to use tika to extract
+content from webpages and then index
+them into a special solar core that was
+designed specifically for searching
+webpages the other little piece that the
+web crawler does is it keeps track of
+when it runs and it deletes anything
+that was indexed on a prior run so every
+time you run the web crawler it captures
+a timestamp it indexes all of the
+sitemaps you've referred to and then it
+deletes anything that's older than the
+time at which the process started
+so if webpages are removed the indexer
+will get rid of them on the next run you
+do have to be careful about this though
+because if you run the web crawler at a
+time when a website is temporarily
+offline it's going to wipe out parts of
+your index so use with caution in any
+case now that the indexing has completed
+we can go back to our beautifying
+interface and if we go to you find slash
+web with a capital W that brings up the
+website search which uses the solar
+website index I mentioned I just do a
+blank search
+we will see that I now have two pages
+both of the pages from my sitemap were
+indexed and just to prove that the
+full-text searching works right I type
+the word lazy here that word appears on
+only one of these two pages and sure
+enough there it is it highlights where
+the text matched everything is working
+so one quick thing to demonstrate before
+we call it a day is that there are two
+configuration files that might be of
+interest there's a website ini file
+which controls all the behavior of the
+web search and this is kind of like a
+combination of facets dot ini and
+searches dot ini for the main Biblio
+index but these settings are applied the
+website search so if you want to
+customize recommendation modules or
+change labels or sort options etc facets
+it's all in here so this is how you can
+control the presentation of your website
+search also a possible interest is
+config / view find / we have a search
+specs that Yambol again this is just
+like the regular search specs that yeah
+mph or the Biblio index but this is
+tuned to the website index so if you
+want to change which fields are searched
+or how relevancy ranking works this is
+the place where you can do that finally
+if you want to configure the actual
+indexing process there is a an import /
+XSL sitemap XSL which is the XSLT that
+gets applied to all of the sitemaps in
+order to index them and as you can see
+here this is really just a wrapper
+around a PHP function called define
+sitemap get document
+and import sitemap dot properties is the
+import configuration that sets up the
+custom class it specifies the XSLT and
+so forth so it's beyond the scope of
+today's video but if you want to
+customize things what you want to do is
+override the view find app class with
+your own behavior and you can do
+anything you like in that PHP for
+example you might want to extract values
+from particular HTML meta tags and use
+them for facets or whatever you need to
+do so that's it for this month next
+month we are going to look at how you
+can combine different kinds of searches
+in view find which will be useful
+because it would be nice to be able to
+search our new website index in our
+regular biblio book and journal index at
+the same time so I will show you how to
+do that until then have a good month and
+thank you for listening
 ---- struct data ----
 ----