Differences

This shows you the differences between two versions of the page.

--- videos:sitemaps_and_web_indexing [2020/08/07 13:30] – [Transcript] demiankatz
+++ videos:sitemaps_and_web_indexing [2023/04/26 13:29] (current) – crhallberg
@@ Line 1: / Line 1: @@
 ====== Video 8: Sitemaps and Web Indexing ======
-The eighth VuFind instructional video explains how to share your VuFind contents using XML sitemaps as well as how to crawl the contents of existing sitemaps to build a web index.
+The eighth VuFind® instructional video explains how to share your VuFind contents using XML sitemaps as well as how to crawl the contents of existing sitemaps to build a web index.
 Video is available as an [[https://vufind.org/video/Sitemaps_Web_Indexing.mp4|mp4 download]] or through [[https://www.youtube.com/watch?v=0V9Q5vF13Uw&feature=youtu.be|YouTube]].
@@ Line 13: / Line 13: @@
 ===== Transcript =====
-// This is a raw machine-generated transcript; it will be cleaned up as time permits. //
+So, this month's video is going to be a discussion of site maps and web crawling, and it's sort of a logical follow up to last month's video about indexing XML.
- so this month's video is going to be a
+So, first, a quick introduction to what I'm talking about. If you put a website up on the internet, chances are that sooner or later search engines will find it and crawl through it and make it searchable. But without a little bit of help, this can happen somewhat haphazardly, and that is where XML site maps come into play.
-discussion of sitemaps and web crawling
-and it's sort of a logical follow-up to
+A site map is just an XML document that lists all of the pages on your site, which you can submit to a search engine in order to let them find all of your content. This is certainly important for something like VuFind where it's a search driven interface and there may not actually be a way to crawl into every single record without typing something into a box first.
-last month's video about indexing XML so
-first a quick introduction to what I'm
+So, by publishing a site map, we make it possible for all of our index records to be findable. And so VuFind includes facilities for generating site maps, which make it very search engine friendly and will make your content a lot more visible.
-talking about if you put a website up on
-the internet chances are that sooner or
+On the flip side of the coin site maps are also a really useful tool for harvesting content. And so VuFind also includes tools for crawling all of the pages in the site map and creating a website index.
-later search engines will find it and
-crawl through it and make it searchable
+I will show sort of both sides of that equation, how you can make site maps from VuFind and how you can populate VuFind using site maps. And I have up here sitemaps.org, which is where the site map specification can be found if you want to learn more about how these documents are structured.
-but without a little bit of help
-this can happen somewhat haphazardly and
+As a really simple example. I've created this beautiful website on my virtual machine. It's just a couple of HTML files I hand edited in the web route. So I've got this front page and I have a link that leads to this other page. And I have by hand generated a sitemap.xml file, which just lists both of these pages, the root of the site and the link page.
-that is where XML sitemaps come into
-play a a sitemap is just an XML document
+So, suppose that I want this website I've just created to live in harmony with the VuFind instance. I'm demonstrating for some time. What I would want to do is create a site map containing all of the content of VuFind as well as a site map index, which is another part of the site map specification, which allows you to group together multiple site maps so that they can all be discovered as a bundle.
-that lists all of the pages on your site
-which you can submit to a search engine
+Fortunately, VuFind includes a command line tool that will do all of this for you. I'm just going to drop to the terminal and go to my VuFind home directory. There is a configuration file named "sitemap.ini" that I will copy into my local "config" directory and then edit. The "sitemap.ini" file contains comments explaining the settings, and I will highlight the most important ones. The "top sitemap" section controls how VuFind generates sitemaps. The "frequency" setting affects the content of the sitemap, and "count per page" controls the number of URLs in each sitemap file. By default, VuFind generates 10,000 URLs per chunk, and a sitemap index file points to all the chunks. The "name" setting controls the name of the sitemap file, which defaults to "sitemap.xml." The "file location" setting determines where VuFind generates the sitemap files, and I will change it to "/var/www/html," the web root of my Ubuntu server.
-in order to let them find all of your
-content now this is certainly important
+The "site map index" section controls how VuFind generates the high-level index XML. One important setting is the "base site map URL," which corresponds to the directory where the files will be generated. In this instance, it's http://localhost. Obviously, in a real world scenario, you would not be using localhost here, but for this example, it will do.
-for something like view find where it's
-a search driven interface and there may
+You can control the name of the index file that VuFind generates. We'll leave that as the default. And we can tell it's the name of an existing sitemap that we want to include in the index. So the default of base sitemap is not a file that actually exists in our example. So we will just tell it to use the sitemap.xml that was generated by hand or the demonstration web page and incorporate that into the index so that it's findable along with VuFind-generated content.
-not actually be a way to crawl into
-every single record without typing
+And now VuFind sitemap generator is fully configured. There's just one more important detail, which is that I need to be sure that the user that runs the command-line tool has write access to the web route or else it won't be able to successfully write out the sitemaps that it generates.
-something into a box first
-so by publishing a sitemap we make it
+And now I'm just running all of these tools using my own decats account. In a production environment, of course, you would want to have a dedicated user for running VuFind utilities, and you would have ownership set accordingly. But in the interest of expediency, I'm just going to use e-cats for all of these demonstrations, so I'm going to chown e-cats our web route so that I now own the web route and have permission to write files there.
-possible for all of our index records to
-be findable and so if you find includes
+And then all I need to do is say ''phputil/sitemap.php'' while I'm in the VuFind home directory, and that will run the generator. That only took a couple of seconds, and now if I do a file listing of our ''/var/www/html'', I will see that there is a ''vufind-sitemap.xml'' and a ''sitemap-index.xml'' that were not there before. You can look at those in our web browser if you go to ''localhost/sitemap-index.xml''. Sure enough, this points us to two different files, the existing ''sitemap.xml'' that was already there that we told VuFind about as the base sitemap and also this new ''vufind-sitemap.xml'', which has been generated. And if I go there, we will see that this contains a list of all the records in VuFind so they can be easily crawled.
-facilities for generating sitemaps which
-make it very search engine friendly and
+So one more small step that you might want to take with sitemaps is to publish for search engines where they can be found. You may be familiar with a file called ''robots.txt'', which you can use to tell crawlers which parts of your site they should or should not be crawling. So use that file to specify where a sitemap lives.
-will make your content a lot more
-visible on the flip side of the coin
+So if I edit ''/var/www/html/robots.txt'', in this example, I'm just creating a new file, but you might have an existing one in some situations. All I need to do is say ''Sitemap:'' and give the URL of my site map. In this case, ''Sitemap: http://localhost/sitemapIndex.xml''. Now, if a robot comes to the site and looks at ''robots.txt'' and supports the part of the protocol that includes the site map specification, it will know exactly where to look and can find all of your content.
-sitemaps are also a really useful tool
-for harvesting content and so if you
+Now that we've seen how to create site maps within VuFind, let's talk about how VuFind can take advantage of other people's site maps, including our own site maps from our content management systems for websites. VuFind has the capacity to index content from site maps to create a web index so that you can search your own website. However, before it can do that, you need to set up a full-text extraction tool. There are several places in VuFind where it can take advantage of full-text extraction to make the content of a file searchable. For example, when indexing MARC records or XML files, you can do some custom rules that will use URLs within metadata to retrieve content and then index all of the text coming back from those URLs. That same mechanism is used by VuFind's site map indexer.
-find also includes tools or crawling all
-of the pages in the sitemap and creating
+If VuFind supports two different full-text extraction tools, one is called Aperture and has not been updated in many years. So, it's strongly encouraged that everyone uses the second option, which is Tika that can be obtained at ''https://tika.apache.org''. You just go to the download page. There are a number of downloads available, but what we want is the tika-app Jar file, which is all that you need to extract text content from a variety of different file formats, including PDFs, office documents, and web pages. I've actually already downloaded this file to save a little bit of time.
-a website index so today I will show
-sort of both sides of that equation how
+Once we've downloaded Tika, we should set it up in a way where it's easily accessible to VuFind. What I like to do is give it its directory. We will call it user local Tika. I'm going to copy the file from my download directory, and I like to create a symbolic link shortcut from the long Tika jar file to just ''tika.jar'', which makes VuFind configuration easier because we can download new versions of the app as they are released in the future, and we just have to rewrite the symbolic link instead of having to constantly edit VuFind configuration files. I'm just going to do a quick ''sudo ln -s /usr/local/tika/tika-app-1.24.1.jar /usr/local/tika/tika.jar''. So that's all we have to do to install Tika. Just download a jar file to put it someplace nice and easy. But we also have to tell VuFind where to find it. And for that, there is a configuration file called ''fulltext.ini''. So let's copy ''config/vufind/fulltext.ini'' to ''local/config/vufind/''. So you always do to override it. And then edit ''local/config/vufind/fulltext.ini''.
-you can make sitemaps from view find and
-how you can populate view find using
+Like all of VuFind configuration files, once again there are lots of comments here explaining what it does. We just need to do two simple things. Uncomment the general section and tell it to use Tika. If we didn't do that, it would try to auto-detect which tools are being used, which is fine, but telling it is a little bit faster. You can skip all the aperture settings, and then we can just uncomment the Tika path. So if VuFind those where to find Tika, and as you can see, the default setting matches the symbolic link that I set up, so there's no need to change anything other than uncommenting the line.
-sitemaps and I have up here sitemaps org
-which is where the sitemap specification
+Now we're halfway there. We've got the full-text extractor set up, but now we need to set up the web crawler, and there's another config file for that. So we'll copy ''config/vufind/webcrawl.ini'' to ''local/config/vufind''. And we'll edit that file.
-can be found if you want to learn more
-about how these documents are structured
+So all the web crawl.ini does is tell VuFind site map index or where to find site maps. You can create a list of as many site maps as you want, but for this example, the only thing we want to index is our locally created ''sitemap.xml''. We could index ''sitemapIndex.xml'' our VuFind crawler is smart enough to crawl into all of the site maps referenced by an index, but we really don't want to index VuFind record pages into the VuFind web index. That would only be confusing, so we're going to focus on the content that exists outside of VuFind itself.
-so just as a really simple example I've
-created this beautiful website on my
+We could also turn on this verbose setting just to get some more feedback out of the crawler while it runs, but it makes no functional difference whether we do that or not.
-virtual machine it's just a couple of
-HTML files I hand edited in the web
+So now we're all set up! If VuFind knows where to find Tika for full-text extraction, it knows where to find a site map to crawl, so all we need to do is run the crawler, which is ''php import/webcrawl.php'' from the home VuFind directory. It's harvesting the sitemap XML file, doing a bit of work, and in just a moment, we should have our content.
-route so I've got this front page and I
-have a link that leads to this other
+So what the web crawler is actually doing is running the XML importer that was demonstrated last month. We have an XSLT that uses site map XML in combination with some custom PHP code.
-page and I have by hand generated a
-sitemap XML file which just lists both
+To use Tika to extract content from web pages and then index them into a special Solr core that was designed specifically for searching web pages. The other little piece that the web crawler does is it keeps track of when it runs and it deletes anything that was indexed on a prior run.
-of these pages the the root of the site
-and the linked page so suppose that I
+Every time you run the web crawler, it captures a timestamp. It indexes all of the site maps you referred to, and then it deletes anything that's older than the time at which the process started. So if web pages are removed, the indexer will get rid of them on the next run. You do have to be careful about this though because if you run the web crawler at a time when a website is temporarily offline, it's going to wipe out parts of your index, so use with caution.
-want this website I've just created to
-live in harmony with the view find
+In any case, now that the indexing has completed, we can go back to our VuFind interface. And if we go to ''http://localhost/vufind/Web'' with a capital W, that brings up the website search, which uses the Solr website index I mentioned by just doing a blank search.
-instance I've been demonstrating for
-some time what I would want to do is
+That I now have two pages, both of the pages from my site map were indexed, and just to prove that the full-text searching works right. If I type the word "lazy" here, that word appears on only one of these two pages, and sure enough, there it is. It highlights where the text matched, everything is working.
-create a sitemap containing all of the
-content of view fine as well as a
+One quick thing to demonstrate before we call it a day is that there are two configuration files that might be of interest. There's a website.ini file which controls all the behavior of the web search, and this is kind of like a combination of facets.ini and searches.ini for the mean. But these settings are applied the website search, so if you want to customize recommendation modules or change labels, sort options, etc. facets, it's all in here. So this is how you can control the presentation of your website search.
-sitemap index which is another part of
-the sitemap specification which allows
+So, a possible interest is ''config/vufind/web/search-specs.yml''. Again, this is just like the regular search-specs.yml for the bibli index, but this is tuned to the website index. So if you want to change which fields are searched or how relevancy ranking works, this is the place where you can do that.
-you to group together multiple sitemaps
-so that they can all be discovered as a
+Finally, if you want to configure the actual indexing process, there is an import/XSL/sitemap.xsl, which is the XSLT that gets applied to all of the site maps in order to index them. As you can see here, this is really just a wrapper around a PHP function called "VuFind/SiteMap::getDocument()" and import/sitemap.properties. So, import configuration that sets up the custom class and specifies the XSLT and so forth. So, it's beyond the scope of today's video. But if you want to customize things, what you want to do is override the VuFind site map class with your own behavior. So, if you do anything you like in that PHP. For example, you might want to extract values from particular HTML meta tags and use them for facets or whatever you need to do.
-bundle
-fortunately you find includes a command
+For this month, next month we are going to look at how you can combine different kinds of searches in VuFind, which will be useful because it would be nice to be able to search our new website index and our regular bibliographic journal index at the same time. I will show you how to do that. Until then, have a good month, and thank you for listening.
-line tool that will do all of this for
-you so I am just going to drop to the
+//This is an edited version of an automated transcript. Apologies for any errors.//
-terminal and go to my view find home
-directory
-there is a configuration file called
-sitemap I and I so I'm going to copy
-config you find sitemap dot ini into my
-local config defined directory and then
-I'm going to edit that configuration
-file so like all of you finds
-configuration files the sitemap dot ini
-file is full of comments explaining what
-all of the settings do and I won't go
-through all of these right now I will
-just highlight the ones that are most
-important to get things working so we
-have a top sitemap section that's going
-to control how you find that generates
-your sitemaps there are some settings in
-here like frequency which will affect
-the content of the generated sitemap and
-which can impact how
-currently search engines will come back
-and recall your pages count per page is
-going to control how many URLs
-you find puts in each of the sitemap
-files it generates because view find
-sites could potentially have millions of
-Records and creating one sitemap with a
-million records in it it's probably
-going to cause some problems thus
-there's a mechanism for breaking that up
-into chunks by default 10,000 records
-per chunk and then if you find will
-generate a sitemap index file that
-points to all of the chunks and this is
-another part of the the sitemap spec
-that you can create lists of sitemaps
-the file name setting is going to
-control the name of the file that you
-find generates for insight map of course
-it defaults to sitemap XML but in many
-cases you don't want you find overwrite
-an existing sitemap XML that was created
-either by hand or by a different tool so
-this gives us the ability to give it a
-more specific name in this case I'm
-going to call it you find sitemap file
-location determines where you find
-generates the sitemap files by default
-it uses the temporary directory but
-that's not going to be very useful in a
-real life situation so I'm going to
-change it to bar dub-dub-dub HTML which
-happens to be the web root on the ubuntu
-server I am using for this example
-we can also control which indexes get
-indexed but we can stick with defaults
-there we can affect how you find
-retrieves the URLs to put into the
-sitemap again we'll use the default
-there but you can tune that in some ways
-if it's not performing quickly enough on
-next up is the sitemap index section and
-this controls how you find and generates
-that high level index XML that I
-mentioned there are a couple of
-important things that we need to set
-here one being the base sitemap URL when
-we set the directory where the files
-will be generated we also need to tell
-you find what URL that directory
-corresponds with in this instance it's
-HTTP colon slash slash localhost
-obviously in a real world scenario you
-would not be using localhost here but
-for this example it will do you can
-control the name of the index file that
-you find generates we'll leave that as
-the default and we can tell it the name
-of an existing sitemap that we want to
-include in the index so the default of
-Base sitemap is not a file that actually
-exists in our example so we will just
-tell it use the sitemap XML that was
-generated by hand or the demonstration
-web page incorporate that into the index
-so that it's findable along with my
-generated content so now you find
-sitemap generator is fully configured
-there's just one more important detail
-which is that I need to be sure that the
-user that runs the command line tool has
-right access to the web root or else it
-won't be able to successfully write out
-the sitemaps that it generates so right
-now I'm just running all of these tools
-using my own cats account in a
-production environment of course you
-would want to have
-a dedicated user for running view find
-utilities and you would have ownership
-set accordingly but in the interest of
-expediency I'm just going to use cats
-for all of these demonstrations so I'm
-going to sue do CH own d-pad spar dub
-dub dub HTML so that I now own the web
-route and have permission to write files
-there and then all I need to do is a PHP
-util slash sitemap PHP while I'm in the
-viewfinder Ector e and that will run the
-generator that only took a couple
-seconds and now if i do a file listing
-of bar dub dub dub HTML i will see that
-there is a view find sitemap XML and a
-sitemap index dot XML that we're not
-there before
-look at those through our web browser we
-go to localhost / slight map index.xml
-sure enough this points us to two
-different files the existing sitemap XML
-that was already there that we told you
-find about as the base sitemap and also
-this new you find sitemap dot XML which
-has been generated and if I go there we
-will see that this contains a list of
-all the records in defined so they can
-be easily crawl there's one more small
-step that you might want to take with
-sitemaps which is to publish for search
-engines where they can be found you may
-be familiar with a file called
-robots.txt which you can use to tell
-crawlers which parts of your site they
-should or should not be crawling you can
-also use that file to specify where a
-sitemap lives so if I edit var dub dub
-dub HTML robots.txt and in this example
-I'm just creating a new file but you
-might have an existing one in some
-situations all I need to do is say
-sitemap colon and if the you are
-of my sitemap in this case HTTP colon
-slash slash localhost / site now index
-dot XML and now if a robot comes to the
-site and looks at robots.txt and
-supports the part of the protocol that
-includes the sitemap specification I
-will know exactly where to look and it
-can find all of your content so now that
-we've seen how to create sitemaps within
-if you find let's talk about how view
-fine can take advantage of other
-people's sitemaps including our own
-sitemaps from our content management
-systems or websites as I mentioned
-earlier you find has the capacity to
-index content from sitemaps to create a
-web index so that you can search your
-own website however before it can do
-that you need to set up a full text
-extraction tool there are several places
-in view find where it can take advantage
-of full text extraction to make the
-content a file searchable so for example
-when indexing mark records or XML files
-you can do some custom rules that will
-use URLs within metadata to retrieve
-content and then index all of the text
-coming back from those URLs and that
-same mechanism is used by Pugh finds
-sitemap index err if you find supports
-two different full-text extraction tools
-one is called aperture and has not been
-updated in many years so I strongly
-encourage that everyone use the second
-option which is patching teeka that can
-be obtained at pika patchy dot-org if
-you just go to the download page
-are a number of downloads available but
-what we want is the tikka App jar file
-which is all that you need to extract
-text content from a variety of different
-file formats including PDF Office
-documents and fortunately for us web
-pages I've actually already done the the
-download of this file to save a little
-bit of time
-so once we've downloaded this we should
-set it up in a way where it's easily
-accessible did you find what I like to
-do is give it its own directory we'll
-call it user local tika and I'm going to
-copy the file from my download directory
-into user local tika and I like to
-create a symbolic link shortcut to from
-the long tika jar file to just tika jar
-which makes if you find configuration
-easier because we can download new
-versions of the app as they're released
-in the future and we just have to
-rewrite the symbolic link instead of
-having to constantly edit of you find
-configuration files so I'm just going to
-do a quick CDU ln- s from user local
-tika tika app 1.2 4.1 that jar to user
-local tika tika jar so that's all we
-have to do to install tika just download
-a jar file put it someplace nice and
-easy but we also have to tell you find
-where to find it and for that there is a
-configuration file called full text dot
-ini so let's copy config view find full
-text I and I to local big do you find so
-we always do to override it and then
-edit local config define okay so
-like all of you finds configuration
-files once again there are lots of
-comments here explaining what it does we
-just need to do two simple things
-uncomment the general section and tell
-it to use tika if we didn't do that it
-would try to auto detect which tool is
-being used which is fine but telling it
-is a little bit faster you can skip all
-the aperture settings and then we can
-just uncomment the tika path so if you
-find those where to find tika and as you
-can see the default setting matches the
-symbolic link that I set up so there's
-no need to change anything other than
-uncommenting the line now we're halfway
-there we've got the full text extractor
-setup but now we need to set up the web
-crawler and there's another config file
-for that so we'll copy config you find
-web crawl ini to local config you find
-and we'll edit that file so all the web
-crawl dot ini does is till you find
-sitemap index err where to find sitemaps
-create a list of as many sitemaps as you
-want but for this example the only thing
-we want to index is our locally created
-sitemap XML we could index the sitemap
-index our view fine crawler is smart
-enough to crawl into all of the sitemaps
-referenced by an index but we really
-don't want to index view finds record
-pages into view finds web index that
-would only be confusing so we're going
-to focus in on the content that exists
-outside of you find itself we could also
-turn on this for beau setting just to
-get some more feedback out of the
-crawler while it runs but it makes no
-functional difference whether we do that
-or not
-so now we're all set up if you find
-knows where to find tika for full text
-extraction it knows where to find a
-sitemap to crawl so all we need to do is
-run the crawler which is PHP import
-slash web crawl from the home you find
-directory now we see it's harvesting the
-sitemap XML file
-doing a bit of work and in just a moment
-we should have our content so what the
-web crawler is actually doing is running
-the XML importer that was demonstrated
-last month we have an XSLT that uses
-sitemap XML in combination with some
-custom PHP code to use tika to extract
-content from webpages and then index
-them into a special solar core that was
-designed specifically for searching
-webpages the other little piece that the
-web crawler does is it keeps track of
-when it runs and it deletes anything
-that was indexed on a prior run so every
-time you run the web crawler it captures
-a timestamp it indexes all of the
-sitemaps you've referred to and then it
-deletes anything that's older than the
-time at which the process started
-so if webpages are removed the indexer
-will get rid of them on the next run you
-do have to be careful about this though
-because if you run the web crawler at a
-time when a website is temporarily
-offline it's going to wipe out parts of
-your index so use with caution in any
-case now that the indexing has completed
-we can go back to our beautifying
-interface and if we go to you find slash
-web with a capital W that brings up the
-website search which uses the solar
-website index I mentioned I just do a
-blank search
-we will see that I now have two pages
-both of the pages from my sitemap were
-indexed and just to prove that the
-full-text searching works right I type
-the word lazy here that word appears on
-only one of these two pages and sure
-enough there it is it highlights where
-the text matched everything is working
-so one quick thing to demonstrate before
-we call it a day is that there are two
-configuration files that might be of
-interest there's a website ini file
-which controls all the behavior of the
-web search and this is kind of like a
-combination of facets dot ini and
-searches dot ini for the main Biblio
-index but these settings are applied the
-website search so if you want to
-customize recommendation modules or
-change labels or sort options etc facets
-it's all in here so this is how you can
-control the presentation of your website
-search also a possible interest is
-config / view find / we have a search
-specs that Yambol again this is just
-like the regular search specs that yeah
-mph or the Biblio index but this is
-tuned to the website index so if you
-want to change which fields are searched
-or how relevancy ranking works this is
-the place where you can do that finally
-if you want to configure the actual
-indexing process there is a an import /
-XSL sitemap XSL which is the XSLT that
-gets applied to all of the sitemaps in
-order to index them and as you can see
-here this is really just a wrapper
-around a PHP function called define
-sitemap get document
-and import sitemap dot properties is the
-import configuration that sets up the
-custom class it specifies the XSLT and
-so forth so it's beyond the scope of
-today's video but if you want to
-customize things what you want to do is
-override the view find app class with
-your own behavior and you can do
-anything you like in that PHP for
-example you might want to extract values
-from particular HTML meta tags and use
-them for facets or whatever you need to
-do so that's it for this month next
-month we are going to look at how you
-can combine different kinds of searches
-in view find which will be useful
-because it would be nice to be able to
-search our new website index in our
-regular biblio book and journal index at
-the same time so I will show you how to
-do that until then have a good month and
-thank you for listening
 ---- struct data ----
+properties.Page Owner :
 ----