About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
indexing:websites

Indexing a Website

Starting with release 2.1, VuFind can be used to create a website index separate from your main search index. Results from this index can then be used on their own or merged with catalog results using the combined search tools.

Getting Started

  1. Make sure that you have a full text extraction tool installed and configured.
  2. Copy config/vufind/webcrawl.ini into the config/vufind subdirectory of your local settings directory and edit the file to specify where your website's XML sitemap lives.
  3. Run the import/webcrawl.php tool to load your website's data into the index (this may take a long time).
  4. When crawling is done, go to http://vufind_server/vufind/Web/Results – you can enter a search in the box here.

(In very old versions of VuFind – earlier than release 3.0 – you will need to enable the website core by editing solr/solr.xml and uncommenting the appropriate line, then restart Solr, before running the webcrawl.php tool).

Several things can be modified (with the help of your local settings directory) to adjust web search behavior and appearance.

  • You can customize the way web pages are indexed by creating a custom version of import/xsl/sitemap.xsl and/or import/sitemap.properties.
  • You can customize search behavior and options through config/vufind/website.ini and config/vufind/websearchspecs.yaml.
  • You can customize display behavior through the VuFind\RecordDriver\SolrWeb record driver and corresponding templates.

Notes

  • The current webcrawl.php tool works very much by brute force; we may want to build a more intelligent, flexible crawler at some point in the future.

You can learn more about web indexing through the Sitemaps and Web Indexing video.

indexing/websites.txt · Last modified: 2021/08/03 13:49 by demiankatz