About Features Downloads Getting Started Documentation Events Support GitHub

Site Tools


administration:robots.txt

robots.txt

A robots.txt file can be used to specify the way search engine crawlers are allowed to access your site. This is relevant to VuFind users for two reasons:

  • It may allow you to improve VuFind's performance by reducing the load placed on it by robots.
  • It may be necessary to comply with the licenses of third-party content providers (e.g. Summon).

Basic Syntax

The basics of robots.txt syntax can be found at The Web Robots Pages.

File Location

The most important thing to know about robots.txt is that it must exist at the root of your server. If VuFind is running in the root of your server, this means you can simply create a robots.txt file in VuFind's web folder (public/ in VuFind 2.x or later). If VuFind is running in a directory (the most common use case), you will need to place the robots.txt file in your Apache web root or manage it through your site's Content Management System (if applicable).

To summarize: the URL must be http://your-server/robots.txt. The file will not be found if it is http://your-server/vufind/robots.txt.

Example

Here is an excerpt from the robots.txt file currently used at Villanova:

User-agent: *
Disallow: /vufind/AJAX
Disallow: /vufind/Alphabrowse
Disallow: /vufind/Browse
Disallow: /vufind/Combined
Disallow: /vufind/EDS
Disallow: /vufind/EdsRecord
Disallow: /vufind/Search/Results
Disallow: /vufind/Summon
Disallow: /vufind/SummonRecord
Disallow: /vufind/AJAX/
Disallow: /vufind/Alphabrowse/
Disallow: /vufind/Browse/
Disallow: /vufind/Combined/
Disallow: /vufind/EDS/
Disallow: /vufind/EdsRecord/
Disallow: /vufind/Search/Results/
Disallow: /vufind/Summon/
Disallow: /vufind/SummonRecord/

This blocks access to the Combined, Summon and SummonRecord controllers to prevent access to Summon content. It also blocks the AJAX controller, since there is no reason for a search engine to be accessing this type of dynamic content.

Note that entries are included both with and without trailing slashes – we found this helpful in ensuring compliance by some crawlers, though it may not be strictly necessary.

Also note that we have to include the full path to VuFind, which (in our case) includes the /vufind/ prefix.

We've recently added the Browse module to avoid redundancy. We also disabled access to the Alphabrowse and Results pages in compliance with the Google crawling guidelines and to reduce server strain. We recommend providing a sitemap of all records to the bot to make sure each of your records is crawled. See here for more information: Search Engine Optimization

More Information

Google offers some answers to frequently asked questions. This explains some of Google's crawling behavior in more detail, including some important information such as the fact that a robots.txt “Disallow” directive may be ignored based on other criteria, and the “noindex” robots meta tag may be a stronger way to hide content.

administration/robots.txt.txt · Last modified: 2020/06/04 15:06 by demiankatz