A robots.txt file can be used to specify the way search engine crawlers are allowed to access your site. This is relevant to VuFind users for two reasons:
- It may allow you to improve VuFind's performance by reducing the load placed on it by robots.
- It may be necessary to comply with the licenses of third-party content providers (e.g. Summon).
The basics of robots.txt syntax can be found at The Web Robots Pages.
The most important thing to know about robots.txt is that it must exist at the root of your server. If VuFind is running in the root of your server, this means you can simply create a robots.txt file in VuFind's web folder (public/ in VuFind 2.x or later). If VuFind is running in a directory (the most common use case), you will need to place the robots.txt file in your Apache web root or manage it through your site's Content Management System (if applicable).
To summarize: the URL must be http://your-server/robots.txt. The file will not be found if it is http://your-server/vufind/robots.txt.
Here is an excerpt from the robots.txt file currently used at Villanova:
User-agent: * Disallow: /vufind/AJAX Disallow: /vufind/Alphabrowse Disallow: /vufind/Browse Disallow: /vufind/Combined Disallow: /vufind/EDS Disallow: /vufind/EdsRecord Disallow: /vufind/Search/Results Disallow: /vufind/Summon Disallow: /vufind/SummonRecord Disallow: /vufind/AJAX/ Disallow: /vufind/Alphabrowse/ Disallow: /vufind/Browse/ Disallow: /vufind/Combined/ Disallow: /vufind/EDS/ Disallow: /vufind/EdsRecord/ Disallow: /vufind/Search/Results/ Disallow: /vufind/Summon/ Disallow: /vufind/SummonRecord/
This blocks access to the Combined, Summon and SummonRecord controllers to prevent access to Summon content. It also blocks the AJAX controller, since there is no reason for a search engine to be accessing this type of dynamic content.
Note that entries are included both with and without trailing slashes – we found this helpful in ensuring compliance by some crawlers, though it may not be strictly necessary.
Also note that we have to include the full path to VuFind, which (in our case) includes the /vufind/ prefix.
We've recently added the Browse module to avoid redundancy. We also disabled access to the Alphabrowse and Results pages in compliance with the Google crawling guidelines and to reduce server strain. We recommend providing a sitemap of all records to the bot to make sure each of your records is crawled. See here for more information: Search Engine Optimization
Google offers some answers to frequently asked questions. This explains some of Google's crawling behavior in more detail, including some important information such as the fact that a robots.txt “Disallow” directive may be ignored based on other criteria, and the “noindex” robots meta tag may be a stronger way to hide content.