Webmasters have been using the Robots Exclusion Protocol for years to tell crawlers and spiders what to index and not index. It’s actually not an official standard. Still, the top three search engines have recently agreed to support some new improvements to the protocol, so you should use it. It will work on our Google Search Appliance too. A robots.txt can keep spiders from crawling parts of your site that you don’t want to include in search engines.
You can also prevent individual pages from being indexed by adding this line to the head section:
<meta name=”robots” content=”noindex” />
To prevent spiders from following any of the links on the page, add this line:
<meta name=”robots” content=”nofollow” />>
But the simpler way to configure the crawling of your site is to use robots.txt. To see if your site has a robots.txt, you can visit http://your-site.tamu.edu/robots.txt. If it says “File Not Found,” ask your server administrator to put a “robots.txt” file into the root directory of your website. Server administrators like robots.txt because it can prevent spiders from overworking their machines by blithely clicking on every link much faster than human visitors would: “Print Version” “Download Now” “Send by Email” “Sort by Topic” “Sort by Date” “Show Details” “Hide Details” etc.
The simplest robots.txt, consisting of two lines, allows spiders to crawl your entire site:
An empty robots.txt will do the same thing. If you want to keep spiders away from your entire site (if your department is working in stealth mode), add a forward slash to the Disallow line.
Now, this doesn’t provide security, just obscurity. If you want people to enter a password before they can view your sensitive content, you need to set that up separately
Google’s instructions on Creating a robots.txt file explains some of the new features you can use, such as wildcards, pattern matching, and sitemaps.