skip to main content

TAMU Webmaster's Blog


Information and insight from the A&M Webmasters

Controlling non-human website visitors with robots.txt

February 4th, 2009 by mdmcginnis

Webmasters have been using the Robots Exclusion Protocol for years to tell crawlers and spiders what to index and not index. It’s actually not an official standard. Still, the top three search engines have recently agreed to support some new improvements to the protocol, so you should use it. It will work on our Google Search Appliance too. A robots.txt can keep spiders from crawling parts of your site that you don’t want to include in search engines.

You can also prevent individual pages from being indexed by adding this line to the head section:

<meta name=”robots” content=”noindex” />

To prevent spiders from following any of the links on the page, add this line:

<meta name=”robots” content=”nofollow” />>

But the simpler way to configure the crawling of your site is to use robots.txt. To see if your site has a robots.txt, you can visit http://your-site.tamu.edu/robots.txt. If it says “File Not Found,” ask your server administrator to put a “robots.txt” file into the root directory of your website. Server administrators like robots.txt because it can prevent spiders from overworking their machines by blithely clicking on every link much faster than human visitors would: “Print Version” “Download Now” “Send by Email” “Sort by Topic” “Sort by Date” “Show Details” “Hide Details” etc.

The simplest robots.txt, consisting of two lines, allows spiders to crawl your entire site:

User-agent: *
Disallow:

An empty robots.txt will do the same thing. If you want to keep spiders away from your entire site (if your department is working in stealth mode), add a forward slash to the Disallow line.

Now, this doesn’t provide security, just obscurity. If you want people to enter a password before they can view your sensitive content, you need to set that up separately

Google’s instructions on Creating a robots.txt file explains some of the new features you can use, such as wildcards, pattern matching, and sitemaps.

Next time, as we finish our discussion of the official Google tips for search engine optimization, we’ll talk about the best ways to promote your website.

Wednesday, February 4th, 2009 Future Projects, Search
Share this article

1 Comment to Controlling non-human website visitors with robots.txt

  1. […] Make effective use of robots.txt Good practices for robots.txt […]

  2. Search Engine Optimization according to Google | Aggie Webmasters on February 5th, 2009

Leave a comment

Categories

Archives