skip to main content

TAMU Webmaster's Blog

Information and insight from the A&M Webmasters

Crawlers, blackholes and robots

March 7th, 2008 by mdmcginnis

Of course, most webmasters, like most restaurant owners, want as many visitors as possible, but there are times when you want to keep them away from parts of your site. And sometimes when you make it clear who your site is not intended for, you make it easier for your desired visitors to find you through the search engines.

Still, some of the least wanted visitors can be the search engine spiders or robots, who crawl Web sites adding pages to their index, which is why the robot exclusion standards were developed. Now, we have our own Aggie crawler – tamu-googlebot – so we’re partial toward it. Most webmasters like their sites to be crawled – it means new information can be found.

The problem is that spiders add every page they find to their index, unless you tell them not to. A spider will discover that forgotten directory of research data from 1987 – all 800 pages of it. Or it will find the internal report that your staff was required to read three years ago, but only your staff, and it will make that report searchable, long after the problems it described have been fixed, long after you forgot about it, and long, long after you wanted anybody else to read it.

Search engine crawlers can also cause problems with large sites with many pages or many large files. If poor coding practices have created a “black hole” by allowing your departmental calendar to expand to the year 2200 AD and beyond, be sure that search engine crawlers will try to explore any events you may have planned for the 23rd century – and your Web server may become pretty busy while they’re doing it. (Most of our current students will have graduated by then.) Our own crawler also tries to convert PDF and Microsoft Office documents to plain HTML, which makes them available to more people, but also puts a heavier load on your server. We don’t crawl media files themselves, but we try to index their titles.

Fortunately, there are easy ways to control how a search engine spider crawls your Web site. Using robot metatags and a robots.txt, you can prevent our crawler and other search engine crawlers from taxing your server or taxing your patience. By making your own decisions about what to exclude, you prevent the crawler, or me, from making arbitrary decisions. Some crawlers have actually chosen to avoid any URL with a question mark in it, just because those dynamic pages can easily become black holes (perhaps into the 23rd century?). When we find a black hole with thousands of identical or nearly-identical pages, we have manually excluded it from our search engine. But we would rather not do that, preferring to let your robots files control our crawler.

Friday, March 7th, 2008 Search
Share this article

No comments yet.

Leave a comment