On the Care and Feeding of Bots

article by michal wallace (sabren@manifestation.com)

Abstract

This article discusses web robots, how they work, and how to build them. You should have some basic understanding of programming and web concepts before reading this article. All of the example code uses Perl, but the commentary applies to bots written in any language.

Introduction

Web robots, or bots, make life easier by automating web-related tasks. Simple bots can monitor sites for changes, index pages for searching, check sites for broken links, or grab a list of headlines from a news site. More advanced bots gather information from multiple sites - perhaps to compare prices on books - or actually perform transactions such as ordering items that are out of stock.

Technically speaking, bots act autonomously, perhaps as a scheduled task. However, for the sake of simplicity, we will refer to any scripted web agent as a bot.

A Simple Bot

The following simple bot, written in perl, gets the current size of Mozilla's Open Directory Project:

#
# dmozbot.pl - a simple bot to fetch the dmoz.org vital stats
#

use LWP::UserAgent;
use HTTP::Request::Common;

$bot = new LWP::UserAgent;
$response = $bot->request( GET "http://www.dmoz.org" );
$content = $response->content;

for (split "\n", $content) {
   print $1 if /(.*sites.*unreviewed.*editors.*categories)\/;
}

__END__

As I run this (in March of 1999), dmozbot prints out:

408,236 sites - 3,812 unreviewed - 8,285 editors - 61,866 categories

Assuming dmoz.org keeps its current format, and that you have a recent version of Perl with the LWP modules installed, running dmozbot will give you an up-to-the-minute status report on the size of the Open Directory project.

Caring for your Bot

Before we take a closer look at this code, realize that a simple redesign over at dmoz.org could break the dmozbot at any moment. Perhaps no one will cry if this particular bot should die, but a mission-critical bot requires constant attention.

For this reason, bot creators should put careful thought into error handling and logging. Administrators may also find it useful if bots that run on a schedule have some sort of "check in" mechanism showing the time of the last successful run.

How it Works

Now we will walk line by line through the program and explain how dmozbot works.

#
# dmozbot.pl - a simple bot to fetch the dmoz.org vital stats
#

Obviously, these lines do nothing. The # symbol denotes a perl comment.

use LWP::UserAgent;
use HTTP::Request::Common;

These two lines tell perl to use the LWP::UserAgent and HTTP::Request::Common modules, respectively. These two modules offer a very high level of abstraction for dealing with HTTP requests. Specifically, LWP::UserAgent contains the user agent code, which handles connecting to the webserver, making the request, and fetching the result. HTTP::Request::Common lets you build simple HTTP GET request objects with a single line of code.

For simple scripts like dmozbot, where it doesn't really matter whether the request succeeded, you can even use LWP::Simple, which fetches data from a URL with a single subroutine. For spiders that hit the same site many times, replace LWP::UserAgent with LWP::RobotUserAgent, which obeys the robots.txt standard.

$bot = new LWP::UserAgent;
$response = $bot->request( GET "http://www.dmoz.org" );
$content = $response->content;

Here we see the modules put to use. The first line creates the bot, the second line tells it to fetch the dmoz.org homepage, and the third line saves the actual content of the page in the variable called $content.

A more robust bot might verify that the page actually existed and deal with cookies or various form of authentication. Consult the LWP documentation for more information here.

Once we have the content, we can employ perl's powerful regular expression handling to retrive the information we need:

for (split "\n", $content) {
   print $1 if /(.*sites.*unreviewed.*editors.*categories)\/;
}

Here, split "\n" breaks the page down into individual lines, and the for loop moves through them. The middle line checks each line of the code, and prints any part of it that looks like the dmoz.org status line.

How did I know what to look for? I looked for myself. I opened htttp://www.dmoz.org in my browser, selected "view source" from the menu, and examined the HTML. Near the bottom of the file, I saw some lines that looked like this:

<hr width="600"><small>
408,236 sites - 3,812 unreviewed - 8,285 editors - 61,866 categories<br>
Last update:   11:21:59 PDT, Sunday, March 28, 1999

Since I only wanted part of one line on this page, I used a regular expression to match the pattern I wanted.

Regular Expressions

I personally prefer Perl for this kind of task, because Perl has very powerful regular expressions built directly into the syntax of the language. Other languages, such as Python and Java also have strong regular expression capabilies, but I have never used either for making bots, simply because I find perl a more natural tool for the job.

To learn more about regular expressions, read the perlre documentation that comes with perl, or type man regexp on most any unix machine.

XML

HTML authors don't always make it easy on bot authors. They design pages to make sense to humans, not to programs. HTML tells a browser what text should look like, but not what it means.

XML formats, such as Netscape's RDF Site Summary, actually make much more sense to bots, because they contain meta data explaining what the content actually means. However, most sites do not yet use XML, so for most bot tasks, we're stuck with regular expressions.

Legal Aspects

When building bots, you should always consider the legal and ethical implications of what you do. You probably do not want to violate someone else's copyrights, or make so many hits to a server that it slows things down for other users. If you have any reservations about sending your bot to someone else's site, simply ask permission from the webmaster. Most will not mind, so long as your bot respects their site.

Conclusion

Bots let you customize the web. Perl, or any other good scripting language, lets you customize your bots. So think of something useful for your bot to do and go do it!