A Perl Rogue Robot Trap

Some user agents wilfully disregard the access rules laid down in the robots.txt file. Typically these are agents that up to no good - email address collectors, ruthless web site downloaders, and so on.

This Perl script in conjunction with the robots.txt file is designed to identify such agents, so that they may be subsequently banned from the site using the htaccess file.

The basic mechanism used is to include a line in the robots.txt file which disallows access to a special directory, say robottrap.

User-agent: *
Disallow: /robottrap/

For some rogue robots, the existance of a forbidden directory is reason enough to visit it. For others, you may need to tempt them with a link on your index page which is inaccessible to humans, such as:

<a href="/robottrap/index.shtml"></a>

It doesn't matter what the /robottrap/index.shtml web page looks like. It is simply a vehicle for invoking a Perl CGI script called robottrap.cgi in the server. To do this, the web page must include the following line within the body:

<!--#include virtual="/cgi-bin/robottrap.cgi"-->

It is assumed here that the CGI scripts are held in a first level subdirectory called cgi-bin. You may have to change this to match the address of your script directory. Note that the page should have an extension of .shtml.

The robottrap.cgi script has two functions:

  1. It records the date and time of the hit, the name of the user agent, the originating IP address, and the domain name in a file. This file may be listed using the supplied transaction robotreport.shtml.
  2. It also sends the same details by email to the given email address.

There are three variables which need to be set up to allow the script to be tailored to local requirements:

  1. The directory where the data file should be stored. By default this is the /data subdirectory where the script is held, and should be set up beforehand.
  2. The email address to where the details of the rogue agents should be sent.
  3. The location of the local sendmail program. If the script doesn't work, check the location of this program with your system manager.

The three variables are all declared together near the top of the script, and are readily identifiable.

Finally, there is a web page, robotreport.shtml that displays a full list of robots that have fallen into your trap. Note that some so-called "web accelerators" which simply pre-fetch all links on a page at cost to your bandwidth will also be caught by the trap.

The functionality of these two scripts is also provided in PHP, which is a little easier to implement. The generated file is compatible with both versions.

Download Zip File (5 Kb).