A Perl Rogue Robot Trap
Some user agents wilfully disregard the access rules laid down in the robots.txt file. Typically these are agents that up to no good - email address collectors, ruthless web site downloaders, and so on.
This Perl script in conjunction with the robots.txt file is designed to identify such agents, so that they may be subsequently banned from the site using the htaccess file.
The basic mechanism used is to include a line in the robots.txt file which disallows access to a special directory, say robottrap.
User-agent: * Disallow: /robottrap/
For some rogue robots, the existance of a forbidden directory is reason enough to visit it. For others, you may need to tempt them with a link on your index page which is inaccessible to humans, such as:
<a href="/robottrap/index.shtml"></a>
It doesn't matter what the /robottrap/index.shtml web page looks like. It is simply a vehicle for invoking a Perl CGI script called robottrap.cgi in the server. To do this, the web page must include the following line within the body:
<!--#include virtual="/cgi-bin/robottrap.cgi"-->
It is assumed here that the CGI scripts are held in a first level subdirectory called cgi-bin. You may have to change this to match the address of your script directory. Note that the page should have an extension of .shtml.
The robottrap.cgi script has two functions:
There are three variables which need to be set up to allow the script to be tailored to local requirements:
The three variables are all declared together near the top of the script, and are readily identifiable.
Finally, there is a web page, robotreport.shtml that displays a full list of robots that have fallen into your trap. Note that some so-called "web accelerators" which simply pre-fetch all links on a page at cost to your bandwidth will also be caught by the trap.
The functionality of these two scripts is also provided in PHP, which is a little easier to implement. The generated file is compatible with both versions.
Download Zip File (5 Kb).