A PHP Rogue Robot Trap
Some user agents wilfully disregard the access rules laid down in the robots.txt file. Typically these are agents that up to no good - email address collectors, ruthless web site downloaders, and so on.
This PHP script in conjunction with the robots.txt file is designed to identify such agents, so that they may be subsequently banned from the site using the htaccess file.
The basic mechanism used is to include a line in the robots.txt file which disallows access to a special directory, say robottrap.
User-agent: * Disallow: /robottrap/
For some rogue robots, the existance of a forbidden directory is reason enough to visit it. For others, you may need to tempt them with a link on your index page which is inaccessible to humans, such as:
<a href="/robottrap/robottrap.php"></a>
The robottrap.php script has three functions:
There are three variables which need to be set up to allow the script to be tailored to local requirements:
The three variables are all declared together near the top of the script, and are readily identifiable.
Finally, there is a web page, robotreport.php that displays a full list of robots that have fallen into your trap. Note that some so-called "web accelerators" which simply pre-fetch all links on a page at cost to your bandwidth will also be caught by the trap.
Download Zip File (3138 bytes).