Some user agents wilfully disregard the access rules laid down in the robots.txt file. Typically these are agents that up to no good - email address collectors, ruthless web site downloaders, and so on.
This PHP script in conjunction with the robots.txt file is designed to identify such agents, and ban them from the site using the htaccess file.
The basic mechanism used is to include a line in the robots.txt file which disallows access to a special directory, say robottrap.
User-agent: * Disallow: /robottrap/
For some rogue robots, the existance of a forbidden directory is reason enough to visit it. For others, you may need to tempt them with a link on your index page which is inaccessible to humans, such as:
The robottrap.php script has three functions:
There are two variables which need to be set up to allow the script to be tailored to local requirements:
Full instructions on how to set up the rogue robot trap is included in the header comments of the transaction.
Finally, there is a web transaction, robots.php that displays a full list of robots that have fallen into your trap. Note that some so-called "web accelerators" which simply pre-fetch all links on a page at cost to your bandwidth will also be caught by the trap.
Arie Slob has kindly pointed out that there should be an interval between uploading the new robots.txt file, and setting the trap, to ensure that valid bots are not using an old cached version of the file. Twenty four hours should suffice.
Download Zip File (7,763 bytes).