I have developed Guestbooks (Guest Books) in PHP, and in common with many other people who have done so, I have found them inundated with spam. Some spam one can do little about - there are always small-minded people who will vandalise what they can, because they can. Most spam entries, however, have a purpose - to propagate web links as widely as possible so that the associated web sites attain a high search engine ranking. Such entries are usually inserted by software agents.
My initial reaction to the flood of spam was that of annoyance, but I then took it as a personal challenge to beat the spammers. This article shares those measures which I have found most useful.
You obviously don't want your Guestbook to be flooded with garbage without your knowledge. My Guestbook sends me a detailed email every time an entry is inserted, as well as every time an entry is rejected. The former is useful to ensure that your filters don't need upgrading, and the latter is useful to make sure that your filters are not stopping valid entries. For example:
Reasons for rejection: Original referer incorrect: http://braemoor.co.uk/y.php It appears to be a script adding this entry IP address now banned - it has attempted to spam 3 times before The Title field is more than 45 characters - 53 found Date: 12 October 2011 03:15:21 Initial referer: http://braemoor.co.uk/y.php IP address: 220.127.116.11 Domain: 18.104.22.168.justquaconnect.com User agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FREE; .NET CLR 1.1.4322) Contributor: Genesis_fromAA Email address: firstname.lastname@example.org Title: Chochqa goshti va salomatlik xaqida nima deya olasiz? Elapsed: 0 Message: I think greater interaction among groups would result (mix up the tech bloggers and the marketers and the VCs and the usability people and, hell, even the sex bloggers) and really spark some awesome conversations.
Many spammers find guestbooks through search engines, normally by specially written software. Make sure, therefore, that your guestbook transactions are not indexed by including the following in your HTML header:
<meta name="robots" content="noindex, nofollow" />
However, it is not useful to exclude robots from accessing your guestbook pages using the robots.txt file. Whereas Good Bots will dutifully obey, such a request will be treated as an invitation to enter by the Bad Bots.
This is the most effective defence against your guestbook being spammed by software agents, but its use depends on how your guestbook is structured. The idea is to make sure that the Add Entry form is accessed only from a valid page, and is not invoked directly by a software agent. In theory one could use the referer environment variable to check this, but in practice this is not to be trusted.
In the case of my guestbook the Add Entry page should only be accessed from the View Entry page. To ensure this is so, I start a PHP session in the View Entry page, and load a session string variable with a random number. The Add Entry page incorporates the contents of the variable into a hidden field on the form, and when the completed form is returned for processing the value of the hidden field is compared with that of the session variable. If the session variable doesn't exist or the values differ then the entry is rejected. If the entry is valid, the current session is destroyed, and a return made to the View Entry page where a fresh session will be created.
My Add Entry transaction is responsible for both displaying the form and processing the form - and so it is entered twice. The first time is from the View Guestbook function, and the second is from itself. The original referer is remembered in a Session Variable when the transaction is first entered, and when the form details are being processed a check is made that the ORIGINAL referer is the View Guestbook transaction.
Assuming that your data is passed as Post variables, make sure that there aren't any Get variables. There is no valid reason for having any. Then check that the received Post variables are as expected - with the right number of variables, with the right names, and in the right order.
In the form definition HTML markup one specifies the maximum length for each field. Ensure that the form variables received aren't greater than this length.
If you have an email field in your form, don't make the name obvious - call it something like 'xyzzy'. Do have a hidden field, however, called 'email'. This field will get picked up by the spam programs and will get filled in. If it is filled in, then it can only be by a spambot.
Spam bots aren't one fingered typists, and when they first access the Add Entry page they immediately send the form data, whereas a human requires time to type in a contribution. By remembering the time when the Add Entry page was first invoked in a Session Variable, and comparing that with when the data was provided, one can see how long it took to add the entry. If it is less than, say, 10 seconds it is safe to assume that it wasn't a human adding the entry.
The purpose of the vast majority of spam is to propagate web addresses. Tell your users that entries containing web addresses will be treated as spam, and filter out entries that contain them. Checks should be made for such things as "http://", "www.", "href=", and "[url=" in ALL fields - not just the message field. I incorporate such pieces of text in an array which can be easily extended as and when required. I also assume that <script> and <image> tags are similarly malicious.
Incidentally, I filter out all HTML tags from the entry, as I like to retain control on how my guestbooks look.
Once you have an array of strings of text as above, it is easy to add words like "porn", "viagra", and "cialis" to allow you to exclude entries containing words that you feel are inappropriate for your guestbook. I have four lists: sexual, pharmaceuticals, financial, and miscellaneous, and these get updated fairly regularly as new marketing campaigns emerge.
Because most spam is generated by software, you will sometimes find character entities in unexpected places - such as "%20" in a name field or a email address field. These would never get typed in by a human user, so when you find such characters, add appropriate strings to your filter array.
Much of the span comes from eastern Europe countries, Russia, and the far east, and the messages are often full of encoded characters. I assume that any occurrence of such characters is indicative of a malicious post. The following regular expression seems to pick them up pretty well:
A common cause of attempted vandalism of guestboooks is to provide long strings of text which causes the display to be widened, ruining the format. I regard any attempt to insert a word of more than 30 characters as malicious.
Once you have detected the spam, send details of it to your email address. This allows you to check that you haven't unintentionally trapped a benign entry, and also gives you the opportunity to look at the contents to see if the filtering processes can be strengthened. I also log the spam attempt in a database.
I now allow an ip to attempt to spam my guestbook twice. On the third occasion, I ban it by automatically updating the .htaccess file. This is made easier by having my guestbook software in a separate directory with its own .htaccess file that simply contains a list of banned ip adddresses. I keep the list in sorted order to make it easier to remove an entry.
Deny from 22.214.171.124 Deny from 126.96.36.199 Deny from 188.8.131.52
I have no wish to advertise the programs that are involved in spamming, but it is useful to be aware of what you are up again. XRumer is a bot that is specifically designed to spam guestbooks, and details may be found at http://www.botmasterlabs.net/index.php. There is a whole service industry built up around this product. Submit is designed to spam blogs, and details may be found in the Search Engine Journal. If you haven't come across such software before, you will probably be shocked by just how sophisticated it is.
The short answer is "yes". My guestbooks haven't been successfully spammed for many years. I am, however, regularly reviewing the spam that doesn't get through to attempt to identify patterns that will help to improve the filtering, and experimenting with new ideas. It is an arms race!
It can be interesting to look at the reasons why a spam attempt has been rejected, as it gives an insight into the set-up of the spambot script. In the example above, it will be seen that the attempt was not rejected due to the session variable not being set up. This implies that the script invoked the View Guestbook transaction before calling the Add Entry transaction. Someone has made a positive attempt to analyse the requirements of the Guestbook before generating the script.