Services_Trackback - Thoughts on trackback spam
A few weeks ago I announced the release of Services_Trackback 0.5.0, which has a new module system for integrating spam protections into your trackback mechanisms. While the most easy filter (the bad word list) worked quite well for the first time frame, but as usual it did not take long for the spammers to work around that with using entitie encoding. Of course to get around that from the anti spam point of view is very simple, too, with simply reconverting that stuff before running the bad word check. But that's not really the sense, because the spam fraction will not need long to come around this, too.
So, basically what I'm currently thinking about is, how to build a (to some degree) reliable spam protection.
The great archetype for such a system could of course be spamassassin. Where the question is of either re-implementing a similar system (rule based, regex based,...) or simply try to interface with spamassassin itself. I talked to several people here on Linuxtag to get their opinion on such stuff and the common sense was to keep the module stuff as is and try to write a new module interfacing with Spamassassin. That's what I will try to do in the next time.
Beside that I shared some general thoughts on spam protection and tried to get some input on what methods may be sensible. Services_Trackback currently supports 4 spam modules, which are:
Bad word list
While the first 2 are pretty simple, but somewhat effective, the remaining are more resource greedy and complex. The DNSBL of course is effective when spammer come through a dial-up connection, since most of those IP ranges are blocked through DNSBL (no one would really run a productive webserver through a dialup connection and trackbacks usually come from productive websites). On the other hand, this method is quite ineffective when someone spams through static IPed server, since DNSBLs only list servers with open SMTP relays and that's mostly likely not the case on such servers. The 3rd method infact is the most effective one (SURBL) since it extracts the URLs from a trackback and check the domain name of those against a DNS server. But the effectiveness is payed with even more resource consumption, since the URLs have to be extracted and each have to be checked through a DNS lookup.
Please read the extended entry to get an impression on my thoughts and comment on them. I would also be lucky to receive some more ideas on that topic!
Currently I have not much of an idea what anti spam methods could be implemented additionally, bit there are some:
1. Domain blacklist A mixture of the bad word list and the SURBL plugin. Since domain names used in spam URLs stay mostly the same (at least for a while) on can simply check those stuff against a local BL which is much faster (because of no network access necessary) and even faste than just adding the domains to the bad word list (you don't need to match those agains the whole trackback.
This idea is ready for implementation and very simple to realize, so the next version of Services_Trackback will include it.
2. "Greylisting" This is really an early idea on that stuff, which would definitly require to extend the definition of trackbacks in general. So, what greylisting (for emails) does is to reject an email in the first range, wait for a defined range of time and if the mailserver tries to send the mail again after that time (what a non-spam MTA will definitly do) the mail is accepted. The sense behind that technique is, that spammers usually not use full-featured MTAs to send spam, but have simle scripts that just drop of a mail and if it fails they give up and think the adress is invalid. Doing a resend attemp to would simply cause too much work for them.
This idea is basically not portable to trackbacks, because the real trackbacks are even not retried or something similar. But parts of that are portable in any way. For example you can resend some data to the weblog sending the trackback which says "please try again in X seconds" and if the trackback is send again that should work, because even the resend itself should be too much work for a spammer to deal with.
But of course this is not the solution against trackback spam and I simply can not imagine it really works. So, let's look at another idea...
3. Verification The verification of a trackback can be done by requiring either the weblog itself or the user in front of it to verify his goal of sending a valid trackback. Possibilities to do that could either be a) let the weblog system calculate some stuff and make it resend the trackback including the calculated data. What is achieved by that is, that the spammers system has to become more complex and he cannot simply rely on just posting some data, no matter if the trackback succeeds or not.
Another kind of verification is even more efficient: Having a valid email address inside the posted trackback data and sending an email to the author, who has to klick a link to approve that trackback. Surely some people might complain that nobody will give out a valid email address because he fears that it gets displayed on the website receiving the trackback and will result in more email spam for him. But surely this is a possibility to more or less cut off trackback spam for a longer time, since parsing those incoming mails (of course they will look differnt for every site you post a trackback to) and calling the URL automatically is much too much work for a spam system.
The third possibility does a similar thing but without emails. The trackbacked website could answer with the location of a CAPTCHA and require the backtracking user to submit the CAPTCHA value back.
So, basically that's what I thought about trackback spam avoidance so far. I'm sure that the list above is quite incomplete, so I'd be lucky to receive some comments on my ideas and of course on new methods to try to avoid spam.