Services_Trackback - Thoughts on trackback spam

2008-11-27

Tobias Schlitt

A few weeks ago I announced the release of Services_Trackback 0.5.0, which has a new module system for integrating spam protections into your trackback mechanisms. While the most easy filter (the bad word list) worked quite well for the first time frame, but as usual it did not take long for the spammers to work around that with using entitie encoding. Of course to get around that from the anti spam point of view is very simple, too, with simply reconverting that stuff before running the bad word check. But that's not really the sense, because the spam fraction will not need long to come around this, too.

So, basically what I'm currently thinking about is, how to build a (to some degree) reliable spam protection.

The great archetype for such a system could of course be spamassassin. Where the question is of either re-implementing a similar system (rule based, regex based,...) or simply try to interface with spamassassin itself. I talked to several people here on Linuxtag to get their opinion on such stuff and the common sense was to keep the module stuff as is and try to write a new module interfacing with Spamassassin. That's what I will try to do in the next time.

Beside that I shared some general thoughts on spam protection and tried to get some input on what methods may be sensible. Services_Trackback currently supports 4 spam modules, which are:

Bad word list
Regular expressions
DNSBL
SURBL

While the first 2 are pretty simple, but somewhat effective, the remaining are more resource greedy and complex. The DNSBL of course is effective when spammer come through a dial-up connection, since most of those IP ranges are blocked through DNSBL (no one would really run a productive webserver through a dialup connection and trackbacks usually come from productive websites). On the other hand, this method is quite ineffective when someone spams through static IPed server, since DNSBLs only list servers with open SMTP relays and that's mostly likely not the case on such servers. The 3rd method infact is the most effective one (SURBL) since it extracts the URLs from a trackback and check the domain name of those against a DNS server. But the effectiveness is payed with even more resource consumption, since the URLs have to be extracted and each have to be checked through a DNS lookup.

Please read the extended entry to get an impression on my thoughts and comment on them. I would also be lucky to receive some more ideas on that topic!

Currently I have not much of an idea what anti spam methods could be implemented additionally, bit there are some:

1. Domain blacklist A mixture of the bad word list and the SURBL plugin. Since domain names used in spam URLs stay mostly the same (at least for a while) on can simply check those stuff against a local BL which is much faster (because of no network access necessary) and even faste than just adding the domains to the bad word list (you don't need to match those agains the whole trackback.

This idea is ready for implementation and very simple to realize, so the next version of Services_Trackback will include it.

2. "Greylisting" This is really an early idea on that stuff, which would definitly require to extend the definition of trackbacks in general. So, what greylisting (for emails) does is to reject an email in the first range, wait for a defined range of time and if the mailserver tries to send the mail again after that time (what a non-spam MTA will definitly do) the mail is accepted. The sense behind that technique is, that spammers usually not use full-featured MTAs to send spam, but have simle scripts that just drop of a mail and if it fails they give up and think the adress is invalid. Doing a resend attemp to would simply cause too much work for them.

This idea is basically not portable to trackbacks, because the real trackbacks are even not retried or something similar. But parts of that are portable in any way. For example you can resend some data to the weblog sending the trackback which says "please try again in X seconds" and if the trackback is send again that should work, because even the resend itself should be too much work for a spammer to deal with.

But of course this is not the solution against trackback spam and I simply can not imagine it really works. So, let's look at another idea...

3. Verification The verification of a trackback can be done by requiring either the weblog itself or the user in front of it to verify his goal of sending a valid trackback. Possibilities to do that could either be a) let the weblog system calculate some stuff and make it resend the trackback including the calculated data. What is achieved by that is, that the spammers system has to become more complex and he cannot simply rely on just posting some data, no matter if the trackback succeeds or not.

Another kind of verification is even more efficient: Having a valid email address inside the posted trackback data and sending an email to the author, who has to klick a link to approve that trackback. Surely some people might complain that nobody will give out a valid email address because he fears that it gets displayed on the website receiving the trackback and will result in more email spam for him. But surely this is a possibility to more or less cut off trackback spam for a longer time, since parsing those incoming mails (of course they will look differnt for every site you post a trackback to) and calling the URL automatically is much too much work for a spam system.

The third possibility does a similar thing but without emails. The trackbacked website could answer with the location of a CAPTCHA and require the backtracking user to submit the CAPTCHA value back.

So, basically that's what I thought about trackback spam avoidance so far. I'm sure that the list above is quite incomplete, so I'd be lucky to receive some comments on my ideas and of course on new methods to try to avoid spam.

Comments

The CAPTCHA idea is interesting, but would require changing the way trackbacks work. If the software sending the trackback didn't know it needed to capture a verification process and present it, then the whole process would break down and the trackback would be dropped on the floor. There are other techniques, but I tend to favor this one despite the feature issues.
One of the reasons we've been reasonably successful in combating comment spam is because of the assumption that a human is involved. This allows us to use techniques that are easy for humans to process (like CAPTCHA) but tend to be harder for computers to figure out. With trackbacks (and email for that matter), the transaction is all done between two computers, so anything a computer can process will always be exploitable by another computer.
It is time to ask around about the trackback spec and discuss ways to require a human to be part of the process. Call it trackback 1.1 and start encouraging others to use it.

Joseph Scott at 2005-06-24

Tobias,
You should see about implementing something using the MTBlacklist Blacklist file/updates.
This is a user contributed blacklist consisting of URLs to block for spam.
It works by first importing the Master blacklist: http://www.jayallen.org/comment_spam/blacklist.txt
And then you can keep it up to date by syncing http://www.jayallen.org/comment_spam/blacklist_changes.txt or using the RSS Feeds (1.0) http://www.jayallen.org/comment_spam/feeds/blacklist-changes.rdf (2.0) http://www.jayallen.org/comment_spam/feeds/blacklist-changes.xml
This was pretty effective when I still used MT. Its all in PCRE regex format (without the delimiters IIRC) so will run pretty quickly - certainly more so than any remote BL

Davey Shafik at 2005-06-24

Hi Tobias --
did someone mention SpamAssassin? ;)
SpamAssassin is indeed a good way to deal with this -- it has plenty of smarts about the various obfuscation methods used by spammers, and code written to deal with them. However, its default ruleset is quite email-spam-oriented, and I've observed that web spam is quite different.
You will need to synthesise a "fake" email message that contains the data found in the comment; this would be something like:
Note the faked Received line. that allows SpamAssassin to look up the IP address that really delivered the message.
You will also need to use a user preferences file that turns off the DYNABLOCK blocklists, since those are hosts that are not supposed to deliver email directly, but are allowed to originate HTTP traffic just fine (end-user broadband machines).
You may get good results with the default email ruleset, but if not, SpamAssassin can be told to use a different ruleset directory using the -C switch.
Good luck!

-- Justin Mason at 2005-06-24