Using Gmails spam - Blog - Open Source - - php, photography and private stuff

Using Gmails spam

Since I maintain my own server for web, mail and some other services, I do not use my Gmail account much. I originally created the account just by curiosity for their UI and now use it to log into other Google services and occasionally if I need a different account than one of my main ones. What I like about Gmail is, that it seems to have a quite good spam filter. In the past half year about 10 spams got through to my inbox, while more than 900 were filtered into the spam folder (in the past 30 days, if you believe Gmail).

So, what to do with those filtered spams? Deleting them just at once is a bummer because some cute mail marketers might have wasted hours in hacking web spiders, mailing scripts or Windoze trojans for bot networks. Therefore I decided to make some more use of the nice large collection in training the bayes filter of my Spamassassin with it. ;)

If you like to do the same, you just need a .fetchmailrc configured for Gmail and a small shell script that receives the spam from Gmail and makes Spamassassin learn it. The following are my settings and the script, which you could use as a starting point:

poll with proto IMAP user '' there with password 'very very secret' options keep ssl sslfingerprint '2E:52:DE:98:7F:07:A3:CB:43:9E:7B:77:51:60:0E:07'

This .fetchmailrc (note that I left my gmail address in there intentionally, I want more spam for training! ;) configures the gmail IMAP access through SSL. You need to use IMAP, since POP does not know about folders on the remote host and therefore will fetch mails from your inbox instead of the spam folder.

#!/bin/bash /usr/bin/fetchmail -a -n -s \ --folder '[Gmail]/Spam' \ -m '/usr/bin/sa-learn -C /etc/spamassassin --no-sync --spam' \ | awk '/Learned tokens from 1 message/ { learned++; } /1 message(s) examined/ { all++; }\ END { print "Learned " learned " from " learned " messages. Thanks Google! ;)"; }' /usr/bin/sa-learn --sync

This bash script can be used via CRON to fetch the spam mails from Gmail and inject them into the bayes filter for learning. Note that I use a global bayes database for my whole server and that this database must be writeable for the user who executes the CRON job. fetchmail calls the sa-learn command instead of a real MDA (thanks to this Spamassassin wiki page for the hint). Note, that the user executing this script also needs write access to the directory containing the bayes database files. The learning process creates a journal file, which is then synced into the database, there. To avod that a new journal is created for each mail and this journal is synched after each mail the --no-sync switch is used. I expect the synchronization re-calculates the probabilities in the database. The final sa-learn --sync switch makes the learning complete.

Looks like the spam in my Gmail account got a nice new employment. ;)

P.S.: I hope that the Spamassassing bayes does not use the To/CC/... headers for learning. Else all my Gmail ham might be classified as spam soonish, too. ;)

If you liked this blog post or learned something, please consider using flattr to contribute back: .



Add new comment

Fields with bold names are mandatory.