Mail index : imap : laptop : spam

Mail: handling spam

I've already described some of my mail setup, the ways in which I send and receive mail on my various systems. The final thing I want to cover is a more universal problem - not how to transfer mail, but how to limit how much unwanted mail you see on a daily basis.

Most people are aware of the spam problem, but not many might be so aware of how and why it happens, nor how to control it effectively.

Background

In order to understand how to tackle the spam problem, it's necessary to understand some more about it. I'll try to explain some of the terminology here, and some of the challenges involved in dealing with spam. If you want to know more, there are lots of very good discussions about spam and related topics out there on the 'net. Paul Graham has some particularly well-regarded resources.

What is spam?

Purely and simply, my definition is that it's mail that is sent to you that you did not ask for and do not want. There are other definitions if you go looking around the internet, but those two features are the important ones. Common spam includes:

Advertisements for products that you don't want or couldn't use. These are various and widespread: pharmaceutical products to make your sex life better, a new mortgage, diesel engine parts from China, etc.
Enticements to gamble your money away at online casinoes, or to buy shares in companies that you've never heard of ("pump and dump").
Fake "phishing" emails purporting to come from Ebay, PayPal or your bank. These will claim that you need to login now at the provided URL to fix something wrong, but actually what they're hoping to do is to steal your login details so they can steal your money.
Virus-ridden emails asking you to click on the enclosed picture (or similar); viewing the "image" will instead infect your system if you're unlucky enough to be running on a Windows system without virus protection.

There are other types too, but these should be good examples of what we're trying to get rid of.

Where does spam come from?

There are many unscrupulous people in the world who make money from sending spam, and that's why it happens. You can generally work out quite quickly who might be responsible simply by following the money trail - spam is a lucrative business. Depressingly, it seems that almost no matter how stupid or unintelligible a spam message may be, there are enough naive/stupid/ignorant people in the world that it may still find targets. Spam works because of the very low costs involved in sending emails - if it costs a small fraction of a penny to send each email, then sending millions does not cost much either. You only need a very small rate of response to cover those costs and make a profit.

A long time ago, it used to be the case that people could reasonably expect to hide from spam. Unfortunately, those days are probably gone forever. The spammers have lots of ways to find out or guess your address, and many ways to try to get it to you.

Spammers use lists of email addresses to attack. These can be bought and sold in the shadier corners of the internet. They may originate from unscrupulous companies trying to make money out of selling your personal details, or found by "scraping" web pages and newsgroups looking for email addresses. In some cases, the lists are generated on the fly using a "dictionary attack": simply trying a large set of common names at each email domain that they can find.

They will try to fool you and your software in any way that they can to try to make you read the spam that you receive. This is very much an arms race - as new techniques are found to block spam, the spammers will try to find ways around them. Initially, spammers would simply rent computers at ISPs, using those machines to connect to email servers around the world to try and send spam. These machines and their addresses became easy to identify and block, so the spammers started to pretend to come from other machines, Then they started directly hacking into other people's systems to use them to send mail. Then they started using viruses to take control of increasingly large numbers of network-connected machines, spreading their payload of crap using thse machines. As more and more home computers are left inadequately protected yet permanently connected via fast broadband connections, these "botnets" are rapidly growing in size. When the spammers have essentially unlimited computing resources at their control, their per-message costs spiral ever downwards.

What is NOT spam?

That's generally easy to work out - for a human. Non-spam mail (aka "ham") often comes from people you already know: it's mail from companies you already deal with, either telling you about the delivery status of the order you just placed or about their latest special offers (if you signed up for their newsletters etc.). It can also covers other types of mail, for example old friends and colleagues getting in touch if/when they find out your address, or people replying to things you've written in newsgroups or on your blog. This is the tricky part - if you don't expect these mails yourself, then you can't easily tell your software what to expect either. More about this later.

What can be done?

That's a pretty bleak picture. There are a few things that normal people can do about spam. The ideal solution is something that will stop you receiving any spam without causing any "collateral damage", i.e. causing other people to do more work or receive more spam.

Can we stop the spam?

Ideally, it would be wonderful if we could stop the spammers altogether, but that's a very difficult proposition. It's possible to help fight the flood of crap, and that's a reasonable thing to do. If you are spammed and are sure you can identify the spammer, send an abuse report to their ISP. Ditto if you're being attacked from what looks to be a member of a botnet. But be very careful: spammers lie and it's very easy to mis-read email headers and blame the wrong person. That leads to false accusations and potential for collateral damage.

Blacklisting and whitelisting

The simplest way to control incoming mail is to just track senders. There are two ways to do this. Either "blacklist" by default and only allow people on the "whitelist" to send mail to you (not very useful, as it's very difficult for new people to talk to you), or the (more common) opposite - allow mails by default unless you have their senders blacklisted. You can track mails by sender or system, but system is more common.

To help in blacklisting, some organisations run systems called Real-time Blackhole Lists (RBLs). They keep track of mail servers that are known to be spam sources. You can then configure your own mail system to query the RBLs each time a new mail lands, and act accordingly. This can work well, but there are often problems with the RBLs - you're trusting other people to determine which mail servers you should listen to, and false positives are a common issue.

Challenge-response systems

Another option that people try in their attempts to kill spam is an idea called challenge-response, or C-R for short. In this setup, your mail system keeps track of everybody who has tried to send you mail and which of those people you have accepted mail from. When somebody new sends you mail for the first time, the C-R system will respond to ask them to confirm they're not a spammer (e.g. by sending another specially-formatted mail or by visiting a special web page). Once they have responded appropriately, they will be whitelisted. You may also blacklist senders so that they will never be able to talk with you.

This may initially sound like a reasonable solution (white/blacklisting without relying on external resources), but it leads to collateral damage. When spammers forge email From: headers, C-R systems almost invariably end up challenging the wrong people. Annoying other people is not (in my opinion) an acceptable way to go, and this is a common point of view. It's antisocial - C-R users are spreading their own spam problem to others. Make sure you tell them that when you interact with such a system.

Greylisting

As might be expected from the name, "greylisting" is somewhere in between whitelisting and blacklisting. It's an automated system that helps to work out whether new mail should be blacklisted or whitelisted. However, unlike a C-R system above, it does not depend on human interaction to determine how a new mail sender should be treated.

Most spamming systems are designed to use a scattershot approach - they spend as little time and effort as possible on each mail that they send in order to minimise the cost. That means that some of the niceties of "real" email systems are tossed out of the window in the name of this minimisation, in particular mail spooling and retrying. If a normal mail system encounters temporary errors when sending mail out, for the sake of reliability it will keep hold of that mail and retry it again later. This may happen potentially many times over an extended period, depending on configuration.

Greylisting depends on this difference in behaviour between typical spam and non-spam systems. It does that by always returning a temporary error code the first time a new mail system connects and remembering details about it for later. When a real mail is retried later on, the greylisting system will match up the old and new attempts, deliver the mail and put the sending system into the whitelist. If the mail is not retried for a long time, then eventually it will be purged from the greylisting database.

Greylisting by issuing pretend temporary errors is technically allowed by the RFC standards for mail systems, but is a little frowned upon in some quarters. There is also growing evidence that this counter-measure in the arms race is starting to lose its effectiveness: as more people use the technique, more spammers have started or will start to retry mails that fail with temporary errors. Greylisting is therefore not a solution to spam on its own, but may be a helpful part of a complete system.

There can be some drawbacks to greylisting: legitimate mail is (clearly) often going to be delayed, plus some poorly-configured mail servers may not accept the concept of a temporary error and simply give up immediately without a retry. Hence, it's common to use greylisting with some sort of whitelist configuration - only greylist incoming messages that are considered dubious already (e.g. due to RBL warnings). Another suggested use for greylisting is to allow more time to do extra checks on a mail after the first temporary error but before any further delivery attempts. There are lots of options here, with potentially very complex interactions!

Filtering

The other common thing that people do to avoid spam is attempt to filter it out - distinguish between the ham and the spam in the mail system. There are multiple common ways of doing this:

Network blocking - some mail server admins simply block mail connections from networks that they believe to be spam havens. This can work quite well if configured correctly, but is also very likely to lead to lost legitimate mail when done badly.
Mail header checks - typically spammers play fast and loose with standards when sending their mail, and there are common fingerprints left by some of the most common spam tools. Simple header validity checks during the mail transfer itself can often pick up on the most obvious spam.
RBLs - use these as a guide for how spammy a mail is likely to be, rather than in a simple black or white choice.
Bayesian analysis - split a mail up into its component words/phrases, calculate the spam probability of each of those individually, then (using Bayesian algebra) sum those individual scores to give an overall likelihood of the message being ham or spam. The individual word spam scores need to be calculated ("trained") by the user's software individually in order to be effective - different people expect different topics and words in their mail.

The most effective anti-spam systems use a combination of methods: the more spam cues that can be found in a given mail, the more likely it is to be spam.

My own setup

On my own mail server, I use 2 layers of software to protect against spam. Others may use more, but this is enough for me. Typically, the layers in systems are laid out in order of cost - whether that cost is in terms of network usage (via such things as RBL lookups, sender callbacks etc.) or CPU time (for bayesian statistics and the like) will depend very much on the local configuration. The earlier (and hence the more cheaply) that spam can be identified and rejected, the better. In mine, I don't have to worry too much.

Mail transport agent (MTA) - Exim

Firstly, I use exim as my MTA. It directly does a fair number of checks on incoming mails and will reject many of them immediately due to errors it finds. Some minor errors are flagged as warnings in added headers. I have explicitly configured Exim to drop messages:

to non-existent local users (to kill dictionary attacks)
if the address of the sender is invalid (syntax errors, or the address will not accept mail in return - "sender callout")

and to add warning headers to messages:

if the sending machine does not have working reverse DNS

Anti-spam - SpamAssassin

After exim, mails are passed through procmail (a mail delivery agent filter) to SpamAssassin, a dedicated piece of anti-spam software. It uses more checks to determine the spam score for each mail: RBL lookups, more header checks and (most importantly) a Bayesian analysis.

Depending on the results of those checks, SpamAssassin builds up a total score of the "spamminess" of the mail. The lower the score, the more the mail is desired. It starts off with a centrally-distributed set of scores for its known rules and these can be overridden locally by the user in $HOME/.spamassassin/user_prefs.

When SpamAssassin returns from checking a mail, it will add more headers to say what it has found:

X-Spam-Status: whether or not it believes the mail is spam (i.e. above a configurable threshold), and which rules matched.
X-Spam-Level: the (integer) score of the mail, represented in asterisks. This is useful for automated filtering (see later).
X-Spam-Checker-Version: details about the version of SpamAssassin used.

As SpamAssassin adds this extra information in headers, procmail can use it for later processing. For example, the easiest way to pick up on definite spam is to count the number of asterisks in the X-Spam-Level header. Some people simply split mail into ham or spam based on a threshold here. Instead, I choose two thresholds to split into three types of mail:

Mail with a score of more than 8 is just about guaranteed to be spam, and should be saved directly to my "spam" folder
Mail with a score of more than 5 but less than 8 is likely to be spam, but there is some doubt. Save it to "maybe-spam"
Mail with a score of less than 5 is most likely to be ham, and should be accepted - fall through for continued processing and delivery

Delivering mail to the right place - procmail

The following rules in my .procmailrc cover passing mail to SpamAssassin, and picking up on the spam score afterwards. Procmail rules are not the most obvious of things to read - see the man page if you don't understand these.

# The condition line ensures that only messages smaller than 250 kB
# (250 * 1024 = 256000 bytes) are processed by SpamAssassin. Most spam
# isn't bigger than a few k and working with big messages can bring
# SpamAssassin to its knees.
#
# The lock file ensures that only 1 spamassassin invocation happens
# at 1 time, to keep the load down.
#
:0fw: spamassassin.lock
* < 256000
| spamassassin

# Mails with a score of 8 or higher are almost certainly spam (with 0.05%
# false positives according to rules/STATISTICS.txt). Let's dump them out
# of the way.
:0:
* ^X-Spam-Level: \*\*\*\*\*\*\*\*
Inbox/.spam/

:0:
* ^X-Spam-Level: \*\*\*\*\*
Inbox/.maybe-spam/

# Anything that has not been delivered by now will go to $DEFAULT
# using LOCKFILE=$DEFAULT$LOCKEXT

Training SpamAssassin

That's the incoming mail dealt with. However, that's not the only part that's needed. As I mentioned earlier, SpamAssassin uses Bayesian statistics internally as part of its analysis of each mail. Bayesian methods work best when the database of scores of words and phrases is tuned specifically to match the characteristics of the user's own mail. The best way to do this is to feed SpamAssassin with the ham and spam mail you have received.

To do that, use the program sa-learn (from the SpamAssassin package). Feed it with your spam messages and your ham messages so that it can learn both what's bad and what's good. There are several ways to do this, but the exact details of your mail setup may decide the best way for you. If you are receiving mail directly to the computer where you read your email, then running sa-learn directly on that machine as you classify mails as ham or spam is a reasonable thing to do. However, in my case I want SpamAssassin to run on mails as they are initially received on the mail server before I transfer them across to my laptop or dekstop machine via IMAP. As I read the mail on a different machine, building a SpamAssassin Bayes database there is not very useful.

The easiest way for me to do the training is to do something slightly different, therefore. Rather than delete mail or train directly, I save them into ham and spam folders locally. Then, by the magic of IMAP, the contents of those folders will be synchronised automatically back to the mail server. Once per day, I train SpamAssassin using the mail stored in the folders on the mail server. To do that, I run the following script (check_spam_folder) from cron:

#!/bin/sh
#
# Check_spam_folder
#
# (c) Steve McIntyre 2008
# GPL v2
#
# Train SpamAssassin with the contents of local mail folders
#
# Takes one argument: the root of the tree of local maildirs

MAILBOXES=$1
PROBABLE_SPAM=$MAILBOXES/Inbox/.maybe-spam
DEFINITE_SPAM=$MAILBOXES/Inbox/.spam
INBOX=$MAILBOXES/Inbox
HAM=$MAILBOXES/Inbox/.ham
PATH=/usr/local/bin:$PATH; export PATH

# Calculate some statistics for the specified folder
max_min () {
    HIGHEST=`find $1 -type f | xargs cat | \
             grep -a "^X-Spam-Status:.*hits" | \
             sed 's/^.*hits=//g;s/ .*$//g' | \
             sort -n | tail -1`
    LOWEST=`find $1 -type f | xargs cat | \
             grep -a "^X-Spam-Status:.*hits" | \
             sed 's/^.*hits=//g;s/ .*$//g' | \
             sort -n | head -1`
    echo "  highest score $HIGHEST and lowest score $LOWEST"
}

# Count the number of "possibly spam" messages
NUM_MAILS=`grep -rc ^From: $PROBABLE_SPAM/{cur,new} | wc -l`
if [ $NUM_MAILS -gt 0 ] ; then
    echo "Probable spam folder $PROBABLE_SPAM:" 
    echo "  currently contains $NUM_MAILS suspect message(s) for review" 
    max_min $PROBABLE_SPAM
else
    echo "No Probable spam found..."
fi

# If there are any definite spam messages, feed through sa-learn and
# then delete them
NUM_MAILS=`grep -rc ^From: $DEFINITE_SPAM/{cur,new} | wc -l`
if [ $NUM_MAILS -gt 0 ] ; then
    echo "Definite spam folder $DEFINITE_SPAM:" 
    echo "  currently contains $NUM_MAILS spam message(s)" 
    max_min $DEFINITE_SPAM
    echo "  Feeding them through spamassasin and deleting them..."
    sa-learn --spam --dir $DEFINITE_SPAM/{cur,new}
    find $DEFINITE_SPAM/{cur,new} -type f | xargs rm -f
else
    echo "No Definite spam found..."
fi

# If there are any definite ham messages, feed through sa-learn and
# then delete them
NUM_MAILS=`grep -rc ^From: $HAM/{cur,new} | wc -l`
if [ $NUM_MAILS -gt 0 ] ; then
    echo "Definite ham folder $HAM:" 
    echo "  currently contains $NUM_MAILS ham message(s)" 
    max_min $HAM
    echo "  Feeding them through spamassasin and deleting them..."
    sa-learn --ham --dir $HAM/{cur,new}
    find $HAM/{cur,new} -type f | xargs rm -f
else
    echo "No Definite ham found..."
fi

echo "Feeding current Inbox contents as ham..."
sa-learn --ham --dir $INBOX/{cur,new}

This script will deal with the ham and spam folders, clearing out their contents after SpamAssassin is done. It will also mail me a summary of what it did, and a count of the messages I need to review in the "maybe-spam" folder.

An important thing to remember is: I never simply delete a mail - I'll either leave it in my inbox, file it in a folder to be kept, or move it to the ham or spam folder where most people might just delete it. To help in the latter two cases, I have added a couple of macros in the configuration of my mail program (mutt):

# Capital S saves a mail to the spam folder, capital H to ham
# Use instead of "d" for delete
macro index S <save-message>=spam\n
macro pager S <save-message>=spam\n
macro index H <save-message>=ham\n
macro pager H <save-message>=ham\n

This works well for me using mutt, but of course simply saving mails to spam/ham by hand would work just as well in any other mail program.

Reviewing the maybe-spam folder

By now, SpamAssassin is quite good at picking up on mails that are obviously spam or obviously ham. The more difficult mails will end up somewhere in the middle, in the "maybe-spam" folder. check_spam_folder will count the number of mails in there each day and issue a reminder. Once in a while (every few days or so), I simply navigate to "maybe-spam" and check through the mails there. For me today, the vast majority of them are likely to be spam that I can simply hit 'S' on, but the odd one may be a false positive. In those cases, I'll either hit 'H' (for mails that should have got to me but that I don't need to keep or respond to) or save them back to the Inbox. When I first started using SpamAssassin and the Bayes database was not so mature, more mails needed attention in "maybe-spam". Bayesian statistics will get better over time, as the "corpus" you have processed increases in size.

It's also not a great idea to leave the "maybe-spam" folder unattended for too long. Firstly, if you do have legitimate mail landing there then people may simply worry that you're ignoring them! Secondly, there are reports that SpamAssassin's Bayesian database can become unhappy and start misclassifying mail if you let it run too long without training it on the edge-cases that are landing here.

Other configuration

There are other tweaks that I've made in local SpamAssassin configuration over the years, editing $HOME/.spamassassin/user_prefs. The configuration syntax is obvious enough, and there are examples in the global SpamAssassin config files. I'm not going to list detailed examples here, as I don't want to give spammers obvious target addresses!

some people who send me email are stuck using mail systems that may cause high SpamAssassin scores. For those specific people, I add "whitelist" entries.
I've also tweaked the default scores for a small number of the tests that SpamAssassin gives in certain tests. In most of these cases, I've simply boosted the scores (e.g. for MICROSOFT_EXECUTABLE) after analysing the mails that get to my Inbox and maybe-spam folder.

Other options

Obviously, there are lots more options for software in the anti-spam stack. There are multiple different common MTAs (e.g. postfix, sendmail) that provide varying degrees of support for anti-spam hooks. SpamAssassin is just one of (possibly the best-known of) many pieces of software written specifically to detect and kill spam in email. Other common choices are dspam and crm114.

Another common thing to add to the mail stack is a virus checker, most commonly ClamAV. Depending on local circumstances, this can be very useful but also potentially very CPU and/or time-consuming. As I don't use Windows at all, I couldn't care less about picking up on the viruses. My SpamAssassin rules are good enough to catch bad content for me.

And I'm sure there are likely to be other things I've missed or forgotten about here. Please feel free to correct me... :-)

Revision history

v1 (2008-02-17)	Initial release
v2 (2008-02-17)	First update after comments: Extra discussion about greylisting options "maybe-spam" needs checking regularly