Spam Detection

January 25, 2010

Anyone who has had an E-mail account for more than a few hours is undoubtedly familiar with the problem of spam: junk E-mail.  Some of it is actively malicious, attempting to get the user to click on a toxic attachment or visit a compromised Web site.  Much of it is just annoying, like junk postal mail, only usually more stupid.  There are certain common genres: advertisements for bogus pharmaceuticals promising anatomically improbable results, get-rich-quick offers from new friends in Nigeria, and the like.  For the Internet as a whole, spam is an enormous waste of resources; it is estimated that something like 95% of all E-mail sent is spam.

Finding ways of filtering out spam has become a necessity, if E-mail is to be at all useful.  Most spam today is sent by networks of personal computers, so-called “botnets”, that have been “hijacked” by malware without their users’ knowledge.  This means that there are many, many sources for spam, so that blocking senders is not really practical or effective.  Contemporary spam filtering focuses on the content of the message, and is reasonably but not entirely successful.

According to a report in New Scientist, a team of computer scientists, from the International Computer Science Institute in Berkeley, California, and the University of California, San Diego, has come up with a new technique for detecting spam, which they claim has better accuracy than any method currently used.  The researchers started with the observation that, when botnets generate spam, the generating software typically introduces small variations in the text template of the message, in an attempt to confuse spam detection algorithms.   (You can think of this as being somewhat similar to the process of “personalizing” form letters using the mail-merge function of a word processor.)   The new approach essentially uses a sample of messages to reverse-engineer the  message template and variation rules being used.

The team reasoned that analysing such messages could reveal the template that created them. And since the spam template describes the entire range of the emails a bot will send, possessing it might provide a watertight method of blocking spam from that bot.

The resulting filters were found to have very high accuracy, and, perhaps more important, to have very few false positives (that is, legitimate messagea misclassified as spam).

This is an interesting approach, especially if it can reduce the number of false positives.  Current filtering techniques identify almost all spam, but are sometimes over-zealous.  Anyone who uses E-mail seriously is aware of the dilemma: if you don’t check your spam filters’ results regularly, you risk missing an important message that was incorrectly flagged as spam.

Still, I’m sure the spam arms race will continue.  The real problem is an economic one: the cost of spam — which, given that it’s 95% of all E-mail, is non-trivial — is borne by everyone except the sender.  Junk postal mail (in the US) gets carried at a discounted rate, but it still costs something for postage, paper, printing, and so on.  (Incidentally, in other countries where junk mail pays “full freight”, there is much less of it.)  In contrast, the marginal cost of sending a spam message is essentially zero.  If we can devise a way to change that, we may have a chance of really cutting down on spam.


%d bloggers like this: