I was wandering around the ‘net yesterday looking for more topics to write about and I ran across this great set of articles by Paul Graham. http://www.paulgraham.com/antispam.html

The basic principle is to use Bayesian Filters to evaluate spam messages based on data sets of good messages and spam messages. Since each message is evaluated based on a data set of the actual users email, and not some general catch, all the statistics tend to allow for things to come through that the user would allow and some other people may not.

I tried this approach on a cache of messages that I keep around for analysis and I was quite impressed with the outcome. I would definitely need different tools to accomplish this on a wider scale then I did yesterday, but the basic principles held true.

The most interesting part of this for me is the idea that this would be customized for each and every user. I have several client sites that I often talk about spam and the word ‘mortgage’ has come up in the conversations several times. One of the clients is in the sales field and taking any message with the word ‘mortgage’ and deleting it does not impact this business. Another client of mine is the financial sector and removing that word would be an obvious mistake.

This the Bayesian Filter approach users that never get a good mail message with the word ‘mortgage’ in it will give a very high spam probability to any message with that word. Users that have both good and bad messages with the word in it will have an average or neutral rating and the more good messages with the word in it the better the rating will be.

Also, since this article advocates only looking at the 15 to 20 more ‘interesting’ words, meaning the 15 to 20 words that the score is either the highest or lowest, any neutral words will not be evaluated in the message.

Took me 3 or 4 times of reading the article and doing some test programming to completely understand the concepts, but this is a solid concept and I will be adding this into my anti-spam arsenal as soon as I can find the right approach.