While designing my anti-spam service I have been thinking about how I want to handle things like really rare words and personalization. These two concepts are on opposite ends of the spectrum for spam filters.

One of the best ways to deal with really rare words is collaborative filters. This allows you to take the settings of other people to evaluate a word that you have never seen before or have seen very few times. The problem with this is that other people may have a different idea of how to classify those words then you do.

Enter personalization. This is where Bayesian filters truly succeed in filtering spam, by looking at how you have previously classified spam messages and doing some wild statistical analysis on them. The weak point is when they do not have enough information to produce a statically significant result.

I think I’ve found a way of combining these two theories into a working filter that allows for the best of both worlds.