Few days ago I had yet another hard drive go out on me. This time I was much better prepared then the previous time, I actually had a backup from the previous night at 11:30. So no code lost and the system was really only down for a few hours, mostly because I was deciding what direction to take the hardware and where to spend money to put in a permanent fix.

I still had the corpus (read: database of words for spam) but I no longer had any of the messages to look back at. In the end I decided to clear out my corpus and watch how well the system gets trained with my email and some of the training idea that I have implemented in the code.

This is something that I haven’t talked much about, but it is integral to the was this system is working. ”I use as much data as I can to learn what is good and what is bad and then fine tune that knowledge into multiple levels of personalization.” As each message is broken up into it’s parts the system decides what level of personalization to use, and I’d built a system that is capable for doing that in under three seconds per message.

By adding more information about each of the messages, like I am thinking about doing Yahoo! DomainKey as a test, the system learns better, faster and get the fine tuned personalization to be faster and more accurate.

Some of you may be reading this and assuming that this is the way that bayesian filters work, but in this case I have changed it enough that the personalization is now a much larger feature.

So now, on day three of training the new corpus, I got my first good message. Out of 1,650 message, 304 have been marked bad and 1 is now marked good. Many of these message are truly good or bad, but the great exciting news is that the system is learning how to tell the difference and it is doing it well.

Each good message and each bad message redefines how the system defined the tokens in a message, but more importantly is also redefines how each users personalization works. So not only do you get the intelligence of the entire system, but you get personalization built in as well.

I’ve been purposefully vague and left out many details as the are the secret sauce of the service and they change on a daily basis. The end result will be that newer types of spam message will get marked faster and one person’s personal preference will not greatly effect how another person’s email is classified in the long run.

I’m excited about this and I can’t wait to see how fast the filter will start to see the patterns in the chaos that I can’t see my self :-)

Note: 3 more messages have been classified as good since I started writing this blog entry …