I’m still working on setting up my development system. It’s taking longer than I wanted, partially because I keep leaving to go do real work :-)

As I’ve been creating this system I’ve started to think about how I want my anti spam solution to be different. When it comes down to it I want two things that I have not seen anywhere else; flexibility and control. I want to be able to know every reason why any particular email message might get blocked as spam and I want the end users to be able to see those reasons as well.

To that end I have decided that each email message needs to collect some data about it and then once all the data is collected the email message will be processed as spam or not. This meta data about each email message will include RBL data, SPF data as well as test data on the content of the message. Each message that passes through the system will collect as much data as possible and it will go through every test, even if it is already considered to be spam.

The reason for this is to better identify and describe what a spam message is and to give the feed back to the users on why a message might give a false positive or pass through the system when it truly is spam.

In my time working with spam filters they have tended to operate as black boxes that give no data back to the end user. The data that some of them do give back is near useless in describing an email message’s properties. I want to solve this part of the problem and give the users the tools to create better spam filters.

If a user sees that a spam message got through the filters, but it might not have gotten through if a new RBL was added, then the user will have the ability to add that RBL.

I also want to create reports to give an idea of exactly what feature is most capable of generating the best combination of filters. Perhaps filters would give fewer false positives if more then 3 RBLs are triggered or an exact combination of two of them. At this point I don’t have that information, but I plan on creating a way to get it.

I also hope to release this information to the public in very generalized reports. Information like the best RBL and how many domains are really using SPF.

In any case, all of this is built on the idea of creating meta data to describe each email message as it passes through this system. So this will be a large potion of the core of this system, which will in the end give the users the flexibility and control that I know I want from my spam filters.