Spam Free Email

Anti-spam ideas, tools and services

March 29th, 2006

Anti-Virus Stats

After doing the SPF stats at http://spf.spamfreeemail.com I kind of got into the mood of doing more stats pages. I have a few more planned, but I have the AntiVirus page done well enough to show people.

Of the message I have gotten viruses account for about 0.30% of all that messages. Granted we haven’t had a good outbreak of a hard hitting virus in the past few weeks, but that number still seems pretty low to me.

You can find the AntiVirus stats page at http://antivirus.spamfreeemail.com

I think my next stats page will be the country of origin for spam page.

March 23rd, 2006

Training the Corpus

Few days ago I had yet another hard drive go out on me. This time I was much better prepared then the previous time, I actually had a backup from the previous night at 11:30. So no code lost and the system was really only down for a few hours, mostly because I was deciding what direction to take the hardware and where to spend money to put in a permanent fix.

I still had the corpus (read: database of words for spam) but I no longer had any of the messages to look back at. In the end I decided to clear out my corpus and watch how well the system gets trained with my email and some of the training idea that I have implemented in the code.

This is something that I haven’t talked much about, but it is integral to the was this system is working. ”I use as much data as I can to learn what is good and what is bad and then fine tune that knowledge into multiple levels of personalization.” As each message is broken up into it’s parts the system decides what level of personalization to use, and I’d built a system that is capable for doing that in under three seconds per message.

By adding more information about each of the messages, like I am thinking about doing Yahoo! DomainKey as a test, the system learns better, faster and get the fine tuned personalization to be faster and more accurate.

Some of you may be reading this and assuming that this is the way that bayesian filters work, but in this case I have changed it enough that the personalization is now a much larger feature.

So now, on day three of training the new corpus, I got my first good message. Out of 1,650 message, 304 have been marked bad and 1 is now marked good. Many of these message are truly good or bad, but the great exciting news is that the system is learning how to tell the difference and it is doing it well.

Each good message and each bad message redefines how the system defined the tokens in a message, but more importantly is also redefines how each users personalization works. So not only do you get the intelligence of the entire system, but you get personalization built in as well.

I’ve been purposefully vague and left out many details as the are the secret sauce of the service and they change on a daily basis. The end result will be that newer types of spam message will get marked faster and one person’s personal preference will not greatly effect how another person’s email is classified in the long run.

I’m excited about this and I can’t wait to see how fast the filter will start to see the patterns in the chaos that I can’t see my self :-)

Note: 3 more messages have been classified as good since I started writing this blog entry …

March 20th, 2006

SPF Adoption

I’ve been working on and off with creating an SPF module for SFE. I’ve always intended it to be one of the major parts of the spam analysing process, but the specification is more complicated then I care for at times, so I’ve been going pretty slow.

One of the things I have been doing is caching the SPF records for 24 hours so that I don’t have to do the DNS look-up every single time. I’ve written a few reports that are pretty interesting about SPF and how well it is being adopted by Internet users.

At this point only 22% of domains that I have seen send email through SFE have SPF records.

You can find the real-time reports at http://spf.spamfreeemail.com These reports are cached for 1 hour and they SPF records are updated every 24 hours and they do reflect the real data for SpamFreeEmail.com.

I’ll create new reports as I find the time. If you have any ideas on reports to create feel free to leave them in the comments.

March 9th, 2006

SFE Status update

After I figured out what was causing the problems with multiple messages slowing each other down on the system things have been working wonderfully. So well in fact that I don’t see the need to replace the current hardware in the near future. I’m hoping that adding more hardware will be necessary, but that will only be to handle additional loads.

There are currently 5 permanent servers running SFE, three dual PII’s and two P4 with lots of RAM between them. The database server that runs SFE and many of my other sites is a Dual P4 Xeon with 4GB of RAM. With this current configuration my statics say that I could process more then 450 times the spam messages that I currently am each day and I suspect that number will hold true for a while.

Thanks to Elang, it is easy to add more nodes into the system, it can be done in less then 15 minutes per server. In fact each server can easily handle 6 nodes, possible more if I wanted to. Each node is capable of handling more then 13,500 messages a day and that number is only going up for the next week or so.

I still have a few things that I need to work out before letting too many more people onto SFE. I am hoping to have the major issues resolved soon and then just work on the small or more cosmetic issues.

My current goal is to start letting people sign up for themselves to be part of the extended BETA testing (which will be free) before the first of April.

The Extended BETA test will still have the system working out some bugs and the bigger thing is that I will need more samples of good email messages. Once I see the neutral message accounting for less them 5% (It is currently just less then 20%) of the total message for more then 1 week then I will feel confidant that the system is good enough to charge money for :-)

March 8th, 2006

Tagging messages for the filters

One of the pieces of information that I have available to me for every message is what country the IP address is registered to. I have been trying to figure out exactly how it incorporate this into a filter while still giving the benefit of the doubt to messages that are going to be good for that country.

The idea that I just implemented was to add the name of the country into the informational live in the header that I tag on each message. This is part of the SMTP standard, although adding the name of the country is something I have never heard of any one else doing before.

There are two things that this does, first this allows the person reading any header to easily see what country the message is from based on my IP look up, but more importantly this will give more words of the bayesian filters to work with.

The basic idea came to me while I was looking at some message that were in an inbox that is 100% spam. I was watching the score of each individual word as it was running through the bayesian filter. One thing that stood out was that since every message was spam certain words that were in the headers were considered spam markers. These words would have been normally considered neutral, but since every instance was in a spam message they were considered spam.

This gave met he idea that is in the header the country of origin was explicitly stated, then the countries that only send spam to you will give an additional marker for spam. Countries that send both good and bad mail will end up with an extra neutral marker, which tends to not adjust the combined statics very much. And I’d be hard pressed to think of a country that only sends good mail, so well leave that case alone.

This, of course, will likely not be the only deciding factor if a message is spam or not, but every piece of information is helpful.

March 8th, 2006

Mnesia vs MySQL again …

I’ve been trying to figure out exactly why each message seems to slow down when I an profiling more then about 4 messages at once. I had been thinking that it was a problem with my MySQL server, when I was trying to access complicated data more often then the server was happy about.

If that was the case then the only solution I could see was more and better hardware. This was not a option at this point in development.

I wrote some code to try and push the MySQL server into a state that I could monitor what the problem was. I ended up seeing the problem reflected in the times to run my code, but the MySQL server was running along quite nicely.

So I ended up taking this to the next level. I used fprof to profile what my code was doing and Mnesia turned out to be the problem area. Or rather again, how I was using Mnesia.

The way I had been using Mnesia was to load large amounts of data from the MySQL server so that I could do a lot of math in with Mnesia as my data store. Turns out that Mnesia didn’t like the large amounts of data I was loading and when I was loading large amounts of data for 4 or more message things started to slow down.

My solution was to switch back to using MySQL for this. After my crash a few weeks ago, I redesigned the database to where it was almost nothing like the database before the crash. During this redesign I reworked the areas that stored the data I needed for my bayesian filters. This new design is much more efficient then the way I was using it before and it takes the need for using Mnesia as a cache away.

In fact by switching to many small queries instead of one large query that got imported into Mnesia I seem to have reduced the average time to process a message by 40%. (Hopefully that number will home over time)

The moral of this little story is that optimization is not just in one place and it is not just at one time. Each change you make might need to be rethought after making other changes.

March 5th, 2006

Bayesian filter accuracy

This morning I got to thinking that knowing the overall accuracy of the bayesian filter would give a good handle on how well the filter is learning.

The thing I am most interested in is what percentage of messages are getting marked neutral, or unclassified.

From the beginning of the data I have, which is about 21 days of messages, I see that about 30% of the messages have been marked neutral, 6.5% of have been reclassified as good and 93.5 have been reclassified as bad.

Over the past 14 days I see 24.2% are neutral, while 8.6 have been reclassified as good and 91.3 have been reclassified as bad.

and in the past 2 days 17.2% have been marked neutral with 3.7% reclassified as good and 96.2 reclassified as bad.

Over all I can see that the filter is learning although not as fast as I would like it to. The really good news is that of all of the messages that have been originally classified as good or bad, none of them have been reclassified.

|