Spam Free Email

Anti-spam ideas, tools and services

March 8th, 2006

Tagging messages for the filters

One of the pieces of information that I have available to me for every message is what country the IP address is registered to. I have been trying to figure out exactly how it incorporate this into a filter while still giving the benefit of the doubt to messages that are going to be good for that country.

The idea that I just implemented was to add the name of the country into the informational live in the header that I tag on each message. This is part of the SMTP standard, although adding the name of the country is something I have never heard of any one else doing before.

There are two things that this does, first this allows the person reading any header to easily see what country the message is from based on my IP look up, but more importantly this will give more words of the bayesian filters to work with.

The basic idea came to me while I was looking at some message that were in an inbox that is 100% spam. I was watching the score of each individual word as it was running through the bayesian filter. One thing that stood out was that since every message was spam certain words that were in the headers were considered spam markers. These words would have been normally considered neutral, but since every instance was in a spam message they were considered spam.

This gave met he idea that is in the header the country of origin was explicitly stated, then the countries that only send spam to you will give an additional marker for spam. Countries that send both good and bad mail will end up with an extra neutral marker, which tends to not adjust the combined statics very much. And I’d be hard pressed to think of a country that only sends good mail, so well leave that case alone.

This, of course, will likely not be the only deciding factor if a message is spam or not, but every piece of information is helpful.

March 8th, 2006

Mnesia vs MySQL again …

I’ve been trying to figure out exactly why each message seems to slow down when I an profiling more then about 4 messages at once. I had been thinking that it was a problem with my MySQL server, when I was trying to access complicated data more often then the server was happy about.

If that was the case then the only solution I could see was more and better hardware. This was not a option at this point in development.

I wrote some code to try and push the MySQL server into a state that I could monitor what the problem was. I ended up seeing the problem reflected in the times to run my code, but the MySQL server was running along quite nicely.

So I ended up taking this to the next level. I used fprof to profile what my code was doing and Mnesia turned out to be the problem area. Or rather again, how I was using Mnesia.

The way I had been using Mnesia was to load large amounts of data from the MySQL server so that I could do a lot of math in with Mnesia as my data store. Turns out that Mnesia didn’t like the large amounts of data I was loading and when I was loading large amounts of data for 4 or more message things started to slow down.

My solution was to switch back to using MySQL for this. After my crash a few weeks ago, I redesigned the database to where it was almost nothing like the database before the crash. During this redesign I reworked the areas that stored the data I needed for my bayesian filters. This new design is much more efficient then the way I was using it before and it takes the need for using Mnesia as a cache away.

In fact by switching to many small queries instead of one large query that got imported into Mnesia I seem to have reduced the average time to process a message by 40%. (Hopefully that number will home over time)

The moral of this little story is that optimization is not just in one place and it is not just at one time. Each change you make might need to be rethought after making other changes.

|