Spam Free Email

Anti-spam ideas, tools and services

February 26th, 2006

If it’s not one thing …

The system has been working pretty well for the pat few days, but a problem that I knew about a month or two ago seems to be rearing it’s head me the usual lately.

I knew I was having a problem with large messages, in excess of 1MB. The small ones are fine, but the big ones are a problem. I suspect it has to do with writing the file form memory and the connection timing out in the mean time. I get the file and process it just fine, but the client seems determined to send the message over and over again …

So I’ve decided to create a much more elaborate insert queue. Till now I’ve been relying on a record being passes through a finite state machine. I think it is time to start writing files earlier and using them to be able to recreate the state of a message if it fails.

I’ve looked at the way a few other system work and I think I’ve come up with a directory structure, naming convention and a few states that will simplify the process.

Hopefully the process of doing more writes, but with smaller amounts of data will get rid of the timing issue. Either case making the system able to recover from a crash or shutdown better will improve things quite a bit.

February 22nd, 2006

Corpus, Corpus, Corpus

I have been thinking about implementing more corpuses, (corpii?) then just the standard Good, Bad and Neutral. Since ClamAV marks some messages as Phishing schemes, I have been thinking about creating a corpus where the known phishing schemes are marked bad by ClamAV and then I could pull out certain message that are similar to evaluate them further by hand.

I was also thinking that it might be an effect way to categorize mail, for different types of content. Sexual content will have a limit number of words as well as profanity. By marking some messages as sexual in nature and rerunning some statistics based on the corpus you could easily categorize the mail into almost any category you would want.

This could pull out 411/Nigeria scams as easily as mortgage offers. Then with some user preferences and statistics you could determine what types of mail each user likes and does not like and further customize their experience.

It might even be an effect way to sort mail in a web based interface. If the user could setup their own categories and then specify that they want all mortgage related emails to go into one folder and all travel related email to go into another one they would have a much easier way to find the messages by topic.

Of course this is just me blathering about things that I don’t have time to work on, but they are fascinating ideas on how to sort and classify messages …

February 20th, 2006

Less is more continued …

I figured out the problem where adding more nodes slowed down the system and a work around that keep the performance stable enough to add more nodes back in on a regular basis.

Turns out that the system as a whole could not handle when lots of nodes attempted to work on the same message at the same time. I had built in a throttling system which allowed only one node to work on one section of each message at one time, but the nodes were being so aggressive at trying to process messages that they were stepping on each other toes.

I added in another limiting factor that when more nodes are present each node takes slight longer before it tries to pull more data in and it has been working very well today.

So the problem wasn’t Mnesia or MySQL, but in fact my over use of them :-)

Right now the biggest bottleneck in the whole system is MySQL, and that is saying a lot as most messages are processing in under 5 seconds. The problem occurs when a burst of 30 to 50 messages come in and they all try to process at the same time.

The solution to this is to create a master database server and have several replicated slave servers. The larger problem is that my current funding will not allow for this right now. As soon as I can afford to get two really beefy servers I’ll start breaking then out into a master and slave configuration.

I’m almost tempted to turn one of my servers that is running Erlang into a MySQL slave, but I don’t think any of them are powerful enough to create a performance increase.

Another option I have thought of is to add a MySQL slave to each server that runs 6 nodes. Then the data would be local and could never have more then 6 connections to the MySQL server.

While this would improve performance of MySQL I worry about he performance of Erlang with the current configuration, so either way I need more and better hardware.

February 19th, 2006

When Less is More

Since my crash about a week ago I have been rebuilding the system and trying to improve upon it in the process. I got the major things fixed and working by Thursday and last night I got the minor things fixed so that I am moving forward again instead of playing catchup.

In the process I have removed hundreds of lines of code that were duplicated code. Mostly I found ways of reducing several functions into one or two functions and give them more functionality most of the time as well.

I had also taken the time to remove many tables from my MySQL database. I reduced the number of tables almost in half by combining what was in several tables into a few of them. This has greatly improved the performance of the database and in come cases I have seen a 10 fold improvement. (Most of that is due to better indexes)

All of that makes sense to me when I am talking about making less into more, which I consider more messages being processes in less time the ultimate goal, but what is confusing me at the moment, is that less Erlang nodes are processing each message in less time.

Right now I have six nodes running and messages are precessing in 1 to 5 seconds. If I turn on all of my nodes, which is currently 50, the messages start to process in 50 to 200 seconds.

I’ve check the database code and it is working as streamlined as I can make it, but I think my current bottle neck is how I am using Mnesia.

I am using Mnesia to hold information about each message as it is being profile and the I throttle the nodes so that only a one node can be working on one process for each message. (There are currently 5 different processes that each node could be working on for each message.)

I’m beginning to think that I am passing more information between the node then they are able to handle, so with fewer node Mnesia is running better.

I’;m considering this a flaw in my own use of Mnesia, not in Mnesia itself. I think if I redesign the data I should be able to add all 50 nodes and make them work faster then just the 6 I have working now. The trick is figuring out the correct data structures in Mnesia to make it work.

February 13th, 2006

Problems with prototyping on a budget

Well, I had my first real disaster today. Last night really. Looks like at about 5AM I lost 2 drives in my RAID5 that was storing not only all the emails for my corpus, but also all of my Erlang code. I lost it all.

I do have some backups of the code, but they are more then two weeks old. Backups are not my forte and unfortunately this will not make me very much better at backups.

The reason for the failure is that I do not have all of my production hardware yet and that means I have been working on what I have lying around. Of course what I have lying around is a variable treasure trove for most people, but it is still prone to failures.

So now I have to recreate two weeks worth of code and I’m going to redesign parts of my database. I was going a bit over board with the third form normalization and I need to move back to second form normalization in a few areas for performance reasons.

I suspect that the system will be down for at least two days, one for the hardware and at least one for the code and databases. I’ll see what I can do about doing both at once, but I don’t have any of my personal email till the system is back up and running … So two days is a really long time :-)

February 9th, 2006

Innodb vs Myisam

For the past week I’ve known what I had to do and that is to get InnoDB working on my MySQL server.

The row locks vs table locks were the big selling point as the table locks were making it so that I could only process 4 messages at once.

The difference in disk space is a concern of mine, but the supposed performance increases is currently outweighing that.

I say supposed performance increases because I have been converting one table for nearly three hours. I have no idea how far along the process is and at this point I am assuming that it is working at all. I can see the time stamps changing on the table space, but that is my only clue.

All of my other tables converted quickly, but this i the one huge table that I have. And of course the one that needs the innodb row locking the most. It is also the only table that didn’t have a good primary key.

I know this is what I need to do to perform the way I am the system to, but I just wish I had some kind of sign that it was still working …. something … anything ….

|