Spam Free Email

Anti-spam ideas, tools and services

January 29th, 2006

Benefits of being obsessive compulsive :-)

Ever since I finished the telnet module to give me real time stats about the system I’ve been puzzled. I was watching message as they went through the system and counting. I counted to 15 or so and the system was telling me they took 50 seconds.

Finally I saw one that I counted to 30 and it told me 4,000 seconds.

Turns out I had a glitch in my math that I was using to generate my stats, or rather my SQL statement that was doing the math to generate my stats. Because of that my average time literally went from 280 seconds for each message to 89 seconds and it’s still going down slowly.

So what that means is that for the past week I’ve been trying to optimize the system with bad numbers. Which explains why things that I thought would help didn’t do as well as I thought they would.

But now, officially I think I have the back-end optimized the way it needs to be and I’ll be working on adding more features again.

January 29th, 2006

Push vs Pull

Over the past two days I’ve been experimenting with two distinctly different ways to handle message queues. Basically the easiest way for me to describe them is a Push vs Pull concept.

Originally I designed the system with a Push method. I would get a new message and I would push it through the profilers and then out the client.

While this method was working, it had some problems when I tried pushing to a server that went down. So I’d have to re-profile and then that became a problem when I was re-profiling the same message on several machines. I was also ending up with multiple copies of the good messages being sent out when they were completed.

The Pull method that I have switched to I designed a throttle so that only one server can pull one section at a time and all servers check the queue every 1 to 5 seconds. This makes sure that each server is only processing one thing at a time and all servers are working on something different.

I have a lot of code from the push method still in place, so some of the system looks redundant at times, but in reality this make the system more fault tolerant.

All in all I am much happier with the current pull system I have implemented.

January 27th, 2006

Server Statistics

One of the complaints I have heard from people using other anti-spam services is that they never know if/when the service is having problems.

One of the things that I am interested in is performance statistics so that I can monitor the way the servers are performing and catch and problems that might be happening. I need to have access to this information while I am at home and more importantly while I am away from home.

Both of these problems are solved with my latest feature, a telnet based stats page. The format of the page is based off of the UNIX command vmstats. It has several columns of information that are constantly scrolling with updates.

If you want to see what it looks like you can drop to a command prompt and type ”’telnet stats.spamfreeemail.net”’.

From there you will see the same stats that I look at to see how the system is performing.

There are 6 major columns with minor columns for details, they are:

* Message Count for the past 24 hours
** G = Good
** B = Bad
** N = Neutral
** v = Virus
* Current Cache
** Profile Cache
** Mail Cache
* Active Processes or mail currently being profiled
* SMTPC Retries or the number of message in the SMTP Client retry queue
* Average processing time
** Last Message
** Day = Average for the last 24 hours
** Week = Average for the past 7 days
* Node count or total number of nodes in the cluster

These are current real-time stats, possibly delay no more then 5 seconds, but updated closer to 1 per second.

If this is ever down it simply means I am doing some maintenance, as this is integrated as part of the system as a whole now :-)

January 25th, 2006

Finally stabilized

For most of the month of January I have been working on optimization and performance issues concerning my anti-spam service. It was feeling like anything I was doing to improve performance was creating a bottleneck somewhere else, but I think I’m finally past that.

I have the profiler distributing the messages between all of the nodes in a weighted random round robin fashion. Then each part of the profiler runs concurrently on that machine.

Thanks to the native Erlang module for MySQL my communications with my database server have stabilized, mostly due to the connection pooling and the streamlining that the MySQL module does for the overall design.

Also thanks to the native Erlang module for MySQL I am again able to run the profiler on Windows workstations as well as my Linux servers. The only piece that does not currently run on Windows is the anti-virus and I have setup the code to send the anti-virus to another PC (hopefully Linux based) to have that part of the profiler completed. It’s fast so there is very little impact on anything.

I have one version of the profiler that distributes everything to different nodes instead of doing all the work concurrently on one node and the profiler evaluates the message with the same code no matter if it is distributed or concurrent. I might go back to the distributed design for testing, but last time it was slower then the concurrent model when I thought it should have been faster.

My statistics show that the messages are being processed in a reasonable amount of time (less then a minute for 90% of messages) depending on the size of the message. I might have some more optimization to do, but it’s time to move on to newer things.

I still have a decent amount of work to do on the front end before anyone else will be able to use it well, but I’m going to keep focusing on the major things on the back end for the next few days. I might do the front end over the weekend, it shouldn’t take me more then a few days anyway.

I’m hoping to still be ready to put a few more domains onto the system around the first week of February. If you think you might want to be part of the beta testing drop me a note and mention ”’SFE beta testing”’ in the subject line or comment on this tread.

January 24th, 2006

Notes on native Erlang MySQL module

There were a few things that were worth mentioning that I did not like about the native Erlang MySQL module. But mostly I’m quite happy with it.

The biggest thing that I did not like was the fact that it returns everything as strings. It gives you the column information and tell you that it’s a long or a double or whatever, but the data is still a string.

I wrote a function into my wrapper that looks at the column information and casts it to an appropriate type based on what was in MySQL. Occasionally I have to recast it, but it is mostly when it is a string in the database and I was to use the information as an atom.

I also had most of my code based off the ”list of tuples” design from the OTP ODBC application. The native MySQL module uses a ”list of lists” design, so while I was recasting things about I also changed the list into a tuple. This is minor to me, as it was really just to prevent rewriting a lot of my code.

The single greatest advantage to using this module is connection pooling. That alone will save my MySQL server tons of new connections and logins.

January 24th, 2006

Native support for MySQL in Erlang

I was reading through my RSS feeds today when I noticed a new post at http://www.planeterlang.com. It mentioned a native Erlang application to communicate with MySQL. Since I have been having some issues with my ODBC connection to MySQL I thought I’d give it a try.

The native Erlang application actually send binary data back and forth between your Erlang node and the MySQL server, instead of translating test into something that the client send and then it comes back to the server. It seems to not only be faster, but much more stable then using ODBC to connect to MySQL.

When I was doing a lot of ODBC command from a windows workstation (WinXP) I was having problems where the odbc driver would give out. So far the native Erlang application for MySQL has been solid as a rock.

When using ODBC, the request was going from my application, to the OTP ODBC application, which started a program outside of Erlang to connect to the ODBC drivers that eventually made a connection to the MySQL server. In the case of Linux, I was using MyODBC and UnixODBC and that means that five programs had wot work right before things would get to MySQL.

Now Erlang talk directly to the MySQL server, so I have a wrapper in my application, which talk to the native Erlang application and then that talks directly to MySQL. Two applications internal to Erlang instead of 2 inside and at least 2 outside of Erlang.

I’m a happy camper, as I have rewritten a complete version just to accommodate the new native Erlang application in my anti-spam server. Luckily I had isolated most of my SQL commands into one module :-)

http://support.process-one.net/doc/display/CONTRIBS/Yxa

January 23rd, 2006

SQL optimization

It seems that whenever I manage to get the system faster on one component another bottleneck appears to take it’s place. I suspect that is what beta testing is for ….

The latest bottleneck was a needed optimization. While designing the system I had a function that linked each work in a message to that message in the database. Originally I had it setup to use 1 SQL statement each time I linked a word to that message.

This morning when I woke up I was noticing that the system fell behind again, and like many times before it looked as if I was overloading my MySQL server.

I wrote a few quick functions to clear up some obvious problems, which didn’t solve anything, then I got to the part where it was slowing down, which was where I was linking the words to the message.

In MySQL you can put multiple inserts into one statement, I’m not sure if this is a MySQL specific feature or part of the SQL standard, but I have changed the code so that it builds a large SQL insert statement and only execute one statement per message.

So in the case that a message has 100 words in it I just eliminate 99 SQL queries without removing any functionality.

Now it should be time to find a bottleneck on the Erlang side of things ….

January 19th, 2006

Distributed Concurrency and performance

I’ve been distracting myself from the things I really should be doing today by thinking of other ways to optimize the profiler, which is the piece that processes each mail message and determines if it is spam or not.

I found a few ways to avoid redundant disk reads and changed some SQL statements to reduce the number of ODBC connections. I’m still not overly happy with the performance of the system as a whole, but I think it is working great for a BETA stage product.

My numbers show that is takes 65 seconds to process the average mail message and I think I have maxed out at about 60 seconds per message on the current system with it’s current design.

Of course I am not fully implementing distributed processing and concurrency with the profiler. One of my next step is to break the profiler into distinct sections that will process on randomly chosen nodes concurrently. Then once the entire profile is created the message will be evaluated and processed as needed.

This, of course, require a bit different logic, but the code stay remarkably similar.

I know I will need to create a more advanced evaluate command, but other then that I think the only other thing I will need will be to distribute the individual processes to different computers.

I currently have 5 server with a total of 25 nodes, 22 of those nodes are dedicated to profiling. The profiler currently has about 6 distinct sections, so having 6 different nodes processing concurrently ”’should”’ mean that the messages will complete faster then if I do all six sections in order.

Plus I am using ClamAV, which I have working on my Linux servers, but not on my Windows computer. Right now I am unable to join my workstation into the cluster to process mail since it cannot do the AntiVirus, but if I write the AntiVirus process to where it will only run on the Linux servers I will be able to add my workstation back into the cluster :-)

It really has taken me longer then I thought it would to see the ease and benefits of distributed processing and concurrent processing in Erlang, but at this point I don’t see any other language I’d want to use to create a service like this one.

and I have a few more projects that I’ve thought up already :-)

January 18th, 2006

What country spams the most?

I asked this question in another blogs post, most likely titled the same, about six months ago. Well, I have the answer now.

Here goes:

  1.  UNITED STATES
  2. CHINA
  3. REPUBLIC OF KOREA
  4. POLAND
  5. FRANCE
  6. SPAIN
  7. BRAZIL
  8. NETHERLANDS
  9. JAPAN
  10. GERMANY

China and the US account for 46% of the spam (23% each right now) currently going through the system. Korea holds it’s own with 9% and all the rest have 6% or less.

As of now I’m considering making a filter that will mark all mail from any particular country as bad, unless the email address is on a white list. In that case it would get through just fine, unless sit was a virus. The country list to block would be up to individual domain administrators.

I’m not sure how many people would use it, but I think it would be a good way to block bad mail that would otherwise get through or be marked as neutral.

January 16th, 2006

98% spam …

In the statistics that I look at while I am developing this system I have been amazed that the percentage of spam message has risen to 98.65% and viruses take up another 0.34%. Which means that 99% of all message currently going through this system are bad.

While this is not a surprise to me, it surely does resolve the idea in the back of my mind that spam is totally out of control.

On the plus side, I’m seeing fewer and fewer false positives in the system. (Good emails getting marked as spam) Which is a function of training the filters and I think I have only had one false negative, or a spam message that got through the system marked as good, in the past week. That message was a work of art as well, but eventually even those will be caught by the system.

The optimizations that I did a week or so ago seem to be holding well, but the more messages in the system the slower the system is working. SO I do know I need a few more rounds of optimizations or I might need to give up the real time aspect of the filters for a periodically updated version.

Basically now as each message is put into the system it automatically is incorporated into the other filters. I may need to update the filters once an hour, once a day or once a week with new message. Not sure what will end up being best.