Spam Free Email

Anti-spam ideas, tools and services

August 31st, 2005

Automating the spam filter training process

One of the greatest features of modern spam fitlers is the ability to look at the content of messages and identify if it is spam or not based on previous messages that have been categorized as spam.

The largest down fall to this method is that you need a ‘’statistically relevant” number of messages that are classified as either good or bad messages before these filters work well. The problem with this is that you need input from the user to do this or you need to have an over generalized set to start with.

While reading about writing these filters I started to think about how to automate this while still taking the user in the equation.

The idea that I’m toying with right now is to use the users white list to determine good messages and then a combination of RBLs and the users black list to determine bad messages. once I have a statistically relevant number of messages then the filter itself will start to work in conjunction with everything else.

This will train directly on the verbage that is used in the users own messages while only requiring them to create a white list. In the case of false positives and false negatives the users will be able to look through messages for a certain time period and reclassify them as good or bad. This will in effect retrain the filters and prevent the false positive or false negative from recurring.

The process of building each users filters based on their own email messages ‘’should” make their spam filters more effective in the long run, with very little effort in the short run.

August 26th, 2005

Can’t see the spam through the processed meat products …

I’ve been looking at spam messages for years. I’ve been reading articles and essays about spam for years. I’ve been getting spam for years and the only thing that I know for sure is that no one has found the perfect way to prevent spam from getting into my in-box.

My latest thoughts of the problem is that we are looking too hard at the overall problem and at the individual spam message. We are seeing the two ends of the spectrum but we do not have the information to fill in the gaps in-between.

This is part of the reason that I am taking the current approach that I am on dealing with spam. We need more statistical data about why spam messages are spam.

I don’t just want to know that a message triggered an RBL, I want to know every filter it triggers. I don’t want to know that a word somewhere in the entire message was on a block list, I want to know what the word was and if it was in the subject line, one of the headers or in the body and I want to know how many times it was used.

And if a message is on a white list I still want to know what would have happened if it had been processed….

Without this level of information, I do not think we will be able to conceive of the next generation of filters. Once there is a pattern in the chaos, it is simple to filter it out. But first we need a complete picture of the chaos and the tools to find the pattern.

[tag]spam, spam filters[/tag]

August 22nd, 2005

LFS or FreeBSD

I’ve been working with the LFS software for about a week now and even though I think it is a superior direction I’m currently thinking of using FreeBSD instead.

Each time I try to get LFS it take me 5 to 9 hours to get to the point where I can test it, which is turning out to be once a day. I’m going to keep working on getting it right, but till them I’m going to setup my test system using FreeBSD. I’m currently downloading the version 5.4 ISO images and I hope to have my test computer up and running this evening.

I’ll keep working on the LFS system with one of my slower system, which take 9 hours to get to the failure point. So I expect that progress there will be quite slow ….

[tag]Linux From Scratch, LFS, FreeBSD[/tag]

August 20th, 2005

The importance of DNS in anti-spam technology

As I’ve been thinking about the different things that I want to check on each email message I keep noticing that most important technology for checking information is DNS. RBLs, SPF and even checking to make sure that the domain name that is sending the email has a MX record or that it even exists all relies on DNS.

Being that the first four or five things that I want to check on each email message relate directly to DNS that means that the DNS server will need to perform well and cache information and the DNS Client that is doing the DNS Queries will also need to perform well.

So I am currently looking into DNS technologies to see which one I am most interested in using.

August 19th, 2005

Slow progress

Alright, even with ALFS my system is taking 5+ hours to create each time I compile it. I’m still working through the (lack of) documentation, but I still think this is the best route to go so far.

While I’m waiting for the rebuild, which I get 1 or 2 a day right now. I’m working on my ALFS and thinking of how I want to use the beyond LFS as well. I know that the BLFS profiles are going to be of great help for Apache and a few other packages that I’m going to need, but I will most likely have to create a few of my own profiles of things like CLisp.

I’ve also been putting some more though into the email meta data, a little into what I want to have in it and a bit on the structure. I’m not sure if I mentioned that I want it to end up in XML, but I do, and even though I always intend to keep the original MIME email messages (I’ll need them to pass to other systems) the anti-spam system that I am creating will use the XML meta data internally.

August 18th, 2005

More on Linux From Scratch

I’ve been working with Linux from scratch for a few days now and I’m still impressed.

Today, after getting quite bored unpacking and compiling software, I decided to look at the automatic LFS profiles.

The LFS LiveCD has a profile on it setup to compile all the packages, just like the instructions say and I’m even more impressed now then I was when I was doing everything by hand.

With just a few changes in a few scripts you can setup a PC to build a LFS system and since the configuration files are all XML based, you can get the automatic Linux from scratch software to install just about anything else as well.

I can not imagine how one person could create a few scripts and make their own LFS LiveCD and be able to build any type of system that they wanted. Sure it may take a few hours or over night to get the results you want, no where near as fast a installing from a standard distribution, but from what I can see it will be more then worth it.

In fact you could create a a single LFS LiveCD with multiple profiles and be able to recreate an entire server room in a day or two. I’m impressed and I’m going to be basing my servers off of this technology. Anytime I need to add capacity I will be able to boot off of a CD and run a few script, then I will have a server configured to my specifications and I’ll be able to support more servers by myself than ever before :-)

August 17th, 2005

Email Meta-Data

I’m still working on setting up my development system. It’s taking longer than I wanted, partially because I keep leaving to go do real work :-)

As I’ve been creating this system I’ve started to think about how I want my anti spam solution to be different. When it comes down to it I want two things that I have not seen anywhere else; flexibility and control. I want to be able to know every reason why any particular email message might get blocked as spam and I want the end users to be able to see those reasons as well.

To that end I have decided that each email message needs to collect some data about it and then once all the data is collected the email message will be processed as spam or not. This meta data about each email message will include RBL data, SPF data as well as test data on the content of the message. Each message that passes through the system will collect as much data as possible and it will go through every test, even if it is already considered to be spam.

The reason for this is to better identify and describe what a spam message is and to give the feed back to the users on why a message might give a false positive or pass through the system when it truly is spam.

In my time working with spam filters they have tended to operate as black boxes that give no data back to the end user. The data that some of them do give back is near useless in describing an email message’s properties. I want to solve this part of the problem and give the users the tools to create better spam filters.

If a user sees that a spam message got through the filters, but it might not have gotten through if a new RBL was added, then the user will have the ability to add that RBL.

I also want to create reports to give an idea of exactly what feature is most capable of generating the best combination of filters. Perhaps filters would give fewer false positives if more then 3 RBLs are triggered or an exact combination of two of them. At this point I don’t have that information, but I plan on creating a way to get it.

I also hope to release this information to the public in very generalized reports. Information like the best RBL and how many domains are really using SPF.

In any case, all of this is built on the idea of creating meta data to describe each email message as it passes through this system. So this will be a large potion of the core of this system, which will in the end give the users the flexibility and control that I know I want from my spam filters.

August 14th, 2005

Initial setup

First thing I am going to need for this process as a whole is a development environment. I am choosing Linux as the platform for the entire project and I will be using open source packages where ever I can, which would be most of them.

Many times I will be rewriting software and jumping through hoops that may not seem to be necessary, but the long term plan I have is to have a fully integrated system, which I do believe will require recreating the wheel (again) a few times.

I found a project called Linux from scratch that I am going to try to use for the base Linux systems that I need. My biggest problem with most Linux distributions has always been that they are over blown and take up more resources then they need to in order to accomplish the simple tasks that I want my servers for.

When I want a web server all I really need on it is Apache and a few extras to connect to the database server. When I want a database server I want MySql and not much more on the system, so the idea of Linux from scratch is exactly what I have been looking for for longer then I can imagine :-)

So I’m off to build my LFS development environment …

August 13th, 2005

Time to get to work ….

For nearly a year now (maybe longer) I’ve been putting off the idea that is the basis for this site; ”’creating an online anti-spam, anti-virus mail forwarding service aimed at the lower end of the market”’

I think it has finally come down to the fact that I need to create this service as much for myself as for anyone else who would want to use it.

While I know that there are many different services that are already offering the exact feature set that I have planned, the real innovations will be the user interface and the reporting features. I am also planning on creating a service that will allow both POP3 and IMAP functionality instead of just mail forwarding.

… and all of this will be created from scratch. While I know that solutions like the mail toaster package open source software together to create a mail server to accomplish all of the goals that I have, the real problem is interoperability. They work well cobbled together, I want a elegant solution that will be expandable in the future as well.

I still have some hacking to do before I get moving much farther on this, but I think putting pen to paper would be motivational, so to speak.

August 8th, 2005

RSS and spam?

I’ve been doing a lot of work on RSS the past few weeks. I got my RSS feeds working and validated, I’ve installed and love my first RSS reader and I’m starting to tell my friend and clients about RSS in general. Which means that I’ve become a fanatic and I’ll drive everyone nuts from now on.

In my reading of a few RSS feeds about RSS I found an interesting thought. The idea is that RSS really creates a tightly focused extremely anonymous feed that allows users to have control, while not having to give details to the owner of the feed. To get this type of data before, a user would have to give up their email address. So if the user is not giving their email address out as much, one would hope spam would end up going down, or at least not increase.

On the flip side, to properly validate an RSS feed, you need to have an email address. So while users may be limiting their exposure to email harvesters the content providers are increasing their exposure.

In addition, the content providers no longer have a direct way to access the users except the RSS feed, which goes to all of them at once.

There are a lot of things that will end up coming from this, but giving the control back to the user and forcing the content providers to produce better content are the two largest ones that I see immediately. So why do they have to produce better content? Because I can delete the RSS feed with a click or two and they have no way to get me back.

So the bottom line for me is that I happily will provide RSS feeds and I will never sign up for a newsletter again. Give me a feed and don’t ask for my email address if you want me to read.

|