Spam Free Email

Anti-spam ideas, tools and services

September 27th, 2004

Wiki Pages

I just finished some code over the weekend to add Wiki Pages to spamfreeemail.com. I’m hoping to take advantage of the viral content nature of Wiki pages to advance the information and practices that are here to fight against spam.

You can get to the wiki page by clicking on the link on the left bar or going to http://wiki.spamfreeemail.com

Remember the more informed everyone is about spam and security the less of a problem it will become.

September 9th, 2004

Regular Expression problems

While working with a client, trying to figure out why thier spam filter was catching more messages then it should, we realized that we had accidentally added some regular expressions.

How do you accidentally add a regular expression you ask?

In this case the system admin had gotten tired of seeing the word ‘on|ine’ spelled with a pipe. By copying and pasting that word out of an email message they inadvertantly added a regular expression into thier spam fitlers.

This particular regular expression was pretty bad. It reads as “on OR ine” so any instance of ‘on’ or ‘ine’ was getting caught as a spam word. About 60% of all of thier email was getting caught by this filter.

Regular exprssions can be wonderful things for filtering spam, when you truely mean to use them. For example say you want to filter the word ‘penny’ and the plural ‘pennies’. (my example to my client was another word that started with a P and has the same plural issue.) A simple regualr expression for this woudl be /penn[y|ies]/. This reads “anything that starts with ‘penn’ and ends with ‘y’ or ‘ies’”

So be careful with those regular expressions. Know what they really mean and make sure they mean what you want them to.

September 8th, 2004

Phonetic Spam Checker

I was working on some back end code that I have been rewriting lately and I was working on a spell checker that I originally wrote about two years ago.

As I’m updating the code for the new capabilities on my server I started thinking about using phonetic spell checking to identify words as spam words even if the characters are not correct.

The best example I have come to think up so far is the work viagra and the spam word v|@gr@. Notice the usage of the pipe (|) and the at sign (@) to replace the letter i and the letter a respectively.

The spell checker algorithm that I use only looks at vowels if they are the first letter of the word, therefore the special character in place of the vowels would not even be evaluated if they are not the very first character.

The phonetic encoding for the two variations of the word are as follows:

viagra = FGR
v|@gr@ = FGR

After getting the phonetic encoding it would be a matter of evaluating the word against a list of known spam words. The database of 47,000 words that I have in my spell checker database have no other words that use the phonetic key of FGR. That means that both versions of the word would be considered a spam word and there is little chance that this word would give a false positive.

The only concern that I have on this idea is that many words do generate the same phonetic keys. This would create two issues in my mind, one good and one bad. First the good, there would be fewer potential phonetic keys then there ever could be word combinations. Fewer phonetic keys means faster look ups.

The second problem is that with fewer phonetic keys you are more likely to get false positives or the averages of the words would tend toward neutral more then bad.

Using the Paul Graham Plan for Spam, (Get a copy of Hackers and Painters) this would inherently look for the 15 to 20 ‘most interesting’ words in the email message to generate the spam rating. If a spam word has the same phonetic key as a word that is generally benign, then the spam rating would tend more toward neutral.

The true benefit is to catch the intentional misspellings and usage of special characters in spam messages, and that might be enough to consider further investigation into this method.

|