I was working on some back end code that I have been rewriting lately and I was working on a spell checker that I originally wrote about two years ago.
As I’m updating the code for the new capabilities on my server I started thinking about using phonetic spell checking to identify words as spam words even if the characters are not correct.
The best example I have come to think up so far is the work viagra and the spam word v|@gr@. Notice the usage of the pipe (|) and the at sign (@) to replace the letter i and the letter a respectively.
The spell checker algorithm that I use only looks at vowels if they are the first letter of the word, therefore the special character in place of the vowels would not even be evaluated if they are not the very first character.
The phonetic encoding for the two variations of the word are as follows:
viagra = FGR
v|@gr@ = FGR
After getting the phonetic encoding it would be a matter of evaluating the word against a list of known spam words. The database of 47,000 words that I have in my spell checker database have no other words that use the phonetic key of FGR. That means that both versions of the word would be considered a spam word and there is little chance that this word would give a false positive.
The only concern that I have on this idea is that many words do generate the same phonetic keys. This would create two issues in my mind, one good and one bad. First the good, there would be fewer potential phonetic keys then there ever could be word combinations. Fewer phonetic keys means faster look ups.
The second problem is that with fewer phonetic keys you are more likely to get false positives or the averages of the words would tend toward neutral more then bad.
Using the Paul Graham Plan for Spam, (Get a copy of Hackers and Painters) this would inherently look for the 15 to 20 ‘most interesting’ words in the email message to generate the spam rating. If a spam word has the same phonetic key as a word that is generally benign, then the spam rating would tend more toward neutral.
The true benefit is to catch the intentional misspellings and usage of special characters in spam messages, and that might be enough to consider further investigation into this method.