I have been thinking about implementing more corpuses, (corpii?) then just the standard Good, Bad and Neutral. Since ClamAV marks some messages as Phishing schemes, I have been thinking about creating a corpus where the known phishing schemes are marked bad by ClamAV and then I could pull out certain message that are similar to evaluate them further by hand.

I was also thinking that it might be an effect way to categorize mail, for different types of content. Sexual content will have a limit number of words as well as profanity. By marking some messages as sexual in nature and rerunning some statistics based on the corpus you could easily categorize the mail into almost any category you would want.

This could pull out 411/Nigeria scams as easily as mortgage offers. Then with some user preferences and statistics you could determine what types of mail each user likes and does not like and further customize their experience.

It might even be an effect way to sort mail in a web based interface. If the user could setup their own categories and then specify that they want all mortgage related emails to go into one folder and all travel related email to go into another one they would have a much easier way to find the messages by topic.

Of course this is just me blathering about things that I don’t have time to work on, but they are fascinating ideas on how to sort and classify messages …