To combat spam, I have installed a Bayesian filter on my home computer that does an excellent job of recognizing spam and sending it to a special folder I have setup in Outlook. It is very effective: 99.28% correct in determining the type of email coming in (personal, work, spam, etc.). My spam filter has only mistakenly classified 4 emails as spam, out of the 5000+ total emails I have received over the last 6 weeks, which is beter than 99.9%.
I do however, spend a few minutes each day combing through the spam just on that 0.1% chance a good email has been misidentified. What amazes me are the ineffective techniques spammers are using to try and trick Bayesian filters into accepting the message:
* Totally nonsensical subject lines, such as “apprise”. I can see what the spammers are trying to do. They're trying to pick a dictionary word that will likely not be considered spam by any spam filters. Luckily, my spam filter catches those emails all the time.
* Breaking up spam words with funny symbols and characters. How many emails have tried to sell me “V1@gr@” or “C.i.a.l.i.s”. I think the use of these non-alphabetic characters is a dead giveaway. That never works either.
* Adding random characters into the subject, like that's supposed to stop any modern spam filter.
* Leaving the subject field blank, hoping I would be curious enough to open the email just to see what it's about. It's easy to delete a subject advertising “We1ght Loss Pill“ but harder to delete one with no subject.
* A subject line that tries to pretend like it's a non-spam email. “Re: Your order“ is a favorite spammer subject line.
* Having only a little bit of spam (like a URL and a few words), and then a whole lot of unrelated non-spam-like text. I've seen spam that contain one bad sentence, and then 10-15 quotations from famous writers filling out the rest of the message.
None of the above types of messages ever make it into my Inbox.
The type of spam people are kind of worried about, as they could theoretically beat any Bayesian filter, is one that contains actual text that you (as the individual receiver) might be interested in. Say you are a computer programmer, and you send and receive emails regarding Linux and Oracle all day. Any spam that mentions Linux and Oracle a few times (and neglects to mention high-probability spam keywords) is likely to get through. But the problem is, each person's non-spam keywords will be slightly different. How can a spammer design an email message that gets through in my email, but does not get rejected by 99.999% of other Bayesian filters out there?
Luckily for us, in 2004, they can't. In order to do such a thing, you would literally have to have not only a list of 25 million email addresses, but a bunch of context sensitive keywords relating to each email address. I mean, you would have to link a spam database with Lexis-Nexis (or Google I guess). They may be able to do something like that in a few years, but I don't see that the technology is available today. And even if it was, there are two main reasons it still might not work:
(a) anyone who employs an aggressive spam filter is not likely to order products advertised as spam thus defeating the whole purpose; and
(b) in order for spam to be effective you have to mention the product you are advertising (mortgages, drugs, adult web sites, etc.) so those keywords will be automatically caught as spam anyways, despite how many non-spam keywords that are used. I would like to think that a spam email that mentions “online pharmacy“ will always be caught regardless of how relevant the rest of the content is to me personally.
Disclaimer The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.