Painting by Edward Ruscha, Actual Size, 1962
This web page publishes an email filter designed to weed out SPAM (unwanted and unsolicited email). This mail filter is designed for UNIX/Linux shell email accounts. The mail filter is open source and can be adapted for other purposes.
The SPAM filter is "rule based". It relies on a set of rules to recognize what I define as SPAM. For example, as far as I'm concerned any email only has an HTML section is probably SPAM. The mail filter also uses a user created dictionary of words and phrases. These words are phrases are divided into two catagories: "SPAM words" and "kill words". When a SPAM word or phrase is recognized, the email will be moved to a junk_mail file. When a "kill word" or phrase is recognized the email will be deleted or moved to a garbage_mail file, depending on whether the keep_garbage flag is set in the parameter file.
The mail filter can be configured by the user using the parameter file. By setting flags in the parameter file the mail filter may be run in debug mode, in which case it will generate a detailed execution trace.
In the beginning there was ARPAnet, which was originally sponsored by the Defense Advanced Research Projects Agency (DARPA). ARPAnet connected scientists and engineers at government research labs, universities, defense contractors and computer manufacturers. ARPAnet was explicitly non-commercial. Those who insisted in sending commercial email risked gettiing their accounts canceled. In the beginning, there was no SPAM.
SPAM appeared along with the World Wide Web and general use of the Internet. When I first registered bearcave.com in 1995 there was little if any SPAM. But as the Internet burst into the general consciousness, the hucksters and scammers started to appear, promoting their "products" via SPAM. One of the most infamous, in those early days, was Stanford Wallace, known as "Spamford". He was finally driven from the Internet after he was successfully sued by a large ISP.
Many of the early SPAM lists came from the Network Solutions domain registration database. Network Solutions did nothing to protect this database from abuse by SPAMmers. In fact, it was only when they realized that they could sell the databse themselves that they took action. Since I registered bearcave.com through Network Solutions relatively early in the Web's history, the SPAMmers got my e-mail address.
The computer science and engineering community has always had a strong libertarian faction. In the early days of the Web the whole idea of regulating the Internet and the Web was viewed with suspicion disdain. The Internet, it was believed, would be a libertarian self-regulating paradise. SPAMmers would be driven from the Internet by the vast numbers of people they abused with their spew.
The Internet grew exponentially for a number of years. The huge size of the Internet and the software used by spammers defeated the Internet community dedicated to removing spammers from the Libertarian paradise. In retrospect the whole idea was naive.
For unethical hucksters, SPAM was just too attractive. The volume of SPAM kept growing, even as ISPs have dedicated resources to combating SPAM. Since the Libertarian Internet solutions were obviously a failure, I hoped that anti-SPAM laws would succeed, just as junk FAX laws have been successful. But here again I was disappointed. Although laws have been proposed and a few actually passed, the laws have been ineffective. Many laws did not have enforcable penalties that were strong enough to make SPAM expensive for SPAMmers.
SPAM has now become international. Even if the United States is made unattractive to SPAMmers they can always move offshore. For a while I was getting several SPAMs a day in chinese characters (big5 character set). I recently saw a spam that originated from the Caribbean island of Aruba.
The mail filter published here represents a surrender on my part. Getting a SPAMmer's account or even domain canceled does little good. They pop up again elsewhere like some kind of stinking poisonous mushroom after a shower of acid rain. This e-mail filtering software will obviously not get rid of SPAM, but it will reduce its impact.
For two years I lived in a small rural town outside of Santa Fe, New Mexico. My house was about half a mile down a dirt road from the Santa Fe National forest. The road continues into the forest as a jeep road. The jeep road goes through the pine trees, toward the Sangre De Cristo mountain range. I loved walking there. The only thing that marred the walk was that at the start of the jeep road "people" dumped garbage. Santa Fe is an area with low rain fall and garbage can survive in this dry climate for decades.
The same morality that leads people to dump garbage on the border of a National Forest is the morality (or lack of it) that animates the selfish minds of SPAMmers. Like those who dump garbage in the forest, SPAMmers are scum. Like the garbage dumpers, a SPAMmer has no regard for others. Their only thought is that SPAM allows them to send their bogus offers and get rich schemes to suckers at a very low cost. SPAMmers are liars ("to be removed send e-mail to notachance@yahoo.com") and in many cases thieves. Only a few idiots will answer a SPAM sent to millions, but apparently this is enough to turn a profit for a SPAMmer.
One example of the sort of "person" who becomes a SPAMmer is Charles F. Childs, who according to Avalanche of spam becomes a curse of the Internet by Jim DeBrosse, Dayton Daily News is "a former Dayton police officer who was fired in 1996 for selling drugs on the street". Childs likes the SPAMmer lifestyle, because it allows him to live like a drug dealer:
Childs said he doesn't rise most days now until 1 p.m. He works out of his apartment clad in silk kimono and leather slippers until he dresses at 6 p.m. to go out for the evening.
Childs also affects the sartorial splendor of the crack dealer:
Childs sports a gold bracelet and large gold ear loop and said he has his nails done twice a week. During a recent interview, his hand cupped a snifter of brandy.
Although SPAMmers steal vast resources form the Internet community, they face, at most, annoying civil legal repercussions (Childs has been sued by the Federal Trade Commision). So far Childs does not have to worry about prison. Some hoped this would change with the Federal "CAN SPAM act", to date this law has had no impact.
Yet another SPAMmer is Davis Wolfgang Hawke, who was involved in "white power" and Nazi affiliated groups until it came out that his father was Jewish (Hawke was born Andrew Britt Greenbaum). After being booted out of the "white power" camp, Hawke became a SPAMmer (why am I not surprised). The whole story can be found in Brian McWilliams Salon article Meet the spam Nazi (Salon, July 29, 2003).
The volume of SPAM continues to increase. In addition to all the body part enlargement and drug offers, there is an increasing volume of viruses. Apparently the huge increase in virus traffic is also tied to SPAM. Computers that are successfully infected by SPAMmer viruses become "zombies" that are used to send out SPAM without their owners knowledge.
This software is in its third generation. The first generation was a very simple, basic piece of software. It used a compiled in list of words that it looked for to identify SPAM. At the time this simple filter was successful in separating most of the primordial SPAM from valid email. However, spammers got more sophisticated, so I wrote a slightly better second generation mail filter, based on the first version. It was this version that was originally published on Bearcave.com, so I refer to it as Version 1.0.
As SPAMmers got more sophisticated in disguising their spew it got to the point where about 50 percent of the SPAM was getting through into my inbox file. The tedium of looking through hundreds of email headers every day prompted me to develop the latest version of the mail filter, which I refer to as version 2.0. Version 2.0 reuses the file input/output code from Version 1.0, but otherwise it is a total rewrite and consists of about three times more source code.
Version 2.0 of the SPAM filter was designed to reduce the onslaught of SPAM produced by a more sophisticated generation of SPAMmers. The mail filter is designed for my wife and I. To avoid any danger of virus infection, we only read email on the Linux system that hosts bearcave.com. We also only use text email readers (Emacs for me and Elm for my wife). This mail filter will probably not be very useful for anyone who does not use UNIX/Linux. Nor will this mail filter be useful to people who use HTML formatted email.
I receive a few hundred emails a day, of which about 90 percent are SPAM. This creates a real challenge in designing an email filter (a challenge that SPAMmers make as difficult as possible). Unavoidably some valid email will be classified as suspect. There is also a non-zero probability that valid email will be discarded if the keep_garbage flag is not set (in practice this probability can be kept low). There is no question that SPAMmers have succeeded in making email more difficult to use. But without a mail filter a few valid emails will be buried in a vast sea of sewage. To deal with the current generation of SPAM the version 2.0 mail filter had the following design objectives:
Parse the email header and email body sections. This is needed to recognize SPAM which SPAMmers have tried to disguise. This allows improper addresses to be recognized and allows the sections to be identified. Properly identifying the email sections allows base64 encoded sections to be properly recognized.
Rapidly process a given email. The mail filter is implemented in C++ (as opposed to Java, PERL or Python, which execute more slowly). The logic in the mail filter also tries to recognize whether an email is valid or SPAM by parsing the header, avoiding the more more time consuming processing of the email body.
Allow the SPAM filter to be configured from a file. The previous version used words lists and controls that were staticly compiled into the code. The SpamFilterParams file allows the user to customize the mail filter operation. By adding to the spam_words or kill_words lists filtering can be improved over time.
Support execution tracing to support debugging of the mail filter. Without execution tracing it can be very difficult to understand what the mail filter is doing, since it runs as an independent process.
Support error reporting in case a runtime error is identified by the mail filter logic.
Reduce the volume of junk_mail headers that have to be sifted through. The mail filter will, optionally, throw away email that is identified as garbage. This seems to reduce the amount of email in the junk_mail file by about a third.
Optionally generate a garbage_trace which outputs part of the email header information for the email that is discarded. This trace can be viewed with the UNIX mail or elm software. If you expected an email and did not receive it, the garbage_trace file can tell you whether the email was discarded.
A variety of techniques have been applied to separate email that people want to receive from the vast flow of SPAM in which it floats. Some of the filtering techniques are based on algorithms which were previously used in machine learning (see the references on Bayesian filtering below). Some SPAM filters apply multiple techniques: rule and dictionary based filtering, combined with Bayesian filtering.
Bayesian statistics and inferencing are fascinating. The applications of these techniques to mail filtering is interesting, but may not always be effective. As Bayesian filters have been used more widely, SPAMmers have started to send out SPAM that contains sections of text that differs in each email. This can cause problems for Bayesian filters which will see each email as different, harming the convergence of the Bayesian algorithm which classifies an email as SPAM. Some filters, like DSPAM seem to be able to avoid this problem. Perhaps by parsing the email into sections they can avoid the "mutating" part.
This mail filter uses a set of rules in an attempt to separate valid email from SPAM. Email is divided into three catagories: valid email, suspect email and garbage. Valid email is copied into the email inbox file. Suspect email is copied into the junk_mail file and garbage email may be copied into a garbage_mail file or discarded (depending on whether the keep_garbage flag is set in the SpamFilterParams file.. Some of the rules used by this mail filter are outlined below:
Check the To:, Cc: and From: lines for addresses that are always allowed. For example, friends or mailings lists that you've subscribed to. Email with these allowed addresses is marked as valid.
Check the email addresses in the To: and Cc: lines of the email header. If the email contains an invalid domain address (e.g., fatbear@bearcave.com), then it is marked as "garbage".
Check the email From: address for a set of strings from known SPAMmers (e.g., yodude, tailwagging, funbenefits...). If one of these is found, the email is marked as "garbage".
Check to see if the email starts with an HTML section. If it does, mark it as suspect. A large amount of SPAM is HTML encoded and either has a blank text section or no text section. This is one of the features that makes this mail filter unusable for those who receive a lot of valid HTML encoded email.
Check to see if the <html> or <body> HTML tags are found in the text section. If so, the email is marked as suspect. SPAMmers count of the flexibility of email software to translate HTML, even thought it is in a text section. This allows them to hide SPAM within HTML tags.
Check to see if the email body is Windoz character encoded. If so, the email is marked as suspect (you can't read this junk on UNIX/Linux anyway).
Check to see if the email contains base 64 (base64) encoded data. Base 64 encoding is used for viruses and other security hacks, like .pif (program information) files, Microsoft Word and Excel documents which encapsulate executable commands. If a base64 encoded section is recognized the email will be marked as garbage if the kill_base64 flag is set in the SpamFilterParams file. Email marked as garbage will be discarded or put in the garbage_mail file if the keep_garbage flag is set. If this flag is not set, the email will be placed in the junk_mail file.
The text section of the email is checked against the spam_words (for suspect email) and the kill_words (for email that will be marked as garbage). If the keep_garbage flag is not set, email that is marked as garbage will be discarded.
The HTML tags are filtered out of the HTML section and the resulting text is checked for spam_words and kill_words.
Compared to the sophistication of SPAM filters like SpamAssassin and DSPAM, this mail filter is pretty modest and is probably much less accurate in separating the email from the dross. For me it has the advantage of running very fast, so that it does not use much in the way of computational resources on the shared system that we read email on. It also can be easily configured for my needs. And compared to tools like DSPAM and SpamAssassin, this mail filter is pretty simple. All that said, I may be guilty of the software engineer's vice of writing something myself where I could have used an available tool.
The version 2.0 mail filter processes all of the email received by my wife and I. It meets out needs. It may not be useful for others who recieve valid HTML encoded email or email with base64 encoded sections.
Although I have worked hard to make the mail filter reliable and bug free, using it may result in the loss of important email. If you use the mail filter, you agree to accept the risk that you may lose email. The mail filter is published as Open Source. It is designed for software engineers, not end users.
You must be especially careful with the contents of the kill_words word list in the SpamFilterParams file. Unless the keep_garbage flag is set, email that contains words or phrases in this list will be discarded. Only words and phrases that are highly unlikely to occur in valid email should be included there. For example, I don't get email with the words "valium", "xanax" or "viagra" in it. But if I were a medical doctor this might not be the case.
Since I use the mail filter, I would like to know about any bugs. Especially properly formatted emails that crash the mail filter or cause it to put out an error trace. I use this software so I'd like to fix any bugs. But there are no guarantees. Finally, I "don't do requests". I am not willing to customize this software for you unless you are willing to pay for these changes on a time and materials basis.
The source is packaged as a UNIX tar file. To download click here [mail_filter.tar.gz].
This email filter was written by Ian Kaplan, Bear Products International. It is copyrighted by Ian Kaplan, 2004, www.bearcave.com.
You may use this software for any purpose, with the two conditins listed below.
You must preserve this copyright notice in this software and any software derived from it.
You accept any risk entailed in using this software. By using this software, you acknowledge that you have a sophisticated background in software engineering and understand the way this software functions. You further acknowledge that using this software may result in the irretrievable loss of important e-email and you alone are responsible for this loss.
If either of these conditions are unacceptable, you may not use any part of this software.
The source for the mail filter is extensively commented. I have written the comments so that they could be processed by the doxygen C++ documentation generator. The doxygen generated web pages can be found here.
To install do the following:
Unzip the software. The software is packaged with GNU tar and is compressed using gzip. To unpackage:
tar xzvf mail_filter.tar.gz
The software will be unpackaged into the directory mail_filter
Build the software. The Makefile is targeted at UNIX make, or, on Linux pmake. Simply enter the command pmake and the mail_filter executable should be built.
I developed the software on Windoz, so a Windoz Makefile (for nmake) is included as well (see Makefile_win). To build on Windoz enter:
nmake -f Makefile_win
The mail filter parameter file SpamFilterParams must be in your home directory. So copy the parameter file mail_filter/SpamFilterParams to your home directory.
Make a symbolic link from your email file to a file in your local directory named inbox. If your email file is /var/mail/iank then make the following symbolic link
ln -s /var/mail/iank inbox
Set up mail forwarding. Unfortunately this differs on Linux and on UNIX (e.g., freeBSD).
Linux
Mail forwarding is done via procmail. On my Linux system I set up the following .procmailrc file:
:0fw * < 75000 | /home/iank/bin/mail_filter
where the mail_filter program is installed in /home/iank/bin
On freeBSD UNIX I used a .forward file that contained:
"|~iank/bin/mail_filter"
That should do it. Sorry, I can't provide support if this does not work.
The SpamFilterParams file contains information that the mail filter uses in processing email. As noted above, this file must be in your home directory (the default current working directory which will be used by the mail_filter executable when it runs). The SpamFilterParams file is extensively commented. You should take a look at it and make sure that it contains the proper information before installing the mail filter.
Be very careful with the kill_words list of words and phrases. When strings in this list are found in email, the email will be marked as garbage. If the keep_garbage flag is not set, this email will be discarded. If the keep_garbage flag is set, the email will be copied into the garbage_mail file.
This file will contain the email that is marked as "suspect". I use the UNIX mail tool to look at the email headers in this file. Most of the time I delete the email. From the mail tool command line you can delete blocks of email by entering the email number range. For example
d 1-36
This will delete emails 1 to 36.
Including (or uncommenting) the trace_garbage flag in the SpamFilterParams file will enable tracing of emails that are marked as garbage and removed. This addresses the concern that the mail filter might have discarded a valid email. If this happens you at least have the option of contacting the sender and asking them to resend the email.
The garbage_trace file uses email header format which can be viewed with email tools like elm or UNIX/Linux mail. An example of the format is shown below:
From OWNER-NOLIST-DAILY*iank**bearcave*-com@FunBenefits.com Sat Mar 27 13:30:16 2004 From: FunBenefitsTo: iank@bearcave.com Subject: Sit back relax and get paid for what you think Date: Sat, 27 Mar 2004 15:19:00 -0600 Reason-its-garbage: Found address "funbenefits", email marked as GARBAGE From dfrederick_ye@sigma.ie Sat Mar 27 13:37:03 2004 From: "Doris Frederick" To: iank@bearcave.com Subject: Internet Pharmacy - Cheap Prices Date: Sat, 27 Mar 2004 22:38:38 +0100 Reason-its-garbage: pharmacy From b.YoDude.0-31eafca-5cf6.bearcave.com.-iank@05.moosq.com Sat Mar 27 13:39:54 2004 From: Blue Love Pill To: iank@bearcave.com Subject: Your love life is about to be awesome Date: Sat, 27 Mar 2004 13:37:48 -0800 (PST) Reason-its-garbage: Found address "yodude", email marked as GARBAGE From qxmafla@hotmail.com Sat Mar 27 13:49:30 2004 From: "Hallie Donahue" To: iank@bearcave.com Subject: g Date: Sat, 27 Mar 2004 19:48:07 -0200 Reason-its-garbage: m0rtgage From ambarferrer@hotmail.com Sat Mar 27 14:35:59 2004 From: ambarferrer@hotmail.com To: iank@bearcave.com Subject: Re: Your text Date: Sat, 27 Mar 2004 18:35:43 -0400 Reason-its-garbage: found base64 encoding
The garbage_trace file will never grow beyond about 50K bytes in size. When it's size becomes greater than 50K, the file will be truncated to zero and the new entry will be written. Then means that an entry written just as the file reaches 50K will disappear when a new entry is added. The size limit for the garbage_trace file is hard coded into the software (it did not seem worth making this a parameter in the SpamFilterParams file).
The massive SPAM problem has gotten a lot of attention in the computer science community. A variety of software solutions have been devised. As time goes on these solutions have become more sophisticated as SPAMmers design their SPAM to get around filters (perhaps applying the remarkable bit of logic that people who filter out SPAM would actually want to read the SPAM that made it throught the filter). I've listed a few references to some of the SPAM filtering software. Some of these SPAM filters seems to be designed for ISP level email filtering. In contrast, the filter published here is designed for a single UNIX/Linux shell account.
CRM114 - the Controllable Regex Mutilator by William S. Yerazunis
CRM114 is an open source SPAM filter that uses Markovian discriminator based filtering. After glancing over Dr. Yerazunis' 2001 paper I was not left much wiser about what "Markovian" means in this context.
The DSPAM Project (on NuclearElephant.com). This SPAM filter seems to join Bayesian statistical techniques (which they refer to as Bayesian Dolby) with chained token sets (e.g., SPAM features), which seems to be similar to the Markovian technique above.
SpamAssassin is a super-rule-based SPAM filter. It not only does header and body analysis, but it can also reference blacklists and make use of collaborative filtering.
The DSPAM web page claims that the statistical algorithms that they use are very powerful and are actually better than humans at recognizing spam. They use a technique that they call Bayesian Noise Reduction, which they claim results is very accurate spam filtering. I have not see a discussion of performance, however. Naively I'd assume that these techniques are computationally expensive.
Paul Graham's Web pages on filtering out SPAM
Paul Graham was one of the pioneers in applying Bayesian techniques to filtering out SPAM.
SpamBayes: A Bayesian anti-spam classifer written in Python
Gary Robinson's Rants: SPAM Detection
This is a link rich web page which discusses the technical detials of various SPAM filtering techniques.
Brian McWilliams Spam Kings blog, hosted on the O'Reilly site.
Brian McWilliams is the author of the book Spam Kings. This blog discusses the latest developments in the world of spammers, spyware hackers and other scum spawned by the Internet era.
Ian Kaplan, April 24, 2001
Revised: February 2006