#include <MailHeader.h>
| Public Member Functions | |
| MailHeader (SpamParameters &p, HeaderInfo &headInfo) | |
| MailFilter::classification | parseContentType (FILE *fp, char *buf) | 
| const char * | getBoundaryStr () | 
| MailFilter::classification | checkHeader (FILE *fp) | 
| Private Member Functions | |
| void | saveBoundary (const char *pBound) | 
| void | fillInSections (FILE *fp) | 
| MailFilter::classification | checkReceived (const char *buf, FILE *fp, size_t line) | 
| MailFilter::classification | checkSubject (const char *buf, FILE *fp) | 
| MailFilter::classification | checkFrom (const char *buf) | 
| bool | addrContinues (const char *buf) | 
| MailFilter::classification | checkDomainAddrs (const char *domainName, const char *pBuf) | 
| MailFilter::classification | checkAddressSection (const char *buf, FILE *fp) | 
| Private Attributes | |
| Logger | log | 
| SpamParameters & | mParams | 
| HeaderInfo & | mHeadInfo | 
| bool | foundValidAddress | 
| char | boundaryStr [128] | 
| char | pushBackBuf [1024] | 
| char * | pPushBack | 
The pushBackBuf
When processing the email header sometimes it is necessary to read the next line to know what to do. For example, to know whether the header has come to an end with a blank line. Or to know if the subject or other parts of the header continue on the next line. By reading the next line is may also be that we've read too far. We've read a line that should be processed. To deal with this issue, the pushBackBuf is used. The next line can be read into the pushBackBuf. When subsequent logic needs another line, the pushBackBuf can be checked (to see if it is non-zero length) before reading another line.
The boundaryStr
A boundary string is used to separate the sections of a MIME formatted email. Email software (like this email filter) can skip between sections by looking for the boundary string. The boundary string is defined in the email header. The MailHeader code saves the boundary string (if it exists) in the boundaryStr buffer. The boundary string is then used in processing the email body.
Definition at line 57 of file MailHeader.h.
| 
 | 
| addrContinues Return TRUE if it looks like the "to:" is spread across multiple lines. The "To:" continues on another line when the line ends with either a comma or a single quote follwed by a double quote. This function looks at the end of the line. If the line ends with: 
 | 
| 
 | ||||||||||||
| Check either the To: or Cc: sections. The "to_list" addresses, defined in SpamFilterParams, will usually be mailing lists (for example, the ANTLR anltr-interest mailing list that is distributed via Yahoo). If one of these addresses (or parts of an address) are found, then then the function will return EMAIL and no further content checking will be done by the mail filter. This function checks for the domain name specified in the my_domain secton of SpamFilterParams. If the domain is found it then checks to see if the user (e.g., the string to the left of the @) is listed in the valid_users section. This function limits user names to strings consisting of 'a'..'z' and '0'..'9' (case insensitive) plus the underscore character. Note that only one domain is allowed in the my_domain section. Before moving to my current ISP I would get any e-mail addressed to bearcave.com. This was a problem when the site bearcave.org existed since a number of bearcave.org users made the mistake of using .com when they should have used .org. Checking for valid users marks as garbage any email to a user that is not valid. Definition at line 334 of file MailHeader.C. References addrContinues(), checkDomainAddrs(), SpamParameters::getSection(), and Logger::log(). Referenced by checkHeader(). 
 00335 {
00336   log.log(Logger::DEBUG, "checkAddressSection", "enter");
00337 
00338   MailFilter::classification klass = MailFilter::UNKNOWN;
00339   vector<const char *> toAddrs = mParams.getSection(SpamParameters::to_list);
00340   vector<const char *> myDomain = mParams.getSection(SpamParameters::my_domain);
00341 
00342   const char *domainName = 0;
00343   if (myDomain.size() > 0)
00344     domainName = myDomain[0];
00345   
00346   const char *pBuf = buf;
00347   const size_t toAddrLen = toAddrs.size();
00348 
00349   char localBuf[1024];
00350   size_t i;
00351   bool done;
00352   do {
00353     done = true;
00354 
00355     SpamUtil().toLower(localBuf, pBuf, sizeof(localBuf));
00356     for (i = 0; i < toAddrLen; i++) {
00357       if (strstr(localBuf, toAddrs[i]) != 0) {
00358         const char *hit = toAddrs[i];
00359         char msg[128];
00360         sprintf(msg, "found \"%s\", marked as EMAIL", hit );
00361         log.log(Logger::DEBUG, "checkAddressSection", msg );
00362         klass = MailFilter::EMAIL;
00363         break;
00364       }
00365     } // for
00366 
00367     if (klass == MailFilter::UNKNOWN && domainName != 0) {
00368       klass = checkDomainAddrs( domainName, localBuf );
00369     }
00370 
00371     if (klass == MailFilter::UNKNOWN) {
00372       if (addrContinues(localBuf)) {
00373         if ((pBuf = fgets(localBuf, sizeof(localBuf), fp)) != 0) {
00374           done = false;
00375         }
00376       }
00377     }
00378 
00379   } while (!done);
00380 
00381   log.log(Logger::DEBUG, "checkAddressSection", "exit");
00382 
00383   return klass;
00384 } // checkAddressSection
 | 
| 
 | ||||||||||||
| Check the string for a user name associated with domainName. The domain name is defined in the my_domain section of SpamFilterParams. Valid user names for this domain are defined in the valid_users section. If valid user names are found then the foundValidAddrAddress flag is set to true. If there are users that are not in the valid_users list then the classification GARBAGE is returned. Otherwise, UNKNOWN is returned (UNKNOWN is returned when a valid user is found as well). Definition at line 235 of file MailHeader.C. References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason(). Referenced by checkAddressSection(). 
 00237 {
00238   log.log(Logger::DEBUG, "checkDomainAddrs", "enter");
00239 
00240   assert( ((domainName != 0) && (pBuf != 0)) );
00241 
00242   vector<const char *> validUsers = mParams.getSection(SpamParameters::valid_users);
00243   const size_t numUsers = validUsers.size();
00244   MailFilter::classification klass = MailFilter::UNKNOWN;
00245   
00246   bool done = false;
00247   size_t domainNameLen = strlen( domainName );
00248   const char *domainPtr = strstr(pBuf, domainName );
00249   while (domainPtr) {
00250     if (domainPtr > pBuf+2) {
00251       domainPtr--;
00252       if (*domainPtr == '@') {
00253         // find the start and end of the user name
00254         const char *endPtr = domainPtr;
00255         domainPtr--;
00256         const char *beginPtr = domainPtr;
00257         while (beginPtr >= pBuf && isalnum( *beginPtr ))
00258           beginPtr--;
00259         if (!isalnum(*beginPtr)) {
00260           beginPtr++;
00261         }
00262 
00263         // Now check to see if the user name is in the valid_users list
00264         // Note that this function is used for both To: and Cc:, so 
00265         // foundValidAddress could have been set in a previous call.
00266         bool foundInList = false;
00267         for (size_t i = 0; i < numUsers; i++) {
00268           const char *word = validUsers[i];
00269           if (SpamUtil().match(beginPtr, endPtr, word)) {
00270             foundValidAddress = true;
00271             foundInList = true;
00272           }
00273         } // for
00274 
00275         if (!foundInList) {
00276           char msg[128];
00277           char user[128];
00278           size_t ix = 0;
00279           for (const char *pCh = beginPtr; pCh < endPtr; pCh++, ix++) {
00280             user[ix] = *pCh;
00281           }
00282           user[ix] = '\0';
00283           sprintf(msg, "Non-valid user \"%s\", email marked as GARBAGE", 
00284                   user );
00285           mHeadInfo.reason( msg );
00286           log.log(Logger::DEBUG, "checkDomainAddrs", msg );
00287           klass = MailFilter::GARBAGE;
00288         }
00289         // endPtr points to the '@'
00290         pBuf = (endPtr + 1);
00291       }
00292     }
00293     if (klass == MailFilter::UNKNOWN) {
00294       pBuf = pBuf + domainNameLen;
00295       domainPtr = strstr(pBuf, domainName);
00296     }
00297     else {
00298       break;  // exit the while loop
00299     }
00300   } // while
00301 
00302   log.log(Logger::DEBUG, "checkDomainAddrs", "exit");
00303 
00304   return klass;
00305 } // checkDomainAddrs
 | 
| 
 | 
| Check to see if an e-mail address in the from_address section of the SpamParameters is in the "From:" field. If a "from_address" string is found then it is valid email and no further checking will be done by the mail filter. This allows people you know to send you e-mail that may have spam or kill words in it. The "From" is also checked against "from_kill" strings. This allows you to mark as garbage email from frequent spammers. For example, when I developed this software there was a spammer who used YoDude in the from line and another that used "TailWaggingOffers". Definition at line 123 of file MailHeader.C. References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason(). Referenced by checkHeader(). 
 00124 {
00125   MailFilter::classification klass = MailFilter::UNKNOWN;
00126   log.log(Logger::DEBUG, "checkFrom", "enter");
00127 
00128   vector<const char *> fromAddrs = mParams.getSection(SpamParameters::from_address);
00129   vector<const char *> killAddrs = mParams.getSection(SpamParameters::from_kill);
00130 
00131   char msg[128];
00132   char from[256];
00133 
00134   SpamUtil().toLower(from, buf, sizeof(from));
00135 
00136   size_t len;
00137   len = killAddrs.size();
00138   for (size_t i = 0; i < len; i++) {
00139     if (strstr(from, killAddrs[i]) != 0) {
00140       sprintf(msg, "Found address \"%s\", email marked as GARBAGE", 
00141               killAddrs[i] );
00142       mHeadInfo.reason( msg );
00143       log.log(Logger::DEBUG, "checkFrom", msg );
00144       klass = MailFilter::GARBAGE;
00145       break;
00146     }
00147   }
00148 
00149   if (klass == MailFilter::UNKNOWN) {
00150     len = fromAddrs.size();
00151     for (size_t i = 0; i < len; i++) {
00152       if (strstr(from, fromAddrs[i]) != 0) {
00153         sprintf(msg, "Found \"from address\" \"%s\", email marked as EMAIL", 
00154                 fromAddrs[i] );
00155         log.log(Logger::DEBUG, "checkFrom", msg );
00156         klass = MailFilter::EMAIL;
00157         break;
00158       }
00159     } // for
00160   }
00161 
00162   log.log(Logger::DEBUG, "checkFrom", "exit");
00163   return klass;
00164 } // checkFrom
 | 
| 
 | 
| Rules for processing the e-mail header: The "To:" and "Cc:" 
 
 The "From:" and "From" At least in the case of e-mail on Linux there is a "From" line which leads the e-mail file. This line has the following format: 
 
 
 Check the subject line for spam and kill words (e.g., penis, xanax). Processing of the email header ends when a blank line is found (all email headers must end with a blank line). When it comes to recognizing "spam_words" and "kill_words" in the subject line, the code below relies on the fact that the "From:" line preceeds the "Subject:" line. This allows your lover, whose address will presumably be in the from_address part of the SpamFilterParams, to send you e-mail with the word "penis" in the subject, without having the mail discarded if "penis" is in the kill_words list. The subject line and other parts of the email header are copied into a HeaderInfo object. This information is used in generating debug trace information and the garbage trace (for discarded email). and error messages. Many emails (especially those that are MIME formatted) will have a boundary line (which usually follows the "Content-Type:" line. The boundary line has the format 
 | 
| 
 | ||||||||||||||||
| The received line in the email header spans multiple lines. The end is determined by the next line that contains a colon. Something like "Message-ID:" or "From:" (or, perhaps, another "Received:"). Right now not much is done with this section except to search for the word "forged". If a section is added to the SpamFilterParams file for spammer addresses, then this function could recognize these. Right now it is not clear that this would be very profitable, since spammers move around so much. For use in future checking, the buf pointer points to the character that follows the colon. One complexity introduced by this function is that it reads the next line to see if this line has a colon header in it. There is no way to "unget" a line. So as a hack around this there is a "pushBackBuf" in the class (can you say global variable by another name) which contains this line. If the pPushBack pointer at this buffer (e.g., is not NULL) then the pushBackBuf line will be used rather than reading a new line from stdin. Some mailers add a "may be forged" note on one of the received lines. This seems to happen when the address is given via SMTP, rather than in the "To:" line. In this case, the mail should be marked as SPAM (e.g., SUSPECT) The "line" argument is for debugging. Definition at line 416 of file MailHeader.C. References Logger::log(). Referenced by checkHeader(). 
 00419 {
00420   log.log(Logger::DEBUG, "checkReceived", "enter");
00421   MailFilter::classification klass = MailFilter::UNKNOWN;
00422 
00423   const char *pColon;
00424   do {
00425     pPushBack = fgets(pushBackBuf, sizeof(pushBackBuf), fp);
00426     if (pPushBack != 0) {
00427       if ((klass == MailFilter::UNKNOWN) && (strstr(pPushBack, "forged") != 0)) {
00428         klass = MailFilter::SUSPECT;
00429       }
00430       pColon = SpamUtil().findColon( pPushBack );
00431     }
00432   } while (pPushBack != 0 && !pColon);
00433 
00434   log.log(Logger::DEBUG, "checkReceived", "exit");
00435   return klass;
00436 } // checkReceived
 | 
| 
 | ||||||||||||
| Check the email header subject line for spam or kill words or phrases. Apparently some emails may have multi-line subjects. So after the subject line is found, we check to see if there is another line with a colon (something like "Reply-To:" for example) or if the line is blank (indicating an end to the header). In both cases we "push back" the line. If the line does not have a colon or is not blank, we skip it (since this is the subject line continuing on the next line). The subject line should not continue on more than one line after the "Subject:" line or something is really wrong with the email format. Definition at line 68 of file MailHeader.C. References Logger::log(), and HeaderInfo::reason(). Referenced by checkHeader(). 
 00069 {
00070   log.log(Logger::DEBUG, "checkSubject", "enter");
00071 
00072   char msg[256];
00073   char subject[256];
00074   char foundStr[128];
00075 
00076   // convert to lower case
00077   SpamUtil().toLower(subject, buf, sizeof(subject));
00078 
00079   foundStr[0] = '\0';
00080   MailFilter::classification klass = SpamUtil().checkLine(subject, 
00081                                                           mParams, 
00082                                                           foundStr, 
00083                                                           sizeof(foundStr));
00084   
00085   if (klass == MailFilter::SUSPECT || klass == MailFilter::GARBAGE) {
00086     if (klass == MailFilter::SUSPECT) {
00087       sprintf(msg, "Found \"spam\" word \"%s\", email marked as SUSPECT", 
00088               foundStr );
00089     }
00090     else if (klass == MailFilter::GARBAGE) {
00091       mHeadInfo.reason( foundStr );
00092       sprintf(msg, "Found \"kill\" word \"%s\", email marked as GARBAGE", 
00093               foundStr );
00094     }
00095     log.log(Logger::DEBUG, "checkSubject", msg );
00096   }
00097 
00098   pPushBack = 0;
00099   char *pBuf;
00100   if ((pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) {
00101     if (SpamUtil().isBlankLine(pBuf) || SpamUtil().findColon(pBuf) != 0) {
00102       pPushBack = pBuf;
00103     }
00104   }
00105   
00106   log.log(Logger::DEBUG, "checkSubject", "exit");
00107   return klass;
00108 } // checkSubject
 | 
| 
 | 
| Fill in the email header in the mHeadInfo class variable. The mHeadInfo object is used to encapsulate header information that is used in generating debug log messages and in generating the garbage trace (if the trace_garbage flag is set). Processing the email stops as soon as it can be determined that the email is valid, suspect or garbage. In some cases (for example an invalid domain address) the complete header will not have been processed and some fields in mHeadInfo have not been filled in. This function is called to read the rest of the header and fill in these fields. Definition at line 659 of file MailHeader.C. References HeaderInfo::date(), HeaderInfo::from(), Logger::log(), HeaderInfo::subject(), and HeaderInfo::to(). Referenced by checkHeader(). 
 00660 {
00661   static const char *TO = "to:";
00662   static const size_t TO_LEN = strlen( TO );
00663   static const char *FROM = "from:";
00664   static const size_t FROM_LEN = strlen( FROM );
00665   static const char *SUBJECT = "subject:";
00666   static const size_t SUBJECT_LEN = strlen( SUBJECT );
00667   static const char *DATE = "date:";
00668   static const size_t DATE_LEN = strlen( DATE );
00669   char buf[1024];
00670   char *pBuf = 0;
00671   size_t bufSize = 0;
00672 
00673   log.log(Logger::DEBUG, "fillInSections", "enter");
00674 
00675   do {
00676     if (pPushBack == 0) {
00677       pushBackBuf[0] = '0';
00678       bufSize = sizeof(buf);
00679       pBuf = fgets(buf, bufSize, fp);
00680     }
00681     else {
00682       pBuf = pPushBack;
00683       bufSize = sizeof( pushBackBuf );
00684       pPushBack = 0;
00685     }
00686     if (pBuf != 0) {
00687       if (! SpamUtil().isBlankLine( pBuf )) {
00688         char *pCopy = 0;
00689         if (SpamUtil().match(pBuf, TO_LEN, TO)) {
00690           pCopy = pBuf + TO_LEN;
00691           mHeadInfo.to( pCopy );
00692         }
00693         else if (SpamUtil().match(pBuf, FROM_LEN, FROM)) {
00694           pCopy = pBuf + FROM_LEN;
00695           mHeadInfo.from( pCopy );
00696         }
00697         else if (SpamUtil().match(pBuf, SUBJECT_LEN, SUBJECT)) {
00698           pCopy = pBuf + SUBJECT_LEN;
00699           mHeadInfo.subject( pCopy );
00700         }
00701         else if (SpamUtil().match(pBuf, DATE_LEN, DATE)) {
00702           pCopy = pBuf + DATE_LEN;
00703           mHeadInfo.date( pCopy );
00704         }
00705       }
00706       else {
00707         // found a blank line which follows the mail header
00708         break;
00709       }
00710     }
00711   } while (pBuf != 0);
00712 
00713   if (pBuf == 0) {
00714     log.log(Logger::DEBUG, "fillInSections", "end-of-file reached");
00715   }
00716   log.log(Logger::DEBUG, "fillInSections", "exit");
00717 } // fillInSections
 | 
| 
 | ||||||||||||
| Return the content type for the email. Many emails (especially those which are MIME encoded, but others as well) include a Content-type section whose format is: 
 | 
| 
 | 
| Save the boundary string which may follow the "Content-Type:" line. The format for the boundary definition is 
 | 
 1.3.3
 
1.3.3