#include <MailHeader.h>
Public Member Functions | |
MailHeader (SpamParameters &p, HeaderInfo &headInfo) | |
MailFilter::classification | parseContentType (FILE *fp, char *buf) |
const char * | getBoundaryStr () |
MailFilter::classification | checkHeader (FILE *fp) |
Private Member Functions | |
void | saveBoundary (const char *pBound) |
void | fillInSections (FILE *fp) |
MailFilter::classification | checkReceived (const char *buf, FILE *fp, size_t line) |
MailFilter::classification | checkSubject (const char *buf, FILE *fp) |
MailFilter::classification | checkFrom (const char *buf) |
bool | addrContinues (const char *buf) |
MailFilter::classification | checkDomainAddrs (const char *domainName, const char *pBuf) |
MailFilter::classification | checkAddressSection (const char *buf, FILE *fp) |
Private Attributes | |
Logger | log |
SpamParameters & | mParams |
HeaderInfo & | mHeadInfo |
bool | foundValidAddress |
char | boundaryStr [128] |
char | pushBackBuf [1024] |
char * | pPushBack |
The pushBackBuf
When processing the email header sometimes it is necessary to read the next line to know what to do. For example, to know whether the header has come to an end with a blank line. Or to know if the subject or other parts of the header continue on the next line. By reading the next line is may also be that we've read too far. We've read a line that should be processed. To deal with this issue, the pushBackBuf is used. The next line can be read into the pushBackBuf. When subsequent logic needs another line, the pushBackBuf can be checked (to see if it is non-zero length) before reading another line.
The boundaryStr
A boundary string is used to separate the sections of a MIME formatted email. Email software (like this email filter) can skip between sections by looking for the boundary string. The boundary string is defined in the email header. The MailHeader code saves the boundary string (if it exists) in the boundaryStr buffer. The boundary string is then used in processing the email body.
Definition at line 57 of file MailHeader.h.
|
addrContinues Return TRUE if it looks like the "to:" is spread across multiple lines. The "To:" continues on another line when the line ends with either a comma or a single quote follwed by a double quote. This function looks at the end of the line. If the line ends with:
|
|
Check either the To: or Cc: sections. The "to_list" addresses, defined in SpamFilterParams, will usually be mailing lists (for example, the ANTLR anltr-interest mailing list that is distributed via Yahoo). If one of these addresses (or parts of an address) are found, then then the function will return EMAIL and no further content checking will be done by the mail filter. This function checks for the domain name specified in the my_domain secton of SpamFilterParams. If the domain is found it then checks to see if the user (e.g., the string to the left of the @) is listed in the valid_users section. This function limits user names to strings consisting of 'a'..'z' and '0'..'9' (case insensitive) plus the underscore character. Note that only one domain is allowed in the my_domain section. Before moving to my current ISP I would get any e-mail addressed to bearcave.com. This was a problem when the site bearcave.org existed since a number of bearcave.org users made the mistake of using .com when they should have used .org. Checking for valid users marks as garbage any email to a user that is not valid. Definition at line 334 of file MailHeader.C. References addrContinues(), checkDomainAddrs(), SpamParameters::getSection(), and Logger::log(). Referenced by checkHeader().
00335 { 00336 log.log(Logger::DEBUG, "checkAddressSection", "enter"); 00337 00338 MailFilter::classification klass = MailFilter::UNKNOWN; 00339 vector<const char *> toAddrs = mParams.getSection(SpamParameters::to_list); 00340 vector<const char *> myDomain = mParams.getSection(SpamParameters::my_domain); 00341 00342 const char *domainName = 0; 00343 if (myDomain.size() > 0) 00344 domainName = myDomain[0]; 00345 00346 const char *pBuf = buf; 00347 const size_t toAddrLen = toAddrs.size(); 00348 00349 char localBuf[1024]; 00350 size_t i; 00351 bool done; 00352 do { 00353 done = true; 00354 00355 SpamUtil().toLower(localBuf, pBuf, sizeof(localBuf)); 00356 for (i = 0; i < toAddrLen; i++) { 00357 if (strstr(localBuf, toAddrs[i]) != 0) { 00358 const char *hit = toAddrs[i]; 00359 char msg[128]; 00360 sprintf(msg, "found \"%s\", marked as EMAIL", hit ); 00361 log.log(Logger::DEBUG, "checkAddressSection", msg ); 00362 klass = MailFilter::EMAIL; 00363 break; 00364 } 00365 } // for 00366 00367 if (klass == MailFilter::UNKNOWN && domainName != 0) { 00368 klass = checkDomainAddrs( domainName, localBuf ); 00369 } 00370 00371 if (klass == MailFilter::UNKNOWN) { 00372 if (addrContinues(localBuf)) { 00373 if ((pBuf = fgets(localBuf, sizeof(localBuf), fp)) != 0) { 00374 done = false; 00375 } 00376 } 00377 } 00378 00379 } while (!done); 00380 00381 log.log(Logger::DEBUG, "checkAddressSection", "exit"); 00382 00383 return klass; 00384 } // checkAddressSection |
|
Check the string for a user name associated with domainName. The domain name is defined in the my_domain section of SpamFilterParams. Valid user names for this domain are defined in the valid_users section. If valid user names are found then the foundValidAddrAddress flag is set to true. If there are users that are not in the valid_users list then the classification GARBAGE is returned. Otherwise, UNKNOWN is returned (UNKNOWN is returned when a valid user is found as well). Definition at line 235 of file MailHeader.C. References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason(). Referenced by checkAddressSection().
00237 { 00238 log.log(Logger::DEBUG, "checkDomainAddrs", "enter"); 00239 00240 assert( ((domainName != 0) && (pBuf != 0)) ); 00241 00242 vector<const char *> validUsers = mParams.getSection(SpamParameters::valid_users); 00243 const size_t numUsers = validUsers.size(); 00244 MailFilter::classification klass = MailFilter::UNKNOWN; 00245 00246 bool done = false; 00247 size_t domainNameLen = strlen( domainName ); 00248 const char *domainPtr = strstr(pBuf, domainName ); 00249 while (domainPtr) { 00250 if (domainPtr > pBuf+2) { 00251 domainPtr--; 00252 if (*domainPtr == '@') { 00253 // find the start and end of the user name 00254 const char *endPtr = domainPtr; 00255 domainPtr--; 00256 const char *beginPtr = domainPtr; 00257 while (beginPtr >= pBuf && isalnum( *beginPtr )) 00258 beginPtr--; 00259 if (!isalnum(*beginPtr)) { 00260 beginPtr++; 00261 } 00262 00263 // Now check to see if the user name is in the valid_users list 00264 // Note that this function is used for both To: and Cc:, so 00265 // foundValidAddress could have been set in a previous call. 00266 bool foundInList = false; 00267 for (size_t i = 0; i < numUsers; i++) { 00268 const char *word = validUsers[i]; 00269 if (SpamUtil().match(beginPtr, endPtr, word)) { 00270 foundValidAddress = true; 00271 foundInList = true; 00272 } 00273 } // for 00274 00275 if (!foundInList) { 00276 char msg[128]; 00277 char user[128]; 00278 size_t ix = 0; 00279 for (const char *pCh = beginPtr; pCh < endPtr; pCh++, ix++) { 00280 user[ix] = *pCh; 00281 } 00282 user[ix] = '\0'; 00283 sprintf(msg, "Non-valid user \"%s\", email marked as GARBAGE", 00284 user ); 00285 mHeadInfo.reason( msg ); 00286 log.log(Logger::DEBUG, "checkDomainAddrs", msg ); 00287 klass = MailFilter::GARBAGE; 00288 } 00289 // endPtr points to the '@' 00290 pBuf = (endPtr + 1); 00291 } 00292 } 00293 if (klass == MailFilter::UNKNOWN) { 00294 pBuf = pBuf + domainNameLen; 00295 domainPtr = strstr(pBuf, domainName); 00296 } 00297 else { 00298 break; // exit the while loop 00299 } 00300 } // while 00301 00302 log.log(Logger::DEBUG, "checkDomainAddrs", "exit"); 00303 00304 return klass; 00305 } // checkDomainAddrs |
|
Check to see if an e-mail address in the from_address section of the SpamParameters is in the "From:" field. If a "from_address" string is found then it is valid email and no further checking will be done by the mail filter. This allows people you know to send you e-mail that may have spam or kill words in it. The "From" is also checked against "from_kill" strings. This allows you to mark as garbage email from frequent spammers. For example, when I developed this software there was a spammer who used YoDude in the from line and another that used "TailWaggingOffers". Definition at line 123 of file MailHeader.C. References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason(). Referenced by checkHeader().
00124 { 00125 MailFilter::classification klass = MailFilter::UNKNOWN; 00126 log.log(Logger::DEBUG, "checkFrom", "enter"); 00127 00128 vector<const char *> fromAddrs = mParams.getSection(SpamParameters::from_address); 00129 vector<const char *> killAddrs = mParams.getSection(SpamParameters::from_kill); 00130 00131 char msg[128]; 00132 char from[256]; 00133 00134 SpamUtil().toLower(from, buf, sizeof(from)); 00135 00136 size_t len; 00137 len = killAddrs.size(); 00138 for (size_t i = 0; i < len; i++) { 00139 if (strstr(from, killAddrs[i]) != 0) { 00140 sprintf(msg, "Found address \"%s\", email marked as GARBAGE", 00141 killAddrs[i] ); 00142 mHeadInfo.reason( msg ); 00143 log.log(Logger::DEBUG, "checkFrom", msg ); 00144 klass = MailFilter::GARBAGE; 00145 break; 00146 } 00147 } 00148 00149 if (klass == MailFilter::UNKNOWN) { 00150 len = fromAddrs.size(); 00151 for (size_t i = 0; i < len; i++) { 00152 if (strstr(from, fromAddrs[i]) != 0) { 00153 sprintf(msg, "Found \"from address\" \"%s\", email marked as EMAIL", 00154 fromAddrs[i] ); 00155 log.log(Logger::DEBUG, "checkFrom", msg ); 00156 klass = MailFilter::EMAIL; 00157 break; 00158 } 00159 } // for 00160 } 00161 00162 log.log(Logger::DEBUG, "checkFrom", "exit"); 00163 return klass; 00164 } // checkFrom |
|
Rules for processing the e-mail header: The "To:" and "Cc:"
The "From:" and "From" At least in the case of e-mail on Linux there is a "From" line which leads the e-mail file. This line has the following format:
Check the subject line for spam and kill words (e.g., penis, xanax). Processing of the email header ends when a blank line is found (all email headers must end with a blank line). When it comes to recognizing "spam_words" and "kill_words" in the subject line, the code below relies on the fact that the "From:" line preceeds the "Subject:" line. This allows your lover, whose address will presumably be in the from_address part of the SpamFilterParams, to send you e-mail with the word "penis" in the subject, without having the mail discarded if "penis" is in the kill_words list. The subject line and other parts of the email header are copied into a HeaderInfo object. This information is used in generating debug trace information and the garbage trace (for discarded email). and error messages. Many emails (especially those that are MIME formatted) will have a boundary line (which usually follows the "Content-Type:" line. The boundary line has the format
|
|
The received line in the email header spans multiple lines. The end is determined by the next line that contains a colon. Something like "Message-ID:" or "From:" (or, perhaps, another "Received:"). Right now not much is done with this section except to search for the word "forged". If a section is added to the SpamFilterParams file for spammer addresses, then this function could recognize these. Right now it is not clear that this would be very profitable, since spammers move around so much. For use in future checking, the buf pointer points to the character that follows the colon. One complexity introduced by this function is that it reads the next line to see if this line has a colon header in it. There is no way to "unget" a line. So as a hack around this there is a "pushBackBuf" in the class (can you say global variable by another name) which contains this line. If the pPushBack pointer at this buffer (e.g., is not NULL) then the pushBackBuf line will be used rather than reading a new line from stdin. Some mailers add a "may be forged" note on one of the received lines. This seems to happen when the address is given via SMTP, rather than in the "To:" line. In this case, the mail should be marked as SPAM (e.g., SUSPECT) The "line" argument is for debugging. Definition at line 416 of file MailHeader.C. References Logger::log(). Referenced by checkHeader().
00419 { 00420 log.log(Logger::DEBUG, "checkReceived", "enter"); 00421 MailFilter::classification klass = MailFilter::UNKNOWN; 00422 00423 const char *pColon; 00424 do { 00425 pPushBack = fgets(pushBackBuf, sizeof(pushBackBuf), fp); 00426 if (pPushBack != 0) { 00427 if ((klass == MailFilter::UNKNOWN) && (strstr(pPushBack, "forged") != 0)) { 00428 klass = MailFilter::SUSPECT; 00429 } 00430 pColon = SpamUtil().findColon( pPushBack ); 00431 } 00432 } while (pPushBack != 0 && !pColon); 00433 00434 log.log(Logger::DEBUG, "checkReceived", "exit"); 00435 return klass; 00436 } // checkReceived |
|
Check the email header subject line for spam or kill words or phrases. Apparently some emails may have multi-line subjects. So after the subject line is found, we check to see if there is another line with a colon (something like "Reply-To:" for example) or if the line is blank (indicating an end to the header). In both cases we "push back" the line. If the line does not have a colon or is not blank, we skip it (since this is the subject line continuing on the next line). The subject line should not continue on more than one line after the "Subject:" line or something is really wrong with the email format. Definition at line 68 of file MailHeader.C. References Logger::log(), and HeaderInfo::reason(). Referenced by checkHeader().
00069 { 00070 log.log(Logger::DEBUG, "checkSubject", "enter"); 00071 00072 char msg[256]; 00073 char subject[256]; 00074 char foundStr[128]; 00075 00076 // convert to lower case 00077 SpamUtil().toLower(subject, buf, sizeof(subject)); 00078 00079 foundStr[0] = '\0'; 00080 MailFilter::classification klass = SpamUtil().checkLine(subject, 00081 mParams, 00082 foundStr, 00083 sizeof(foundStr)); 00084 00085 if (klass == MailFilter::SUSPECT || klass == MailFilter::GARBAGE) { 00086 if (klass == MailFilter::SUSPECT) { 00087 sprintf(msg, "Found \"spam\" word \"%s\", email marked as SUSPECT", 00088 foundStr ); 00089 } 00090 else if (klass == MailFilter::GARBAGE) { 00091 mHeadInfo.reason( foundStr ); 00092 sprintf(msg, "Found \"kill\" word \"%s\", email marked as GARBAGE", 00093 foundStr ); 00094 } 00095 log.log(Logger::DEBUG, "checkSubject", msg ); 00096 } 00097 00098 pPushBack = 0; 00099 char *pBuf; 00100 if ((pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) { 00101 if (SpamUtil().isBlankLine(pBuf) || SpamUtil().findColon(pBuf) != 0) { 00102 pPushBack = pBuf; 00103 } 00104 } 00105 00106 log.log(Logger::DEBUG, "checkSubject", "exit"); 00107 return klass; 00108 } // checkSubject |
|
Fill in the email header in the mHeadInfo class variable. The mHeadInfo object is used to encapsulate header information that is used in generating debug log messages and in generating the garbage trace (if the trace_garbage flag is set). Processing the email stops as soon as it can be determined that the email is valid, suspect or garbage. In some cases (for example an invalid domain address) the complete header will not have been processed and some fields in mHeadInfo have not been filled in. This function is called to read the rest of the header and fill in these fields. Definition at line 659 of file MailHeader.C. References HeaderInfo::date(), HeaderInfo::from(), Logger::log(), HeaderInfo::subject(), and HeaderInfo::to(). Referenced by checkHeader().
00660 { 00661 static const char *TO = "to:"; 00662 static const size_t TO_LEN = strlen( TO ); 00663 static const char *FROM = "from:"; 00664 static const size_t FROM_LEN = strlen( FROM ); 00665 static const char *SUBJECT = "subject:"; 00666 static const size_t SUBJECT_LEN = strlen( SUBJECT ); 00667 static const char *DATE = "date:"; 00668 static const size_t DATE_LEN = strlen( DATE ); 00669 char buf[1024]; 00670 char *pBuf = 0; 00671 size_t bufSize = 0; 00672 00673 log.log(Logger::DEBUG, "fillInSections", "enter"); 00674 00675 do { 00676 if (pPushBack == 0) { 00677 pushBackBuf[0] = '0'; 00678 bufSize = sizeof(buf); 00679 pBuf = fgets(buf, bufSize, fp); 00680 } 00681 else { 00682 pBuf = pPushBack; 00683 bufSize = sizeof( pushBackBuf ); 00684 pPushBack = 0; 00685 } 00686 if (pBuf != 0) { 00687 if (! SpamUtil().isBlankLine( pBuf )) { 00688 char *pCopy = 0; 00689 if (SpamUtil().match(pBuf, TO_LEN, TO)) { 00690 pCopy = pBuf + TO_LEN; 00691 mHeadInfo.to( pCopy ); 00692 } 00693 else if (SpamUtil().match(pBuf, FROM_LEN, FROM)) { 00694 pCopy = pBuf + FROM_LEN; 00695 mHeadInfo.from( pCopy ); 00696 } 00697 else if (SpamUtil().match(pBuf, SUBJECT_LEN, SUBJECT)) { 00698 pCopy = pBuf + SUBJECT_LEN; 00699 mHeadInfo.subject( pCopy ); 00700 } 00701 else if (SpamUtil().match(pBuf, DATE_LEN, DATE)) { 00702 pCopy = pBuf + DATE_LEN; 00703 mHeadInfo.date( pCopy ); 00704 } 00705 } 00706 else { 00707 // found a blank line which follows the mail header 00708 break; 00709 } 00710 } 00711 } while (pBuf != 0); 00712 00713 if (pBuf == 0) { 00714 log.log(Logger::DEBUG, "fillInSections", "end-of-file reached"); 00715 } 00716 log.log(Logger::DEBUG, "fillInSections", "exit"); 00717 } // fillInSections |
|
Return the content type for the email. Many emails (especially those which are MIME encoded, but others as well) include a Content-type section whose format is:
|
|
Save the boundary string which may follow the "Content-Type:" line. The format for the boundary definition is
|