Main Page | Compound List | File List | Compound Members | File Members

MailHeader Class Reference

#include <MailHeader.h>

List of all members.

Public Member Functions

MailHeader (SpamParameters &p, HeaderInfo &headInfo)

MailFilter::classification parseContentType (FILE *fp, char *buf)

const char * getBoundaryStr ()

MailFilter::classification checkHeader (FILE *fp)

Private Member Functions

void saveBoundary (const char *pBound)

void fillInSections (FILE *fp)

MailFilter::classification checkReceived (const char *buf, FILE *fp, size_t line)

MailFilter::classification checkSubject (const char *buf, FILE *fp)

MailFilter::classification checkFrom (const char *buf)

bool addrContinues (const char *buf)

MailFilter::classification checkDomainAddrs (const char *domainName, const char *pBuf)

MailFilter::classification checkAddressSection (const char *buf, FILE *fp)

Private Attributes

Logger log

SpamParameters & mParams

HeaderInfo & mHeadInfo

bool foundValidAddress

char boundaryStr [128]

char pushBackBuf [1024]

char * pPushBack

Detailed Description

Support for processing the email header (e.g., From:, To:, Subject:, etc...)

The pushBackBuf

When processing the email header sometimes it is necessary to read the next line to know what to do. For example, to know whether the header has come to an end with a blank line. Or to know if the subject or other parts of the header continue on the next line. By reading the next line is may also be that we've read too far. We've read a line that should be processed. To deal with this issue, the pushBackBuf is used. The next line can be read into the pushBackBuf. When subsequent logic needs another line, the pushBackBuf can be checked (to see if it is non-zero length) before reading another line.

The boundaryStr

A boundary string is used to separate the sections of a MIME formatted email. Email software (like this email filter) can skip between sections by looking for the boundary string. The boundary string is defined in the email header. The MailHeader code saves the boundary string (if it exists) in the boundaryStr buffer. The boundary string is then used in processing the email body.

Definition at line 57 of file MailHeader.h.

Member Function Documentation

bool MailHeader::addrContinues ( const char * buf ) [private]

addrContinues
Return TRUE if it looks like the "to:" is spread across multiple lines.
The "To:" continues on another line when the line ends with either a comma or a single quote follwed by a double quote.
This function looks at the end of the line. If the line ends with:
"," (comma) "'\"" (that's a single quote followed by a double quote)
Note that in the code below the search is done in reverse, so the '"' is encountered before the '\'' (single quote).
Definition at line 188 of file MailHeader.C.
References Logger::log().
Referenced by checkAddressSection().

00189 { 00190 bool rslt = false; 00191 00192 log.log(Logger::DEBUG, "addrContinues", "enter"); 00193 if (buf) { 00194 int end = strlen( buf ); 00195 00196 if (end > 0) { 00197 end--; 00198 const char *endPtr; 00199 for (endPtr = &buf[end]; endPtr >= buf && isspace(*endPtr); endPtr--) 00200 /* nada */; 00201 if (*endPtr == ',') { 00202 rslt = true; 00203 } 00204 else if (*endPtr == '"') { 00205 if (endPtr > buf && *(endPtr-1) == '\'') { 00206 rslt = true; 00207 } 00208 } 00209 } 00210 } 00211 00212 char msgbuf[128]; 00213 sprintf( msgbuf, "returns %s", (rslt) ? "TRUE" : "FALSE" ); 00214 log.log(Logger::DEBUG, "addrContinues", msgbuf ); 00215 00216 log.log(Logger::DEBUG, "addrContinues", "exit"); 00217 return rslt; 00218 } // addrContinues

MailFilter::classification MailHeader::checkAddressSection ( const char * buf,

FILE * fp

) [private]

Check either the To: or Cc: sections.
The "to_list" addresses, defined in SpamFilterParams, will usually be mailing lists (for example, the ANTLR anltr-interest mailing list that is distributed via Yahoo). If one of these addresses (or parts of an address) are found, then then the function will return EMAIL and no further content checking will be done by the mail filter.
This function checks for the domain name specified in the my_domain secton of SpamFilterParams. If the domain is found it then checks to see if the user (e.g., the string to the left of the @) is listed in the valid_users section. This function limits user names to strings consisting of 'a'..'z' and '0'..'9' (case insensitive) plus the underscore character. Note that only one domain is allowed in the my_domain section.
Before moving to my current ISP I would get any e-mail addressed to bearcave.com. This was a problem when the site bearcave.org existed since a number of bearcave.org users made the mistake of using .com when they should have used .org. Checking for valid users marks as garbage any email to a user that is not valid.
Definition at line 334 of file MailHeader.C.
References addrContinues(), checkDomainAddrs(), SpamParameters::getSection(), and Logger::log().
Referenced by checkHeader().

00335 { 00336 log.log(Logger::DEBUG, "checkAddressSection", "enter"); 00337 00338 MailFilter::classification klass = MailFilter::UNKNOWN; 00339 vector<const char *> toAddrs = mParams.getSection(SpamParameters::to_list); 00340 vector<const char *> myDomain = mParams.getSection(SpamParameters::my_domain); 00341 00342 const char *domainName = 0; 00343 if (myDomain.size() > 0) 00344 domainName = myDomain[0]; 00345 00346 const char *pBuf = buf; 00347 const size_t toAddrLen = toAddrs.size(); 00348 00349 char localBuf[1024]; 00350 size_t i; 00351 bool done; 00352 do { 00353 done = true; 00354 00355 SpamUtil().toLower(localBuf, pBuf, sizeof(localBuf)); 00356 for (i = 0; i < toAddrLen; i++) { 00357 if (strstr(localBuf, toAddrs[i]) != 0) { 00358 const char *hit = toAddrs[i]; 00359 char msg[128]; 00360 sprintf(msg, "found \"%s\", marked as EMAIL", hit ); 00361 log.log(Logger::DEBUG, "checkAddressSection", msg ); 00362 klass = MailFilter::EMAIL; 00363 break; 00364 } 00365 } // for 00366 00367 if (klass == MailFilter::UNKNOWN && domainName != 0) { 00368 klass = checkDomainAddrs( domainName, localBuf ); 00369 } 00370 00371 if (klass == MailFilter::UNKNOWN) { 00372 if (addrContinues(localBuf)) { 00373 if ((pBuf = fgets(localBuf, sizeof(localBuf), fp)) != 0) { 00374 done = false; 00375 } 00376 } 00377 } 00378 00379 } while (!done); 00380 00381 log.log(Logger::DEBUG, "checkAddressSection", "exit"); 00382 00383 return klass; 00384 } // checkAddressSection

MailFilter::classification MailHeader::checkDomainAddrs ( const char * domainName,

const char * pBuf

) [private]

Check the string for a user name associated with domainName.
The domain name is defined in the my_domain section of SpamFilterParams. Valid user names for this domain are defined in the valid_users section.
If valid user names are found then the foundValidAddrAddress flag is set to true. If there are users that are not in the valid_users list then the classification GARBAGE is returned. Otherwise, UNKNOWN is returned (UNKNOWN is returned when a valid user is found as well).
Definition at line 235 of file MailHeader.C.
References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason().
Referenced by checkAddressSection().

00237 { 00238 log.log(Logger::DEBUG, "checkDomainAddrs", "enter"); 00239 00240 assert( ((domainName != 0) && (pBuf != 0)) ); 00241 00242 vector<const char *> validUsers = mParams.getSection(SpamParameters::valid_users); 00243 const size_t numUsers = validUsers.size(); 00244 MailFilter::classification klass = MailFilter::UNKNOWN; 00245 00246 bool done = false; 00247 size_t domainNameLen = strlen( domainName ); 00248 const char *domainPtr = strstr(pBuf, domainName ); 00249 while (domainPtr) { 00250 if (domainPtr > pBuf+2) { 00251 domainPtr--; 00252 if (*domainPtr == '@') { 00253 // find the start and end of the user name 00254 const char *endPtr = domainPtr; 00255 domainPtr--; 00256 const char *beginPtr = domainPtr; 00257 while (beginPtr >= pBuf && isalnum( *beginPtr )) 00258 beginPtr--; 00259 if (!isalnum(*beginPtr)) { 00260 beginPtr++; 00261 } 00262 00263 // Now check to see if the user name is in the valid_users list 00264 // Note that this function is used for both To: and Cc:, so 00265 // foundValidAddress could have been set in a previous call. 00266 bool foundInList = false; 00267 for (size_t i = 0; i < numUsers; i++) { 00268 const char *word = validUsers[i]; 00269 if (SpamUtil().match(beginPtr, endPtr, word)) { 00270 foundValidAddress = true; 00271 foundInList = true; 00272 } 00273 } // for 00274 00275 if (!foundInList) { 00276 char msg[128]; 00277 char user[128]; 00278 size_t ix = 0; 00279 for (const char *pCh = beginPtr; pCh < endPtr; pCh++, ix++) { 00280 user[ix] = *pCh; 00281 } 00282 user[ix] = '\0'; 00283 sprintf(msg, "Non-valid user \"%s\", email marked as GARBAGE", 00284 user ); 00285 mHeadInfo.reason( msg ); 00286 log.log(Logger::DEBUG, "checkDomainAddrs", msg ); 00287 klass = MailFilter::GARBAGE; 00288 } 00289 // endPtr points to the '@' 00290 pBuf = (endPtr + 1); 00291 } 00292 } 00293 if (klass == MailFilter::UNKNOWN) { 00294 pBuf = pBuf + domainNameLen; 00295 domainPtr = strstr(pBuf, domainName); 00296 } 00297 else { 00298 break; // exit the while loop 00299 } 00300 } // while 00301 00302 log.log(Logger::DEBUG, "checkDomainAddrs", "exit"); 00303 00304 return klass; 00305 } // checkDomainAddrs

MailFilter::classification MailHeader::checkFrom ( const char * buf ) [private]

Check to see if an e-mail address in the from_address section of the SpamParameters is in the "From:" field. If a "from_address" string is found then it is valid email and no further checking will be done by the mail filter. This allows people you know to send you e-mail that may have spam or kill words in it.
The "From" is also checked against "from_kill" strings. This allows you to mark as garbage email from frequent spammers. For example, when I developed this software there was a spammer who used YoDude in the from line and another that used "TailWaggingOffers".
Definition at line 123 of file MailHeader.C.
References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason().
Referenced by checkHeader().

00124 { 00125 MailFilter::classification klass = MailFilter::UNKNOWN; 00126 log.log(Logger::DEBUG, "checkFrom", "enter"); 00127 00128 vector<const char *> fromAddrs = mParams.getSection(SpamParameters::from_address); 00129 vector<const char *> killAddrs = mParams.getSection(SpamParameters::from_kill); 00130 00131 char msg[128]; 00132 char from[256]; 00133 00134 SpamUtil().toLower(from, buf, sizeof(from)); 00135 00136 size_t len; 00137 len = killAddrs.size(); 00138 for (size_t i = 0; i < len; i++) { 00139 if (strstr(from, killAddrs[i]) != 0) { 00140 sprintf(msg, "Found address \"%s\", email marked as GARBAGE", 00141 killAddrs[i] ); 00142 mHeadInfo.reason( msg ); 00143 log.log(Logger::DEBUG, "checkFrom", msg ); 00144 klass = MailFilter::GARBAGE; 00145 break; 00146 } 00147 } 00148 00149 if (klass == MailFilter::UNKNOWN) { 00150 len = fromAddrs.size(); 00151 for (size_t i = 0; i < len; i++) { 00152 if (strstr(from, fromAddrs[i]) != 0) { 00153 sprintf(msg, "Found \"from address\" \"%s\", email marked as EMAIL", 00154 fromAddrs[i] ); 00155 log.log(Logger::DEBUG, "checkFrom", msg ); 00156 klass = MailFilter::EMAIL; 00157 break; 00158 } 00159 } // for 00160 } 00161 00162 log.log(Logger::DEBUG, "checkFrom", "exit"); 00163 return klass; 00164 } // checkFrom

MailFilter::classification MailHeader::checkHeader ( FILE * fp )

Rules for processing the e-mail header:
The "To:" and "Cc:"

Check for items in the "to_list". This is where mailing list addresses go. If a "to_list" item is found it is classified as "EMAIL".

Check for a "my_address" address. In some cases spammers do not include your e-mail address in the "To:" or "Cc:" lines since the are using mailing lists or direct SMTP connections. Of course if your address is found it may still be spam.

The "From:" and "From"
At least in the case of e-mail on Linux there is a "From" line which leads the e-mail file. This line has the following format:

From
Note that this "From" has no colon. A "From:" line follows which may or may not have the same e-mail address. In the case of SPAM it frequently does not, since SPAMmers forge the email address.

Check for an entry in the "from_list" in the "From:" part of the header. The "from_list" contains the email addresses (user name, domain, or both user and domain) of people you know. If a "from_list" item is found it is marked as valid email.

Check the subject line for spam and kill words (e.g., penis, xanax).
Processing of the email header ends when a blank line is found (all email headers must end with a blank line).
When it comes to recognizing "spam_words" and "kill_words" in the subject line, the code below relies on the fact that the "From:" line preceeds the "Subject:" line. This allows your lover, whose address will presumably be in the from_address part of the SpamFilterParams, to send you e-mail with the word "penis" in the subject, without having the mail discarded if "penis" is in the kill_words list.
The subject line and other parts of the email header are copied into a HeaderInfo object. This information is used in generating debug trace information and the garbage trace (for discarded email). and error messages.
Many emails (especially those that are MIME formatted) will have a boundary line (which usually follows the "Content-Type:" line. The boundary line has the format
boundary=""
The boundary string is used to demarkate the bounds of the various sections. This string is saved in the class variable boundaryStr.
Header processing may terminate before the header is completely read since it may be determined at an early point that the email is either valid or SPAM. In these case the rest of the mail header will be read to initialize the HeaderInfo object.
Definition at line 804 of file MailHeader.C.
References checkAddressSection(), checkFrom(), checkReceived(), checkSubject(), HeaderInfo::date(), fillInSections(), HeaderInfo::from(), HeaderInfo::fromNoColon(), HeaderInfo::klass(), Logger::log(), parseContentType(), HeaderInfo::subject(), and HeaderInfo::to().
Referenced by MailFilter::checkMail().

00805 { 00806 log.log(Logger::DEBUG, "checkHeader", "enter"); 00807 MailFilter::classification klass = MailFilter::BAD_VALUE; 00808 if (!feof(fp)) { 00809 klass = MailFilter::UNKNOWN; 00810 char *pBuf; 00811 00812 // Skip any blank lines which start the e-mail message 00813 while ((pPushBack = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) { 00814 if (! SpamUtil().isBlankLine( pPushBack )) { 00815 break; 00816 } 00817 } // while 00818 00819 if (pPushBack != 0) { // Loop through the e-mail header 00820 size_t line = 1; 00821 const char *RECEIVED = "received"; 00822 static const size_t RECEIVED_LEN = strlen(RECEIVED); 00823 const char *SUBJECT = "subject"; 00824 static const size_t SUBJECT_LEN = strlen(SUBJECT); 00825 const char *FROM = "from"; 00826 static const size_t FROM_LEN = strlen(FROM); 00827 const char *CONTENT_TYPE = "content-type"; 00828 const char *TO = "to"; 00829 static const size_t TO_LEN = strlen(TO); 00830 const char *CC = "cc"; 00831 const char *DATE = "date"; 00832 const char *pBound = 0; 00833 const char *pColon = 0; 00834 char buf[1024]; 00835 do { // DO 00836 if (pPushBack == 0) { 00837 pushBackBuf[0] = '\0'; 00838 pBuf = fgets(buf, sizeof(buf), fp); 00839 } 00840 else { 00841 pBuf = pPushBack; 00842 pPushBack = 0; 00843 } 00844 if (!pBuf) { 00845 break; 00846 } 00847 00848 if (! SpamUtil().isBlankLine( pBuf )) { 00849 pColon = SpamUtil().findColon( pBuf ); 00850 if (SpamUtil().match(pBuf, FROM_LEN, FROM)) { 00851 pColon = pBuf + FROM_LEN; 00852 if (*pColon == ':') { 00853 pColon++; 00854 mHeadInfo.from( pColon ); 00855 } 00856 else { 00857 mHeadInfo.fromNoColon( pColon ); 00858 } 00859 klass = checkFrom( pColon ); 00860 } 00861 else if (pColon != 0) { 00862 if (SpamUtil().match(pBuf, RECEIVED_LEN, RECEIVED)) { 00863 pColon = pBuf + RECEIVED_LEN + 1; 00864 klass = checkReceived( pColon, fp, line ); 00865 } else if (SpamUtil().match(pBuf, SUBJECT_LEN, SUBJECT)) { 00866 pColon = pBuf + SUBJECT_LEN + 1; 00867 mHeadInfo.subject(pColon); 00868 klass = checkSubject( pColon, fp ); 00869 } 00870 else if (SpamUtil().match(pBuf, pColon, CONTENT_TYPE)) { 00871 pColon++; 00872 klass = parseContentType(fp, pBuf); 00873 } 00874 else if (SpamUtil().match(pBuf, TO_LEN, TO)) { 00875 pColon = pBuf + TO_LEN + 1; 00876 mHeadInfo.to( pColon ); 00877 klass = checkAddressSection(pColon, fp); 00878 } 00879 else if (SpamUtil().match(pBuf, pColon, CC)) { 00880 pColon++; 00881 klass = checkAddressSection(pColon, fp); 00882 } 00883 else if (SpamUtil().match(pBuf, pColon, DATE)) { 00884 pColon++; 00885 mHeadInfo.date(pColon); 00886 } 00887 } // has a colon (pColon != 0 00888 } // is blank line 00889 else { 00890 // found a blank line 00891 break; 00892 } 00893 } while (klass == MailFilter::UNKNOWN && pBuf != 0); 00894 00895 // if we have not finished on a black line, fill in any sections that 00896 // have not been encountered yet. 00897 if (! SpamUtil().isBlankLine( pBuf )) { 00898 fillInSections(fp); 00899 } 00900 00901 // If the email was not addressed to a known mailing list and 00902 // and address in the SpamFilterParams section my_address is 00903 // not found, then it is classified as SPAM. 00904 if (klass == MailFilter::UNKNOWN && (! foundValidAddress)) { 00905 log.log(Logger::DEBUG, "checkHeader", "Did not find a valid To: or Cc: address"); 00906 klass = MailFilter::SUSPECT; 00907 } 00908 } // if pBuf 00909 } 00910 00911 mHeadInfo.klass( klass ); 00912 00913 char msg[128]; 00914 sprintf(msg, "return value = %s", SpamUtil().classificationToStr( klass )); 00915 log.log(Logger::DEBUG, "checkHeader", msg ); 00916 log.log(Logger::DEBUG, "checkHeader", "exit"); 00917 return klass; 00918 } // checkHeader

MailFilter::classification MailHeader::checkReceived ( const char * buf,

FILE * fp,

size_t line

) [private]

The received line in the email header spans multiple lines. The end is determined by the next line that contains a colon. Something like "Message-ID:" or "From:" (or, perhaps, another "Received:").
Right now not much is done with this section except to search for the word "forged". If a section is added to the SpamFilterParams file for spammer addresses, then this function could recognize these. Right now it is not clear that this would be very profitable, since spammers move around so much. For use in future checking, the buf pointer points to the character that follows the colon.
One complexity introduced by this function is that it reads the next line to see if this line has a colon header in it. There is no way to "unget" a line. So as a hack around this there is a "pushBackBuf" in the class (can you say global variable by another name) which contains this line. If the pPushBack pointer at this buffer (e.g., is not NULL) then the pushBackBuf line will be used rather than reading a new line from stdin.
Some mailers add a "may be forged" note on one of the received lines. This seems to happen when the address is given via SMTP, rather than in the "To:" line. In this case, the mail should be marked as SPAM (e.g., SUSPECT)
The "line" argument is for debugging.
Definition at line 416 of file MailHeader.C.
References Logger::log().
Referenced by checkHeader().

00419 { 00420 log.log(Logger::DEBUG, "checkReceived", "enter"); 00421 MailFilter::classification klass = MailFilter::UNKNOWN; 00422 00423 const char *pColon; 00424 do { 00425 pPushBack = fgets(pushBackBuf, sizeof(pushBackBuf), fp); 00426 if (pPushBack != 0) { 00427 if ((klass == MailFilter::UNKNOWN) && (strstr(pPushBack, "forged") != 0)) { 00428 klass = MailFilter::SUSPECT; 00429 } 00430 pColon = SpamUtil().findColon( pPushBack ); 00431 } 00432 } while (pPushBack != 0 && !pColon); 00433 00434 log.log(Logger::DEBUG, "checkReceived", "exit"); 00435 return klass; 00436 } // checkReceived

MailFilter::classification MailHeader::checkSubject ( const char * buf,

FILE * fp

) [private]

Check the email header subject line for spam or kill words or phrases.
Apparently some emails may have multi-line subjects. So after the subject line is found, we check to see if there is another line with a colon (something like "Reply-To:" for example) or if the line is blank (indicating an end to the header). In both cases we "push back" the line. If the line does not have a colon or is not blank, we skip it (since this is the subject line continuing on the next line).
The subject line should not continue on more than one line after the "Subject:" line or something is really wrong with the email format.
Definition at line 68 of file MailHeader.C.
References Logger::log(), and HeaderInfo::reason().
Referenced by checkHeader().

00069 { 00070 log.log(Logger::DEBUG, "checkSubject", "enter"); 00071 00072 char msg[256]; 00073 char subject[256]; 00074 char foundStr[128]; 00075 00076 // convert to lower case 00077 SpamUtil().toLower(subject, buf, sizeof(subject)); 00078 00079 foundStr[0] = '\0'; 00080 MailFilter::classification klass = SpamUtil().checkLine(subject, 00081 mParams, 00082 foundStr, 00083 sizeof(foundStr)); 00084 00085 if (klass == MailFilter::SUSPECT || klass == MailFilter::GARBAGE) { 00086 if (klass == MailFilter::SUSPECT) { 00087 sprintf(msg, "Found \"spam\" word \"%s\", email marked as SUSPECT", 00088 foundStr ); 00089 } 00090 else if (klass == MailFilter::GARBAGE) { 00091 mHeadInfo.reason( foundStr ); 00092 sprintf(msg, "Found \"kill\" word \"%s\", email marked as GARBAGE", 00093 foundStr ); 00094 } 00095 log.log(Logger::DEBUG, "checkSubject", msg ); 00096 } 00097 00098 pPushBack = 0; 00099 char *pBuf; 00100 if ((pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) { 00101 if (SpamUtil().isBlankLine(pBuf) || SpamUtil().findColon(pBuf) != 0) { 00102 pPushBack = pBuf; 00103 } 00104 } 00105 00106 log.log(Logger::DEBUG, "checkSubject", "exit"); 00107 return klass; 00108 } // checkSubject

void MailHeader::fillInSections ( FILE * fp ) [private]

Fill in the email header in the mHeadInfo class variable.
The mHeadInfo object is used to encapsulate header information that is used in generating debug log messages and in generating the garbage trace (if the trace_garbage flag is set).
Processing the email stops as soon as it can be determined that the email is valid, suspect or garbage. In some cases (for example an invalid domain address) the complete header will not have been processed and some fields in mHeadInfo have not been filled in. This function is called to read the rest of the header and fill in these fields.
Definition at line 659 of file MailHeader.C.
References HeaderInfo::date(), HeaderInfo::from(), Logger::log(), HeaderInfo::subject(), and HeaderInfo::to().
Referenced by checkHeader().

00660 { 00661 static const char *TO = "to:"; 00662 static const size_t TO_LEN = strlen( TO ); 00663 static const char *FROM = "from:"; 00664 static const size_t FROM_LEN = strlen( FROM ); 00665 static const char *SUBJECT = "subject:"; 00666 static const size_t SUBJECT_LEN = strlen( SUBJECT ); 00667 static const char *DATE = "date:"; 00668 static const size_t DATE_LEN = strlen( DATE ); 00669 char buf[1024]; 00670 char *pBuf = 0; 00671 size_t bufSize = 0; 00672 00673 log.log(Logger::DEBUG, "fillInSections", "enter"); 00674 00675 do { 00676 if (pPushBack == 0) { 00677 pushBackBuf[0] = '0'; 00678 bufSize = sizeof(buf); 00679 pBuf = fgets(buf, bufSize, fp); 00680 } 00681 else { 00682 pBuf = pPushBack; 00683 bufSize = sizeof( pushBackBuf ); 00684 pPushBack = 0; 00685 } 00686 if (pBuf != 0) { 00687 if (! SpamUtil().isBlankLine( pBuf )) { 00688 char *pCopy = 0; 00689 if (SpamUtil().match(pBuf, TO_LEN, TO)) { 00690 pCopy = pBuf + TO_LEN; 00691 mHeadInfo.to( pCopy ); 00692 } 00693 else if (SpamUtil().match(pBuf, FROM_LEN, FROM)) { 00694 pCopy = pBuf + FROM_LEN; 00695 mHeadInfo.from( pCopy ); 00696 } 00697 else if (SpamUtil().match(pBuf, SUBJECT_LEN, SUBJECT)) { 00698 pCopy = pBuf + SUBJECT_LEN; 00699 mHeadInfo.subject( pCopy ); 00700 } 00701 else if (SpamUtil().match(pBuf, DATE_LEN, DATE)) { 00702 pCopy = pBuf + DATE_LEN; 00703 mHeadInfo.date( pCopy ); 00704 } 00705 } 00706 else { 00707 // found a blank line which follows the mail header 00708 break; 00709 } 00710 } 00711 } while (pBuf != 0); 00712 00713 if (pBuf == 0) { 00714 log.log(Logger::DEBUG, "fillInSections", "end-of-file reached"); 00715 } 00716 log.log(Logger::DEBUG, "fillInSections", "exit"); 00717 } // fillInSections

MailFilter::classification MailHeader::parseContentType ( FILE * fp,

char * contentBuf

)

Return the content type for the email.
Many emails (especially those which are MIME encoded, but others as well) include a Content-type section whose format is:
"Content-type:" type
Where examples of "Content-Type:" include "text/html", "text/plain", "multipart/alternative", "multipart/mixed", "image/jpg"
This spam filter marks all email that _starts_ with an HTML section, instead of a text section, as SUSPECT, which will result in placing the email in the junk_mail file.
This spam filter also attempts to identify email with base64 encoded sections. If the "kill_base64" flag is set, email with base64 encoded sections will be discarded. This tends to weed out viruses and spam that attempts to hide behind the base64 encoding. If the "kill_base64" flag is not set, the email with base64 encoded data will be placed in the junk_mail file.
The Content-Type section may be followed by a charset or boundary definition. The boundary definition is discussed below.
The charset definition may be on the same line as the Content-Type definition (separated by a semicolon) or it may be on the following line. The case that causes difficulty is the one where it is on the following line:
Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: base64
In this case the charset line is skipped, setting up processing for the Content-Transfer-Encoding line, which in this case is base64. This avoids having an email marked as HTML when we really want to classify it as base64 encoded.
Multipart email is separated by boundary strings. This allows email programs (and this spam filter) to skip to a section by looking for the boundary string.
A boundary string definition may follow the Content-Type. This may be either on the same line, separated by a semicolon:
Content-Type: multipart/mixed;boundary="--SpammersAreScum--"
or on the following line:
Content-Type: multipart/mixed; boundary="--SpammersAreScum--"
The boundary definition is saved in a class variable.
Note that the boundary definition is not saved if the Content-Type is text. This is because the following sometimes appears:
Content-Type: text/plain; boundary="--09064944530622531466"
Here, even though a boundary section is defined, it is unused because the email Content-Type is text.
Definition at line 561 of file MailHeader.C.
References SpamParameters::hasFlag(), Logger::log(), HeaderInfo::reason(), and saveBoundary().
Referenced by checkHeader().

00563 { 00564 00565 00566 const char *BOUNDARY = "boundary"; 00567 static size_t BOUNDARY_LEN = strlen( BOUNDARY ); 00568 const char *CONTENT_ENCODE = "Content-Transfer-Encoding"; 00569 static size_t CONTENT_ENCODE_LEN = strlen(CONTENT_ENCODE); 00570 00571 log.log(Logger::DEBUG, "parseContentType", "enter"); 00572 00573 MailFilter::classification klass = MailFilter::UNKNOWN; 00574 00575 SpamUtil::contentType type = SpamUtil().classifySection( contentBuf ); 00576 00577 { 00578 char *pBound = strstr(contentBuf, BOUNDARY); 00579 if (pBound && type != SpamUtil::TEXT) { 00580 saveBoundary( pBound + BOUNDARY_LEN ); 00581 } 00582 } 00583 00584 pPushBack = 0; 00585 char *pBuf; 00586 if ((pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) { 00587 // If the line that follows the Content-Type does not have a colon 00588 // (e.g., findColon() does not return a pointer) and it is not a 00589 // blank line, then it may be a boundary definition or a charset 00590 // definition. If it is a boundary we want to pick it up. Otherwise 00591 // we want to skip it. 00592 if (SpamUtil().isBlankLine(pBuf) || SpamUtil().findColon(pBuf) != 0) { 00593 pPushBack = pBuf; 00594 } 00595 else { 00596 char *pBound = strstr(pBuf, BOUNDARY); 00597 if (pBound && type != SpamUtil::TEXT) { 00598 saveBoundary( pBound + BOUNDARY_LEN ); 00599 } 00600 // get the next line 00601 pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp); 00602 pPushBack = pBuf; 00603 } 00604 00605 // Check for the Content-Transfer-Encoding line and see if it is base64 00606 char *pEncode; 00607 if ((pEncode = strstr(pBuf, CONTENT_ENCODE)) != 0) { 00608 pPushBack = 0; 00609 if (strstr(pEncode + CONTENT_ENCODE_LEN, "base64")) { 00610 type = SpamUtil::BASE64; 00611 } 00612 } 00613 00614 } 00615 00616 if (type == SpamUtil::BASE64) { 00617 if (mParams.hasFlag("kill_base64")) { 00618 mHeadInfo.reason("found base64 encoded information"); 00619 klass = MailFilter::GARBAGE; 00620 } 00621 else { 00622 klass = MailFilter::SUSPECT; 00623 } 00624 } 00625 else if (type == SpamUtil::HTML) { 00626 klass = MailFilter::SUSPECT; 00627 } 00628 else if (type == SpamUtil::IMAGE || type == SpamUtil::AUDIO) { 00629 klass = MailFilter::SUSPECT; 00630 } 00631 00632 const char *typeName; 00633 typeName = SpamUtil().typeToStr( type ); 00634 00635 char msg[128]; 00636 sprintf(msg, "mail type = %s", typeName ); 00637 log.log(Logger::DEBUG, "parseContentType", msg ); 00638 00639 log.log(Logger::DEBUG, "parseContentType", "exit"); 00640 return klass; 00641 } // parseContentType

void MailHeader::saveBoundary ( const char * pBound ) [private]

Save the boundary string which may follow the "Content-Type:" line. The format for the boundary definition is
boundary=""
The pBound argument should point to the '=' character.
There are times when the boundary string does not start with a quote. There are also times when there is white space between the "=" and the quote.
The boundary string is used in processing the mail body to move between the various sections of an email.
Definition at line 456 of file MailHeader.C.
References Logger::log().
Referenced by parseContentType().

00457 { 00458 log.log(Logger::DEBUG, "saveBoundary", "enter"); 00459 00460 if (*pBound == '=') { 00461 pBound++; 00462 // skip any white space between the "=" and the quote 00463 pBound = SpamUtil().skipWhiteSpace( pBound ); 00464 if (*pBound == '"') { 00465 pBound++; 00466 } 00467 const size_t len = sizeof(boundaryStr) - 1; 00468 size_t ix = 0; 00469 while (*pBound && 00470 ix < len && 00471 (! isspace(*pBound)) && 00472 *pBound != '"') { 00473 boundaryStr[ix] = *pBound; 00474 pBound++; 00475 ix++; 00476 } 00477 boundaryStr[ix] = '\0'; 00478 00479 if (ix > 0) { 00480 char msg[128]; 00481 sprintf(msg, "boundary str. = \"%s\"", boundaryStr ); 00482 log.log(Logger::DEBUG, "saveBoundary", msg); 00483 } 00484 } 00485 else { 00486 log.log(Logger::ERROR, "saveBoundary", "'=' expected"); 00487 } 00488 00489 log.log(Logger::DEBUG, "saveBoundary", "exit"); 00490 } // saveBoundary

The documentation for this class was generated from the following files:

Generated on Sat Mar 27 13:07:38 2004 for Mail Filter by

1.3.3


Public Member Functions
	MailHeader (SpamParameters &p, HeaderInfo &headInfo)
MailFilter::classification	parseContentType (FILE fp, char buf)
const char *	getBoundaryStr ()
MailFilter::classification	checkHeader (FILE *fp)
Private Member Functions
void	saveBoundary (const char *pBound)
void	fillInSections (FILE *fp)
MailFilter::classification	checkReceived (const char buf, FILE fp, size_t line)
MailFilter::classification	checkSubject (const char buf, FILE fp)
MailFilter::classification	checkFrom (const char *buf)
bool	addrContinues (const char *buf)
MailFilter::classification	checkDomainAddrs (const char domainName, const char pBuf)
MailFilter::classification	checkAddressSection (const char buf, FILE fp)
Private Attributes
Logger	log
SpamParameters &	mParams
HeaderInfo &	mHeadInfo
bool	foundValidAddress
char	boundaryStr [128]
char	pushBackBuf [1024]
char *	pPushBack