Saturday, July 24, 2004

Spam Classification Results from an informal test

I'd been noticing that SpamAssassin, at a threshold of 4.5 and even with its built-in Bayesian scoring was just not performing as well as Bogofilter, which ONLY has Bayesian scoring (but of course, I tweaked the spam and ham cutoffs and other parameters around 3 months ago). I decided to do an informal test.

Procedure:

0. I used my already trained bogofilter and sa-learn setups. For about a month now, I've
been taking spam that bogofilter found but that spamassassin did not determine to be
spam, and I've been feeding them to sa-learn in hopes that spamassassin would eventually
score them as spam since spamassassin would learn through its bayesian test about the
spam that it had not found before. However, even after a month of this training, I see
the result documented below (i.e., spamassassin's bayesian component doesn't seem to
learn very well).

1. Get Mboxes from various sources. The Mboxes include spam and ham

2. Run the email through spamassassin and bogofilter. The bogofilter wordlist does not
include any spamassassin markup because all email is run through a filter that removes
such markup (and performs other cleanup, e.g., removing all lines with too many
consecutive characters without whitespace, the main effect of this is to throw away attachments
that are encoded via MIME, BASE-64 or other encoding schemes).

3. Have evolution group the email into ham, mail that only bogofilter thought was spam,
mail that only spamassassin thought was spam, and mail that both thought was spam.

4. Eyeball all that email (very quickly, mainly looking at from and subject lines, and then
viewing the body of suspicious email).

At the end of all that, I see the following numbers:

On the positive side for both:

  • 1339 spam correctly classified by bogofilter

  • 1337 spam correctly classified by both bogofilter and spamassassin

  • 697 non-spam correctly classified by both bogofilter and spamassassin

  • 0 false negatives by either bogofilter or spamassassin

  • 0 false positives misclassified by bogofilter


On the minus side:

  • 104 bogofilter false-negatives (spam that bogofilter didn‘t classify, all these false negatives were also misclassified as negatives by spamassassin)

  • 90 false positives misclassified by spamassassin only (bogofilter correctly said they were not spam)


SpamAssassin has too high a false positive rate for me. Any false positives are a major problem since, with so much spam overwhelming the nonspam, false positives are very likely to hide in the spam noise and thus get lost. And while the rate here is very low in terms of probability, that is still too high for me.

False negatives aren't such a big deal since basically, the amount of spam is cut down to 1/100th or less of the true spam volume and the little spam left in inboxes is merely a nuisance and not the productivity destroyer that it used to be.

Given these results, where fully half of the spam I found is not correctly classified by SpamAssassin, I cannot afford to use only SpamAssassin. Of course, possibly my threshold of 4.5 is too high, but with the already too high levels of false positives now, lowering the threshold to catch more spam will mean that there will be an increase in false positives too. I‘ll continue my current system where both spamassassin and bogofilter are in use.

  • Email that bogofilter doesn't flag as spam but spamassassin does, is examined and, if it's really spam, sent to bogofilter for training.

  • If it's not really spam, then it's sent to sa-learn for training as –ham, so that the bayesian component will eventually learn that it isn't spam and, hopefully, contribute to decreasing the spamassassin scores of similar email in the future.

  • Email that bogofilter flags as spam but spamassassin doesn't is examined and if it's really spam, is sent to sa-learn for training.

  • If it isn't spam, then it's sent to sa-learn for training as –ham

  • Email that neither bogofilter nor SA classifies as spam but which *are* spam (false negatives) are trained as spam in both

  • I generally just delete email that is flagged as spam by both since my false positive rates are zero, I haven<'t seen any false positives from bogofilter, or from bogofilter+spamassassin in a year

Friday, July 23, 2004

Mailbomb DDOS and Postfix solution

We'resuddenly getting hit by a DDoS that's mailbombing our SMTP server with many simultaneous incoming emails for email addresses that don't exist. So we're getting a lot of errors in our logs about rejected email because of "User unknown in local recipient table". It took us a while to get a handle on this. We got part of the way with some hacks, but the server was still unstable. I posted questions on the Philippine Linux User's Group mailing list and the postfix-users mailing list, and I've got a recipe of things to mitigate the problem.

Orly at mozcom says to do:

disable_vrfy_command = yes
smtpd_banner = $myhostname NO UCE ESMTP
smtpd_delay_reject = no

# slowing down bad clients [added recommendations from wietse]
# we NEED hard_error_limit in order for dictionary-attack stoppage to work
smtpd_error_sleep_time = 0s
smtpd_soft_error_limit = 5
smtpd_hard_error_limit = 10
smtpd_timeout = 30s

and Victor Duchovni on the postfix mailing list gave me the smtp_error_sleep time thing too. Thanks to both. We've checked with upstream and downstream mailservers and they're not getting bombed. So it's probably a targetted DDoS. Some competitor in CDO is sufficiently worried about us that they're willing to pay real money to have thousands of zombie computers out there (many of the IPs resolve to dsl and cable companies in the states, so they're always-on, high bandwidth, cracked-wide-open windows boxes being orchestrated to attack us at the same time) attack us. We had a similar problem around midnight one night, very high UDP packets coming in. Ah well, there's probably no way to trace this back to the person or company that commissioned this short of going and finding the person/persons who cracked those zombie machines and, well, dismembering them little by little until they squeal.