Posts Tagged ‘Spam’

.NET email spam filter

Saturday, October 22nd, 2011

Mail.dll .NET email component includes high accuracy anti-spam filter.

It uses enhanced naive Bayesian classifier, specifically modified to handle email messages. Bayesian spam filters are a very powerful technique for dealing with spam.

In our tests we achieved 99,6% accuracy with very low false positive spam detection rates (9 false positives in 54’972 emails tested – that’s 0.016%).

Training

First in the learning phase, you need to teach the classifier to recognize spam and non-spam (ham) messages. You need to prepare 100-200 spam and ham messages.

I suggest using following folder structure:

“Learn” folder is used for training the filter. Both spam and ham folders should contain around 100-200 messages each (the more the better). The number of messages in spam and ham folders must be equal. You can find a spam archive on the bottom of the article.

Messages must be in eml format with correct line endings (rn or 13 10 hex).

Now we use SpamFilterTeacher class to teach BayesianMailFilter:

// C#
using Limilabs.Mail.Tools.Spam;

BayesianMailFilter filter = new BayesianMailFilter();
SpamFilterTeacher teacher = new SpamFilterTeacher(filter);
teacher.TeachSpam(@"c:\bayes\learn\spam");
teacher.TeachHam(@"c:\bayes\learn\ham");

Testing

“Test” folder is used for testing our filter:

// C#

SpamTestResults r = teacher.Test(
    @"c:\bayes\test\spam",
    @"c:\bayes\test\ham");

Console.WriteLine(r);
r.FalsePositives.ForEach(Console.WriteLine);
r.NotMarkedAsSpam.ForEach(Console.WriteLine);

The results should be similar to this:

Accuracy=0.9949, False positives=9, Not marked as spam=271, Tests count=54972
c:\bayes\test\ham/16874.eml
...

When the filter is trained and the results are satisfactory, you can save it to disk:

// C#

filter.Save(@"c:\20111022.mbayes");

Using

You can load the filter from disk and check individual messages:

// C#

BayesianMailFilter filter = new BayesianMailFilter();
filter.Load(@"c:\20111022.mbayes");

// you can use Mail.dll to download mesage from POP3 or IMAP server:
var eml = ...

IMail email = new MailBuilder().CreateFromEml(eml);

SpamResult result = filter.Examine(email);
Console.WriteLine(result.Probability);
Console.WriteLine(result.IsSpam);

If the filter incorrectly recognizes the message you can train it again:

// C#

filter.LearnSpam(email);
// - or -
filter.LearnHam(email);

filter.Save(@"c:\20111022.mbayes");

Spam archives

For most recent spam you can check this great archive: http://www.untroubled.org/spam/.
Unfortunately messages don’t have correct extension (*.eml) and line endings are incorrect.

You can download spam archive including 7874 spam messages from Oct 2011 here:
/static/mail/spam/spam201110.zip