.NET email spam filter
Mail.dll .NET email component includes high accuracy anti-spam filter.
It uses enhanced naive Bayesian classifier, specifically modified to handle email messages. Bayesian spam filters are a very powerful technique for dealing with spam.
In our tests we achieved 99,6% accuracy with very low false positive spam detection rates (9 false positives in 54’972 emails tested – that’s 0.016%).
First in the learning phase, you need to teach the classifier to recognize spam and non-spam (ham) messages. You need to prepare 100-200 spam and ham messages.
I suggest using following folder structure:
“Learn” folder is used for training the filter. Both spam and ham folders should contain around 100-200 messages each (the more the better). The number of messages in spam and ham folders must be equal. You can find a spam archive on the bottom of the article.
Messages must be in eml format with correct line endings (rn or 13 10 hex).
Now we use SpamFilterTeacher class to teach BayesianMailFilter:
// C# using Limilabs.Mail.Tools.Spam; BayesianMailFilter filter = new BayesianMailFilter(); SpamFilterTeacher teacher = new SpamFilterTeacher(filter); teacher.TeachSpam(@"c:\bayes\learn\spam"); teacher.TeachHam(@"c:\bayes\learn\ham");
“Test” folder is used for testing our filter:
// C# SpamTestResults r = teacher.Test( @"c:\bayes\test\spam", @"c:\bayes\test\ham"); Console.WriteLine(r); r.FalsePositives.ForEach(Console.WriteLine); r.NotMarkedAsSpam.ForEach(Console.WriteLine);
The results should be similar to this:
Accuracy=0.9949, False positives=9, Not marked as spam=271, Tests count=54972
When the filter is trained and the results are satisfactory, you can save it to disk:
// C# filter.Save(@"c:\20111022.mbayes");
You can load the filter from disk and check individual messages:
// C# BayesianMailFilter filter = new BayesianMailFilter(); filter.Load(@"c:\20111022.mbayes"); // you can use Mail.dll to download mesage from POP3 or IMAP server: var eml = ... IMail email = new MailBuilder().CreateFromEml(eml); SpamResult result = filter.Examine(email); Console.WriteLine(result.Probability); Console.WriteLine(result.IsSpam);
If the filter incorrectly recognizes the message you can train it again:
// C# filter.LearnSpam(email); // - or - filter.LearnHam(email); filter.Save(@"c:\20111022.mbayes");
For most recent spam you can check this great archive: http://www.untroubled.org/spam/.
Unfortunately messages don’t have correct extension (*.eml) and line endings are incorrect.
You can download spam archive including 7874 spam messages from Oct 2011 here:
December 30th, 2015 at 07:27
thanks , Can I download a complete folder?
January 2nd, 2016 at 16:43
Not sure what you mean by downloading complete folder.
You can download all emails from a single folder:
May 6th, 2016 at 02:25
You provide a spam archive sample. Can you provide a ham archive sample of the same size or a “mbayes” file to use as a basis to train on top of?
Great software suite you guys have created!
May 6th, 2016 at 07:49
> Can you provide a ham archive[…]?
Not really. To achieve best results you should use ‘yours’ ham and spam messages.
Fortunately spam messages are similar in many countries. Ham messages are not.
They should be in your language, they should contain words your business uses.
For example ham messages would be very different for local pharmacy chain and for international IT company.
> Great software suite you guys have created!
May 9th, 2016 at 08:02
I am creating a web mail client which can be used in many countries in many languages by many people from many companies who would receive mail from almost anywhere. I was wanting to automatically flag any suspected spam and automatically move it to their junk folder and from there the user can either delete it or mark it as not spam. I really love this software suite but I was really wanting a sort of plug-and-play spam filter that could be tweaked.
I was planning to have a base dictionary which would be duplicated for each user and when a user marked something as either spam (when something that should have in their opinion been marked as spam was not) or marked something as not spam (when something that was marked spam that in their opinion should not have been) their personal spam/ham dictionary is updated so it gets to learn what personally to them is considered spam.
I love that the spam archive is provided but I read in the documentation that the spam and ham folders should contain an equal number of emails. I don’t have the 7000 odd ham emails to match the spam archive. And now I realise that I would need spam and ham emails for each language.
Can you recommend a spam filter which provides a spam and ham dictionary to start with.
Sorry for all the questions. I am writing a CMS and I wanted to integrate an email client into it. I was going to use the inbuilt IMAP library, however I was wanting to include a spam filter into it and came across your library which has IMAP, POP3, SMTP and the SPAM Filter which would cut down quite a bit of development time as your library makes the job of connecting to mail server and retrieving and sending emails very easy. I have a lot of experience with web development but little with Bayesian filtering and other spam capturing techniques.
Do you possibly have an already existing “mbayes” english dictionary that you have used in your testing that you wouldn’t mind sharing to get me started?
May 9th, 2016 at 17:17
You can take much less spam messages for initial learning process.
Thus you’ll need less ham messages.
There is no such thing as generic ham messages.
As I said before: ham messages are very different for different people.
I can’t provide you my emails for learning – they are confidential.
I don’t have knowledge of any publicly available archive of such messages.
You simply need to use your own messages.