An evaluation of Naive Bayes variants in content-based learning for spam filtering

Alexander K. Seewald
2007 Intelligent Data Analysis  
We describe an in-depth analysis of spam-filtering performance of a simple Naive Bayes learner and two current variants. A set of seven mailboxes comprising about 65,000 mails from seven different users, as well as a representative snapshot of 25,000 mails which were received over 18 weeks by a single user, were used for evaluation. Our main motivation was to test whether two variants of Naive Bayes learning, SpamAssassin and CRM114, were superior to simple Naive Bayes learning, represented by
more » ... pamBayes. Surprisingly, we found that the performance of these systems was remarkably similar and that the extended systems have significant weaknesses which are not apparent for the simpler Naive Bayes learner. The simpler Naive Bayes learner, SpamBayes, also offers the most stable performance in that it deteriorates least over time. Overall, SpamBayes should be preferred over the more complex variants. c 2005 Kluwer Academic Publishers. Printed in the Netherlands. spam-journal.tex; 5/09/2005; 21:11; p.1 2 During the evaluation of the test version of BrightMail, around 700 megabytes of updates were received weekly. Around 7 megabytes of ham and spam email are received at our institute per user and week, so the break-even point -where the bandwidth for BrightMail equals the email bandwidth -would be achieved at around 100 users. Below 100 users a locally trained filter may be preferrable. 3 See e.g. 4 Naive Bayes learning with the usual setting for text mining: splitting each document (mail) into words, and using each unique word as a feature in a word occurrence vector. The probabilities are estimated in the usual way, see Section 3.3. spam-journal.tex; 5/09/2005; 21:11; p.3
doi:10.3233/ida-2007-11505 fatcat:hqfr7tnfdrbnhe4ixruk3kp3vm