Latent dirichlet allocation in web spam filtering

István Bíró, Jácint Szabó, András A. Benczúr
2008 Proceedings of the 4th international workshop on Adversarial information retrieval on the web - AIRWeb '08  
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-ofwords document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test
more » ... we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers.
doi:10.1145/1451983.1451991 dblp:conf/airweb/BiroSB08 fatcat:jxcgpgywxveuhl2b6tmea7bx3m