A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2012; you can also visit the original URL.
The file type is application/pdf
.
Latent dirichlet allocation in web spam filtering
2008
Proceedings of the 4th international workshop on Adversarial information retrieval on the web - AIRWeb '08
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-ofwords document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test
doi:10.1145/1451983.1451991
dblp:conf/airweb/BiroSB08
fatcat:jxcgpgywxveuhl2b6tmea7bx3m