Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity [chapter]

Hosein Azarbonyad, Mostafa Dehghani, Tom Kenter, Maarten Marx, Jaap Kamps, Maarten de Rijke
2017 Lecture Notes in Computer Science  
Azarbonyad, H.; Dehghani, M.; Kenter, T.M.; Marx, M.J.; Kamps, J.; de Rijke, M. Abstract. A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due
more » ... o generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents' topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.
doi:10.1007/978-3-319-56608-5_6 fatcat:xu7xbtl25nedbk55bdcc6kvroa