Semantic-based topic detection using Markov decision processes

Qian Chen, Xin Guo, Hexiang Bai
2017 Neurocomputing  
In the field of text mining, topic modeling and detection are fundamental problems in public opinion monitoring, information retrieval, social media analysis, and other activities. Document clustering has been used for topic detection at the document level. Probabilistic topic models treat topics as a distribution over the term space, but this approach overlooks the semantic information hidden in the topic. Thus, representing topics without loss of semantic information as well as detecting the
more » ... ptimal topic is a challenging task. In this study, we built topics using a network called a topic graph, where the topics were represented as concept nodes and their semantic relationships using WordNet. Next, we extracted each topic from the topic graph to obtain a corpus by community discovery. In order to find the optimal topic to describe the related corpus, we defined a topic pruning process, which was used for topic detection. We then performed topic pruning using Markov decision processes, which transformed topic detection into a dynamic programming problem. Experimental results produced using a newsgroup corpus and a science literature corpus showed that our method obtained almost the same precision and recall as baseline models such as latent Dirichlet allocation and KeyGraph. In addition, our method performed better than the probabilistic topic model in terms of its explanatory power and the runtime was lower compared with all three baseline methods, while it can also be optimized to adapt the corpus better by using topic pruning. (H. Bai). including text classification and clustering, information retrieval, and document summarization. [1] . Topic detection plays an important role in information retrieval and data mining, and it is an effective tool for organizing and managing text data such as newswire archives and research literature. Unlike other existing applications in text mining and information retrieval, topic detection is an entirely unsupervised learning task without any topic classes or structure labels. In general, a topic is represented as related sets of keywords, and thus important descriptions can be given to topics or events. Many text clustering algorithms that typically compute similarities have been developed for topic detection, such as single pass incremental clustering algorithms [2] and incremental clustering algorithms [10] . Since the latent Dirichlet allocation (LDA) method was proposed by Blei in 2003 [4] , the probabilistic topic model (pTM) has attracted much attention in the fields of information retrieval, text mining, and other areas. Essentially, pTM is a type of probabilistic model used for topic modeling, including L SA, pL SA , LDA , and various extension versions of pTM, which treat a topic as a distribution over the term space. Despite the success of pTM, it has several drawbacks, as follows. (1) The inference algorithm used in the model can be too complex and much time is required to generate the topic word http://dx.
doi:10.1016/j.neucom.2017.02.020 fatcat:yvze46zl3jbyppnbvz4pkyvcqe