Latent Dirichlet learning for document summarization

Ying-Lang Chang, Jen-Tzung Chien
2009 2009 IEEE International Conference on Acoustics, Speech and Signal Processing  
Automatic summarization is developed to extract the representative contents or sentences from a large corpus of documents. This paper presents a new hierarchical representation of words, sentences and documents in a corpus, and infers the Dirichlet distributions for latent topics and latent themes in word level and sentence level, respectively. The sentence-based latent Dirichlet allocation (SLDA) is accordingly established for document summarization. Different from the vector space
more » ... n, SLDA is built to fit the fine structure of text documents, and is specifically designed for sentence selection. SLDA acts as a sentence mixture model with a mixture of Dirichlet themes, which are used to generate the latent topics in observed words. The theme model is inherent to distinguish sentences in a summarization system. In the experiments, the proposed SLDA outperforms other methods for document summarization in terms of precision, recall and F-measure.
doi:10.1109/icassp.2009.4959927 dblp:conf/icassp/ChangC09 fatcat:i7iflhvfonbrrcwof5sbj7d6wq