Semi-supervised learning of language model using unsupervised topic model

Shuanhu Bai, Chien-Lin Huang, Bin Ma, Haizhou Li
2010 2010 IEEE International Conference on Acoustics, Speech and Signal Processing  
We present a semi-supervised learning (SSL) method for building domain-specific language models (LMs) from general-domain data using probabilistic latent semantic analysis (PLSA). The proposed technique first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data. Then it derives latent topic distribution of the interested domain, and derives domain-specific word n-gram counts with a PLSA style mixture model. Finally, it uses traditional n-gram
more » ... itional n-gram modeling to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this technique outperforms both states-of-the-art relative entropy text selection and traditional supervised training methods. Index Terms-semi-supervised learning, language model, topic model.
doi:10.1109/icassp.2010.5494940 dblp:conf/icassp/BaiHML10 fatcat:nipc56dbdze2rdcus32fyhsrre