Joint-character-POC N-gram language modeling for Chinese speech recognition

Bin Wang, Zhijian Ou, Jian Li, Akinori Kawamura
2014 The 9th International Symposium on Chinese Spoken Language Processing  
The state-of-the-art language models (LMs) for Chinese speech recognition are word n-gram models. However, in Chinese, characters are morphological in meaning and words are not consistently defined. There are recent interests in building the character n-gram LM and its combination with the word n-gram LM. In this paper, in order to exploit both character-level and word-level constraints, we propose the joint n-gram LM, which is an n-gram model based on joint-state that is a pair of character
more » ... its position-of-character (POC) tag. We point out the pitfall in naive solving of the smoothing and scoring problems for joint n-gram models, and provide corrected solutions. For experimental comparison, different LMs (including word 4-grams, character 6-grams and joint 6-grams) are tested for speech recognition, using training corpus of 1.9 billion characters. The joint n-gram LM achieves performance improvements, especially in recognizing the utterances containing OOV words.
doi:10.1109/iscslp.2014.6936588 dblp:conf/iscslp/WangOLK14 fatcat:oc2kh4h4ebasbnsn36qy3artqi