Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-switching Speech Recognition [article]

Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng
2020 arXiv   pre-print
In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China. The overall training sets have three subsets, i.e., a code-switching data set, an English (LibriSpeech) and a Mandarin data set respectively. The code-switching data are Mandarin dominated. First of all, it is found using the overall data yields worse results, and hence data selection study is necessary. Then to
more » ... xploit monolingual data, we find data matching is crucial. Mandarin data is closely matched with the Mandarin part in the code-switching data, while English data is not. However, Mandarin data only helps on those utterances that are significantly Mandarin-dominated. Besides, there is a balance point, over which more monolingual data will divert the CSSR system, degrading results. Finally, we analyze the effectiveness of combining monolingual data to train a CSSR system with the HMM-DNN hybrid framework. The CSSR system can perform within-utterance code-switch recognition, but it still has a margin with the one trained on code-switching data.
arXiv:2006.07094v2 fatcat:g5pql34bozdhxgaj4e76jphyp4