Accent- and speaker-specific polyphone decision trees for non-native speech recognition

Dominic Telaar, Mark C. Fuhs
2013 Interspeech 2013   unpublished
Acoustic models in state-of-the-art LVCSR systems are typically trained on data from thousands of speakers and then adapted to a speaker using, e.g., various combinations of CM-LLR, MLLR and MAP. This adaptation step is particularly important for speakers with accents that are not well represented in the training set. The present study explores how to improve performance on South-Asian-accented speakers (SoA-accented) with the availability of thousands of US-accented, hundreds of SoA-accented,
more » ... nd tens of hours of speaker-specific training data. We employ a decision tree similarity measure to analyze how varying co-articulations across accents and people manifest themselves in the decision tree. Modeling these variations in addition to adapting the GMMs of an existing baseline system to a speaker improved WER for small systems (1k GMMs), but improvement for systems with larger trees (2k, 3k GMMs) was modest. Overall, GMM adaptation/retraining yields significant performance benefits, and training a SoA-accent-specific system is particularly worthwhile when lacking speaker adaptation data.
doi:10.21437/interspeech.2013-733 fatcat:nezgqoqcabaclloi6kt5oqaomy