Ensemble speaker and speaking environment modeling approach with advanced online estimation process

Yu Tsao, Jinyu Li, Chin-Hui Lee
2009 2009 IEEE International Conference on Acoustics, Speech and Signal Processing  
Recently, we proposed an ensemble speaker and speaking environment modeling (ESSEM) framework to characterize speaker variability and speaking environments. In contrast to multi-style training, ESSEM uses single-style training to prepare multiple sets of environment-specific acoustic models. The ensemble of these acoustic models forms a prior structure of the environment for flexible prediction of unknown environment during testing. In this study, we present methods to further improve the
more » ... r improve the precision for model characterization. We first study a weighted N-best information technique to well utilize the N-best transcription hypothesis in an unsupervised adaptation manner. Next, we introduce cohort selection and environment space adaptation techniques to online improve the resolution and coverage of the prior structure. With an integration of the proposed methods, we further improve the ESSEM performance over our previous study. On the Aurora-2 task, ESSEM achieves an average word error rate (WER) of 4.64%, corresponding to a 15.64% relative WER reduction over our best baseline result (5.50% to 4.64% WER) obtained with multi-condition training. Index Terms-noise robustness, ensemble speaker and speaking environment modeling, N-best transcription
doi:10.1109/icassp.2009.4960463 dblp:conf/icassp/TsaoLL09 fatcat:zi4s2psw4vbs7gd3m2ejdlv4dm