Robustness Over Time-Varying Channels in DNN-HMM ASR Based Human-Robot Interaction

José Novoa, Jorge Wuth, Juan Pablo Escudero, Josué Fredes, Rodrigo Mahu, Richard M. Stern, Nestor Becerra Yoma
2017 Interspeech 2017   unpublished
This paper addresses the problem of time-varying channels in speech-recognition-based human-robot interaction using Locally-Normalized Filter-Bank features (LNFB), and training strategies that compensate for microphone response and room acoustics. Testing utterances were generated by re-recording the Aurora-4 testing database using a PR2 mobile robot, equipped with a Kinect audio interface while performing head rotations and movements toward and away from a fixed source. Three training
more » ... s were evaluated called Clean, 1-IR and 33-IR. With Clean training, the DNN-HMM system was trained using the Aurora-4 clean training database. With 1-IR training, the same training data were convolved with an impulse response estimated at one meter from the source with no rotation of the robot head. With 33-IR training, the Aurora-4 training data were convolved with impulse responses estimated at one, two and three meters from the source and 11 angular positions of the robot head. The 33-IR training method produced reductions in WER greater than 50% when compared with Clean training using both LNFB and conventional Mel filterbank features. Nevertheless, LNFB features provided a WER 23% lower than MelFB using 33-IR training. The use of 33-IR training and LNFB features reduced WER by 64% compared to Clean training and MelFB features.
doi:10.21437/interspeech.2017-1308 fatcat:rwmfq3lhwfghfaxeb5yoflnpxm