Weighting schemes for audio-visual fusion in speech recognition

H. Glotin, D. Vergyr, C. Neti, G. Potamianos, J. Luettin
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221)  
In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio-and visual-only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate
more » ... bination weights for each of the modalities. Two different techniques are described: The first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).
doi:10.1109/icassp.2001.940795 dblp:conf/icassp/GlotinVNPL01 fatcat:phollsva3zftnmpkpg2wphlsl4