Multimodal speech processing using asynchronous Hidden Markov Models

Samy Bengio
2004 Information Fusion  
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a
more » ... iterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases..
doi:10.1016/j.inffus.2003.04.001 fatcat:tzddmmp5s5d5pbmmghqy2xrzey