End-to-End Audiovisual Fusion with LSTMs

Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic
2017 The 14th International Conference on Auditory-Visual Speech Processing  
Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long-Short Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which
more » ... usly learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modelled by a BLSTM and the fusion of multiple streams/modalities takes place via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4 nonlingusitic vocalisations over audio-only classification is reported on the AVIC database. At the same time, the proposed end-to-end audiovisual fusion system improves the state-of-theart performance on the AVIC database leading to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual speech recognition experiments on the OuluVS2 database using different views of the mouth, frontal to profile. The proposed audiovisual system significantly outperforms the audioonly model for all views considered when the acoustic noise is high.
doi:10.21437/avsp.2017-8 dblp:conf/avsp/PetridisWLP17 fatcat:kmmowoi725dnfm67yrom5rkshe