Multi-pose lipreading and audio-visual speech recognition

Virginia Estellers, Jean-Philippe Thiran
2012 EURASIP Journal on Advances in Signal Processing  
In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. The proposed method
more » ... is inspired by pose-invariant face recognition and relies on linear regression to find an approximate mapping between images from different poses. We integrate the proposed pose normalization block at different stages of the speech recognition system and quantify the loss of performance related to pose changes and pose normalization techniques. In audio-visual experiments we also analyse the integration of the audio and visual streams. We show that an audio-visual system should account for non-frontal poses and normalization techniques in terms of the weight assigned to the visual stream in the audio-visual classifier. automatic speech recognition (AV-ASR), combining the audio and visual modalities of speech to improve the performance of audio-only ASR, especially in presence of noise [1, 2] . In these situations we cannot trust the corrupted audio signal and must rely on the visual modality of speech to guide recognition. The major challenges that AV-ASR has to face are, therefore, the definition of reliable visual features for speech recognition and the integration of the audio and visual cues when taking decisions about the speech classes. A general framework for AV-ASR [3] has been developed during the last years, but for a practical deployment the systems still lack robustness against non-ideal working conditions. Research has particularly neglected the variability of the visual modality subject to real scenarios, i.e non-uniform lighting and nonfrontal poses caused by natural movements of the speaker. The first studies on genuine AV-ASR applications with realistic working conditions [4, 5] applied directly the systems developed for ideal visual conditions, obtaining poor performances and failing to exploit the visual modality in the multi-modal system. These works pointed out the necessity of new visual feature extraction methods robust to illumination and pose changes. In lipreading systems, the variations of the mouth's appearance caused by different poses are more significant than those caused by different speech classes and, therefore, recognition degrades dramatically when non-frontal poses are matched against frontal visual models. It is then necessary to develop an effective framework for pose invariant lipreading. In particular, we are interested in pose-invariant methods which can easily be incorporated in the AV-ASR systems developed so far for ideal frontal conditions. In fact, the same problem exists in the face recognition task and it is natural to apply the methods adopted in that field to the lipreading problem. We thus propose to introduce a pose normalization step in a system designed for frontal views, that is, we generate virtual frontal views from the non-frontal images and rely on the existing frontal visual models to recognize speech. The pose normalization block has also an effect on the fusion strategy, where the weight given to the visual stream should reflect its reliability. We can expect that the virtual frontal features generated by the pose normalizer from lateral views will be less reliable than the features extracted directly from frontal images and, therefore, the weight assigned to the pose-normalized visual stream on the audio-visual classifier should account for it. Previous work on this topic is limited to Lucey et al [6] [7] [8] , who projected the visual speech features of complete profile images to a frontal viewpoint with a linear transform. We introduce other projection techniques applied in face recognition to the lipreading task and justify their use on the different feature spaces involved in the lipreading system: the images themselves, a smooth and compact representation of the images in the frequency domain or the final features used in the classifier. The effectiveness of the different
doi:10.1186/1687-6180-2012-51 fatcat:4qhc5wlak5gfdecyh272uo67n4