Local spatiotemporal descriptors for visual recognition of spoken phrases

Guoying Zhao, Matti Pietikäinen, Abdenour Hadid
2007 Proceedings of the international workshop on Human-centered multimedia - HCM '07  
Visual speech information plays an important role in speech recognition under noisy conditions or for listeners with hearing impairment. In this paper, we propose local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Positions of the eyes determined by a robust face and eye detector are used for localizing the mouth regions in face images. Spatiotemporal local binary patterns extracted from these regions are used for describing phrase
more » ... equences. In our experiments with 817 sequences from ten phrases and 20 speakers, promising accuracies of 62% and 70% were obtained in speaker-independent and speaker-dependent recognition, respectively. In comparison with other methods on the Tulips1 audio-visual database, the accuracy 92.7% of our method clearly outperforms the others. Advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.
doi:10.1145/1290128.1290138 dblp:conf/mm/ZhaoPH07 fatcat:uooot3dn35a7faffo2e34betii