Deep complementary bottleneck features for visual speech recognition

Stavros Petridis, Maja Pantic
2016 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep autoencoders. To the best of our knowledge, this is the first work that extracts DBNFs for visual speech recognition directly from pixels. We first train a deep autoencoder with a bottleneck layer in
more » ... r to reduce the dimensionality of the image. Then the autoencoder's decoding layers are replaced by classification layers which make the bottleneck features more discriminative. Discrete Cosine Transform (DCT) features are also appended in the bottleneck layer during training in order to make the bottleneck features complementary to DCT features. Long-Short Term Memory (LSTM) networks are used to model the temporal dynamics and the performance is evaluated on the OuluVS and AVLetters databases. The extracted complementary DBNF in combination with DCT features achieve the best performance resulting in an absolute improvement of up to 5% over the DCT baseline.
doi:10.1109/icassp.2016.7472088 dblp:conf/icassp/PetridisP16 fatcat:ut5tzskabrflzk6mg7t2q6dqpe