Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation

Jiaxing Liu, Zhilei Liu, Longbiao Wang, Yuan Gao, Lili Guo, Jianwu Dang
2020 Interspeech 2020  
As the fundamental research of affective computing, speech emotion recognition (SER) has gained a lot of attention. Unlike with common deep learning tasks, SER was restricted by the scarcity of emotional speech datasets. In this paper, the vector quantization variational automatic encoder (VQ-VAE) was introduced and trained by massive unlabeled data in an unsupervised manner. Benefiting from the excellent invariant distribution encoding capability and discrete embedding space of VQ-VAE, the
more » ... trained VQ-VAE could learn latent representation from labeled data. The extracted latent representation could serve as the additional source data to make data abundantly available. While solving data lacking issue, sequence information modeling was also taken into account which was considered useful for SER. The proposed sequence model, temporal attention convolutional network (TACN) was simple yet good at learning contextual information from limited data which was not friendly to complicated structures of recurrent neural network (RNN) based sequence models. To validate the effectiveness of the latent representation, t-distributed stochastic neighbor embedding (t-SNE) was introduced to analyze the visualizations. To verify the performance of the proposed TACN, quantitative classification results of all commonly used sequence models were provided. Our proposed model achieved state-of-theart performance on IEMOCAP.
doi:10.21437/interspeech.2020-1520 dblp:conf/interspeech/LiuLWGGD20 fatcat:eidctr5wgjd6ba75p7e4kgp6s4