Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks

Zhong-Qiu Wang, Ivan Tashev
2017 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Accurately recognizing speaker emotion and age/gender from speech can provide better user experience for many spoken dialogue systems. In this study, we propose to use deep neural networks (DNNs) to encode each utterance into a fixed-length vector by pooling the activations of the last hidden layer over time. The feature encoding process is designed to be jointly trained with the utterance-level classifier for better classification. A kernel extreme learning machine (ELM) is further trained on
more » ... further trained on the encoded vectors for better utterance-level classification. Experiments on a Mandarin dataset demonstrate the effectiveness of our proposed methods on speech emotion and age/gender recognition tasks.
doi:10.1109/icassp.2017.7953138 dblp:conf/icassp/WangT17 fatcat:mvqznfj5mrg5vknkznkwgfgmbm