A Bimodal Approach for Speech Emotion Recognition using Audio and Text
Journal of Internet Services and Information Security
This paper presents a novel bimodal speech emotion recognition system based on analysis of acoustic and linguistic information. We propose a novel decision-level fusion strategy that leverages both emotions and sentiments extracted from audio and text transcriptions of extemporaneous speech utterances. We perform experimental study to prove the effectiveness of the proposed methods using emotional speech database RAMAS, revealing classification results of 7 emotional states (happy, surprised,
... happy, surprised, angry, sad, scared, disgusted, neutral) and 3 sentiment categories (positive, negative, neutral). We compare relative performance of unimodal vs. bimodal systems, analyze their effectiveness on different levels of annotation agreement, and discuss the effect of reduction of training data size on the overall performance of the systems. We also provide important insights about contribution of each modality for the best optimal performance for emotions classification, which reaches UAR=72.01% on the highest 5-th level of annotation agreement.