Feature Augmenting Networks for Improving Depression Severity Estimation from Speech Signals

Le Yang, Dongmei Jiang, Hichem Sahli
2020 IEEE Access  
Depression disorder has become one of the major psychological diseases endangering human health. Researcher in the affective computing community is supporting the development of reliable depression severity estimation system, from multiple modalities (speech, face, text), to assist doctors in their diagnosis. However, the limited amount of annotated data has become the main bottleneck restricting the study on depression screening, especially when deep learning models are used. To alleviate this
more » ... issue, in this work we propose to use Deep Convolutional Generative Adversarial Network (DCGAN) for features augmentation to improve depression severity estimation from speech. To the best of our knowledge, this approach is the first attempt to apply the Generative Adversarial Network for depression severity estimation from speech. Besides, to measure the quality of the augmented features, we propose three different measurement criteria, characterizing the spatial, frequency and representation learning of the augmented features. Finally, the augmented features are used to train depression estimation models. Experiments are carried out on speech signals from the Audio Visual Emotion Challenge (AVEC2016) depression dataset, and the relationship between the model performance and data size is explored. Our experimental results show that: 1) The combination of the three proposed evaluation criteria can effectively and comprehensively evaluate the quality of the augmented features. 2) When increasing the size of the augmented data, the performance of depression severity estimation gradually improves and the model converges to a certain stable state. 3) The proposed DCGAN based data augmentation approach effectively improves the performance of depression severity estimation, with the root mean square error (RMSE) reduced to 5.520 and mean absolute error (MAE) reduced to 4.634, which is better than most of the state of the art results on AVEC 2016. INDEX TERMS Depression estimation, audio features, data augmentation, deep convolutional generative adversarial network, spatial domain, frequency domain, deep learning aspect. HICHEM SAHLI is currently a Professor in computer vision and machine learning with the Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), and a Group-Coordinator with the Interuniversity Microelectronics Centre (IMEC). He coordinates the Joint VUB-NPU Audio-Visual Signal Processing (AVSP) Laboratory. He has authored or coauthored over 310 refereed journal and conference papers. His research interests include theoretical and applied problems related to computer vision, machine learning, and signal, audio, and image processing, for applications linked to affective computing, multimodal interaction, and behavior analysis.
doi:10.1109/access.2020.2970496 fatcat:3nzusy7skndbjmhfpbwbyy56yq