Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function

Jianfeng Zhou, Tao Jiang, Zheng Li, Lin Li, Qingyang Hong
2019 Interspeech 2019  
In speaker verification, the convolutional neural networks (C-NN) have been successfully leveraged to achieve a great performance. Most of the models based on CNN primarily focus on learning the distinctive speaker embedding from the horizontal direction (time-axis). However, the feature relationship between channels is usually neglected. In this paper, we firstly aim toward an alternate direction of recalibrating the channelwise features by introducing the recently proposed
more » ... on" (SE) module for image classification. We effectively incorporate the SE blocks in the deep residual networks (ResNet-SE) and demonstrate a slightly improvement on Vox-Celeb corpuses. Additionally, we propose a new loss function, namely additive supervision softmax (AS-Softmax), to make full use of the prior knowledge of the mis-classified samples at training stage by imposing more penalty on the mis-classified samples to regularize the training process. The experimental results on VoxCeleb corpuses demonstrate that the proposed loss could further improve the performance of speaker system, especially on the case that the combination of the ResNet-SE and the AS-Softmax.
doi:10.21437/interspeech.2019-1704 dblp:conf/interspeech/ZhouJLLH19 fatcat:6va5knr4cnf4lhh2mjlpecybua