Analysis of Complementary Information Sources in the Speaker Embeddings Framework

Mahesh Kumar Nandwana, Mitchell McLaren, Diego Castan, Julien van Hout, Aaron Lawson
2018 Interspeech 2018  
Deep neural network (DNN)-based speaker embeddings have resulted in new, state-of-the-art text-independent speaker recognition technology. However, very limited effort has been made to understand DNN speaker embeddings. In this study, our aim is analyzing the behavior of the speaker recognition systems based on speaker embeddings toward different front-end features, including the standard Mel frequency cepstral coefficients (MFCC), as well as power normalized cepstral coefficients (PNCC), and
more » ... rceptual linear prediction (PLP). Using a speaker recognition system based on DNN speaker embeddings and probabilistic linear discriminant analysis (PLDA), we compared different approaches to leveraging complementary information using score-, embeddings-, and feature-level combination. We report our results for Speakers in the Wild (SITW) and NIST SRE 2016 datasets. We found that first and second embeddings layers are complementary in nature. By applying score and embedding-level fusion we demonstrate relative improvements in equal error rate of 17% on NIST SRE 2016 and 10% on SITW over the baseline system.
doi:10.21437/interspeech.2018-1102 dblp:conf/interspeech/NandwanaMCHL18 fatcat:s4xhvarvkfcjnkwsu6gvkwuzjy