End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu
2020 Interspeech 2020  
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-toend speaker diarization outperformed conventional clusteringbased speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors
more » ... multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69 % diarization error rate (DER) on simulated mixtures and a 8.07 % DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56 % and 9.54 %, respectively. In unknown numbers of speakers conditions, our method attained a 15.29 % DER on CALLHOME, while the x-vectorbased clustering method achieved a 19.43 % DER. Index Terms: speaker diarization, encoder-decoder, attractor calculation 3. End-to-end neural diarization: Review Here we briefly introduce our end-to-end diarization framework named EEND [15, 16] . The EEND takes a T -length se-
doi:10.21437/interspeech.2020-1022 dblp:conf/interspeech/HoriguchiF0XN20 fatcat:rvnoymqvy5cplcw5jifrivdsfa