ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification

Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, Junhai Xu
2020 Interspeech 2020  
The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates shortcut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation.
more » ... Experiments on VoxCeleb datasets without augmentation indicate that ARET realizes satisfactory performance on the VoxCeleb1 test set, VoxCeleb1-E, and VoxCeleb1-H, with 1.389%, 1.520%, and 2.614% equal error rate (EER), respectively. Compared to state-of-the-art results on these test sets, RET achieves a 23% ∼ 43% relative reduction in EER, and ARET reaches 32% ∼ 45%.
doi:10.21437/interspeech.2020-1626 dblp:conf/interspeech/ZhangWLWLZJX20 fatcat:by5pzsk46rhbpnuzifvb5sunku