A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit <a rel="external noopener" href="https://www.isca-speech.org/archive/Interspeech_2020/pdfs/2470.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
<a target="_blank" rel="noopener" href="https://fatcat.wiki/container/trpytsxgozamtbp7emuvz2ypra" style="color: black;">Interspeech 2020</a>
Recently, speaker verification systems using deep neural networks have shown their effectiveness on large scale datasets. The widely used pairwise loss functions only consider the discrimination within a mini-batch data (short-term), while either the speaker identity information or the whole training dataset is not fully exploited. Thus, these pairwise comparisons may suffer from the interferences and variances brought by speakerunrelated factors. To tackle this problem, we introduce the<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.21437/interspeech.2020-2470">doi:10.21437/interspeech.2020-2470</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/interspeech/PengGZ20.html">dblp:conf/interspeech/PengGZ20</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/s6sq6ix3zjbe7hhr2xsfcjt5fy">fatcat:s6sq6ix3zjbe7hhr2xsfcjt5fy</a> </span>
more »... identity information to form long-term speaker embedding centroids, which are determined by all the speakers in the training set. During the training process, each centroid dynamically accumulates the statistics of all samples belonging to a specific speaker. Since the long-term speaker embedding centroids are associated with a wide range of training samples, these centroids have the potential to be more robust and discriminative. Finally, these centroids are employed to construct a loss function, named long short term speaker loss (LSTSL). The proposed LSTSL constrains that the distances between samples and centroid from the same speaker are compact while those from different speakers are dispersed. Experiments are conducted on VoxCeleb1 and VoxCeleb2. Results on the VoxCeleb1 dataset demonstrate the effectiveness of our proposed LSTSL.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20201210223303/https://www.isca-speech.org/archive/Interspeech_2020/pdfs/2470.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/4a/69/4a694625ad4ca03bda0c23d6e24a642fb7016610.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.21437/interspeech.2020-2470"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>