4,972 Hits in 3.5 sec

Learning Discriminative Features for Speaker Identification and Verification

Sarthak Yadav, Atul Rai
2018 Interspeech 2018  
We also propose a unified deep learning system for both Text-Independent Speaker Recognition and Speaker Verification, by training the proposed network architecture under the joint supervision of Softmax  ...  loss and Center loss [2] to obtain highly discriminative deep features that are suited for both Speaker Identification and Verification Tasks.  ...  Recognition. [20] studied the optimal CNN design for speaker identification and clustering, as well as elaborated on how to apply transfer learning, viz., transfer a network trained for speaker identification  ... 
doi:10.21437/interspeech.2018-1015 dblp:conf/interspeech/YadavR18 fatcat:caeplgy7efht3n3enprie2x34a

Multi-instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams

Keitaro Tanaka, Takayuki Nakatsuka, Ryo Nishikimi, Kazuyoshi Yoshii, Shigeo Morishima
2020 Zenodo  
To improve the performance of transcription, we propose a joint spectrogram and pitchgram clustering method based on the timbral and pitch characteristics of musical instruments.  ...  parts with a deep spherical clustering technique.  ...  To solve this permutation problem, a method called deep clustering has been proposed that treats speech separation for arbitrary speakers as a clustering problem, rather than a classification problem  ... 
doi:10.5281/zenodo.4245435 fatcat:ziaurfyjpfbx5fokuk3euooctm

Cross-lingual Text-independent Speaker Verification Using Unsupervised Adversarial Discriminative Domain Adaptation

Wei Xia, Jing Huang, John H.L. Hansen
2019 ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Further data analysis of ADDA adapted speaker embedding shows that the learned speaker embeddings can perform well on speaker classification for the target domain data, and are less dependent with respect  ...  Being able to improve cross-lingual speaker verification system using unlabeled data can greatly increase the robustness of the system and reduce human labeling costs.  ...  Speaker classifier and domain classifier both take input from the joint feature extractor, are optimized to excel in their own tasks.  ... 
doi:10.1109/icassp.2019.8682259 dblp:conf/icassp/XiaHH19 fatcat:ulwrq5klbbad7dfbzdtg3b6sga

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis [article]

Yihan Wu, Xi Wang, Shaofei Zhang, Lei He, Ruihua Song, Jian-Yun Nie
2022 arXiv   pre-print
It leverages an emotion lexicon and uses contrastive learning and deep clustering. We further integrate the style representation as a conditioned embedding in a multi-style Transformer TTS.  ...  Expressive speech synthesis, like audiobook synthesis, is still challenging for style representation learning and prediction.  ...  Deep embedded clustering (DEC) [17] maps the observed data to a low-dimensional space and optimizes KL divergence as clustering objective.  ... 
arXiv:2206.12559v1 fatcat:nwpoxnylcvhnjbkltimr4d4zfu

Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization

Ruiqing Yin, Hervé Bredin, Claude Barras
2018 Interspeech 2018  
Then, we propose to use affinity propagation on top of neural speaker embeddings for speech turn clustering, outperforming regular Hierarchical Agglomerative Clustering (HAC).  ...  Finally, all these modules are combined and jointly optimized to form a speaker diarization pipeline in which all but the clustering step are based on RNNs.  ...  We would like to thank Sylvain Meignier for providing us with the output of the LIUM's S4D system.  ... 
doi:10.21437/interspeech.2018-1750 dblp:conf/interspeech/YinBB18 fatcat:a6ttuydymzdatck6vjgrfaxkiy

End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification [article]

Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu
2020 arXiv   pre-print
The most common approach to speaker diarization is clustering of speaker embeddings.  ...  However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has  ...  They include, for example, joint modeling of speaker embedding extraction and scoring [16] , [32] and joint modeling of SAD and speaker embedding [33] .  ... 
arXiv:2003.02966v1 fatcat:ca7kgnkbjbb5lc7svksaoxretm

Speaker diarization using latent space clustering in generative adversarial network [article]

Monisankha Pal, Manoj Kumar, Raghuveer Peri, Tae Jin Park, So Hyun Kim, Catherine Lord, Somer Bishop, Shrikanth Narayanan
2019 arXiv   pre-print
In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network.  ...  It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker  ...  Recently, deep embedded clustering on d-vectors was introduced for speaker diarization [13] .  ... 
arXiv:1910.11398v1 fatcat:w33g4bxsw5fy7iodgfp766vmw4

A Review of Speaker Diarization: Recent Advances with Deep Learning [article]

Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, Shrikanth Narayanan
2021 arXiv   pre-print
More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for  ...  Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these  ...  Joint Optimization of Segmentation and Clustering This subsection introduces a VB-HMM-based diarization technique, which can be regarded as a joint optimization of segmentation and clustering, and thus  ... 
arXiv:2101.09624v4 fatcat:kvjhbg5axnc2rhhmt4bridt23q

Monaural Audio Speaker Separation with Source Contrastive Estimation [article]

Cory Stephenson, Patrick Callier, Abhinav Ganesh, Karl Ni
2017 arXiv   pre-print
Our approach is similar to recent deep neural network clustering and permutation-invariant training research; we use weighted spectral features and masks to augment individual speaker frequencies while  ...  Our approach involves a deep recurrent neural networks regression to a vector space that is descriptive of independent speakers.  ...  During training, the learning objective of deep clustering encourages the embedding model to generate similar vectors for each time-frequency bin associated with a particular speaker.  ... 
arXiv:1705.04662v1 fatcat:xb5au2ofknambjmp5kxrkbkhne

Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model Selection [article]

Aswin Sivaraman, Minje Kim
2021 arXiv   pre-print
The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal.  ...  In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set.  ...  Figure 2: Subplots comparing various choices of K for using k-means clustering on the speaker embeddings.  ... 
arXiv:2105.03542v1 fatcat:ntsjw3ty2bat7ninwddh5fjjru

2021 Index IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 29

2021 IEEE/ACM Transactions on Audio Speech and Language Processing  
The primary entry includes the coauthors' names, the title of the paper or other item, and its location, specified by the publication abbreviation, year, and inclusive pagination.  ...  Departments and other items may also be covered if they have been judged to have archival value. The Author Index contains the primary entry for each item, listed under the first author's name.  ...  Hsu, J., +, TASLP 2021 1675-1686 Affine transforms Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection.  ... 
doi:10.1109/taslp.2022.3147096 fatcat:7nl52k7sjfalbhpxtum3y5nmje

Online Speaker Diarization with Relation Network [article]

Xiang Li, Yucheng Zhao, Chong Luo, Wenjun Zeng
2020 arXiv   pre-print
Unlike conventional diariztion systems which consist of several independently-optimized modules, RenoSD implements voice-activity-detection (VAD), embedding extraction, and speaker identity association  ...  The most striking feature of RenoSD is that it adopts a meta-learning strategy for speaker identity association.  ...  speech parts into short segments. (2) Embedding extraction: speaker embeddings such as i-vectors [6, 7] , d-vectors [8, 9] , or deep speaker embeddings [10, 11] are extracted for each short segment  ... 
arXiv:2009.08162v2 fatcat:2udhgcd6rzgadh53oby63i747u

Conference Program

2021 2021 18th International Joint Conference on Computer Science and Software Engineering (JCSSE) 15.15-15.30 Classification of Abusive Thai Language Content in Social Media Using Deep Learning Incorporating Prior Scientific Knowledge Into Deep Learning for Precipitation Nowcasting on  ...  Features from Light Curves for Automatic Classification of Variable Stars Deep Index Price Forecasting in Steel Industry (Prapaporn Techa-Angkoon (Thittaporn Ganokratanaa and Mahasak  ... 
doi:10.1109/jcsse53117.2021.9493806 fatcat:3bvg7qdgerf4toijffq75hvfym

End-to-End Multi-Speaker Speech Recognition Using Speaker Embeddings and Transfer Learning

Pavel Denisov, Ngoc Thang Vu
2019 Interspeech 2019  
Our experimental results on overlapped speech datasets show that joint conditioning on speaker embeddings and transfer learning significantly improves the ASR performance.  ...  This proposed framework does not require any parallel non-overlapped speech materials and is independent of the number of speakers.  ...  The first one [18] connects pretrained deep clustering model and end-to-end ASR for the subsequent join fine-tuning for the better ASR results.  ... 
doi:10.21437/interspeech.2019-1130 dblp:conf/interspeech/DenisovV19 fatcat:erd5qsf4ifegnmmwvjcn5il3fm

End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning [article]

Pavel Denisov, Ngoc Thang Vu
2019 arXiv   pre-print
Our experimental results on overlapped speech datasets show that joint conditioning on speaker embeddings and transfer learning significantly improves the ASR performance.  ...  This proposed framework does not require any parallel non-overlapped speech materials and is independent of the number of speakers.  ...  The first one [18] connects pretrained deep clustering model and end-to-end ASR for the subsequent join fine-tuning for the better ASR results.  ... 
arXiv:1908.04737v1 fatcat:d7xwjygqhndizlmlayb4xcnkw4
« Previous Showing results 1 — 15 out of 4,972 results