6,066 Hits in 4.0 sec

Supervised Contrastive Learning for Accented Speech Recognition [article]

Tao Han, Hantao Huang, Ziang Yang, Wei Han
2021 arXiv   pre-print
In this paper, we study the supervised contrastive learning framework for accented speech recognition.  ...  Neural network based speech recognition systems suffer from performance degradation due to accented speech, especially unfamiliar accents.  ...  Conclusion In this paper, we introduce a supervised contrastive learning framework for a robust speech recognition system.  ... 
arXiv:2107.00921v1 fatcat:ofxdq4x6yfbpxemod3tzfrshpm

Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Embeddings [article]

Jialu Li, Vimal Manohar, Pooja Chitkara, Andros Tjandra, Michael Picheny, Frank Zhang, Xiaohui Zhang, Yatharth Saraf
2021 arXiv   pre-print
Speech recognition models often obtain degraded performance when tested on speech with unseen accents.  ...  We also illustrate that wav2vec embeddings have more advantages for building accent-robust ASR when no accent labels are available for training supervised embeddings.  ...  In contrast to DAT, multi-task learning (MTL) with accent recognition as an auxiliary task is another common approach for building accent-robust ASR.  ... 
arXiv:2110.03520v2 fatcat:tv3dzaiwgvf5bnbr35xrrxl6tq

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset [article]

Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei
2021 arXiv   pre-print
In contrast, the industry usually uses tens of thousands of hours of labeled data to build high-accuracy speech recognition (ASR) systems for resource-rich languages.  ...  Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.  ...  INTRODUCTION In the past decade, the speech recognition field has made huge progress owing to the deep learning techniques [1] .  ... 
arXiv:2107.05233v1 fatcat:u57hswn44faclghmlysfogurha

Domain Adversarial Training for Accented Speech Recognition [article]

Sining Sun, Ching-Feng Yeh, Mei-Yuh Hwang, Mari Ostendorf, Lei Xie
2018 arXiv   pre-print
Furthermore, we find that DAT is superior to multi-task learning for accented speech recognition.  ...  In this paper, we propose a domain adversarial training (DAT) algorithm to alleviate the accented speech recognition problem.  ...  DOMAIN ADVERSARIAL TRAINING FOR ACCENTED SPEECH RECOGNITION Accented speech recognition has long been of high interest in industry due to the high recognition error rates.  ... 
arXiv:1806.02786v1 fatcat:cmdie6p2jngblhyejfolefl2qy

Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation

M.A. Tuğtekin Turan, Emmanuel Vincent, Denis Jouvet
2020 Interspeech 2020  
In addition, we leverage untranscribed accented training data by means of semi-supervised learning.  ...  Current automatic speech recognition (ASR) systems trained on native speech often perform poorly when applied to non-native or accented speech.  ...  M6-M9: Supervised training on native speech and 1 h of transcribed speech for all accents.  ... 
doi:10.21437/interspeech.2020-2742 dblp:conf/interspeech/Turan0J20 fatcat:yw6dswposzbdxet5wpmfbp433y

Accented Speech Recognition: A Survey [article]

Arthur Hinsvark
2021 arXiv   pre-print
Automatic Speech Recognition (ASR) systems generalize poorly on accented speech.  ...  We present a survey of current promising approaches to accented speech recognition and highlight the key challenges in the space.  ...  Thus, accent-robustness is needed for speech recognition to be solved in the wild. Generalizing speech recognition across dialects is a hard problem for real-world speech systems.  ... 
arXiv:2104.10747v2 fatcat:hmsi4ufbhnaifibk5gvuloxnv4

An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

Shreya Kumar, Swarnalaxmi Thiruvenkadam
2021 International Journal of Recent Contributions from Engineering, Science & IT  
Feature extraction is an integral part in speech emotion recognition.  ...  The use of spectral contrast feature has increased the prediction accuracy in speech emotion recognition systems to a good degree as it performs well in distinguishing emotions with significant differences  ...  It uses a supervised learning technique called backpropagation for training. Experimental Results The most commonly used features for Speech Emotion Recognition are MFCC, MEL, and chroma.  ... 
doi:10.3991/ijes.v9i2.22983 doaj:b106a430bd6d4d60b74d6281c7ceda72 fatcat:tpoeqn2lnbhhtczvc7r3idf5tq

Sequence-Level Self-Learning with Multiple Hypotheses

Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng
2020 Interspeech 2020  
In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR).  ...  We first demonstrate the effectiveness of our self-learning methods through ASR experiments in an accent adaptation task between the US and British English speech.  ...  The authors would like to thank Masaki Itagaki, Heiko Rahmel, Naoyuki Kanda, Lei He, Ziad Al Bawab, Jian Wu, and Xuedong Huang for their project support and technical discussions.  ... 
doi:10.21437/interspeech.2020-2020 dblp:conf/interspeech/KumataniDGGELZ20 fatcat:ok6cdndf2jbubh4jhpqfjqesei

Accent modification for speech recognition of non-native speakers using neural style transfer

Kacper Radzikowski, Le Wang, Osamu Yoshie, Robert Nowak
2021 EURASIP Journal on Audio, Speech, and Music Processing  
The results show that there is a significant relative improvement in terms of the speech recognition accuracy.  ...  The main reason for this is specific pronunciation and accent features related to the mother tongue of that speaker, which influence the pronunciation.  ...  Acknowledgements In our research, we are using a English Speech Database Read by Japanese Students (UME-ERJ), which was provided by Speech Resources Consortium at National Institute of Informatics (NII-SRC  ... 
doi:10.1186/s13636-021-00199-3 fatcat:536y6julaffmtmrrh3k4axselq

Speech Technology for Everyone: Automatic Speech Recognition for Non-Native English with Transfer Learning [article]

Toshiko Shibano
2021 arXiv   pre-print
., 2021) on L2-ARCTIC, a non-native English speech corpus (Zhao et al., 2018) under different training settings.  ...  We compare (a) models trained with a combination of diverse accents to ones trained with only specific accents and (b) results from different single-accent models.  ...  In the Accented English Speech Recognition Challenge 2020 (AESRC2020), many teams utilize transfer learning to tackle the L2 accent recognition task (Shi et al., 2021) .  ... 
arXiv:2110.00678v3 fatcat:kzee6azhorg4rfzcqnqahjihda

Speaker Identification using Speech Recognition [article]

Syeda Rabia Arshad, Syed Mujtaba Haider, Abdul Basit Mughal
2022 arXiv   pre-print
We proposed an unsupervised learning model where the model can learn speech representation with limited dataset.  ...  This research provides a mechanism for identifying a speaker in an audio file, based on the human voice biometric features like pitch, amplitude, frequency etc.  ...  During the training, we are learning representations of speech audio by solving a contrastive task which is required to identify the true quantized latent speech representation for an already masked time  ... 
arXiv:2205.14649v1 fatcat:6u6owbzesvfpxcm4erco4mzm44

SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition [article]

Li Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen, Youzheng Wu, Xiaodong He
2022 arXiv   pre-print
To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems.  ...  End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision.  ...  To apply contrastive learning for accented speech recognition, the authors of [24] adopted Sim-CLR [25] in the computer vision domain, and then generated contrastive positive pairs from the model's  ... 
arXiv:2110.04187v2 fatcat:k33qfsxurnacjh47f2azjylzby

Neural Representations for Modeling Variation in Speech [article]

Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin Richter, Mark Liberman, Martijn Wieling
2022 arXiv   pre-print
As an alternative, therefore, we investigate the extraction of acoustic embeddings from several self-supervised neural models.  ...  For comparison with several earlier studies, we evaluate how well these differences match human perception by comparing them with available human judgements of similarity.  ...  Acknowledgments The authors thank Hedwig Sekeres for creating the transcriptions of the Dutch speakers dataset, and Anna Pot for creating the visualization of the acoustic distance measure.  ... 
arXiv:2011.12649v3 fatcat:mifjjs23tbgmfc2bf67hr7mzhu

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment [article]

Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, John H. L. Hansen
2022 arXiv   pre-print
In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models.  ...  In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.  ...  between its phoneme recognition performance and L2 speech accentedness and comprehensibility, i.e. a higher PER (more mispronunciations) corresponds to a heavier accent and lower comprehensibility.  ... 
arXiv:2203.15937v2 fatcat:n44b2x42qfd4xbasui63gsrkqq

Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT Based on the Quran Reciters Dataset [article]

Aly Moustafa, Salah A. Aly
2021 arXiv   pre-print
Such methods include audio speech recognitions, eye, and finger signatures. Recent tools utilize deep learning and transformers to achieve better results.  ...  In this paper, we develop a deep learning constructed model for Arabic speakers identification by using Wav2Vec2.0 and HuBERT audio representation learning tools.  ...  [17] developed a virtual learning recitations system for Sighted and Blind Students and developed an efficient speech recognition engine that is a speaker and accent independent.  ... 
arXiv:2111.06331v1 fatcat:y2xayleywvagfogradf6l5mep4
« Previous Showing results 1 — 15 out of 6,066 results