Filters








2,417 Hits in 5.1 sec

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition [article]

Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe
2021 arXiv   pre-print
In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.  ...  Self-supervised pretraining on speech data has achieved a lot of progress.  ...  Thus for ASR, speech representation extraction is an important module to condense the information of speech signal.  ... 
arXiv:2110.04590v1 fatcat:p4peb5urpzaxja62cgalfjnyuy

Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition [article]

Shaoshi Ling, Julian Salazar, Katrin Kirchhoff
2019 arXiv   pre-print
These representations come from the frame-wise intermediate representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken utterances.  ...  For speech, we propose contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition.  ...  These can be adapted to downstream tasks, namely language and speaker recognition, in an utterance-level, end-to-end manner.  ... 
arXiv:1907.00457v1 fatcat:cqbrqf6bxfgr3frxyjko2rpkem

Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition [article]

Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda
2022 arXiv   pre-print
We investigate the performance of self-supervised pretraining frameworks on pathological speech datasets used for automatic speech recognition (ASR).  ...  Modern end-to-end models require thousands of hours of data to train well, but only a small number of pathological speech datasets are publicly available.  ...  Using self-supervised learning pretraining frameworks for ASR With the successful use of previous pretraining methods, we explore the effectiveness of a new pretraining framework called self-supervised  ... 
arXiv:2203.15431v3 fatcat:53jgxgugfbg4bpudeh2zwtr7qi

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision [article]

Abhinav Shukla, Stavros Petridis, Maja Pantic
2020 arXiv   pre-print
However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform.  ...  Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.  ...  However, we wanted our audio encoder to directly operate on the raw audio waveform and perform end-to-end self-supervised representation learning without starting from an intermediate feature like MFCCs  ... 
arXiv:2007.04134v1 fatcat:6vo2bcbyi5fkhcp2tmzpoq7rsa

Injecting Text in Self-Supervised Speech Pretraining [article]

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno
2021 arXiv   pre-print
Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success.  ...  The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of  ...  However, self-supervised pretraining needs to discover effective representations for speech recognition using only internally consistent representations.  ... 
arXiv:2108.12226v1 fatcat:mc55fw4pt5febcfyksuvm46hcq

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation [article]

Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao
2021 arXiv   pre-print
Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting.  ...  In this paper, we proposed to introduce two dropout regularization methods into the pretraining of transformer encoder: (1) attention dropout, (2) layer dropout.  ...  Self-Supervised Learning (SSL) is an approach of learning data representation from unlabeled data, and retraining the model on labeled data [9] .  ... 
arXiv:2107.04227v1 fatcat:kn4st2gdpnbhxkptmnrw7v3joy

Visually Guided Self Supervised Learning of Speech Representations [article]

Abhinav Shukla, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic
2020 arXiv   pre-print
This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past.  ...  Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities.  ...  ACKNOWLEDGEMENTS We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research and Amazon Web Services for providing computational resources for  ... 
arXiv:2001.04316v2 fatcat:owtnx4mgw5bbtgjh7sd2wlaji4

Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition [article]

Lodagala V S V Durga Prasad and Ashish Seth and Sreyan Ghosh and S. Umesh
2022 arXiv   pre-print
Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition (ASR) systems in low-resource settings.  ...  Their performance improves with an increase in similarity and volume of pre-training data.  ...  On the other hand, self-supervised speech representation learning is the process of learning from raw speech signals to downstream speech tasks like Automatic Speech Recognition (ASR), Speaker Verification  ... 
arXiv:2203.16973v2 fatcat:r2fny47jijf5joo42drqf3rjem

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition [article]

Cheng Yi, Shiyu Zhou, Bo Xu
2021 arXiv   pre-print
End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR).  ...  Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models.  ...  For automatic speech recognition (ASR) tasks, self-supervised acoustic pre-training has achieved impressive recognition accuracy with as low as 10 hours of transcribed speech [1] [2] [3] [4] [5] , demonstrating  ... 
arXiv:2101.06699v2 fatcat:b73t4uxicrafjkbb6f3sgigmiu

Explore wav2vec 2.0 for Mispronunciation Detection

Xiaoshuo Xu, Yueteng Kang, Songjun Cao, Binghuai Lin, Long Ma
2021 Conference of the International Speech Communication Association  
Unlike existing methods that use speech recognition corpus to train models, we exploit unlabeled data and utilize a self-supervised learning technique, Wav2vec 2.0, for pretraining.  ...  This paper presents an initial attempt to use self-supervised learning for Mispronunciaiton Detection.  ...  Effects of Wav2vec 2.0 on Learned Representations Viewing that self-supervised learning leads to comparable results like ASR pretraining, we explore how this technique contributes to this task.  ... 
doi:10.21437/interspeech.2021-777 dblp:conf/interspeech/XuKCLM21 fatcat:xvvyldja5jg77nla2d4n3xidcm

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model [article]

Keqi Deng, Songjun Cao, Yike Zhang, Long Ma
2021 arXiv   pre-print
Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR).  ...  However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pre-training methods because its decoder is conditioned on acoustic representation thus cannot  ...  RELATED WORKS Self-supervised pretraining has gained success in E2E ASR tasks, recently. Wav2vec [9] learns the representations of raw speech through a self-supervised context-prediction task.  ... 
arXiv:2112.07254v1 fatcat:urxpigftdffxrcidch7unenpu4

Joint Unsupervised and Supervised Training for Multilingual ASR [article]

Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath
2021 arXiv   pre-print
Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR.  ...  In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling  ...  CONCLUSION This work proposes a novel uniform multilingual ASR system for the end-to-end speech recognition on multiple languages.  ... 
arXiv:2111.08137v1 fatcat:xd2nhyl6ozed7acgc6lm2njx24

Leveraging Multimodal Out-of-Domain Information to Improve Low-Resource Speech Translation

Wenbo Zhu, Hao Jin, WeiChang Yeh, Jianwen Chen, Lufeng Luo, Jinhai Wang, Aiyuan Li, Jian Su
2021 Security and Communication Networks  
First, we propose a low-resource ST framework to reconstruct large-scale label-free audio by combining self-supervised learning.  ...  Speech translation (ST) is a bimodal conversion task from source speech to the target text.  ...  And we analyze the effect of self-supervised learning on speech translation. (2) We utilize decoder fusion techniques to fine-tune the overall model by introducing an out-of-domain unlabeled text pretraining  ... 
doi:10.1155/2021/9915130 fatcat:a5yl5hi7b5cnnnbafkbn633ie4

Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition? [article]

Abhinav Shukla, Stavros Petridis, Maja Pantic
2020 arXiv   pre-print
This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes an audio-only self-supervision approach for speech representation  ...  Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative audio representations for  ...  For speech recognition on SPC, L1 is again the best self-supervised method with an accuracy of 0.  ... 
arXiv:2005.01400v2 fatcat:wzhbwisw5rgubmryky27nrbxry

Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning [article]

Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li
2021 arXiv   pre-print
Self-supervised visual pretraining has shown significant progress recently.  ...  The input feature representations for speech and visual tasks are both continuous, so it is natural to consider applying similar objective on speech representation learning.  ...  PASE [17] tackled multiple self-supervised tasks jointly using an ensemble of neural networks that cooperate to discover good speech representations.  ... 
arXiv:2010.13991v2 fatcat:uihbbdrghzghtahnavvgh33tcq
« Previous Showing results 1 — 15 out of 2,417 results