Filters








25,351 Hits in 5.0 sec

fairseq S2T: Fast Speech-to-Text Modeling with fairseq [article]

Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino
2020 arXiv   pre-print
We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation.  ...  We implement state-of-the-art RNN-based as well as Transformer-based models and open-source detailed training recipes.  ...  Tied multitask learning for neural speech translation.  ... 
arXiv:2010.05171v1 fatcat:tcdojkewtjfyhghjsbogbpgopq

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Ha Nguyen, Fethi Bougares, N. Tomashenko, Yannick Estève, Laurent Besacier
2020 Interspeech 2020  
Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR).  ...  Index Terms: self-supervised learning from speech, automatic speech translation, end-to-end models, low resource settings.  ...  Fine-tuning and normalization of self-supervised repre- Figure 3 : Soft alignments between source speech features and target text for sentence "A outra pessoa perde." sentations also improve the soft  ... 
doi:10.21437/interspeech.2020-1835 dblp:conf/interspeech/NguyenBTEB20 fatcat:c7v3pm4uqrd4nfhdfpzwz3ipdm

Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR

Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
2020 Interspeech 2020  
On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised  ...  Many semi-and weakly-supervised approaches have been investigated for overcoming the labeling cost of building highquality speech recognition systems.  ...  Self-labeled Speech Recognition Self-labeling is one of the most effective methods of semisupervised learning for speech recognition [20, 9, 5] , where a teacher model with limited supervision extends  ... 
doi:10.21437/interspeech.2020-1917 dblp:conf/interspeech/SinghMXEGLFSZM20 fatcat:ujyynrud2vhk5g7geunmbay7dq

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi
2020 Interspeech 2020  
This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR).  ...  One issue caused by data scarcity is that the performance of ASR on out-of-domain tasks different from those using the speech-to-text paired data is poor, since the mapping from the speech information  ...  Our method is regarded as a self-supervised learning that defines the self-supervision task by utilizing the pronunciation dictionary.  ... 
doi:10.21437/interspeech.2020-1930 dblp:conf/interspeech/MasumuraMITTO20 fatcat:3bqseh2v4zbyrdtv2jebvf6vma

Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering [article]

Chenyu You, Nuo Chen, Yuexian Zou
2021 arXiv   pre-print
Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks.  ...  In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage.  ...  In contrast, we focus on learning interactions between speech and text modalities for spoken question answering tasks, and also introduce a set of auxiliary tasks on top of the former self-supervised training  ... 
arXiv:2109.03381v1 fatcat:qdt3ufhby5ao7h2j6hj2nag4p4

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning [article]

Sameer Khurana, Antoine Laurent, James Glass
2020 arXiv   pre-print
In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation  ...  This time consuming and painstaking process could benefit from machine learning.  ...  So far, the self-supervised learning approaches we discussed use only speech data.  ... 
arXiv:2006.02814v2 fatcat:sz32yptl3beeffpkqona57mywi

Unsupervised Speech Recognition [article]

Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli
2022 arXiv   pre-print
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.  ...  Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe  ...  the setup of Chen et al. ( 2019 ), Marc'Aurelio Ranzato for general helpful discussions, and Ruth Kipng'eno, Ruth Ndila Ndeto as well as Mark Mutitu for error analysis of our Swahili model.  ... 
arXiv:2105.11084v3 fatcat:tx63si7jpfdpxowaw7mkyg3vhi

Textless Speech-to-Speech Translation on Real Data [article]

Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Wei-Ning Hsu
2022 arXiv   pre-print
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data.  ...  The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker  ...  Acknowledgements The authors would like to thank Adam Polyak and Felix Kreuk for initial discussions on accent normalization.  ... 
arXiv:2112.08352v2 fatcat:clu34adr7je45p5rwu5zhno7ci

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading [article]

Leyuan Qu, Cornelius Weber, Stefan Wermter
2021 arXiv   pre-print
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams  ...  Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English  ...  Self-supervised Learning Lip reading, also known as visual speech recognition, is the task to predict text transcriptions from silent videos, such as As a form of unsupervised learning, self-supervised  ... 
arXiv:2112.04748v1 fatcat:nkecrtplr5h3laiwpsd6gxjnqu

Defense for Black-Box Attacks on Anti-Spoofing Models by Self-Supervised Learning

Haibin Wu, Andy T. Liu, Hung-yi Lee
2020 Interspeech 2020  
High-performance anti-spoofing models for automatic speaker verification (ASV), have been widely used to protect ASV by identifying and filtering spoofing audio that is deliberately generated by text-to-speech  ...  In this work, we explore the robustness of self-supervised learned high-level representations by using them in the defense against adversarial attacks.  ...  Through pre-training models on speech, self-supervised learning based models are able to leverage the knowledge of unlabeled speech, then the performance of downstream speech and language processing (SLP  ... 
doi:10.21437/interspeech.2020-2026 dblp:conf/interspeech/WuLL20 fatcat:ovvrme7li5ahvhevaebxyei7xm

Exploring Deep Transfer Learning Techniques for Alzheimer's Dementia Detection

Youxiang Zhu, Xiaohui Liang, John A. Batsis, Robert M. Roth
2021 Frontiers in Computer Science  
Performance gains of the text models may be due to the high similarity between the pre-training text dataset and the CTP text dataset.  ...  Examination of speech datasets for detecting dementia, collected via various speech tasks, has revealed links between speech and cognitive abilities.  ...  Speech BERT: Speech BERT, similar to Text BERT, employs a self-supervised learning approach. The pre-training process employs the MAM task.  ... 
doi:10.3389/fcomp.2021.624683 pmid:34046588 pmcid:PMC8153512 fatcat:7s657y4q2jaf5a6absc2sjxdhm

A Survey on Machine Learning Techniques for Auto Labeling of Video, Audio, and Text Data [article]

Shikun Zhang, Omid Jafari, Parth Nagarkar
2021 arXiv   pre-print
In this survey paper, we provide a review of previous techniques that focuses on optimized data annotation and labeling for video, audio, and text data.  ...  Data labeling has always been one of the most important tasks in machine learning. However, labeling large amounts of data increases the monetary cost in machine learning.  ...  Semi-supervised and Supervised Learning Approaches In [82] , exploring the temporal consistency of semantic concepts in video sequences enhances two semi-supervised learning algorithms, which are self-training  ... 
arXiv:2109.03784v1 fatcat:uu55zfmtajcvdjekxeaue76izy

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning [article]

Elad Amrani, Rami Ben-Ari, Daniel Rotman, Alex Bronstein
2020 arXiv   pre-print
Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation.  ...  : Video Question Answering and Text-To-Video Retrieval.  ...  This scenario is very common in the case of self-supervised multimodal learning and even when learning from unlabeled instructional videos.  ... 
arXiv:2003.03186v3 fatcat:p576x72txrhuzgesvvgs7gbsui

Self-supervised discriminative training of statistical language models

Puyang Xu, Damianos Karakos, Sanjeev Khudanpur
2009 2009 IEEE Workshop on Automatic Speech Recognition & Understanding  
A novel self-supervised discriminative training method for estimating language models for automatic speech recognition (ASR) is proposed.  ...  Specifically, model parameters are estimated to maximize the likelihood ratio between words w in the text corpus and w's cohorts in the test speech, i.e. other words that w competes with in the test lattices  ...  ACKNOWLEDGMENT The authors are grateful to Denis Filimonov and Mary Harper for providing the n-best lists and for pre-processing the language model training text used in the experiments reported here.  ... 
doi:10.1109/asru.2009.5373401 dblp:conf/asru/XuKK09 fatcat:wxf3dqkoc5cbtl7xcbz4dyd3dm

Adapting n-gram maximum entropy language models with conditional entropy regularization

Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur
2011 2011 IEEE Workshop on Automatic Speech Recognition & Understanding  
Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.)  ...  Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition.  ...  ACKNOWLEDGEMENTS This research was partially supported by National Science Foundation Grant Nō 0963898 and by the JHU Human Language Technology Center of Excellence.  ... 
doi:10.1109/asru.2011.6163934 dblp:conf/asru/RastrowDK11a fatcat:wzjcpqn2gfhxllhgahcd64fts4
« Previous Showing results 1 — 15 out of 25,351 results