A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
fairseq S2T: Fast Speech-to-Text Modeling with fairseq
[article]
2020
arXiv
pre-print
We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation. ...
We implement state-of-the-art RNN-based as well as Transformer-based models and open-source detailed training recipes. ...
Tied multitask learning for neural speech translation. ...
arXiv:2010.05171v1
fatcat:tcdojkewtjfyhghjsbogbpgopq
Investigating Self-Supervised Pre-Training for End-to-End Speech Translation
2020
Interspeech 2020
Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). ...
Index Terms: self-supervised learning from speech, automatic speech translation, end-to-end models, low resource settings. ...
Fine-tuning and normalization of self-supervised repre- Figure 3 : Soft alignments between source speech features and target text for sentence "A outra pessoa perde." sentations also improve the soft ...
doi:10.21437/interspeech.2020-1835
dblp:conf/interspeech/NguyenBTEB20
fatcat:c7v3pm4uqrd4nfhdfpzwz3ipdm
Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR
2020
Interspeech 2020
On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised ...
Many semi-and weakly-supervised approaches have been investigated for overcoming the labeling cost of building highquality speech recognition systems. ...
Self-labeled Speech Recognition Self-labeling is one of the most effective methods of semisupervised learning for speech recognition [20, 9, 5] , where a teacher model with limited supervision extends ...
doi:10.21437/interspeech.2020-1917
dblp:conf/interspeech/SinghMXEGLFSZM20
fatcat:ujyynrud2vhk5g7geunmbay7dq
Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition
2020
Interspeech 2020
This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR). ...
One issue caused by data scarcity is that the performance of ASR on out-of-domain tasks different from those using the speech-to-text paired data is poor, since the mapping from the speech information ...
Our method is regarded as a self-supervised learning that defines the self-supervision task by utilizing the pronunciation dictionary. ...
doi:10.21437/interspeech.2020-1930
dblp:conf/interspeech/MasumuraMITTO20
fatcat:3bqseh2v4zbyrdtv2jebvf6vma
Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
[article]
2021
arXiv
pre-print
Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. ...
In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. ...
In contrast, we focus on learning interactions between speech and text modalities for spoken question answering tasks, and also introduce a set of auxiliary tasks on top of the former self-supervised training ...
arXiv:2109.03381v1
fatcat:qdt3ufhby5ao7h2j6hj2nag4p4
CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning
[article]
2020
arXiv
pre-print
In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation ...
This time consuming and painstaking process could benefit from machine learning. ...
So far, the self-supervised learning approaches we discussed use only speech data. ...
arXiv:2006.02814v2
fatcat:sz32yptl3beeffpkqona57mywi
Unsupervised Speech Recognition
[article]
2022
arXiv
pre-print
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. ...
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe ...
the setup of Chen et al. ( 2019 ), Marc'Aurelio Ranzato for general helpful discussions, and Ruth Kipng'eno, Ruth Ndila Ndeto as well as Mark Mutitu for error analysis of our Swahili model. ...
arXiv:2105.11084v3
fatcat:tx63si7jpfdpxowaw7mkyg3vhi
Textless Speech-to-Speech Translation on Real Data
[article]
2022
arXiv
pre-print
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. ...
The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker ...
Acknowledgements The authors would like to thank Adam Polyak and Felix Kreuk for initial discussions on accent normalization. ...
arXiv:2112.08352v2
fatcat:clu34adr7je45p5rwu5zhno7ci
LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading
[article]
2021
arXiv
pre-print
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams ...
Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English ...
Self-supervised Learning
Lip reading, also known as visual speech recognition, is the
task to predict text transcriptions from silent videos, such as As a form of unsupervised learning, self-supervised ...
arXiv:2112.04748v1
fatcat:nkecrtplr5h3laiwpsd6gxjnqu
Defense for Black-Box Attacks on Anti-Spoofing Models by Self-Supervised Learning
2020
Interspeech 2020
High-performance anti-spoofing models for automatic speaker verification (ASV), have been widely used to protect ASV by identifying and filtering spoofing audio that is deliberately generated by text-to-speech ...
In this work, we explore the robustness of self-supervised learned high-level representations by using them in the defense against adversarial attacks. ...
Through pre-training models on speech, self-supervised learning based models are able to leverage the knowledge of unlabeled speech, then the performance of downstream speech and language processing (SLP ...
doi:10.21437/interspeech.2020-2026
dblp:conf/interspeech/WuLL20
fatcat:ovvrme7li5ahvhevaebxyei7xm
Exploring Deep Transfer Learning Techniques for Alzheimer's Dementia Detection
2021
Frontiers in Computer Science
Performance gains of the text models may be due to the high similarity between the pre-training text dataset and the CTP text dataset. ...
Examination of speech datasets for detecting dementia, collected via various speech tasks, has revealed links between speech and cognitive abilities. ...
Speech BERT: Speech BERT, similar to Text BERT, employs a self-supervised learning approach. The pre-training process employs the MAM task. ...
doi:10.3389/fcomp.2021.624683
pmid:34046588
pmcid:PMC8153512
fatcat:7s657y4q2jaf5a6absc2sjxdhm
A Survey on Machine Learning Techniques for Auto Labeling of Video, Audio, and Text Data
[article]
2021
arXiv
pre-print
In this survey paper, we provide a review of previous techniques that focuses on optimized data annotation and labeling for video, audio, and text data. ...
Data labeling has always been one of the most important tasks in machine learning. However, labeling large amounts of data increases the monetary cost in machine learning. ...
Semi-supervised and Supervised Learning Approaches In [82] , exploring the temporal consistency of semantic concepts in video sequences enhances two semi-supervised learning algorithms, which are self-training ...
arXiv:2109.03784v1
fatcat:uu55zfmtajcvdjekxeaue76izy
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
[article]
2020
arXiv
pre-print
Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation. ...
: Video Question Answering and Text-To-Video Retrieval. ...
This scenario is very common in the case of self-supervised multimodal learning and even when learning from unlabeled instructional videos. ...
arXiv:2003.03186v3
fatcat:p576x72txrhuzgesvvgs7gbsui
Self-supervised discriminative training of statistical language models
2009
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
A novel self-supervised discriminative training method for estimating language models for automatic speech recognition (ASR) is proposed. ...
Specifically, model parameters are estimated to maximize the likelihood ratio between words w in the text corpus and w's cohorts in the test speech, i.e. other words that w competes with in the test lattices ...
ACKNOWLEDGMENT The authors are grateful to Denis Filimonov and Mary Harper for providing the n-best lists and for pre-processing the language model training text used in the experiments reported here. ...
doi:10.1109/asru.2009.5373401
dblp:conf/asru/XuKK09
fatcat:wxf3dqkoc5cbtl7xcbz4dyd3dm
Adapting n-gram maximum entropy language models with conditional entropy regularization
2011
2011 IEEE Workshop on Automatic Speech Recognition & Understanding
Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) ...
Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. ...
ACKNOWLEDGEMENTS This research was partially supported by National Science Foundation Grant Nō 0963898 and by the JHU Human Language Technology Center of Excellence. ...
doi:10.1109/asru.2011.6163934
dblp:conf/asru/RastrowDK11a
fatcat:wzjcpqn2gfhxllhgahcd64fts4
« Previous
Showing results 1 — 15 out of 25,351 results