Filters








1,158 Hits in 5.8 sec

Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition [article]

Yonatan Belinkov, Ahmed Ali, James Glass
2020 arXiv   pre-print
End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions.  ...  In this paper, we analyze the learned internal representations in an end-to-end ASR model.  ...  Conclusion In this work, we analyzed an E2E speech recognition model in terms of phonetic and graphemic representations.  ... 
arXiv:1907.04224v2 fatcat:y7d6bt5awnbnjamyabt4awynd4

Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition

Yonatan Belinkov, Ahmed Ali, James Glass
2019 Interspeech 2019  
End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions.  ...  In this paper, we analyze the learned internal representations in an end-to-end ASR model.  ...  Conclusion In this work, we analyzed an E2E speech recognition model in terms of phonetic and graphemic representations.  ... 
doi:10.21437/interspeech.2019-2599 dblp:conf/interspeech/BelinkovAG19 fatcat:hhh7n3swyjg6jcu3rbeo333dzy

From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition [article]

Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fügen, Geoffrey Zweig, Michael L. Seltzer
2019 arXiv   pre-print
There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance  ...  Our work provides an alternative for end-to-end ASR and establishes that hybrid systems can be improved by dropping the reliance on phonetic knowledge.  ...  INTRODUCTION In the past decade, neural network acoustic models have become a staple in automatic speech recognition (ASR).  ... 
arXiv:1910.01493v2 fatcat:vsaongotkjfjlid7mlp7zl7snu

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi
2020 Interspeech 2020  
This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR).  ...  Generally, it is essential to prepare speech-to-text paired data to construct end-to-end ASR models, but it is difficult to collect a large amount of such data in practice.  ...  Introduction End-to-end automatic speech recognition (ASR) systems that directly convert input speech into text is one of the most attractive technologies in speech-related fields.  ... 
doi:10.21437/interspeech.2020-1930 dblp:conf/interspeech/MasumuraMITTO20 fatcat:3bqseh2v4zbyrdtv2jebvf6vma

An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems [article]

Antoine Perquin, Erica Cooper, Junichi Yamagishi
2021 arXiv   pre-print
Thanks to this property, we show that grapheme embeddings learned by Tacotron models can be useful for tasks such as grapheme-to-phoneme conversion and control of the pronunciation in synthetic speech.  ...  End-to-end models, particularly Tacotron-based ones, are currently a popular solution for text-to-speech synthesis.  ...  training set. 2 Table 2 . 2 Phoneme Error Rate [%] of all models after automatic speech recognition and phonetization of the transcripts.  ... 
arXiv:2010.10694v2 fatcat:5ctmlwlmr5bxbhhflhrvourpke

Phonetic pronunciations for arabic speech-to-text systems

F. Diehl, M.J.F. Gales, M. Tomalin, P.C. Woodland
2008 Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing  
First, the use of single pronunciation acoustic models in the context of Arabic large vocabulary Automatic Speech Recognition (ASR) is investigated.  ...  Here, pronunciations are automatically found by first deriving grapheme-to-phone rules, and associated rule probabilities.  ...  The experimental evaluation investigates the use of graphemic, phonetic MPron and phonetic SPron models with both 350k and 260k word vocabularies for recognition of both Arabic BN and broadcast conversation  ... 
doi:10.1109/icassp.2008.4517924 dblp:conf/icassp/DiehlGTW08 fatcat:vsgzbm32b5eobb7cfennrshun4

Lithuanian Speech Recognition Using Purely Phonetic Deep Learning

Laurynas Pipiras, Rytis Maskeliūnas, Robertas Damaševičius
2019 Computers  
Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English.  ...  The problems of automatic Lithuanian speech recognition have attracted little attention so far.  ...  To test to what extent encoder-decoder models can be used to perform automatic Lithuanian speech recognition, we decided to test selected models in isolated speech and long phrase recognition tasks.  ... 
doi:10.3390/computers8040076 fatcat:ugkfjr4xwfczxnwbp6wikso6c4

Analysis of Long-distance Word Dependencies and Pronunciation Variability at Conversational Russian Speech Recognition

Irina S. Kipyatkova, Alexey Karpov, Vasilisa Verkhodanova, Milos Zelezný
2012 Conference on Computer Science and Information Systems  
The key issues of conversational Russian speech processing at phonemic and language model levels are considered in the work.  ...  The word error rate of the developed speech recognition system was 33% for the collected conversational speech corpus.  ...  INTRODUCTION T HE majority of state-of-the-art automatic speech recognition systems can efficiently analyze isolated pronounced words or read phrases.  ... 
dblp:conf/fedcsis/KipyatkovaKVZ12 fatcat:fb5udyc2mjee7j3arytzo7eqcq

Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings

Eshete Derb Emiru, Shengwu Xiong, Yaxing Li, Awet Fesseha, Moussa Diallo
2021 Information  
This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its  ...  This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes  ...  Phoneme-Based End-to-End Models Phoneme-based end-to-end ASR models are vital to improve the speech recognition system by addressing the variations of graphemes for similar pronunciation representations  ... 
doi:10.3390/info12020062 fatcat:bsez5mc3ybbu3hdgn2yxvqnrgi

Significance of early tagged contextual graphemes in grapheme based speech synthesis and recognition systems

Gopala Krishna Anumanchipalli, Kishore Prahallad, Alan W. Black
2008 Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing  
We show that the early tagged contextual graphemes play a significant role in improving the performance of grapheme based speech synthesis and speech recognition systems.  ...  In this paper we present our argument that context information could be used in early stages i.e., during the definition of mapping of the words into sequence of graphemes.  ...  Attempts in the use of grapheme as a modeling unit for speech recognition have been reported in phonetic or partially phonetic languages.  ... 
doi:10.1109/icassp.2008.4518692 dblp:conf/icassp/AnumanchipalliPB08 fatcat:3zytopg4cfb3xf5g5wuulu75qm

K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables [article]

Jounghee Kim, Pilsung Kang
2021 arXiv   pre-print
Wav2vec 2.0 is an end-to-end framework of self-supervised learning for speech representation that is successful in automatic speech recognition (ASR), but most of the work on the topic has been developed  ...  In this paper, we present K-Wav2Vec 2.0, which is a modified version of Wav2vec 2.0 designed for Korean automatic speech recognition by exploring and optimizing various factors of the original Wav2vec  ...  The Wav2vec 2.0 model is an end-to-end framework of self-supervised learning for automatic speech recognition (ASR), and it has recently been presented as an effective pre-training method to learn speech  ... 
arXiv:2110.05172v1 fatcat:62cmbc55mzblljklwtvwcgy2fe

Modeling of Pronunciation, Language and Nonverbal Units at Conversational Russian Speech Recognition

Irina S. Kipyatkova, Alexey Karpov, Vasilisa Verkhodanova, Milos Zelezný
2013 International Journal of Computer Science and Applications  
The main problems of a conversational Russian speech recognition system development are variability of pronunciation, free word-order in sentences and presence of speech disfluencies.  ...  The proposed methods of pronunciation variability modeling and syntactic-statistical language model creation were realized in the software complex for Russian speech recognition.  ...  ) and by the grant of the President of Russia (project No.  ... 
dblp:journals/ijcsa/KipyatkovaKVZ13 fatcat:crpbrybwbvdvlmgxw6euhalrlu

Large vocabulary Russian speech recognition using syntactico-statistical language modeling

Alexey Karpov, Konstantin Markov, Irina Kipyatkova, Daria Vazhenina, Andrey Ronzhin
2014 Speech Communication  
In this paper, we describe our efforts to build an automatic speech recognition (ASR) system for the Russian language with a large vocabulary.  ...  Speech is the most natural way of human communication and in order to achieve convenient and efficient human-computer interaction implementation of state-of-the-art spoken language technology is necessary  ...  Acknowledgements This research is supported by the Ministry of Education and Science of Russia (contract No. 07.514.11.4139), by the grant of the President of Russia (project No.  ... 
doi:10.1016/j.specom.2013.07.004 fatcat:hq2vkvwdlzgqlhyi44duyh44hq

Sinhala G2P Conversion for Speech Processing

Thilini Nadungodage, Chamila Liyanage, Amathri Prerera, Randil Pushpananda, Ruvan Weerasinghe
2018 The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages  
Grapheme-to-phoneme (G2P) conversion plays an important role in speech processing applications and other fields of computational linguistics.  ...  Sinhala must have a grapheme-to-phoneme conversion for speech processing because Sinhala writing system does not always reflect its actual pronunciations.  ...  Acknowledgements The authors of this paper would like to thank Language Technology Research Laboratory -University of Colombo School of Computing for the support given to make this work a success.  ... 
doi:10.21437/sltu.2018-24 dblp:conf/sltu/NadungodageLPPW18 fatcat:d6geyde6m5gnvpg7e4bmaxsqlu

Zero-shot keyword spotting for visual speech recognition in-the-wild [article]

Themos Stafylakis, Georgios Tzimiropoulos
2018 arXiv   pre-print
We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.  ...  To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence  ...  Introduction This paper addresses the problem of visual-only Automatic Speech Recognition (ASR) i.e. the problem of recognizing speech from video information only, in particular, from analyzing the spatiotemporal  ... 
arXiv:1807.08469v2 fatcat:4sjqd62esfgyvf4etnwm5osxda
« Previous Showing results 1 — 15 out of 1,158 results