Filters








604 Hits in 5.3 sec

Meta-Learning for improving rare word recognition in end-to-end ASR [article]

Florian Lux, Ngoc Thang Vu
2021 arXiv   pre-print
of combining their outcomes into an end-to-end automatic speech recognition system to improve rare word recognition.  ...  We propose a new method of generating meaningful embeddings for speech, changes to four commonly used meta learning approaches to enable them to perform keyword spotting in continuous signals and an approach  ...  INTRODUCTION While end-to-end (E2E) [1] deep learning (DL) models brought great improvements to the field of automatic speech recognition (ASR) in recent years and reduced word error rates (WER) on benchmark  ... 
arXiv:2102.12624v1 fatcat:25roojpldfe3paqfxdifelq4wi

Multi-task Language Modeling for Improving Speech Recognition of Rare Words [article]

Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju, Denis Filimonov, Ivan Bulyko
2021 arXiv   pre-print
Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.  ...  In this paper, we propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance.  ...  We find that the improvement in WER is more pronounced for rare words, likely due to improvements in recognition of slot content.  ... 
arXiv:2011.11715v4 fatcat:wwlt4dvw75hvnh6vhxvcw4lngm

Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-switching Speech Recognition [article]

Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng
2020 arXiv   pre-print
In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China.  ...  Then to exploit monolingual data, we find data matching is crucial. Mandarin data is closely matched with the Mandarin part in the code-switching data, while English data is not.  ...  Acknowledgements The computational work for this paper is partially performed on the resources of the National Supercomputing Centre (NSCC), Singapore (https://www.nscc.sg). References  ... 
arXiv:2006.07094v2 fatcat:g5pql34bozdhxgaj4e76jphyp4

Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition

Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng
2020 Interspeech 2020  
In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China.  ...  The CSSR system can perform within-utterance code-switch recognition, but it still has a margin with the one trained on code-switching data. 1 Here, data selection simply means how to reasonably exploit  ...  This has been extensively studied under the End-to-end (E2E) ASR framework [23] .  ... 
doi:10.21437/interspeech.2020-1582 dblp:conf/interspeech/ZhangXPHC20 fatcat:oqtng33rr5cczaaoz3qndphxpq

Context-Aware Dialog Re-Ranking for Task-Oriented Dialog Systems [article]

Junki Ohmura, Maxine Eskenazi
2018 arXiv   pre-print
By using neural word embedding-based models and handcrafted or logistic regression-based ensemble models, we have improved the performance of a recently proposed end-to-end task-oriented dialog system  ...  Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human-computer dialogs with speech recognition errors.  ...  However, NN is not effective for ASR-Task 6 since it is quite rare for exactly the same pair to be found in the training dialog.  ... 
arXiv:1811.11430v1 fatcat:dky5fm4bkfh5pcas25kfo3e63u

Mitigating the Impact of Speech Recognition Errors on Chatbot using Sequence-to-Sequence Model [article]

Pin-Jung Chen, I-Hung Hsu, Yi-Yao Huang, Hung-Yi Lee
2017 arXiv   pre-print
We apply sequence-to-sequence model to mitigate the impact of speech recognition errors on open domain end-to-end dialog generation.  ...  The method shows that the sequence-to-sequence model can learn the ASR transcriptions and original text pair having the same meaning and eliminate the speech recognition errors.  ...  While abounding works focusing on spoken language understanding has hastened ASR failure management in modular dialog systems, ASR error handling in end-to-end chatbots is rarely seen.  ... 
arXiv:1709.07862v2 fatcat:klvr5w4iynd5jazba4t2sm65ii

Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition [article]

Christian Huber, Juan Hussain, Sebastian Stüker, Alexander Waibel
2021 arXiv   pre-print
Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition (ASR).  ...  To alleviate this problem we supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.  ...  In order to solve this problem, in this paper we extend an end-to-end ASR system by a memory for words and phrases.  ... 
arXiv:2107.02268v1 fatcat:2afway63wjdtdevkbr3rxktc5m

Multimodal machine translation through visuals and speech

Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann
2020 Machine Translation  
This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance  ...  These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language.  ...  We would also like to thank Maarit Koponen for her valuable feedback and her help in establishing our discussions of machine translation evaluation.  ... 
doi:10.1007/s10590-020-09250-0 fatcat:jod3ghcsnnbipotcqp6sme4lna

System combination and score normalization for spoken term detection

Jonathan Mamou, Jia Cui, Xiaodong Cui, Mark J. F. Gales, Brian Kingsbury, Kate Knill, Lidia Mangu, David Nolden, Michael Picheny, Bhuvana Ramabhadran, Ralf Schluter, Abhinav Sethy (+1 others)
2013 2013 IEEE International Conference on Acoustics, Speech and Signal Processing  
Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information.  ...  First, we show score normalization methodology that improves in average by 20% keyword search performance.  ...  In other words, ATWV metric emphasizes recall of rare terms.  ... 
doi:10.1109/icassp.2013.6639278 dblp:conf/icassp/MamouCCGKKMNPRSSW13 fatcat:3xtqd6xr75dktkbneeoait44de

Adaptive Feature Selection for End-to-End Speech Translation [article]

Biao Zhang, Ivan Titov, Barry Haddow, Rico Sennrich
2020 arXiv   pre-print
Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features.  ...  In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST.  ...  Acknowledgments We would like to thank Shucong Zhang for his great support on building our ASR baselines. IT acknowledges support of the European Research Council (ERC Starting grant 678254) and the  ... 
arXiv:2010.08518v2 fatcat:27fziwfdsnfffjvt2yasmg7p6e

Recent Advances in End-to-End Automatic Speech Recognition [article]

Jinyu Li
2022 arXiv   pre-print
Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).  ...  While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time.  ...  In such a case, E2E models have not learned to map the rare words' acoustic signal to words.  ... 
arXiv:2111.01690v2 fatcat:6pktwep34jdvjklw4gkri4yn4y

Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition

Abhinav Garg, Ashutosh Gupta, Dhananjaya Gowda, Shatrughan Singh, Chanwoo Kim
2020 Interspeech 2020  
In this paper, we propose a hierarchical multi-stage word-tographeme Named Entity Correction (NEC) algorithm.  ...  We evaluate our solution on two different test sets from the call and music domains, for both server as well as on-device speech recognition configurations.  ...  However, the misrecognition of rarely occurring words such as named entities (NEs) is a wellknown shortcoming of end-to-end models [13] .  ... 
doi:10.21437/interspeech.2020-3174 dblp:conf/interspeech/GargGGSK20 fatcat:njufs2tp4fhjvmyd5li7fbfstu

Speech Retrieval [chapter]

Ciprian Chelba, Timothy J. Hazen, Bhuvana Ramabhadran, Murat Saraçlar
2011 Spoken Language Understanding  
The primary technical challenges of speech retrieval lie in the retrieval system's ability to deal with imperfect speech recognition technology that produces errorful output due to misrecognitions cause  ...  by inadequate statistical models or out-of-vocabulary words.  ...  The weights in the index transducer correspond to expected counts that are used for ranking. Spoken Document Ranking in the Presence of Text Meta-Data Spoken documents rarely contain only speech.  ... 
doi:10.1002/9781119992691.ch15 fatcat:o36ulm7kh5dxvhm6alb4yz3qvy

Recent Progress in the CUHK Dysarthric Speech Recognition System

Shansong Liu, Shoukang Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jianwei Yu, Xunying Liu, Helen M. Meng
2021 IEEE/ACM Transactions on Audio Speech and Language Processing  
Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date.  ...  This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech  ...  ACKNOWLEDGMENT We thank Disong Wang for sharing their cross-  ... 
doi:10.1109/taslp.2021.3091805 fatcat:7ss4ldio3rdprfjkhufor6fkvu

Visual features for context-aware speech recognition

Abhinav Gupta, Yajie Miao, Leonardo Neves, Florian Metze
2017 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
We achieve good improvements in both cases and compare and analyze the respective reductions in word error rate.  ...  In this paper, we extend our earlier work on adapting the acoustic model of a DNN-based speech recognition system to an RNN language model and show how both can be adapted to the objects and scenes that  ...  In the long term, this work should help to improve fully end-to-end "video-to-text" approaches, which generate image or video "summaries" based on multi-modal embeddings, and reference "captions" [35,  ... 
doi:10.1109/icassp.2017.7953112 dblp:conf/icassp/GuptaMNM17 fatcat:kg3whgbgv5aevmx6rrdneatymu
« Previous Showing results 1 — 15 out of 604 results