Filters








165 Hits in 3.5 sec

Acoustic-to-Word Recognition with Sequence-to-Sequence Models [article]

Shruti Palaskar, Florian Metze
2018 arXiv   pre-print
Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon.  ...  We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forced-alignments for the Switchboard  ...  We also thank the CMU speech group for many useful discussions. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPUs used for this research.  ... 
arXiv:1807.09597v2 fatcat:5yiise7hp5fafgyc3fvy4tmwj4

An efficient search space representation for large vocabulary continuous speech recognition

Kris Demuynck, Jacques Duchateau, Dirk Van Compernolle, Patrick Wambacq
2000 Speech Communication  
In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component.  ...  In this paper, we present a memory-efficient search topology that enables the use of such detailed acoustic and language models in a one pass time-synchronous recognition system.  ...  Abstract In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component.  ... 
doi:10.1016/s0167-6393(99)00030-8 fatcat:mcbe43giazf3rbjeliyxziwmty

Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition

Edvin Pakoci, Branislav Popović, Darko Pekar
2019 Computational Intelligence and Neuroscience  
recognition system, often a wrong word ending occurs, which is nevertheless counted as an error.  ...  This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech  ...  Training Method-Acoustic Model. e used acoustic models were subsampled time-delay neural networks (TDNNs), which are trained using cross-entropy training within the so-called "chain" training method [  ... 
doi:10.1155/2019/5072918 fatcat:osie2v55hfczzl5p2a4ybls5ja

Large-vocabulary recognition

Christian Dugast
1995 Philips Journal of Research  
It is essential in this application that the user is free to speak as he or she usually does and should be free to use his or her own wording and formulation.  ...  This implies speech recognition for large and open vocabularies, free syntax, continuous speech.  ...  In German, a vocabulary of more than 100000 words is needed to have a good coverage of newspaper article dictation, and 25000 words are necessary for radiology reporting.  ... 
doi:10.1016/0165-5817(96)81585-3 fatcat:2ihc2prnavc3jdat54f7jdkuw4

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing [article]

Zhaofeng Shi
2021 arXiv   pre-print
This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information.  ...  LRS The LRS dataset [15] is a dataset for visual speech recognition, which consists of over 100000 natural sentences from British television.  ...  Acoustic models Nowadays, acoustic features are usually used as the intermediate features in TTS tasks. As a result, we focus on the research work on acoustic models in this section.  ... 
arXiv:2108.00443v1 fatcat:5xkj7lf7pfgpppvfqwynoqkqjm

N-Best Re-scoring Approaches for Mandarin Speech Recognition

Xinxin Li, Xuan Wang, Jian Guan
2014 International Journal of Hybrid Information Technology  
In this paper, we first explore two n-best re-scoring approaches for Mandarin speech recognition. Both re-scoring methods are used to choose the optimal word sequence from nbest lists.  ...  However, training text for acoustic model might be insufficient and inappropriate for the discriminative model.  ...  POS Models POS information can be used to improve the performance for many natural language processing tasks, such as word segmentation and named entity recognition [29] .  ... 
doi:10.14257/ijhit.2014.7.2.26 fatcat:zotvkrfyhfaobcsggblms5kvse

Effective representations for leveraging language content in multimedia event detection

Shuang Wu, Xiaodan Zhuang, Pradeep Natarajan
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
However, sporadic occurrence, content that is unrelated to the events of interest, and high error rates of current speech and text recognition systems on consumer domain video make it difficult to exploit  ...  First, we utilize likelihood weighted word lattices obtained from a Hidden Markov Model (HMM) based decoding engine to encode many alternate hypotheses, rather than relying on noisy single best hypotheses  ...  To alleviate the negative impact on event detection, we use word lattices or confusion networks [17] instead of the 1-best transcripts. This has been widely used for keyword spotting [18] .  ... 
doi:10.1109/icassp.2014.6854982 dblp:conf/icassp/WuZN14 fatcat:ovnu5vzeifdmpk2vp5rdpmdj6q

Speech-to-Speech Translation Services for the Olympic Games 2008 [chapter]

Sebastian Stüker, Chengqing Zong, Jürgen Reichert, Wenjie Cao, Muntsin Kolss, Guodong Xie, Kay Peterson, Peng Ding, Victoria Arranz, Jian Yu, Alex Waibel
2006 Lecture Notes in Computer Science  
One of the objectives of the program is the use of artificial intelligence technology to overcome language barriers during the games.  ...  Acknowledgements The authors would like to thank Raquel Tato and Marta Tolos for their help in the development of the Spanish recognition system, Dorcas Alexander for her contribution to the development  ...  As a preprocessing step, the Chinese part of the corpus was segmented into words using a segmenter derived from the LDC segmenter.  ... 
doi:10.1007/11965152_27 fatcat:n6aetlvsjzd43m6eh4oxekoyam

An Optimum Database for Isolated Word in Speech Recognition System

Syifaun Nafisah, Oyas Wahyunggoro, Lukito Edi Nugroho
2016 TELKOMNIKA (Telecommunication Computing Electronics and Control)  
Speech recognition system (ASR) is a technology that allows computers receive the input using the spoken words.  ...  Mel-scale frequency cepstral coefficients (MFCCs) is used to extract the characteristics of speech signal and backpropagation neural network in quantized vector is used to evaluate likelihood the maximum  ...  Another sample is the word 'Jakarta' will segmented into 'Ja', 'Kar' and 'Ta' From the recording process, this study have 13230 final words, and after the segmentation step, from 13230 final words, there  ... 
doi:10.12928/telkomnika.v14i2.2353 fatcat:iyvoaxzg5zaktjj37qrogbgubu

Application of machine learning method in optical molecular imaging: a review

Yu An, Hui Meng, Yuan Gao, Tong Tong, Chong Zhang, Kun Wang, Jie Tian
2019 Science China Information Sciences  
Optical molecular imaging (OMI) is an imaging technology that uses an optical signal, such as near-infrared light, to detect biological tissue in organisms.  ...  In recent years, machine learning (ML)-based artificial intelligence has been used in different fields because of its ability to perform powerful data processing.  ...  They reported that their network, which was trained with 100000 OCT B-scan images, achieved an area under curve (AUC) of 0.97 in the validation.  ... 
doi:10.1007/s11432-019-2708-1 fatcat:ju6k27sy3jbpxdtli7knopqdji

Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset [article]

Ankit Shah, Harini Kesavamoorthy, Poorva Rane, Pramati Kalwad, Alexander Hauptmann, Florian Metze
2018 arXiv   pre-print
Action recognition refers to the act of classifying the desired action/activity present in a given video.  ...  Accurate recognition of these moments is challenging due to the diverse and complex interpretation of the moments.  ...  Having access to Bridges computing resources enabled us to work efficiently and produce results on this challenging dataset.  ... 
arXiv:1809.00241v2 fatcat:767kewmknra4tbtipstaiyhyxu

Efficient data selection for ASR

Neil Taylor Kleynhans, Etienne Barnard
2014 Language Resources and Evaluation  
In this work, we propose a new data selection framework which can be used to design a speech recognition corpus.  ...  Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products.  ...  The decoding network was built using a flat word-loop grammar and contained only the words which occurred in the evaluation set.  ... 
doi:10.1007/s10579-014-9285-0 fatcat:lba6as2s2rcsddwijndujmq2ie

LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity [article]

Jordan J. Bird, Diego R. Faria, Anikó Ekárt, Cristiano Premebida, Pedro P. S. Ayrosa
2020 arXiv   pre-print
In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification.  ...  Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis.  ...  problems such as segments of the Hub500 problem [41] .  ... 
arXiv:2007.00659v2 fatcat:ykd453eagjgodbmmhpnjac6ou4

Punctuation Prediction Model for Conversational Speech

Piotr Żelasko, Piotr Szymański, Jan Mizgajski, Adrian Szymczak, Yishay Carmiel, Najim Dehak
2018 Interspeech 2018  
The neural networks are trained on Common Web Crawl GloVe embedding of the words in Fisher transcripts aligned with conversation side indicators and word time infomation.  ...  Our results constitute significant evidence that the distribution of words in time, as well as pre-trained embeddings, can be useful in the punctuation prediction task.  ...  Increasing the vocabulary size to 100000 words did not provide any significant performance gains.  ... 
doi:10.21437/interspeech.2018-1096 dblp:conf/interspeech/ZelaskoSMSCD18 fatcat:ymfh7qbf2vd6jj4qim7qnwgrmy

Predicting speech intelligibility from EEG in a non-linear classification paradigm [article]

Bernd Accou, Mohammad Jalilpour Monesi, Hugo Van hamme, Tom Francart
2021 arXiv   pre-print
Recently, brain imaging data has been used to establish a relationship between stimulus and brain response.  ...  Approach: We evaluated the performance of the model as a function of input segment length, EEG frequency band and receptive field size while comparing it to multiple baseline models.  ...  We propose a dilated convolutional network as the basis of an objective measure of speech intelligibility (in our case, word recognition accuracy in noise).  ... 
arXiv:2105.06844v4 fatcat:5nxjti5ufjdozoixeki2q4ymue
« Previous Showing results 1 — 15 out of 165 results