A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Acoustic-to-Word Recognition with Sequence-to-Sequence Models
[article]
2018
arXiv
pre-print
Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. ...
We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forced-alignments for the Switchboard ...
We also thank the CMU speech group for many useful discussions. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPUs used for this research. ...
arXiv:1807.09597v2
fatcat:5yiise7hp5fafgyc3fvy4tmwj4
An efficient search space representation for large vocabulary continuous speech recognition
2000
Speech Communication
In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component. ...
In this paper, we present a memory-efficient search topology that enables the use of such detailed acoustic and language models in a one pass time-synchronous recognition system. ...
Abstract In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component. ...
doi:10.1016/s0167-6393(99)00030-8
fatcat:mcbe43giazf3rbjeliyxziwmty
Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
2019
Computational Intelligence and Neuroscience
recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. ...
This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech ...
Training Method-Acoustic Model. e used acoustic models were subsampled time-delay neural networks (TDNNs), which are trained using cross-entropy training within the so-called "chain" training method [ ...
doi:10.1155/2019/5072918
fatcat:osie2v55hfczzl5p2a4ybls5ja
Large-vocabulary recognition
1995
Philips Journal of Research
It is essential in this application that the user is free to speak as he or she usually does and should be free to use his or her own wording and formulation. ...
This implies speech recognition for large and open vocabularies, free syntax, continuous speech. ...
In German, a vocabulary of more than 100000 words is needed to have a good coverage of newspaper article dictation, and 25000 words are necessary for radiology reporting. ...
doi:10.1016/0165-5817(96)81585-3
fatcat:2ihc2prnavc3jdat54f7jdkuw4
A Survey on Audio Synthesis and Audio-Visual Multimodal Processing
[article]
2021
arXiv
pre-print
This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information. ...
LRS The LRS dataset [15] is a dataset for visual speech recognition, which consists of over 100000 natural sentences from British television. ...
Acoustic models Nowadays, acoustic features are usually used as the intermediate features in TTS tasks. As a result, we focus on the research work on acoustic models in this section. ...
arXiv:2108.00443v1
fatcat:5xkj7lf7pfgpppvfqwynoqkqjm
N-Best Re-scoring Approaches for Mandarin Speech Recognition
2014
International Journal of Hybrid Information Technology
In this paper, we first explore two n-best re-scoring approaches for Mandarin speech recognition. Both re-scoring methods are used to choose the optimal word sequence from nbest lists. ...
However, training text for acoustic model might be insufficient and inappropriate for the discriminative model. ...
POS Models POS information can be used to improve the performance for many natural language processing tasks, such as word segmentation and named entity recognition [29] . ...
doi:10.14257/ijhit.2014.7.2.26
fatcat:zotvkrfyhfaobcsggblms5kvse
Effective representations for leveraging language content in multimedia event detection
2014
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
However, sporadic occurrence, content that is unrelated to the events of interest, and high error rates of current speech and text recognition systems on consumer domain video make it difficult to exploit ...
First, we utilize likelihood weighted word lattices obtained from a Hidden Markov Model (HMM) based decoding engine to encode many alternate hypotheses, rather than relying on noisy single best hypotheses ...
To alleviate the negative impact on event detection, we use word lattices or confusion networks [17] instead of the 1-best transcripts. This has been widely used for keyword spotting [18] . ...
doi:10.1109/icassp.2014.6854982
dblp:conf/icassp/WuZN14
fatcat:ovnu5vzeifdmpk2vp5rdpmdj6q
Speech-to-Speech Translation Services for the Olympic Games 2008
[chapter]
2006
Lecture Notes in Computer Science
One of the objectives of the program is the use of artificial intelligence technology to overcome language barriers during the games. ...
Acknowledgements The authors would like to thank Raquel Tato and Marta Tolos for their help in the development of the Spanish recognition system, Dorcas Alexander for her contribution to the development ...
As a preprocessing step, the Chinese part of the corpus was segmented into words using a segmenter derived from the LDC segmenter. ...
doi:10.1007/11965152_27
fatcat:n6aetlvsjzd43m6eh4oxekoyam
An Optimum Database for Isolated Word in Speech Recognition System
2016
TELKOMNIKA (Telecommunication Computing Electronics and Control)
Speech recognition system (ASR) is a technology that allows computers receive the input using the spoken words. ...
Mel-scale frequency cepstral coefficients (MFCCs) is used to extract the characteristics of speech signal and backpropagation neural network in quantized vector is used to evaluate likelihood the maximum ...
Another sample is the word 'Jakarta' will segmented into 'Ja', 'Kar' and 'Ta' From the recording process, this study have 13230 final words, and after the segmentation step, from 13230 final words, there ...
doi:10.12928/telkomnika.v14i2.2353
fatcat:iyvoaxzg5zaktjj37qrogbgubu
Application of machine learning method in optical molecular imaging: a review
2019
Science China Information Sciences
Optical molecular imaging (OMI) is an imaging technology that uses an optical signal, such as near-infrared light, to detect biological tissue in organisms. ...
In recent years, machine learning (ML)-based artificial intelligence has been used in different fields because of its ability to perform powerful data processing. ...
They reported that their network, which was trained with 100000 OCT B-scan images, achieved an area under curve (AUC) of 0.97 in the validation. ...
doi:10.1007/s11432-019-2708-1
fatcat:ju6k27sy3jbpxdtli7knopqdji
Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset
[article]
2018
arXiv
pre-print
Action recognition refers to the act of classifying the desired action/activity present in a given video. ...
Accurate recognition of these moments is challenging due to the diverse and complex interpretation of the moments. ...
Having access to Bridges computing resources enabled us to work efficiently and produce results on this challenging dataset. ...
arXiv:1809.00241v2
fatcat:767kewmknra4tbtipstaiyhyxu
Efficient data selection for ASR
2014
Language Resources and Evaluation
In this work, we propose a new data selection framework which can be used to design a speech recognition corpus. ...
Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. ...
The decoding network was built using a flat word-loop grammar and contained only the words which occurred in the evaluation set. ...
doi:10.1007/s10579-014-9285-0
fatcat:lba6as2s2rcsddwijndujmq2ie
LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity
[article]
2020
arXiv
pre-print
In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. ...
Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. ...
problems such as segments of the Hub500 problem [41] . ...
arXiv:2007.00659v2
fatcat:ykd453eagjgodbmmhpnjac6ou4
Punctuation Prediction Model for Conversational Speech
2018
Interspeech 2018
The neural networks are trained on Common Web Crawl GloVe embedding of the words in Fisher transcripts aligned with conversation side indicators and word time infomation. ...
Our results constitute significant evidence that the distribution of words in time, as well as pre-trained embeddings, can be useful in the punctuation prediction task. ...
Increasing the vocabulary size to 100000 words did not provide any significant performance gains. ...
doi:10.21437/interspeech.2018-1096
dblp:conf/interspeech/ZelaskoSMSCD18
fatcat:ymfh7qbf2vd6jj4qim7qnwgrmy
Predicting speech intelligibility from EEG in a non-linear classification paradigm
[article]
2021
arXiv
pre-print
Recently, brain imaging data has been used to establish a relationship between stimulus and brain response. ...
Approach: We evaluated the performance of the model as a function of input segment length, EEG frequency band and receptive field size while comparing it to multiple baseline models. ...
We propose a dilated convolutional network as the basis of an objective measure of speech intelligibility (in our case, word recognition accuracy in noise). ...
arXiv:2105.06844v4
fatcat:5nxjti5ufjdozoixeki2q4ymue
« Previous
Showing results 1 — 15 out of 165 results