A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
The RWTH Large Vocabulary Arabic Handwriting Recognition System
2014
2014 11th IAPR International Workshop on Document Analysis Systems
Unsupervised writer adaptation is also performed using the Constrained Maximum Likelihood Linear Regression (CMLLR) feature adaptation. ...
This paper describes the RWTH system for large vocabulary Arabic handwriting recognition. ...
ACKNOWLEDGMENT This work was partially supported by a Google Research Award and by the Quaero Program, funded by OSEO, French State agency for innovation. H. ...
doi:10.1109/das.2014.61
dblp:conf/das/HamdaniDKMN14
fatcat:du743657nvdadclulxln6tl4gm
Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition
[article]
2022
arXiv
pre-print
To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder fed from the acoustic ...
sequences generated by the encoder, thus reducing operations. ...
For example, Wang et al. generate segmental boundaries via special segmentation gates [13] . ...
arXiv:2204.03889v1
fatcat:hbyfw57n2jhspdultq3h2xng2e
Introduction to the special issue. Advancing the state-of-the-science in reading research through modeling
2016
Scientific Studies of Reading
Even skilled reading may be characterized by detailed psycholinguistic investigations of multiple, somewhat dissociable, outcomes, such as generalization to novel pseudowords and reading for comprehension ...
, and how they may put children at risk for reading difficulty. ...
Support for Jason Zevin is provided in part by P01 HD001994 and P01 HD070837 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development. ...
doi:10.1080/10888438.2015.1118480
pmid:26966346
pmcid:PMC4780422
fatcat:dz2lbkqazncaxj7czig3nwruxe
Symbol Grounding in Multimodal Sequences using Recurrent Neural Networks
2015
Neural Information Processing Systems
Our approach uses two Long Short-Term Memory (LSTM) networks for multimodal sequence learning and recovers the internal symbolic space using an EM-style algorithm. ...
We compared our model against LSTM in three different multimodal datasets: digit, letter and word recognition. The performance of our model reached similar results to LSTM. ...
The audio component was generated similar to the first dataset using Festival Toolkit. In contrast to MNIST, this dataset does not have an explicit division for the training set and the testing set. ...
dblp:conf/nips/RaueBBL15
fatcat:drx7z7oatjailp4rlbgs2ya2hy
MLS: A Large-Scale Multilingual Dataset for Speech Research
[article]
2020
arXiv
pre-print
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. ...
We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org. ...
Acknowledgements We would like to thank Steven Garan for help in data preparation and text normalization and Mark Chou for helping with setting up the workflow for transcription verification. ...
arXiv:2012.03411v1
fatcat:krcmqjo2jzatfh6ahrlykqeooi
Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR
[article]
2021
arXiv
pre-print
Experiments on Voicebank + DEMAND dataset for enhancement show that this approach achieves a new state of the art for some objective enhancement scores. ...
To address this, we used a pre-trained acoustic model to generate a perceptual loss that makes speech enhancement more aware of the phonetic properties of the signal. ...
their efforts in creating the toolkit and developing recipes for VoiceBank + DEMAND and LibriSpeech, as well as the pretrained LibriSpeech seq2seq model. ...
arXiv:2112.06068v1
fatcat:qkptt3vegvhz5cwj6kkbvzic54
CTC Network with Statistical Language Modeling for Action Sequence Recognition in Videos
2017
Proceedings of the on Thematic Workshops of ACM Multimedia 2017 - Thematic Workshops '17
The proposed method combines Connectionist Temporal Classification (CTC) and a statistical language model. ...
We propose a method for recognizing an action sequence in which several actions are concatenated and their boundaries are not given. ...
that uniformly selects an action from 48 action types. ...
doi:10.1145/3126686.3126755
dblp:conf/mm/LinIS17
fatcat:bl6rwxal6jcanar7tupffqjsra
Training end-to-end speech-to-text models on mobile phones
[article]
2021
arXiv
pre-print
In addition, these models are trained on generic datasets that are not exhaustive in capturing user-specific characteristics. ...
Hence, this is vital for a successful deployment of on-device training onto a resource-limited environment like mobile phones. ...
EXPERIMENTAL SETUP This section presents the approach to investigate the ondevice training from dataset generation to profiling memory consumption on phone. ...
arXiv:2112.03871v1
fatcat:ozljk2lirbh6jfqqyjchvukr4y
Improving Mandarin Speech Recogntion with Block-augmented Transformer
[article]
2022
arXiv
pre-print
Therefore we propose the Block-augmented Transformer for speech recognition, named Blockformer. ...
Experiments have proved that the Blockformer significantly outperforms the state-of-the-art Conformer-based models on AISHELL-1, our model achieves a CER of 4.35\% without using a language model and 4.10 ...
The model achieved better results with a few extra parameters than previous work on the Mandarin dataset Aishell-1, and achieved a new state-of-the-art performance at 4.35%/4.10% for test dataset. ...
arXiv:2207.11697v1
fatcat:xxsdazufvfge5cyr7wgwxlr5ou
Speech Recognition Using Historian Multimodal Approach
2019
The Egyptian Journal of Language Engineering
The effectiveness of the proposed model is demonstrated on a multi-speakers AVSR benchmark dataset named GRID. ...
While for noisy data, the highest recognition accuracy for integrated audio-visual features is 98.47% with enhancement up to 12.05% over audio-only. ...
[38] performed labeling by using CNN, LSTM and Connectionist Temporal Classification (CTC) [39] which reports a strong speaker-independent performance on the constrained grammar and the 51 words vocabulary ...
doi:10.21608/ejle.2019.59164
fatcat:ylyu5apzuzakvefkxxycavcxei
Amanuensis: The Programmer's Apprentice
[article]
2018
arXiv
pre-print
This document provides an overview of the material covered in a course taught at Stanford in the spring quarter of 2018. ...
The course draws upon insight from cognitive and systems neuroscience to implement hybrid connectionist and symbolic reasoning systems that leverage and extend the state of the art in machine learning ...
Battaglia et al [7] describe Graph Networks as a "new building block for the AI toolkit with a strong relational inductive bias the graph network, which generalizes and extends various approaches for ...
arXiv:1807.00082v2
fatcat:piwexqa2xvgg5ec5xwkswstswy
Learning Multiscale Features Directly from Waveforms
2016
Interspeech 2016
Further, we find more efficient representations by simultaneously learning at multiple scales, leading to an overall decrease in word error rate on a difficult internal speech test set by 20.7% relative ...
Deep learning has dramatically improved the performance of speech recognition systems through learning hierarchies of features optimized for the task at hand. ...
Despite this, a filter bank is constrained by its window size to a single scale. ...
doi:10.21437/interspeech.2016-256
dblp:conf/interspeech/ZhuEH16
fatcat:nl5jbumpvbfctonygcblyslt7q
End-to-End Speech Recognition and Disfluency Removal
[article]
2020
arXiv
pre-print
We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a disfluency ...
The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the future. ...
Acknowledgements We would like to thank the anonymous reviewers for their insightful comments and suggestions. ...
arXiv:2009.10298v3
fatcat:igk4kggpzvejpeihaojwj33ady
Intelligent data analysis: issues and challenges
1996
Knowledge engineering review (Print)
This paper attempts to discuss a wide range of problems that may appear while analysing the data, and suggests strategies to deal with them. ...
Some of these problems and suggestions are examined with the results of data analysis on a real-life example of risk assessment of level crossing data. ...
An iterative process first considers a selected number of attributes chosen by the user for analysis or using a feature selection algorithm, and then keeps adding other attributes for analysis until the ...
doi:10.1017/s0269888900008055
fatcat:bgxjn4qiz5afbi4fyjv4onu4hu
Understanding Audio Features via Trainable Basis Functions
[article]
2022
arXiv
pre-print
From our experiments, we can conclude that trainable basis functions are a useful tool to boost the performance when the model complexity is limited. ...
In our experiments, we allow for this tailoring directly as part of the network. ...
Therefore, the shape constrains might not work well for all speech-related tasks in general. ...
arXiv:2204.11437v1
fatcat:eagtfugkmbb4rbmy7w2qxrwuzu
« Previous
Showing results 1 — 15 out of 209 results