209 Hits in 4.1 sec

The RWTH Large Vocabulary Arabic Handwriting Recognition System

Mahdi Hamdani, Patrick Doetsch, Michal Kozielski, Amr El-Desoky Mousa, Hermann Ney
2014 2014 11th IAPR International Workshop on Document Analysis Systems  
Unsupervised writer adaptation is also performed using the Constrained Maximum Likelihood Linear Regression (CMLLR) feature adaptation.  ...  This paper describes the RWTH system for large vocabulary Arabic handwriting recognition.  ...  ACKNOWLEDGMENT This work was partially supported by a Google Research Award and by the Quaero Program, funded by OSEO, French State agency for innovation. H.  ... 
doi:10.1109/das.2014.61 dblp:conf/das/HamdaniDKMN14 fatcat:du743657nvdadclulxln6tl4gm

Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition [article]

Nick J.C. Wang, Zongfeng Quan, Shaojun Wang, Jing Xiao
2022 arXiv   pre-print
To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder fed from the acoustic  ...  sequences generated by the encoder, thus reducing operations.  ...  For example, Wang et al. generate segmental boundaries via special segmentation gates [13] .  ... 
arXiv:2204.03889v1 fatcat:hbyfw57n2jhspdultq3h2xng2e

Introduction to the special issue. Advancing the state-of-the-science in reading research through modeling

Jason D. Zevin, Brett Miller
2016 Scientific Studies of Reading  
Even skilled reading may be characterized by detailed psycholinguistic investigations of multiple, somewhat dissociable, outcomes, such as generalization to novel pseudowords and reading for comprehension  ...  , and how they may put children at risk for reading difficulty.  ...  Support for Jason Zevin is provided in part by P01 HD001994 and P01 HD070837 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development.  ... 
doi:10.1080/10888438.2015.1118480 pmid:26966346 pmcid:PMC4780422 fatcat:dz2lbkqazncaxj7czig3nwruxe

Symbol Grounding in Multimodal Sequences using Recurrent Neural Networks

Federico Raue, Wonmin Byeon, Thomas M. Breuel, Marcus Liwicki
2015 Neural Information Processing Systems  
Our approach uses two Long Short-Term Memory (LSTM) networks for multimodal sequence learning and recovers the internal symbolic space using an EM-style algorithm.  ...  We compared our model against LSTM in three different multimodal datasets: digit, letter and word recognition. The performance of our model reached similar results to LSTM.  ...  The audio component was generated similar to the first dataset using Festival Toolkit. In contrast to MNIST, this dataset does not have an explicit division for the training set and the testing set.  ... 
dblp:conf/nips/RaueBBL15 fatcat:drx7z7oatjailp4rlbgs2ya2hy

MLS: A Large-Scale Multilingual Dataset for Speech Research [article]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert
2020 arXiv   pre-print
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research.  ...  We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at  ...  Acknowledgements We would like to thank Steven Garan for help in data preparation and text normalization and Mark Chou for helping with setting up the workflow for transcription verification.  ... 
arXiv:2012.03411v1 fatcat:krcmqjo2jzatfh6ahrlykqeooi

Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR [article]

Peter Plantinga, Deblin Bagchi, Eric Fosler-Lussier
2021 arXiv   pre-print
Experiments on Voicebank + DEMAND dataset for enhancement show that this approach achieves a new state of the art for some objective enhancement scores.  ...  To address this, we used a pre-trained acoustic model to generate a perceptual loss that makes speech enhancement more aware of the phonetic properties of the signal.  ...  their efforts in creating the toolkit and developing recipes for VoiceBank + DEMAND and LibriSpeech, as well as the pretrained LibriSpeech seq2seq model.  ... 
arXiv:2112.06068v1 fatcat:qkptt3vegvhz5cwj6kkbvzic54

CTC Network with Statistical Language Modeling for Action Sequence Recognition in Videos

Mengxi Lin, Nakamasa Inoue, Koichi Shinoda
2017 Proceedings of the on Thematic Workshops of ACM Multimedia 2017 - Thematic Workshops '17  
The proposed method combines Connectionist Temporal Classification (CTC) and a statistical language model.  ...  We propose a method for recognizing an action sequence in which several actions are concatenated and their boundaries are not given.  ...  that uniformly selects an action from 48 action types.  ... 
doi:10.1145/3126686.3126755 dblp:conf/mm/LinIS17 fatcat:bl6rwxal6jcanar7tupffqjsra

Training end-to-end speech-to-text models on mobile phones [article]

Zitha S, Raghavendra Rao Suresh, Pooja Rao, T. V. Prabhakar
2021 arXiv   pre-print
In addition, these models are trained on generic datasets that are not exhaustive in capturing user-specific characteristics.  ...  Hence, this is vital for a successful deployment of on-device training onto a resource-limited environment like mobile phones.  ...  EXPERIMENTAL SETUP This section presents the approach to investigate the ondevice training from dataset generation to profiling memory consumption on phone.  ... 
arXiv:2112.03871v1 fatcat:ozljk2lirbh6jfqqyjchvukr4y

Improving Mandarin Speech Recogntion with Block-augmented Transformer [article]

Xiaoming Ren, Huifeng Zhu, Liuwei Wei, Minghui Wu, Jie Hao
2022 arXiv   pre-print
Therefore we propose the Block-augmented Transformer for speech recognition, named Blockformer.  ...  Experiments have proved that the Blockformer significantly outperforms the state-of-the-art Conformer-based models on AISHELL-1, our model achieves a CER of 4.35\% without using a language model and 4.10  ...  The model achieved better results with a few extra parameters than previous work on the Mandarin dataset Aishell-1, and achieved a new state-of-the-art performance at 4.35%/4.10% for test dataset.  ... 
arXiv:2207.11697v1 fatcat:xxsdazufvfge5cyr7wgwxlr5ou

Speech Recognition Using Historian Multimodal Approach

Eslam Elmaghraby, Amr Gody, Mohamed Farouk
2019 The Egyptian Journal of Language Engineering  
The effectiveness of the proposed model is demonstrated on a multi-speakers AVSR benchmark dataset named GRID.  ...  While for noisy data, the highest recognition accuracy for integrated audio-visual features is 98.47% with enhancement up to 12.05% over audio-only.  ...  [38] performed labeling by using CNN, LSTM and Connectionist Temporal Classification (CTC) [39] which reports a strong speaker-independent performance on the constrained grammar and the 51 words vocabulary  ... 
doi:10.21608/ejle.2019.59164 fatcat:ylyu5apzuzakvefkxxycavcxei

Amanuensis: The Programmer's Apprentice [article]

Thomas Dean, Maurice Chiang, Marcus Gomez, Nate Gruver, Yousef Hindy, Michelle Lam, Peter Lu, Sophia Sanchez, Rohun Saxena, Michael Smith, Lucy Wang, Catherine Wong
2018 arXiv   pre-print
This document provides an overview of the material covered in a course taught at Stanford in the spring quarter of 2018.  ...  The course draws upon insight from cognitive and systems neuroscience to implement hybrid connectionist and symbolic reasoning systems that leverage and extend the state of the art in machine learning  ...  Battaglia et al [7] describe Graph Networks as a "new building block for the AI toolkit with a strong relational inductive bias the graph network, which generalizes and extends various approaches for  ... 
arXiv:1807.00082v2 fatcat:piwexqa2xvgg5ec5xwkswstswy

Learning Multiscale Features Directly from Waveforms

Zhenyao Zhu, Jesse H. Engel, Awni Hannun
2016 Interspeech 2016  
Further, we find more efficient representations by simultaneously learning at multiple scales, leading to an overall decrease in word error rate on a difficult internal speech test set by 20.7% relative  ...  Deep learning has dramatically improved the performance of speech recognition systems through learning hierarchies of features optimized for the task at hand.  ...  Despite this, a filter bank is constrained by its window size to a single scale.  ... 
doi:10.21437/interspeech.2016-256 dblp:conf/interspeech/ZhuEH16 fatcat:nl5jbumpvbfctonygcblyslt7q

End-to-End Speech Recognition and Disfluency Removal [article]

Paria Jamshid Lou, Mark Johnson
2020 arXiv   pre-print
We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a disfluency  ...  The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the future.  ...  Acknowledgements We would like to thank the anonymous reviewers for their insightful comments and suggestions.  ... 
arXiv:2009.10298v3 fatcat:igk4kggpzvejpeihaojwj33ady

Intelligent data analysis: issues and challenges

Xiaohui Liu
1996 Knowledge engineering review (Print)  
This paper attempts to discuss a wide range of problems that may appear while analysing the data, and suggests strategies to deal with them.  ...  Some of these problems and suggestions are examined with the results of data analysis on a real-life example of risk assessment of level crossing data.  ...  An iterative process first considers a selected number of attributes chosen by the user for analysis or using a feature selection algorithm, and then keeps adding other attributes for analysis until the  ... 
doi:10.1017/s0269888900008055 fatcat:bgxjn4qiz5afbi4fyjv4onu4hu

Understanding Audio Features via Trainable Basis Functions [article]

Kwan Yee Heung, Kin Wai Cheuk, Dorien Herremans
2022 arXiv   pre-print
From our experiments, we can conclude that trainable basis functions are a useful tool to boost the performance when the model complexity is limited.  ...  In our experiments, we allow for this tailoring directly as part of the network.  ...  Therefore, the shape constrains might not work well for all speech-related tasks in general.  ... 
arXiv:2204.11437v1 fatcat:eagtfugkmbb4rbmy7w2qxrwuzu
« Previous Showing results 1 — 15 out of 209 results