A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Deep Speech: Scaling up end-to-end speech recognition
[article]
2014
arXiv
pre-print
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. ...
Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems. ...
Acknowledgments We are grateful to Jia Lei, whose work on DL for speech at Baidu has spurred us forward, for his advice and support throughout this project. ...
arXiv:1412.5567v2
fatcat:cfqvlbcrbbh23ingwt4zmnz2ka
Deep Audio-visual Speech Recognition
2018
IEEE Transactions on Pattern Analysis and Machine Intelligence
Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal ...
is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. ...
We are very grateful to Rob Cooper and Matt Haynes at BBC Research for help in obtaining the dataset. We would like to thank Ankush Gupta for helpful comments and discussion. ...
doi:10.1109/tpami.2018.2889052
fatcat:pyjz3cnvnvavbluisnp6cqkxyq
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
[article]
2015
arXiv
pre-print
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. ...
Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different ...
Acknowledgments We are grateful to Baidu's speech technology group for help with data preparation and useful conversations. ...
arXiv:1512.02595v1
fatcat:auol4dnoxrc5rmj2yrf2kxt5ya
Speech recognition with deep recurrent neural networks
2013
2013 IEEE International Conference on Acoustics, Speech and Signal Processing
Index Termsrecurrent neural networks, deep neural networks, speech recognition ...
When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge ...
Instead of combining RNNs with HMMs, it is possible to train RNNs 'end-to-end' for speech recognition [8, 9, 10] . ...
doi:10.1109/icassp.2013.6638947
dblp:conf/icassp/GravesMH13
fatcat:5f2ghhhs55f2rdi6cvjgt3a5km
Speech Recognition with Deep Recurrent Neural Networks
[article]
2013
arXiv
pre-print
However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. ...
When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7 our knowledge is the best recorded score. ...
Instead of combining RNNs with HMMs, it is possible to train RNNs 'end-to-end' for speech recognition [8, 9, 10] . ...
arXiv:1303.5778v1
fatcat:7lat3sue25hklb6z5dtcl6axcm
Speech Emotion Recognition Using Deep Learning Techniques
2016
ABC Journal of Advanced Research
The developments in neural systems and the high demand requirement for exact and close actual Speech Emotion Recognition in human-computer interfaces mark it compulsory to liken existing methods and datasets ...
The present investigation assessed deep learning methods for speech emotion detection with accessible datasets, tracked by predictable machine learning methods for SER. ...
End-To-End
Speech Emotion Recognition
Using A Deep Convolutional
Recurrent
Network
(Trigeorgis et al., 2016). ...
doi:10.18034/abcjar.v5i2.550
fatcat:z2rawtplfrfevagcykrlme3vka
Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement
2019
Interspeech 2019
The use of deep learning (DL) architectures for speech enhancement has recently improved the robustness of voice applications under diverse noise conditions. ...
preserving enough information for an SER algorithm to accurately identify emotion in speech. ...
Secondly, we scale up the number of noise environments taken into consideration, essentially moving towards a production ready speech enhancement algorithm that can work reliably for different SER applications ...
doi:10.21437/interspeech.2019-1811
dblp:conf/interspeech/Triantafyllopoulos19
fatcat:fmbbwj5hsvf6rfsb5spwjz6r4a
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition
[article]
2020
arXiv
pre-print
Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR). ...
Currently, most existing methods equate VSR with automatic lip reading, which attempts to recognise speech by analysing lip motion. ...
We would like to thank Chenhao Wang and Mingshuang Luo's extensive help with data processing. ...
arXiv:2003.03206v2
fatcat:7gmyhyka55dq3gwa6cgaybjs6i
Speech Denoising with Auditory Models
[article]
2021
arXiv
pre-print
The development of high-performing neural network sound recognition systems has raised the possibility of using deep feature representations as 'perceptual' losses with which to train denoising systems ...
Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform. ...
Listeners were provided with anchors corresponding to the ends of the rating scale (1 and 7) . The anchor at the high end was always the original clean speech. ...
arXiv:2011.10706v3
fatcat:pi7ijik23nepto3auze5ocugdu
The Conversation: Deep Audio-Visual Speech Enhancement
[article]
2018
arXiv
pre-print
In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the ...
Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. ...
We would like to thank Ankush Gupta for helpful comments.
References ...
arXiv:1804.04121v2
fatcat:bci7qoekrjfafchckb2xc55zqe
Deep Factorization for Speech Signal
[article]
2017
arXiv
pre-print
A natural idea is to factorize each speech frame into independent factors, though it turns out to be even more difficult than decoding each individual factor. ...
Our experiment on an automatic emotion recognition (AER) task demonstrated that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered ...
Acknowledgments Many thanks to Ravichander Vipperla from Nuance, UK for many valuable suggestions. ...
arXiv:1706.01777v2
fatcat:na6u3xhe5nenloyw3lllsykz3y
Deep Learning Approaches for Understanding Simple Speech Commands
[article]
2018
arXiv
pre-print
In this paper we consider several approaches to the problem of sound classification that we applied in TensorFlow Speech Recognition Challenge organized by Google Brain team on the Kaggle platform. ...
As a result we achieved good classification accuracy that allowed us to finish the challenge on 8-th place among 1315 teams. ...
/deep learning. ...
arXiv:1810.02364v1
fatcat:cp4pwdkyencmzifgzhu333w47m
Robust end-to-end deep audiovisual speech recognition
[article]
2016
arXiv
pre-print
This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function. ...
Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging. ...
Audiovisual Results
Figures 2 and
CONCLUSIONS In this paper, we demonstrated that end-to-end Deep Learning can successfully be applied to the problem of audio-visual (multi-modal) speech recognition ...
arXiv:1611.06986v1
fatcat:yqhvju5jirflbcgirpvaegzpwe
DEEP DISCRIMINATIVE AND GENERATIVE MODELS FOR SPEECH PATTERN RECOGNITION
[chapter]
2015
Handbook of Pattern Recognition and Computer Vision
In this chapter we describe deep generative and discriminative models as they have been applied to speech recognition. ...
We focus on speech recognition but our analysis is applicable to other domains. ...
., 2011 could further help scale up such methods to even larger datasets. ...
doi:10.1142/9789814656535_0002
fatcat:2ovjgqq4njgohffzvdnnut6si4
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition
[article]
2020
arXiv
pre-print
We propose a novel approach to semi-supervised automatic speech recognition (ASR). ...
The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. ...
INTRODUCTION Current state-of-the-art models for speech recognition require vast amounts of transcribed audio data to attain good performance. ...
arXiv:1912.01679v2
fatcat:lk434umwwbgephawolhms7khke
« Previous
Showing results 1 — 15 out of 77,583 results