Filters








77,583 Hits in 5.7 sec

Deep Speech: Scaling up end-to-end speech recognition [article]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates and Andrew Y. Ng
2014 arXiv   pre-print
We present a state-of-the-art speech recognition system developed using end-to-end deep learning.  ...  Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.  ...  Acknowledgments We are grateful to Jia Lei, whose work on DL for speech at Baidu has spurred us forward, for his advice and support throughout this project.  ... 
arXiv:1412.5567v2 fatcat:cfqvlbcrbbh23ingwt4zmnz2ka

Deep Audio-visual Speech Recognition

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
2018 IEEE Transactions on Pattern Analysis and Machine Intelligence  
Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal  ...  is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television.  ...  We are very grateful to Rob Cooper and Matt Haynes at BBC Research for help in obtaining the dataset. We would like to thank Ankush Gupta for helpful comments and discussion.  ... 
doi:10.1109/tpami.2018.2889052 fatcat:pyjz3cnvnvavbluisnp6cqkxyq

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin [article]

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel (+21 others)
2015 arXiv   pre-print
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.  ...  Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different  ...  Acknowledgments We are grateful to Baidu's speech technology group for help with data preparation and useful conversations.  ... 
arXiv:1512.02595v1 fatcat:auol4dnoxrc5rmj2yrf2kxt5ya

Speech recognition with deep recurrent neural networks

Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton
2013 2013 IEEE International Conference on Acoustics, Speech and Signal Processing  
Index Termsrecurrent neural networks, deep neural networks, speech recognition  ...  When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge  ...  Instead of combining RNNs with HMMs, it is possible to train RNNs 'end-to-end' for speech recognition [8, 9, 10] .  ... 
doi:10.1109/icassp.2013.6638947 dblp:conf/icassp/GravesMH13 fatcat:5f2ghhhs55f2rdi6cvjgt3a5km

Speech Recognition with Deep Recurrent Neural Networks [article]

Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton
2013 arXiv   pre-print
However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks.  ...  When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7 our knowledge is the best recorded score.  ...  Instead of combining RNNs with HMMs, it is possible to train RNNs 'end-to-end' for speech recognition [8, 9, 10] .  ... 
arXiv:1303.5778v1 fatcat:7lat3sue25hklb6z5dtcl6axcm

Speech Emotion Recognition Using Deep Learning Techniques

Apoorva Ganapathy, Adobe Systems
2016 ABC Journal of Advanced Research  
The developments in neural systems and the high demand requirement for exact and close actual Speech Emotion Recognition in human-computer interfaces mark it compulsory to liken existing methods and datasets  ...  The present investigation assessed deep learning methods for speech emotion detection with accessible datasets, tracked by predictable machine learning methods for SER.  ...  End-To-End Speech Emotion Recognition Using A Deep Convolutional Recurrent Network (Trigeorgis et al., 2016).  ... 
doi:10.18034/abcjar.v5i2.550 fatcat:z2rawtplfrfevagcykrlme3vka

Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement

Andreas Triantafyllopoulos, Gil Keren, Johannes Wagner, Ingmar Steiner, Björn W. Schuller
2019 Interspeech 2019  
The use of deep learning (DL) architectures for speech enhancement has recently improved the robustness of voice applications under diverse noise conditions.  ...  preserving enough information for an SER algorithm to accurately identify emotion in speech.  ...  Secondly, we scale up the number of noise environments taken into consideration, essentially moving towards a production ready speech enhancement algorithm that can work reliably for different SER applications  ... 
doi:10.21437/interspeech.2019-1811 dblp:conf/interspeech/Triantafyllopoulos19 fatcat:fmbbwj5hsvf6rfsb5spwjz6r4a

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition [article]

Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen
2020 arXiv   pre-print
Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR).  ...  Currently, most existing methods equate VSR with automatic lip reading, which attempts to recognise speech by analysing lip motion.  ...  We would like to thank Chenhao Wang and Mingshuang Luo's extensive help with data processing.  ... 
arXiv:2003.03206v2 fatcat:7gmyhyka55dq3gwa6cgaybjs6i

Speech Denoising with Auditory Models [article]

Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott
2021 arXiv   pre-print
The development of high-performing neural network sound recognition systems has raised the possibility of using deep feature representations as 'perceptual' losses with which to train denoising systems  ...  Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform.  ...  Listeners were provided with anchors corresponding to the ends of the rating scale (1 and 7) . The anchor at the high end was always the original clean speech.  ... 
arXiv:2011.10706v3 fatcat:pi7ijik23nepto3auze5ocugdu

The Conversation: Deep Audio-Visual Speech Enhancement [article]

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
2018 arXiv   pre-print
In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the  ...  Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.  ...  We would like to thank Ankush Gupta for helpful comments. References  ... 
arXiv:1804.04121v2 fatcat:bci7qoekrjfafchckb2xc55zqe

Deep Factorization for Speech Signal [article]

Dong Wang and Lantian Li and Ying Shi and Yixiang Chen and Zhiyuan Tang
2017 arXiv   pre-print
A natural idea is to factorize each speech frame into independent factors, though it turns out to be even more difficult than decoding each individual factor.  ...  Our experiment on an automatic emotion recognition (AER) task demonstrated that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered  ...  Acknowledgments Many thanks to Ravichander Vipperla from Nuance, UK for many valuable suggestions.  ... 
arXiv:1706.01777v2 fatcat:na6u3xhe5nenloyw3lllsykz3y

Deep Learning Approaches for Understanding Simple Speech Commands [article]

Roman A. Solovyev, Maxim Vakhrushev, Alexander Radionov, Vladimir Aliev, Alexey A. Shvets
2018 arXiv   pre-print
In this paper we consider several approaches to the problem of sound classification that we applied in TensorFlow Speech Recognition Challenge organized by Google Brain team on the Kaggle platform.  ...  As a result we achieved good classification accuracy that allowed us to finish the challenge on 8-th place among 1315 teams.  ...  /deep learning.  ... 
arXiv:1810.02364v1 fatcat:cp4pwdkyencmzifgzhu333w47m

Robust end-to-end deep audiovisual speech recognition [article]

Ramon Sanabria, Florian Metze, Fernando De La Torre
2016 arXiv   pre-print
This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function.  ...  Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging.  ...  Audiovisual Results Figures 2 and CONCLUSIONS In this paper, we demonstrated that end-to-end Deep Learning can successfully be applied to the problem of audio-visual (multi-modal) speech recognition  ... 
arXiv:1611.06986v1 fatcat:yqhvju5jirflbcgirpvaegzpwe

DEEP DISCRIMINATIVE AND GENERATIVE MODELS FOR SPEECH PATTERN RECOGNITION [chapter]

Li Deng, Navdeep Jaitly
2015 Handbook of Pattern Recognition and Computer Vision  
In this chapter we describe deep generative and discriminative models as they have been applied to speech recognition.  ...  We focus on speech recognition but our analysis is applicable to other domains.  ...  ., 2011 could further help scale up such methods to even larger datasets.  ... 
doi:10.1142/9789814656535_0002 fatcat:2ovjgqq4njgohffzvdnnut6si4

Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition [article]

Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff
2020 arXiv   pre-print
We propose a novel approach to semi-supervised automatic speech recognition (ASR).  ...  The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data.  ...  INTRODUCTION Current state-of-the-art models for speech recognition require vast amounts of transcribed audio data to attain good performance.  ... 
arXiv:1912.01679v2 fatcat:lk434umwwbgephawolhms7khke
« Previous Showing results 1 — 15 out of 77,583 results