3,644 Hits in 7.2 sec

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin [article]

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel (+21 others)
2015 arXiv   pre-print
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.  ...  Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different  ...  Acknowledgments We are grateful to Baidu's speech technology group for help with data preparation and useful conversations.  ... 
arXiv:1512.02595v1 fatcat:auol4dnoxrc5rmj2yrf2kxt5ya

Deep Learning in Speech Recognition
音声認識におけるDeep Learningの活用

Ken-ichi Iso
2017 The Brain & Neural Networks  
) = N n=1 p(wn|wn−1wn−2 . . . w1) ∼ N n=1 p(wn|wn−1wn−2) ( 5 ) p(wn|wn−1wn−2 . . . w1) 2 p(wn|wn−1wn−2) 3-gram End-to-End Speech Recognition in English and Mandarin, arXiv, 1512.02595. 22) Graves, A.,  ...  Acoustic Modeling from Raw Multichannel Waveforms, IEEE Automatic Speech Recognition and Understanding Workshop. 19) Graves, A., Jaitly, N. (2014): Towards Endto-End Speech Recognition with Recurrent  ... 
doi:10.3902/jnns.24.27 fatcat:2ioqodsou5fhvnwmyi3kj2iosu

Deep Learning for Emotional Speech Recognition [chapter]

Máximo E. Sánchez-Gutiérrez, E. Marcelo Albornoz, Fabiola Martinez-Licona, H. Leonardo Rufiner, John Goddard
2014 Lecture Notes in Computer Science  
The principal motivation lies in the success reported in a growing body of work employing these techniques as alternatives to traditional methods in speech processing and speech recognition.  ...  The present paper considers the application of restricted Boltzmann machines (RBM) and deep belief networks (DBN) to the difficult task of automatic Spanish emotional speech recognition.  ...  We also want to thank ELRA for supplying the, Emotional speech synthesis database, catalogue reference: ELRA-S0329.  ... 
doi:10.1007/978-3-319-07491-7_32 fatcat:toc3ewet6nekvk4wtcxjurfcse

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which  ...  Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and  ...  For example, the end-to-end TTS technology based on deep learning has not been able to synthesize speech stably in real time, and the quality of the generated speech cannot be guaranteed.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Deep Discriminative Feature Learning for Accent Recognition [article]

Wei Wang, Chao Zhang, Xiaopei Wu
2021 arXiv   pre-print
In this paper, we borrow and improve the deep speaker identification framework to recognize accents, in detail, we adopt Convolutional Recurrent Neural Network as front-end encoder and integrate local  ...  Accent recognition with deep learning framework is a similar work to deep speaker identification, they're both expected to give the input speech an identifiable representation.  ...  English speech accents in data-set derive from 8 countries.  ... 
arXiv:2011.12461v4 fatcat:vstwbkyct5hdtpcli35rcx3hvq

Lithuanian Speech Recognition Using Purely Phonetic Deep Learning

Laurynas Pipiras, Rytis Maskeliūnas, Robertas Damaševičius
2019 Computers  
Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English.  ...  The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).  ...  Large corpora of speech data such as Librispeech [4] for English and AISHELL-1 [5] for Mandarin are available.  ... 
doi:10.3390/computers8040076 fatcat:ugkfjr4xwfczxnwbp6wikso6c4

Automated English Speech Recognition Using Dimensionality Reduction with Deep Learning Approach

Jing Yu, Nianhua Ye, Xueqin Du, Lu Han, Deepak Kumar Jain
2022 Wireless Communications and Mobile Computing  
This paper presents an automated English speech recognition using dimensionality reduction and deep learning (AESR-DRDL) approach.  ...  Due to the advancements of deep learning (DL) models, speech recognition system has received significant attention among researchers in several areas of speech recognition like mobile communication, voice  ...  In [9] , three methods are examined to enhance speech recognition on Mandarin-English code-switching tasks.  ... 
doi:10.1155/2022/3597347 fatcat:mjirsojlhvbldo3r7o45hh2rbu

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition [article]

Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen
2020 arXiv   pre-print
Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR).  ...  In this paper, we perform a comprehensive study to evaluate the effects of different facial regions with state-of-the-art VSR models, including the mouth, the whole face, the upper face, and even the cheeks  ...  We would like to thank Chenhao Wang and Mingshuang Luo's extensive help with data processing.  ... 
arXiv:2003.03206v2 fatcat:7gmyhyka55dq3gwa6cgaybjs6i

Deep and Wide: Multiple Layers in Automatic Speech Recognition

Nelson Morgan
2012 IEEE Transactions on Audio, Speech, and Language Processing  
This article reviews a line of research carried out over the last decade in speech recognition assisted by discriminatively trained, feedforward networks.  ...  Index Terms-machine learning, multilayer perceptrons, speech recognition  ...  In a number of large tasks (American English conversational telephone speech, American English broadcast news, Mandarin broadcast news, Arabic broadcast news), a further combination of HATs output and  ... 
doi:10.1109/tasl.2011.2116010 fatcat:7wkvohvwezbphhreo22suin5wi

Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks

Zhong-Qiu Wang, Ivan Tashev
2017 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Experiments on a Mandarin dataset demonstrate the effectiveness of our proposed methods on speech emotion and age/gender recognition tasks.  ...  In this study, we propose to use deep neural networks (DNNs) to encode each utterance into a fixed-length vector by pooling the activations of the last hidden layer over time.  ...  Finally, many previous studies on speech emotion recognition are focused on English.  ... 
doi:10.1109/icassp.2017.7953138 dblp:conf/icassp/WangT17 fatcat:mvqznfj5mrg5vknkznkwgfgmbm

Deep Learning Based Automatic Speech Recognition for Turkish

2020 Sakarya University Journal of Science  
Turkish language is an agglutinative and a phoneme-based language. In this study, a Deep Belief Network (DBN) based Turkish phoneme and speech recognizer is developed.  ...  Although DNN has been applied for solving Automatic Speech Recognition (ASR) problem in some languages, DNNbased Turkish Speech Recognition has not been studied extensively.  ...  Larger lexicon size degrades the speed of decoder in word-based speech recognition [24] . Due to high OOV, speech recognition methods for English applied to Turkish give low recognition results.  ... 
doi:10.16984/saufenbilder.711888 fatcat:xvdani7y4nfelnnknpknrcn5oq

Arabic speech recognition using end‐to‐end deep learning

Hamzah A. Alsayadi, Abdelaziz A. Abdelhamid, Islam Hegazy, Zaki T. Fayed
2021 IET Signal Processing  
To the best of our knowledge, end-to-end deep learning approach has not been used in the task of diacritised Arabic automatic speech recognition.  ...  In this work, the application of state-of-the-art end-to-end deep learning approaches is investigated to build a robust diacritised Arabic ASR.  ...  In contrast, end-to-end approach components can be trained and manipulated in one package using deep learning methods.  ... 
doi:10.1049/sil2.12057 fatcat:jqzkk4f6xzch7gorjhv35dodwu

Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal [article]

Zhiyuan Peng, Siyuan Feng, Tan Lee
2019 arXiv   pre-print
A frame decoder serves to reconstruct speech features from the encoders'outputs. The mFAE is evaluated on speaker verification (SV) task and unsupervised subword modeling (USM) task.  ...  Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic.  ...  The dataset consists of three languages, namely, English, French and Mandarin. The amount of training data for the three languages are 45, 24 and 2.5 hours respectively.  ... 
arXiv:1911.01806v1 fatcat:qt6pc3dzifdqpabpgfczixbysq

Survey on Deep Neural Networks in Speech and Vision Systems [article]

Mahbubul Alam, Manar D. Samad, Lasitha Vidyaratne, Alexander Glandon,, Khan M. Iftekharuddin
2019 arXiv   pre-print
This survey presents a review of state-of-the-art deep neural network architectures, algorithms, and systems in vision and speech applications.  ...  Recent advances in deep artificial neural network algorithms and architectures have spurred rapid innovation and development of intelligent vision and speech systems.  ...  Note the views and findings reported in this work completely belong to the authors and not the NSF or NIH.  ... 
arXiv:1908.07656v2 fatcat:7acubicqzzac3dqemkiccoogm4

Improved language identification using deep bottleneck network

Yan Song, Ruilian Cui, Xinhai Hong, Ian Mcloughlin, Jiong Shi, Lirong Dai
2015 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Recently, several representations that employ a pre-trained deep neural network (DNN) as the front-end feature extractor, have achieved stateof-the-art performance.  ...  Effective representation plays an important role in automatic spoken language identification (LID).  ...  , Dari, English-American, English-Indian, Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese).  ... 
doi:10.1109/icassp.2015.7178762 dblp:conf/icassp/SongCHMSD15 fatcat:ej55e4of4bhudgvbjvn5yyvt4q
« Previous Showing results 1 — 15 out of 3,644 results