Filters








79 Hits in 7.1 sec

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM [article]

Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan
2017 arXiv   pre-print
We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network.  ...  The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder.  ...  Figure 1 shows the extended architecture, which includes joint decoding, a deep CNN encoder and an RNN-LM network.  ... 
arXiv:1706.02737v1 fatcat:cdswlmukebhnvepo6phfezmnae

Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan
2017 Interspeech 2017   unpublished
We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network.  ...  The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder.  ...  Figure 1 shows the extended architecture, which includes joint decoding, a deep CNN encoder and an RNN-LM network.  ... 
doi:10.21437/interspeech.2017-1296 fatcat:cidkl3ehzzfqnpahb5mamecr7u

Multi-encoder multi-resolution framework for end-to-end speech recognition [article]

Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Takaaki Hori, Shinji Watanabe, Hynek Hermansky
2018 arXiv   pre-print
Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end Automatic Speech Recognition (ASR).  ...  In this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework based on the joint CTC/Attention model.  ...  Two parallel encoders with heterogeneous structures, RNN-based and CNN-RNN-based, are mutually complementary in characterizing the speech signal.  ... 
arXiv:1811.04897v1 fatcat:5kpfcdtarbfcrch6dbzoq4lfyu

Multi-Stream End-to-End Speech Recognition [article]

Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe, Takaaki Hori, Hynek Hermansky
2019 arXiv   pre-print
Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR).  ...  In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information.  ...  JOINT CTC/ATTENTION MECHANISM In this section, we review the joint CTC/attention architecture, which takes advantage of both CTC and attention-based end-to-end ASR approaches during training and decoding  ... 
arXiv:1906.08041v2 fatcat:xcsggemftfctzc4uioohemtpua

End-to-end Speech Recognition with Word-based RNN Language Models [article]

Takaaki Hori, Jaejin Cho, Shinji Watanabe
2018 arXiv   pre-print
In our prior work, we have proposed a multi-level LM, in which character-based and word-based RNN-LMs are combined in hybrid CTC/attention-based ASR.  ...  This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR).  ...  CONCLUSION In this paper, we proposed a word-based RNN language model (RNN-LM) including a look-ahead mechanism for end-to-end automatic speech recognition (ASR).  ... 
arXiv:1808.02608v1 fatcat:q7jqea25dver3aidalglagh3cy

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Chu-Xiong Qin, Wen-Lin Zhang, Dan Qu
2019 EURASIP Journal on Audio, Speech, and Music Processing  
A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance.  ...  A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments.  ...  It was prepared as a speech recognition corpus by Vassil Panayotov. Competing interests The authors declare that they have no competing interests.  ... 
doi:10.1186/s13636-019-0161-0 fatcat:22f5rozskbbpvmzw5o7lgy5tha

Recent Progresses in Deep Learning based Acoustic Models (Updated) [article]

Dong Yu, Jinyu Li
2018 arXiv   pre-print
, and the attention-based sequence-to-sequence model.  ...  In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.  ...  In [94] , it was further advanced with jointly decoding with the scores from both attention-based model and CTC model. IV.  ... 
arXiv:1804.09298v2 fatcat:yfxzxu6qanbndcnmt3loikqeym

Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling [article]

Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, Takaaki Hori
2018 arXiv   pre-print
Incorporating an RNNLM also brings significant improvements in terms of %WER, and achieves recognition performance comparable to the models trained with twice more training data.  ...  Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments.  ...  Previous studies applying CNN in seq2seq speech recognition [22] also showed that incorporating a deep CNNs in the encoder could further boost the performance.  ... 
arXiv:1810.03459v1 fatcat:alhb44umqbe47elpfmnzvukj7y

An Overview of End-to-End Automatic Speech Recognition

Dong Wang, Xiaodong Wang, Shaohe Lv
2019 Symmetry  
But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques,  ...  Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning.  ...  [31] introduced CNN and combined it with RNN for ASR. It designed the structure consists of four CNN layers, two dense layers, two RNN layer and a CTC layer.  ... 
doi:10.3390/sym11081018 fatcat:ea3ohiy765clzbj7yulonvz7eu

Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level

Orken Mamyrbayev, Keylan Alimhan, Dina Oralbekova, Akbayan Bekarystankyzy, Bagashar Zhumazhanov
2022 Eastern-European Journal of Enterprise Technologies  
Many research papers have shown that deep learning methods make it easier to train automatic speech recognition systems that use an end-to-end approach.  ...  To increase efficiency and quickly solve the problem associated with a limited resource, transfer learning was used for the end-to-end model.  ...  Acknowledgments This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan (Grant No. AP09259309).  ... 
doi:10.15587/1729-4061.2022.252801 fatcat:drbererhabdzjilmcj5xpqwxrm

Recent Advances in End-to-End Automatic Speech Recognition [article]

Jinyu Li
2022 arXiv   pre-print
Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).  ...  In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.  ...  with key technologies year model encoder test-clean/other WER Deep Speech 2: more labeled data, curriculum learning [325] 2016 CTC bi-RNN 5.3/13.2 policy learning, joint training [326] 2018 CTC CNN+bi-GRU  ... 
arXiv:2111.01690v2 fatcat:6pktwep34jdvjklw4gkri4yn4y

The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results [article]

Xian Shi, Qiangze Feng, Lei Xie
2020 arXiv   pre-print
Three tracks were set for advancing the AM and LM part in traditional DNN-HMM ASR system, as well as exploring the E2E models' performance.  ...  This paper describes the design and main outcomes of the ASRU 2019 Mandarin-English code-switching speech recognition challenge, which aims to improve the ASR performance in Mandarin-English code-switching  ...  Encoder-Decoder based system LAS [7] and transformer [6] use global attention and multi-head self-attention to generate implicit alignment.  ... 
arXiv:2007.05916v1 fatcat:yjuo2ejtwbacph7tgcymjocmvu

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model [article]

Alexander H. Liu, Hung-yi Lee, Lin-shan Lee
2018 arXiv   pre-print
In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using a Criticizing Language Model (CLM).  ...  Moreover, AT can be applied to any end-to-end ASR model using any deep-learning-based language modeling frameworks, and compatible with any existing end-to-end decoding method.  ...  "+LM" refers to shallow fusion decoding jointly with RNN-LM [13] , "+AT" refers to the adversarial training proposed here, "+Both" indicates training with AT and joint decoding with RNN-LM, and BT is  ... 
arXiv:1811.00787v1 fatcat:c7jgqcmbtva6fdwcjoo3cwdlqq

Stream attention-based multi-array end-to-end speech recognition [article]

Xiaofei Wang, Ruizhi Li, Sri Harish Mallid, Takaaki Hori, Shinji Watanabe, Hynek Hermansky
2019 arXiv   pre-print
Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framework is proposed in this work.  ...  In terms of attention, a hierarchical structure is adopted. On top of the regular attention networks, stream attention is introduced to steer the decoder toward the most informative encoders.  ...  Joint CTC/Attention Architecture for End-to-End ASR Table 1 . 1 Description of the array configuration in the two-stream E2E experiments.  ... 
arXiv:1811.04903v2 fatcat:q6fcqmmytrfopo35iqgqan5coa

Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language

Abdinabi Mukhamadiyev, Ilyos Khujayarov, Oybek Djuraev, Jinsoo Cho
2022 Sensors  
In this paper, we propose an End-To-End Deep Neural Network-Hidden Markov Model speech recognition model and a hybrid Connectionist Temporal Classification (CTC)-attention network for the Uzbek language  ...  The proposed approach reduces training time and improves speech recognition accuracy by effectively using CTC objective function in attention model training.  ...  The CTC weight for the joint training with the attention model was set to 0.3. During the test phase, the CTC weight for the joint decoding was set to 0.6.  ... 
doi:10.3390/s22103683 fatcat:5iabrkpp2vhkjh6rcxrykfwei4
« Previous Showing results 1 — 15 out of 79 results