On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu
2020 Interspeech 2020  
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attentionbased encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to
more » ... in these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model. Index Terms: end-to-end, RNN-transducer, attention-based encoder-decoder, transformer Popular End-to-End Models In this section, we give a brief introduction of current popular E2E models: RNN-T, RNN-AED, and Transformer-AED. These models have an acoustic encoder that generates high level representation for speech and a decoder, which autoregressively generates output tokens in the linguistic domain. While the acoustic encoders can be same, the decoders of RNN-T and
doi:10.21437/interspeech.2020-2846 dblp:conf/interspeech/Li0G0Z020 fatcat:2xfo2lo4q5cgbgecg3lufby7oq