Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems

Vikas Joshi, Amit Das, Eric Sun, Rupesh R. Mehta, Jinyu Li, Yifan Gong
2021 Conference of the International Speech Communication Association  
Improving multilingual end-to-end (E2E) automatic speech recognition (ASR) systems have manifold advantages. They simplify the training strategy, are easier to scale and exhibit better performance over monolingual models. However, it is still challenging to use a single multilingual model to recognize multiple languages without knowing the input language, as most multilingual models assume the availability of the input language. In this paper, we introduce multi-softmax model to improve the
more » ... ilingual recurrent neural network transducer (RNN-T) models, by having language specific softmax, joint and embedding layers, while sharing rest of the parameters. We extend the multi-softmax model to work without knowing the input language, by integrating a language identification (LID) model, that estimates the LID on-the-fly and also does the recognition at the same time. The multi-softmax model outperforms monolingual models with an average word error rate relative (WERR) reduction of 4.65% on Indian languages. Finetuning further improves the WERR reduction to 12.2%. The multisoftmax model with on-the-fly LID estimation, shows WERR reduction of 13.86% compared to the multilingual baseline.
doi:10.21437/interspeech.2021-1298 dblp:conf/interspeech/JoshiDSM0021 fatcat:7m72xcad6femxibpisgmuvxcgm