Research on Neural Machine Translation Model

Mengyao Chen, Yong Li, Runqi Li
2019 Journal of Physics, Conference Series  
In neural machine translation (NMT), cyclic neural networks, especially long-term and short-term memory networks and gated recurrent neural networks, have been regarded as the latest methods for sequence modeling and transduction problems for a long time, such as language modeling and machine translation. When the cyclic neural network is running, the sequence information is processed one by one, strictly following the order from left to right or from right to left, processing one word at a
more » ... , and parallel operation cannot be realized, resulting in slow running speed. With the rapid development of neural machine translation (NMT) network architecture, cyclic neural network has been effectively replaced by convolution network and self -attention. Convolution neural network has replaced the divine circulation neural network due to its parallel computation of convolution. The Transformer model replaces the long-term and short-term memory network with a complete self-attention structure, and abandons the traditional encoder and decoder model which must combine the inherent mode of convolutional neural network or circular neural network and only uses the self-attention mechanism. Although the biggest innovation of Transformer architecture is to use full self -attention, there are several other factors, such as multi-head attention and residual connection. The model flexibly combine several common building blocks in the Transformer architecture with the cyclic neural network. By borrowing the framework of the Transformer architecture without using full self -attention, experiments show that the cyclic model can be very close to the performance of the Transformer Our model achieved 26.7 BLEU in the WMT 2014 English to German translation task and 37.8 BLEU in the WMT 2014 English to French translation task. Using these two scores alone is very close to the score of the Transformer architecture using full attention, so even if the cyclic neural network is used instead of full self -attention, it can perform well on the data set.
doi:10.1088/1742-6596/1237/5/052020 fatcat:nghf3oryznatboysa2t4xswlmu