Latent Variable Model for Multi-modal Translation

Iacer Calixto, Miguel Rios, Wilker Aziz
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
Link to publication Creative Commons License (see CC BY Citation for published version (APA): Abstract In this work, we propose to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its description in a foreign language. It is used in a target-language decoder and also
more » ... predict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that images be available at test time. We show that our latent variable MMT formulation improves considerably over strong baselines, including a multi-task learning approach (Elliott and Kádár, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016) . Finally, we show improvements due to (i) predicting image features in addition to only conditioning on them, (ii) imposing a constraint on the KL term to promote models with nonnegligible mutual information between inputs and latent variable, and (iii) by training on additional target-language image descriptions (i.e. synthetic data).
doi:10.18653/v1/p19-1642 dblp:conf/acl/CalixtoRA19 fatcat:y3t5oh36x5bqfagucebi424rwq