A Semi-Supervised Batch-Mode Active Learning Strategy for Improved Statistical Machine Translation

Sankaranarayanan Ananthakrishnan, Rohit Prasad, David Stallard, Prem Natarajan
2010 Conference on Computational Natural Language Learning  
The availability of substantial, in-domain parallel corpora is critical for the development of high-performance statistical machine translation (SMT) systems. Such corpora, however, are expensive to produce due to the labor intensive nature of manual translation. We propose to alleviate this problem with a novel, semisupervised, batch-mode active learning strategy that attempts to maximize indomain coverage by selecting sentences, which represent a balance between domain match, translation
more » ... culty, and batch diversity. Simulation experiments on an English-to-Pashto translation task show that the proposed strategy not only outperforms the random selection baseline, but also traditional active learning techniques based on dissimilarity to existing training data. Our approach achieves a relative improvement of 45.9% in BLEU over the seed baseline, while the closest competitor gained only 24.8% with the same number of selected sentences.
dblp:conf/conll/AnanthakrishnanPSN10 fatcat:gzyispsn75hwzeibo264kdd6b4