Submodularity for Data Selection in Machine Translation

Katrin Kirchhoff, Jeff Bilmes
2014 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)  
We introduce submodular optimization to the problem of training data subset selection for statistical machine translation (SMT). By explicitly formulating data selection as a submodular program, we obtain fast scalable selection algorithms with mathematical performance guarantees, resulting in a unified framework that clarifies existing approaches and also makes both new and many previous approaches easily accessible. We present a new class of submodular functions designed specifically for SMT
more » ... ecifically for SMT and evaluate them on two different translation tasks. Our results show that our best submodular method significantly outperforms several baseline methods, including the widely-used cross-entropy based data selection method. In addition, our approach easily scales to large data sets and is applicable to other data selection problems in natural language processing.
doi:10.3115/v1/d14-1014 dblp:conf/emnlp/KirchhoffB14 fatcat:6u37jujmffg77geszinbxsthcm