Survey of data-selection methods in statistical machine translation

Sauleh Eetemadi, William Lewis, Kristina Toutanova, Hayder Radha
2015 Machine Translation  
Statistical Machine Translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. data-to-models takes an inordinate amount of time). Moreover, the
more » ... ining data has a wide quality spectrum. A variety of methods for data cleaning and data selection have been developed to address these issues. Each of these methods employs a search or filtering algorithm to select a subset of the data, given a defined set of feature functions. In this paper we provide a comparative overview of research in this area based on application scenario, feature functions and search method.
doi:10.1007/s10590-015-9176-1 fatcat:e3nogrn42bbwtj4hfcurt72wme