Approaches to improving corpus quality for statistical machine translation
2010 International Conference on Machine Learning and Cybernetics
The performance of a statistical machine translation (SMT) system depends heavily on the quantity and quality of the bilingual language resource. However, previous work mainly focuses on the quantity and tries to collect more bilingual data. In this paper, to optimize the bilingual corpus to improve the performance of the translation system, we propose some approaches to processing the training corpus by filtering noise and selecting more informative sentences from the training corpus. Also, to
... coordinate the parameter turning using minimum error rate training (MERT) approach, we propose two methods to select sentences from the large development data which are based on the phrase and sentence structure respectively. Different from the existing methods, our methods do not need so many development data but still obtains effective and robust parameters, while expending little time in the MERT process. The experimental results show that our methods can get better translation performance both in translation quality and speed.