Domain Adaptation via Pseudo In-Domain Data Selection

Amittai Axelrod, Xiaodong He, Jianfeng Gao
2011 Conference on Empirical Methods in Natural Language Processing  
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora -1% the size of the original -can then used to train small
more » ... -adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in-and general-domain systems during decoding.
dblp:conf/emnlp/AxelrodHG11 fatcat:raevgcmfyzdifhkenxbdqabpdu