An Unsupervised Method for Automatic Translation Memory Cleaning

Masoud Jalili Sabet, Matteo Negri, Marco Turchi, Eduard Barbu
2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)  
We address the problem of automatically cleaning a large-scale Translation Memory (TM) in a fully unsupervised fashion, i.e. without human-labelled data. We approach the task by: i) designing a set of features that capture the similarity between two text segments in different languages, ii) use them to induce reliable training labels for a subset of the translation units (TUs) contained in the TM, and iii) use the automatically labelled data to train an ensemble of binary classifiers. We apply
more » ... ur method to clean a test set composed of 1,000 TUs randomly extracted from the English-Italian version of MyMemory, the world's largest public TM. Our results show competitive performance not only against a strong baseline that exploits machine translation, but also against a state-of-the-art method that relies on human-labelled data.
doi:10.18653/v1/p16-2047 dblp:conf/acl/SabetNTB16 fatcat:hsnrnty7qrh5nbphehseltfrsi