The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task

Raúl Vázquez, Umut Sulubacak, Jörg Tiedemann
2019 Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)  
This paper describes the University of Helsinki Language Technology group's participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the 'bad' quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.
doi:10.18653/v1/w19-5441 dblp:conf/wmt/VazquezST19 fatcat:7ee2dyld5jfd5amckq3y2h32eu