Improving statistical machine translation of informal language: a rule-based pre-editing approach for French Forums

Johanna Gerlach, Pierrette Bouillon
Forums are increasingly used by online communities to share information about a wide range of topics. While this content is in theory available to anyone with internet access, it is in fact accessible only to those users who understand the language in which it was written. Machine translation (MT) seems the most practical solution to make this content more widely accessible, but forum data presents multiple challenges for machine translation. The central objective of the thesis is to
more » ... the possibility of improving the outcome of statistical machine translation of French forum data through the application of pre-editing rules. In particular, our work aims at identifying which transformations are useful to improve translation and whether these transformations can be applied automatically or interactively with a rule-based technology. To evaluate the impact of these rules, we propose a human comparative evaluation methodology using crowdsourcing. Results show that pre-editing significantly improves the machine translation output. To assess the usefulness of these improvements, we perform an evaluation of temporal and technical post-editing effort. Findings show that improvements coincide with reduced effort. Another aspect we consider is whether the pre-editing task can concretely be performed in a forum context. Results of a pre-editing experiment with real forum users suggest that the interactive pre-editing process is accessible, with users producing only slightly less improvement than experts. Finally, to assess the portability of the developed pre-editing process, we perform evaluations with other MT systems, notably rule-based systems, as well as with data from forums from different domains. Findings indicate that, for the most part, the developed pre-editing rules are easily portable.
