Improving statistical machine translation of informal language: a rule-based pre-editing approach for French Forums

Johanna Gerlach, Pierrette Bouillon
Forums are increasingly used by online communities to share information about a wide range of topics. While this content is in theory available to anyone with internet access, it is in fact accessible only to those users who understand the language in which it was written. Machine translation (MT) seems the most practical solution to make this content more widely accessible, but forum data presents multiple challenges for machine translation. The central objective of the thesis is to
more » ... the possibility of improving the outcome of statistical machine translation of French forum data through the application of pre-editing rules. In particular, our work aims at identifying which transformations are useful to improve translation and whether these transformations can be applied automatically or interactively with a rule-based technology. To evaluate the impact of these rules, we propose a human comparative evaluation methodology using crowdsourcing. Results show that pre-editing significantly improves the machine translation output. To assess the usefulness of these improvements, we perform an evaluation of temporal and technical post-editing effort. Findings show that improvements coincide with reduced effort. Another aspect we consider is whether the pre-editing task can concretely be performed in a forum context. Results of a pre-editing experiment with real forum users suggest that the interactive pre-editing process is accessible, with users producing only slightly less improvement than experts. Finally, to assess the portability of the developed pre-editing process, we perform evaluations with other MT systems, notably rule-based systems, as well as with data from forums from different domains. Findings indicate that, for the most part, the developed pre-editing rules are easily portable. First and foremost, I am deeply grateful to my advisor Pierrette Bouillon, without whom this thesis would never have come into existence. Her experience and enthusiasm have been invaluable for the completion of this work. I am greatly indebted to Sabine Lehmann for taking the time to introduce me to the Acrolinx technology and rule development, as well as for the many thought-provoking discussions. I am also very grateful to the other members of my thesis committee, Ana Guerberof, Manny Rayner and Johann Roturier, for providing valuable comments and suggestions. I would also like to thank Aurélie Picton for accepting the role of president of the jury, and for motivating me over these last years. Many thanks to the members of the ACCEPT project for providing a stimulating research context. In particular, I would like to thank Victoria Porro, who significantly contributed to rule development and experiment setup. Many thanks also to Liliana Gaspar for setting up the pre-editing experiment with Norton forum users, and to Philip Koehn, who kindly let me use his Amazon Mechanical Turk Requester account for all my evaluations. I would also like to thank Magdalena Freund who took on the specialisation of Systran. I am also very grateful to the translators and AMT workers who completed a countless number of translation evaluations. I would like to express my gratitude to my colleagues at the FTI/TIM department, Donatella, Lucìa, Marianne, Nikos, Silvia and Violeta, who have been an invaluable source of advice and motivation. Special thanks to Claudia and Tobias who shared an office with me and had to deal with all my doubts and complaints. I would like to acknowledge funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 288769. Lastly, I would like to thank those closest to me, my partner Simon and the wonderful cats with whom we share our home, for bearing with me through all my ups and downs. I would like to dedicate this thesis to my parents, my mother Silke and my late father Dr. Dieter Gerlach. Contents List of Figures vii List of Tables ix LIST OF TABLES 4.2 Comparative evaluation results for grammar (agreement) rules . . . . . 4.3 Comparative evaluation results for grammar (mood/tense) rules . . . . 4.4 Comparative evaluation results for grammar (sequence) rules . . . . . . 4.5 Comparative evaluation results for combined homophone rules . . . . . 4.6 Comparative evaluation results for combined punctuation rules . . . . . 4.7 Comparative evaluation results for combined informal language rules . .
doi:10.13097/archive-ouverte/unige:73226 fatcat:6gp322c7ujhevmvmyrqqumvuvi