A Statistical Machine Translation Primer
Learning Machine Translation
Preface Foreign languages are all around us. Modern communication technologies give us access to a wealth of information in languages we do not fully understand. Millions of internet users are on-line at any time with whom we cannot communicate essentially because of a language barrier. The dream of automatic translation has fuelled a sustained interest in automated or semi-automated approaches to translation. Despite some success in limited domains and applications, and the fact that millions
... f web pages are translated automatically on a daily basis, Machine Translation is often met with skepticism, usually by people who do not appreciate the challenges of producing fully-automated translations. It is, however, a very dynamic and fertile field where statistical approaches seem to have taken the mainstream, at least at the moment. This volume is a follow-up of the workshop on Machine Learning for Multilingual Information Access organised at the NIPS conference in 2006. Several of the contributors to this volume also presented their work there in 2006. However, a number of contributions were also submitted to the book only, which means that roughly half of the content of the final book is newer material that was not presented at the workshop. Compared to the original workshop, this book is also firmly focused on Statistical Machine Translation (SMT). Its aim is to investigate how Machine Learning techniques can improve various aspects of SMT. This volume is split into two roughly equal parts. The first part deals with enabling technologies, i.e. technologies that solve problems that are not Machine Translation proper, but are closely linked to the development of a Machine Translation system. For example, Chapter 2 deals with the acquisition of bilingual sentence aligned data from comparable corpora, a crucial task for domains or language pairs where no parallel corpus is available. Chapters 3 and 4 address the problem of identifying multilingual equivalents of various named entities. One application for such a technology would be to improve the translation of named entities across languages. Chapter 5 deals with word alignment, an essential enabling technology for most SMT systems. It shows how to leverage multiple pre-processing schemes to improve the quality of the alignment. Finally, chapter 6 shows how word-sequence kernels may be used to combine various types of linguistic information, and suggests that this can improve discriminative language modelling. The second part of the book presents either new statistical MT techniques, or improvements over existing techniques, relying on statistics or Machine Learning. xii Contents Chapter 7 addresses the problem of including syntactic information in the translation model, which is estimated in a discriminative training framework. Chapter 8 proposes a new approach to re-ranking the hypotheses output by a SMT system trained on a very large corpus. Chapter 9 presents a novel approach to MT that eschews the traditional log-linear combination of feature function in favour of a kernel-based approach (to our knowledge the first of its kind in the context of MT). Chapters 10 and 11 focus on improving the selection of words or phrases in the translation models. Chapter 10 (???) uses a combination of discriminative lexical selection, using global source-language information and lexical re-ordering in order to produce translations. In Chapter 11, a discriminative phrase selection model is integrated with a phrase-based SMT system. That chapter also provides an interesting analysis and comparison of a large number of automatic MT evaluation metrics. Chapter 12 explores the use of semi-supervised learning to improve MT output by leveraging large amounts of untranslated material in the source language. systems may be combined in order to improve the overall translation quality. This approach allows collaborative projects where several partners contribute different MT systems. System combination currently yields the best results in international evaluation campaigns. We intend this volume to be useful both to the Machine Learning and Statistical Machine Translation communities. We hope that the Machine Learning researcher will get a good overview of various advanced aspects of SMT, and get an idea of different ways in which learning approaches can directly influence and have an impact on a very challenging and important problem. We also hope that the Statistical MT researcher will be interested in this volume, in part as a presentation of some advanced topics that may not be covered in introductory SMT textbooks, and in part as a presentation of some novel Machine Learning-inspired techniques which have the potential to yield new directions of research in this fertile field. We wish to thank the MIT Press who gave us the opportunity to publish this volume, and in particular Susan Buckley and Robert Prior for their support in preparing the manuscript. All the contributed chapters from this volume were reviewed, and we are grateful to the reviewers who took the time to provide comments and criticism which contributed to improving the overall quality of the volume: This first chapter is a short introduction to the main aspects of Statistical Machine Translation. In particular, we cover the issues of automatic evaluation of Machine Translation output, language modelling, word-based and phrased-based translation models, and the use of syntax in Machine Translation. We will also do a quick round-up of some more recent directions that we believe may gain importance in the future. We put this in the general context of Machine Learning research, and put the emphasis on similarities and differences with standard Machine Learning problems and practice.