Statistical machine translation

2010 ChoiceReviews  
Automatic translation from one human language to another using computers, better known as machine translation MT, is a longstanding goal of computer science. In order to be able to perform such a task, the computer must know" the two languages|synonyms for words and phrases, grammars of the two languages, and semantic or world knowledge. One way to incorporate such knowledge into a computer is to use bilingual experts to hand-craft the necessary information into the computer program. Another is
more » ... to let the computer learn some of these things automatically by examining large amounts of parallel text: documents which are translations of each other. The Canadian government produces one such resource, for example, in the form of parliamentary proceedings which are recorded in both English and French. Recently, statistical data analysis has been used to gather MT knowledge automatically from parallel bilingual text. Unfortunately, these techniques and tools have not been disseminated to the scienti c community i n v ery usable form, and new follow-on ideas have developed sporadically. In a six-week summer workshop at Johns Hopkins University, w e constructed a basic statistical MT toolkit called Egypt intended for distribution to interested researchers. We also used the toolkit as a platform for experimentation during the workshop. Our experiments included working with distant language pairs such as Czech English, rapidly porting to new language pairs, managing with small bilingual data sets, speeding up algorithms for decoding and bilingual and text training, and incorporating morphology, syntax, dictionaries, and cognates. Late in the workshop, we built an MT system for a new language pair Chinese English in a single day. W e describe both the toolkit and the experiments in this report. 2. Build a Czech-English machine translation system during the workshop, using this toolkit. 3. Perform baseline evaluations. These evaluations would consist of both objective measures statistical model perplexity and subjective measures human judgments of quality, as well as attempts to correlate the two. We w ould also produce learning curves that show h o w system performance changes when we v ary the amount of bilingual training text. 4. Improve baseline results through the use of morphological and syntactic transducers. 5. Late in the workshop, build a translator for a new language pair in a single day. We largely achieved these goals, as described in this report. We also had time to perform some unanticipated beyond-the-baseline experiments: speeding up bilingual-text training, using online dictionaries, and using language cognates. Finally, w e built additional unanticipated tools to support these goals, including a sophisticated graphical interface for browsing word-by-word alignments, several corpus preparation and analysis tools, and a human-judgment e v aluation interface. 2 Y j:aj6 =0 dj j a j ; l ; m Training is a matter of inducing those four probability tables from a bilingual corpus. This is what GIZA does. The basic idea is to bootstrap. For given English word e, we initially pretend that all French w ords are equally likely translations. For a given sentence pair, all alignments will therefore look equally likely as well. Every time we see a certain word pair co-occurring in an alignment, we mark down a count." After we h a ve traversed the entire corpus, we normalize these counts to create a new word-translation table. The same goes for the fertility, n ull, and distortion tables. According to these new tables, some alignments will now be more probable than others. We collect counts again, but now w eigh co-occurrences by the probability o f the alignments that they occur in. This is the EM algorithm: set t, d, n, and p tables uniformly for several iterations set up count tables tc, dc, nc, and pc with zero entries for each sentence pair e, f of lengths l, m for each word alignment a of e, f
doi:10.5860/choice.47-6293 fatcat:uik7wgysubgwzpfccpaupw4os4