Tree-Based Statistical Machine Translation: Experiments with the English and Brazilian Portuguese Pair

Daniel Beck, Helena Caseli
2013 Learning and Nonlinear Models  
Resumo -Paradigmas baseados em Aprendizagem de Máquina dominam as pesquisas mais recentes em Tradução Automática. O estado-da-arteé baseado em implementações que dependem apenas de métodos estatísticos que coletam todo o conhecimento necessário de corpora paralelos. No entanto, essa falta de conhecimento linguístico explícito os torna incapazes de modelar alguns fenômenos linguísticos. Neste trabalho, são focados modelos que levam em conta a informação sintática das línguas envolvidas no
more » ... o de tradução.É seguida uma proposta recente baseada no preprocessamento de corpora paralelos através de analisadores sintáticos e que usa modelos de tradução compostos por Transdutores deÁrvores. São realizados experimentos com o par de línguas Inglês e Português Brasileiro, provendo os primeiros resultados conhecidos em Tradução Automática Estatística baseada em sintaxe para esse par. Os resultados mostram que essa propostaé capaz de modelar mais facilmente fenômenos como reordenamentos de longa distância e fornecem direcionamentos para melhorias futuras na construção de modelos de tradução baseados em sintaxe para esse par. Abstract -Machine Learning paradigms have dominated recent research in Machine Translation. Current state-of-the-art approaches rely only on statistical methods that gather all necessary knowledge from parallel corpora. However, this lack on explicit linguistic knowledge makes them unable to model some linguistic phenomena. In this work, we focus on models that take into account the syntactic information from the languages involved on the translation process. We follow a novel approach that preprocess parallel corpora using syntactic parsers and uses translation models composed by Tree Transducers. We perform experiments with English and Brazilian Portuguese, providing the first known results in syntax-based Statistical Machine Translation for this language pair. These results show that this approach is able to better model phenomena like long-distance reordering and give directions to future improvements in building syntax-based translation models for this pair. Introduction Statistical Machine Translation (SMT) is the process of translating from one natural language to another one using statistical models and machine learning techniques [1]. In the last twenty years, SMT has become the main research focus in Machine Translation, mainly due to the advent of massive parallel data available in the web and the improvement in computational performance. The idea of SMT is to take advantage of this data to automatically build statistical (language and translation) models that infer the necessary linguistic knowledge to do the translation process. By improving the statistical models and training algorithms, many advances were obtained in SMT since it was first proposed by [2] . Current state-of-the-art SMT systems implement Phrase-based models (PB-SMT), which use phrases 1 as the translation unit [3, 4] . These models do not use any explicit linguistic knowledge, relying only on the implicit knowledge provided by the corpus. In previous work, [5] performed experiments in PB-SMT between Brazilian Portuguese and both English and Spanish languages. The results presented were promising: in some experiments, the PB-SMT systems outperformed rule-based, handmade systems. However, in the last years these advances have been decreasing: purely statistical changes have not brought any significant improvements in translation performance. The following example shows a sentence in English translated to Brazilian Portuguese by a PB-SMT system, along with the reference translation made by a human specialist: Source sentence: The oldest poems are translated. PB-SMT translation: O mais antigo poemas são traduzidos. Reference translation: Traduzidos poemas mais antigos. 1 A phrase in this context is defined as any sequence of words.
doi:10.21528/lnlm-vol11-no1-art2 fatcat:3lrwss65bzhyfpjhhx5jqbva64