Filters








8 Hits in 1.6 sec

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB [article]

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin
2020 arXiv   pre-print
To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets.  ...  Using one unified approach for 38 languages, we were able to mine 4.5 billions parallel sentences, out of which 661 million are aligned with English. 20 language pairs have more then 30 million parallel  ...  Our experimental results seem to indicate that such an approach works surprisingly well: we are able to mine billions of parallel sentences which seem to be of high quality: NMT systems trained only on  ... 
arXiv:1911.04944v2 fatcat:pebxb3fh5namncmpdnzbeufeka

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, Angela Fan
2021 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)   unpublished
Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English.  ...  We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating  ...  To the best of our knowledge, this makes CCMatrix the largest collection of high-quality mined parallel texts, with coverage over a wide variety of languages.  ... 
doi:10.18653/v1/2021.acl-long.507 fatcat:hqn7v55bfzekdf75sg2qc4stiq

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages [article]

Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep (+6 others)
2021 arXiv   pre-print
Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages.  ...  We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual  ...  We would like to thank the Robert Bosch Center for Data Science and Artificial Intelligence for supporting Sumanth and Gowtham through their Post Baccalaureate Fellowship Program.  ... 
arXiv:2104.05596v3 fatcat:zlq7dmu4w5hyljqz7slqsqf2rm

MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases [article]

Louis Martin, Angela Fan, Éric de la Clergerie, Antoine Bordes, Benoît Sagot
2021 arXiv   pre-print
Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English.  ...  We further present a method to mine such paraphrase data in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data.  ...  Acknowledgements This work was partly supported by Benot Sagot's chair in the PRAIRIE institute, funded by the French national agency ANR as part of the "Investissements davenir" programme under the reference  ... 
arXiv:2005.00352v2 fatcat:m2dyquni35d7fi37q3rygoyzza

A Survey on Low-Resource Neural Machine Translation [article]

Rui Wang and Xu Tan and Renqian Luo and Tao Qin and Tie-Yan Liu
2021 arXiv   pre-print
Neural approaches have achieved state-of-the-art accuracy on machine translation but suffer from the high cost of collecting large scale parallel data.  ...  Thus, a lot of research has been conducted for neural machine translation (NMT) with very limited parallel data, i.e., the low-resource setting.  ...  training data to ensure high-quality translation.  ... 
arXiv:2107.04239v1 fatcat:4la4zqfafzhk3l4chhmkaqrmwm

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation? [article]

En-Shiun Annie Lee, Sarubi Thillainathan, Shravan Nayak, Surangika Ranathunga, David Ifeoluwa Adelani, Ruisi Su, Arya D. McCarthy
2022 arXiv   pre-print
data in the model, (4) the impact of domain mismatch, and (5) language typology.  ...  We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training  ...  David Adelani acknowledges the support of the EU funded Horizon 2020 project ROXANNE under grant agreement No. 833635.  ... 
arXiv:2203.08850v3 fatcat:muqcpicbyjhsrj7a4czyvuj2r4

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation [article]

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzman, Angela Fan
2021 arXiv   pre-print
The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations  ...  By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.  ...  CCMatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprint Mike Zhang and Antonio Toral. 2019.  ... 
arXiv:2106.03193v1 fatcat:f6w5eadkqjdcdjlptrlvvcj6ci

A Survey on Low-Resource Neural Machine Translation

Rui Wang, Xu Tan, Renqian Luo, Tao Qin, Tie-Yan Liu
2021 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence   unpublished
Neural approaches have achieved state-of-the-art accuracy on machine translation but suffer from the high cost of collecting large scale parallel data.  ...  Thus, a lot of research has been conducted for neural machine translation (NMT) with very limited parallel data, i.e., the low-resource setting.  ...  training data to ensure high-quality translation.  ... 
doi:10.24963/ijcai.2021/629 fatcat:jcbanv22ezeavbnm773fblctfu