8,681 Hits in 7.2 sec

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

2018 Natural Language Engineering  
Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website.  ...  This led to a multilingual (ArabicEnglish) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words).  ...  For each month, we measure the similarity between each Arabic document and the English documents of the same month. This leads to numerous Arabic-English pairs with a similarity score for each one.  ... 
doi:10.1017/s1351324918000232 fatcat:qulmfx2ujbelbc2k5sivychmwq

Mining Documents and Sentiments in Cross-lingual Context

Motaz Saad
2016 Figshare  
First, we collect English, French and Arabic comparable corpora from Wikipedia and Euronews, and we align each corpus at the document level.  ...  Second, we present a cross-lingual document similarity measure to automatically retrieve and align comparable documents.  ...  measure is a special case of document similarity measure, The authors applied their work on French-English documents.  ... 
doi:10.6084/m9.figshare.3204040.v1 fatcat:5kb4k2kylnc7nhdumanxjw5wpe

Large Scale Parallel Document Mining for Machine Translation

Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe Dubiner
2010 International Conference on Computational Linguistics  
In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents.  ...  Results are presented for a corpus of over two billion web pages and for a large collection of digitized public-domain books.  ...  The resulting reference set contains documents in Arabic, Chinese, English, French, Russian and Spanish, however, for most English pages there is only one translation into one of the other languages.  ... 
dblp:conf/coling/UszkoreitPPD10 fatcat:etys6ivsbfh27b4aa37ac67q3a

The ADAPT Bilingual Document Alignment system at WMT16

Pintu Lohar, Haithem Afli, Chao-Hong Liu, Andy Way
2016 Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers  
We performed our experiments on English-to-French document alignments for this bilingual task.  ...  In this paper, we are interested in improving the quality of bilingual comparable corpora according to increased document alignment score.  ...  It is usually seen that on average a French translation of an English document has 1.2 words for every English word in the original.  ... 
doi:10.18653/v1/w16-2372 dblp:conf/wmt/LoharALW16 fatcat:vvj7ayuasnggdfy3usrqto7b5e

Inferring multilingual domain-specific word embeddings from large document corpora

Luca Cagliero, Moreno La Quatra
2021 IEEE Access  
It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source  ...  language (typically, English).  ...  Benchmark data consist of (i) a set of document corpora retrieved from Wikipedia and written in seven different languages (i.e., Italian, English, French, Spanish, German, Arabic, Russian), (ii) the per-language  ... 
doi:10.1109/access.2021.3118093 fatcat:pyxp6lre5naktagtgua4ucyyi4

A multimodal alignment framework for spoken documents

Dalila Mekhaldi, Denis Lalanne, Rolf Ingold
2011 Multimedia tools and applications  
Our framework that is language independent was evaluated on corpora in French and English, including meetings and scientific presentations.  ...  At the analysis level, the alignment framework was applied at several levels of granularity of documents, requiring specific document segmentation techniques.  ...  Test data The test data on which our work is based are from four different domains in two different languages: three meeting corpora, one in French and two in English, and an English scientific conference  ... 
doi:10.1007/s11042-011-0842-x fatcat:ajsquo2yefbxteagmv3aqokaie

LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models [article]

Hongyu Gong, Vishrav Chaudhary, Yuqing Tang, Francisco Guzmán
2021 arXiv   pre-print
Evaluated on cross-lingual document alignment, LAWDR demonstrates comparable performance to state-of-the-art models on benchmark datasets.  ...  However, there are two challenges: (1) these models impose high costs on long document processing and thus many of them have strict length limit; (2) model fine-tuning requires extra data and computational  ...  It identifies parallel documents, and mines massive data for machine translation ( We compare margin and cosine functions as measures of document similarity. Baselines.  ... 
arXiv:2106.03379v1 fatcat:wdaeurke5fgylnmioxybhcaipy

Subword Recognition in Historical Arabic Documents using C-GRUs

Hanadi Hassen, Somaya Al-Madeed, Ahmed Bouridane
2021 TEM Journal  
A comparison with existing techniques evaluated on the same datasets validates the effectiveness of our proposed model in characterizing Arabic subwords.  ...  Recognition of Arabic handwriting is challenging due to the highly cursive nature of the script and other challenges associated with historical documents (degradation etc.).  ...  The statements made herein are solely the responsibility of the authors.  ... 
doi:10.18421/tem104-19 fatcat:v4z4eyksevb4pedqqiozfyatky

Speech segmentation and spoken document processing

M. Ostendorf, B. Favre, R. Grishman, D. Hakkani-Tur, M. Harper, D. Hillard, J. Hirschberg, Heng Ji, J.G. Kahn, Yang Liu, S. Maskey, E. Matusov (+5 others)
2008 IEEE Signal Processing Magazine  
This article describes different levels of speech segmentation, approaches to automatically recovering segment boundary locations, and experimental results demonstrating impact on several language processing  ...  A key challenge in moving from text-based documents to such "spoken documents" is that spoken language lacks explicit punctuation and formatting, which can be crucial for good performance.  ...  [20] compared the effect of sentence segmentation quality on parsing of reference transcripts of conversational English.  ... 
doi:10.1109/msp.2008.918023 fatcat:jrry5lad2nbjpfd36vncqfyila

Paragraph text segmentation into lines with Recurrent Neural Networks

Bastien Moysset, Christopher Kermorvant, Christian Wolf, Jerome Louradour
2015 2015 13th International Conference on Document Analysis and Recognition (ICDAR)  
The main motivation is to be able to process either damaged documents, or flows of documents with a high variety of layouts and other characteristics.  ...  State-of-the-art methods to locate lines of text are based on handcrafted heuristics finetuned by the image processing community's experience.  ...  ACKNOWLEDGEMENT This work was partly funded by the French Grand Emprunt-Investissements d'Avenir program through the PACTE project.  ... 
doi:10.1109/icdar.2015.7333803 dblp:conf/icdar/MoyssetKWL15 fatcat:m5766frr25d4dlvekk6z6f2nwu

DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition [article]

Denis Coquenet and Clément Chatelain and Thierry Paquet
2022 arXiv   pre-print
For the first time, we propose an end-to-end segmentation-free architecture for the task of handwritten document recognition: the Document Attention Network.  ...  We achieve competitive results on the READ 2016 dataset at page level, as well as double-page level with a CER of 3.43% and 3.70%, respectively.  ...  This work was financially supported by the French Defense Innovation Agency and by the Normandy region.  ... 
arXiv:2203.12273v3 fatcat:dzjakyv53zeqrimgbkzrtccvm4

A Survey of Historical Document Image Datasets [article]

Konstantina Nikolaidou, Mathias Seuret, Hamam Mokayed, Marcus Liwicki
2022 arXiv   pre-print
We advocate for providing conversion tools to common formats (e.g., COCO format for computer vision tasks) and always providing a set of evaluation metrics, instead of just one, to make results comparable  ...  This paper presents a systematic literature review of image datasets for document image analysis, focusing on historical documents, such as handwritten manuscripts and early prints.  ...  An evaluation of a transcription alignment system based on HMM is proposed in the paper and compared with three more reference systems.  ... 
arXiv:2203.08504v2 fatcat:ilgqqgylfzejnpccrsg7vfsncm

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia

Alexandre Patry, Philippe Langlais
2011 Workshop on Building and Using Comparable Corpora  
We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them.  ...  ., 2010) , seek for extracting parallel sentences from comparable corpora, we present PARADOCS, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts.  ...  corpus of United Nation texts, improved significantly an Arabic-to-English SMT system tested on news data.  ... 
dblp:conf/acl-bucc/PatryL11 fatcat:m6qmmaxserdu7k45g4y2s2kevi

Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents [article]

Biao Zhang, Ankur Bapna, Melvin Johnson, Ali Dabirmoghaddam, Naveen Arivazhagan, Orhan Firat
2022 arXiv   pre-print
Using simple concatenation-based DocNMT, we explore the effect of 3 factors on the transfer: the number of teacher languages with document level data, the balance between document and sentence level data  ...  We focus on the scenario of zero-shot transfer from teacher languages with document level data to student languages with no documents but sentence level data, and for the first time treat document-level  ...  In contrast, IWSLT-10 is collected from TED talks and covers translations between English and N =9 different languages, including Arabic, German, French, Italian, Japanese, Korean, Dutch, Romanian and  ... 
arXiv:2109.10341v2 fatcat:jymnbdc7ujeuple6ffauyr754a

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus [article]

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot
2022 arXiv   pre-print
And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling.  ...  automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in  ...  We will also distribute a deduplicated ver- sion of the English part of OSCAR 22.01, with a data 6.2.3. Clean documents layout similar to OSCAR 21.09 corpora.  ... 
arXiv:2201.06642v1 fatcat:n7xdk22ibngztnrgnque2625re
« Previous Showing results 1 — 15 out of 8,681 results