1,387 Hits in 5.9 sec

Set-Theoretic Alignment for Comparable Corpora

Thierry Etchegoyhen, Andoni Azpeitia
2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
We describe and evaluate a simple method to extract parallel sentences from comparable corpora. The approach, termed STACC, is based on expanded lexical sets and the Jaccard similarity coefficient.  ...  board and gives significantly better results on the noisiest datasets.  ...  We would like to thank MondragonLingua Translation & Communication as coordinator of these projects and the three anonymous reviewers for their helpful feedback and suggestions.  ... 
doi:10.18653/v1/p16-1189 dblp:conf/acl/EtchegoyhenA16 fatcat:lkb5kqkkozepvmjfr7he4nv6ai

Scalable Construction of High-Quality Web Corpora

Chris Biemann, Felix Bildhauer, Stefan Evert, Dirk Goldhahn, Uwe Quasthoff, Roland Schäfer, Johannes Simon, Leonard Swiezinski, Torsten Zesch
2013 Journal for Language Technology and Computational Linguistics  
We first focus on web crawling and the pros and cons of the existing crawling strategies.  ...  Finally, we show how the availability of extremely large, high-quality corpora opens up new directions for research in various fields of linguistics, computational linguistics, and natural language processing  ...  Acknowledgments The second evaluation study reported in Section 4.2 is based on joint work with Sabine Bartsch.  ... 
dblp:journals/ldvf/BiemannBEGQSSSZ13 fatcat:eciovvcvazewnfuhk7shiksuiy

Annotated Corpora and Annotation Tools [chapter]

Massimo Poesio, Sameer Pradhan, Marta Recasens, Kepa Rodriguez, Yannick Versley
2016 Anaphora Resolution  
In this Chapter we review the currently available corpora to study anaphoric interpretation, and the tools that can be used to create new ones.  ...  Acknowledgements This work was supported in part by a PhD studentship offered by Cogito / Expert Systems (Kepa Rodriguez), in part by the LIVEMEMORIES project (Poesio), and in part by the SENSEI project  ...  ] for German, AN-CORA for Catalan and Spanish [66], and LIVEMEMORIES [67] for Italian.  ... 
doi:10.1007/978-3-662-47909-4_4 dblp:series/tanlp/PoesioPRRV16 fatcat:4dnly5kz6zar7bfboduulhu66a

Knowledge-lean projection of coreference chains across languages

Yulia Grishina, Manfred Stede
2015 Proceedings of the Eighth Workshop on Building and Using Comparable Corpora  
We apply a direct projection algorithm on a multi-genre and multilingual corpus (English, German, Russian) to automatically produce coreference annotations for two target languages without exploiting any  ...  Our evaluation of the projected annotations shows promising results, and the error analysis reveals structural differences of referring expressions and coreference chains for the three languages, which  ...  In this paper, we report on experiments with projecting nominal coreference chains across bilingual corpora.  ... 
doi:10.18653/v1/w15-3403 dblp:conf/acl-bucc/GrishinaS15 fatcat:3bxejkvsdfctll6hgopqp5rnwm

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features [article]

Thomas Haider
2021 arXiv   pre-print
In this work, we provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models that enable robust large scale analysis.  ...  Poetry corpora do exist for a number of languages, but larger collections lack consistency and are encoded in various standards, while annotated corpora are typically constrained to a particular genre  ...  Acknowledgments We thank Gesine Fuhrmann and Debby Trzeciak for their annotations.  ... 
arXiv:2102.08858v2 fatcat:arcauxw5rjc2pcvl5mlg5tg6dm

Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment

Jason R. Smith, Chris Quirk, Kristina Toutanova
2010 North American Chapter of the Association for Computational Linguistics  
Results for both accuracy in sentence extraction and downstream improvement in an SMT system are presented.  ...  We also include features which make use of the additional annotation given by Wikipedia, and features using an automatically induced lexicon model.  ...  Experiments Data We annotated twenty Wikipedia article pairs for three language pairs: Spanish-English, Bulgarian-English, and German-English.  ... 
dblp:conf/naacl/SmithQT10 fatcat:xfpr2pq6pbcgzc6viycjydo4ne

A Survey of Available Corpora for Building Data-Driven Dialogue Systems [article]

Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, Joelle Pineau
2017 arXiv   pre-print
We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.  ...  In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge.  ...  Canada Research Chairs, the Canadian Institute for Advanced Research (CIFAR) and Compute Canada.  ... 
arXiv:1512.05742v3 fatcat:lh34cnbvefcfxp2qwxfyiuuwhm

Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments

Imad Zeroual, Abdelhak Lakhouaja
2019 Zenodo  
The focus of corpora builders is essentially divided into three areas: corpus compilation, data processing, and corpus annotation.  ...  Corpora are essential resources for computational linguistics and Natural Language Processing (NLP) fields.  ...  As mentioned, Sinclair (2005) formulates the overall instructions proposed by the previous authors in ten fundamental criteria to follow in the design and the compilation of a general corpus:  ... 
doi:10.5281/zenodo.4441159 fatcat:nwix7lrzrbaxpgasing7mgdtwq

Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications

Ahmed Mahany, Heba Khaled, Nouh Sabri Elmitwally, Naif Aljohani, Said Ghoniemy
2022 Applied Sciences  
Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase.  ...  In this article, we review the corpora annotated with negation and speculation in various natural languages and domains.  ...  This section has explored corpora annotated for negation and speculation in 13 languages from six language families: Germanic languages (English, German, Swedish), Romance (French, Spanish), Uralic (Hungarian  ... 
doi:10.3390/app12105209 fatcat:jzm5hjhcqbbr5ck6cosat7n5zq

Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction [article]

Cristina España-Bonet, Alberto Barrón-Cedeño, Lluís Màrquez
2020 arXiv   pre-print
We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains.  ...  We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities.  ...  So, for the editions with more articles, we also extract all the articles from a larger sub-tree, and that favours even more the extraction of huge in-domain corpora for English and more modest ones for  ... 
arXiv:2005.01177v1 fatcat:i2xzqzsjjjadvnvrntutt43n3u

Linguistically-Based Comparison of Different Approaches to Building Corpora for Text Simplification: A Case Study on Italian

Dominique Brunato, Felice Dell'Orletta, Giulia Venturi
2022 Frontiers in Psychology  
In this paper, we present an overview of existing parallel corpora for Automatic Text Simplification (ATS) in different languages focusing on the approach adopted for their construction.  ...  To this end, we perform a two-level comparison on Italian corpora, since this is the only language, with the exception of English, for which there are large parallel resources derived through the two approaches  ...  , and German (Suter et al., 2016) .  ... 
doi:10.3389/fpsyg.2022.707630 pmid:35350726 pmcid:PMC8958033 fatcat:7iidq6myirappc3vsp4vnozwyu

Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora

Michael Bugert, Nils Reimers, Iryna Gurevych
2021 Computational Linguistics  
CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet.  ...  We conclude with recommendations on how to achieve generally applicable CDCR systems in the future — the most important being that evaluation on multiple CDCR corpora is strongly necessary.  ...  Special thanks are due to Jan-Christoph Klie and Nafise Sadat Moosavi for the frequent exchange of ideas. This work was supported by the German Research Foundation through the  ... 
doi:10.1162/coli_a_00407 fatcat:wuyugyp4kbbnlplpzmcni7rz7i

Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora [article]

Michael Bugert and Nils Reimers and Iryna Gurevych
2021 arXiv   pre-print
CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet.  ...  We conclude with recommendations on how to achieve generally applicable CDCR systems in the future -- the most important being that evaluation on multiple CDCR corpora is strongly necessary.  ...  This work was supported by the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1 and grant GU 798/17-1).  ... 
arXiv:2011.12249v2 fatcat:6mbqk25lvbdmzhpsjmicx6274a

Ensemble Named Entity Recognition (NER): Evaluating NER Tools in the Identification of Place Names in Historical Corpora

Miguel Won, Patricia Murrieta-Flores, Bruno Martins
2018 Frontiers in Digital Humanities  
The identification and extraction of toponyms and spatial information mentioned in historical text collections has allowed its use in innovative ways, making possible the application of spatial analysis  ...  In addition, the results showed that these NER systems are not strongly dependent on preprocessing and translation to Modern English.  ...  Most of the NER tools that were considered for our study are based on supervised machine learning, but we nonetheless did not experiment with using the epistolary corpora for training new models, instead  ... 
doi:10.3389/fdigh.2018.00002 fatcat:73rk6pewyfccfe633ce4uem3xi


Piet Mertens
2021 Journal of Speech Sciences  
Provided additional modules for the detection of prominence and prosodic boundaries, the resulting annotation may serve as an input for a phonological annotation.  ...  This paper first proposes a labeling scheme for tonal aspects of speech and then describes an automatic annotation system using this transcription.  ...  extraction  ... 
doi:10.20396/joss.v4i2.15053 fatcat:cnp7p32aa5ck5as4266qwjyycq
« Previous Showing results 1 — 15 out of 1,387 results