A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Set-Theoretic Alignment for Comparable Corpora
2016
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We describe and evaluate a simple method to extract parallel sentences from comparable corpora. The approach, termed STACC, is based on expanded lexical sets and the Jaccard similarity coefficient. ...
board and gives significantly better results on the noisiest datasets. ...
We would like to thank MondragonLingua Translation & Communication as coordinator of these projects and the three anonymous reviewers for their helpful feedback and suggestions. ...
doi:10.18653/v1/p16-1189
dblp:conf/acl/EtchegoyhenA16
fatcat:lkb5kqkkozepvmjfr7he4nv6ai
Scalable Construction of High-Quality Web Corpora
2013
Journal for Language Technology and Computational Linguistics
We first focus on web crawling and the pros and cons of the existing crawling strategies. ...
Finally, we show how the availability of extremely large, high-quality corpora opens up new directions for research in various fields of linguistics, computational linguistics, and natural language processing ...
Acknowledgments The second evaluation study reported in Section 4.2 is based on joint work with Sabine Bartsch. ...
dblp:journals/ldvf/BiemannBEGQSSSZ13
fatcat:eciovvcvazewnfuhk7shiksuiy
Annotated Corpora and Annotation Tools
[chapter]
2016
Anaphora Resolution
In this Chapter we review the currently available corpora to study anaphoric interpretation, and the tools that can be used to create new ones. ...
Acknowledgements This work was supported in part by a PhD studentship offered by Cogito / Expert Systems (Kepa Rodriguez), in part by the LIVEMEMORIES project (Poesio), and in part by the SENSEI project ...
] for German, AN-CORA for Catalan and Spanish [66], and LIVEMEMORIES [67] for Italian. ...
doi:10.1007/978-3-662-47909-4_4
dblp:series/tanlp/PoesioPRRV16
fatcat:4dnly5kz6zar7bfboduulhu66a
Knowledge-lean projection of coreference chains across languages
2015
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
We apply a direct projection algorithm on a multi-genre and multilingual corpus (English, German, Russian) to automatically produce coreference annotations for two target languages without exploiting any ...
Our evaluation of the projected annotations shows promising results, and the error analysis reveals structural differences of referring expressions and coreference chains for the three languages, which ...
In this paper, we report on experiments with projecting nominal coreference chains across bilingual corpora. ...
doi:10.18653/v1/w15-3403
dblp:conf/acl-bucc/GrishinaS15
fatcat:3bxejkvsdfctll6hgopqp5rnwm
Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features
[article]
2021
arXiv
pre-print
In this work, we provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models that enable robust large scale analysis. ...
Poetry corpora do exist for a number of languages, but larger collections lack consistency and are encoded in various standards, while annotated corpora are typically constrained to a particular genre ...
Acknowledgments We thank Gesine Fuhrmann and Debby Trzeciak for their annotations. ...
arXiv:2102.08858v2
fatcat:arcauxw5rjc2pcvl5mlg5tg6dm
Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment
2010
North American Chapter of the Association for Computational Linguistics
Results for both accuracy in sentence extraction and downstream improvement in an SMT system are presented. ...
We also include features which make use of the additional annotation given by Wikipedia, and features using an automatically induced lexicon model. ...
Experiments
Data We annotated twenty Wikipedia article pairs for three language pairs: Spanish-English, Bulgarian-English, and German-English. ...
dblp:conf/naacl/SmithQT10
fatcat:xfpr2pq6pbcgzc6viycjydo4ne
A Survey of Available Corpora for Building Data-Driven Dialogue Systems
[article]
2017
arXiv
pre-print
We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective. ...
In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. ...
Canada Research Chairs, the Canadian Institute for Advanced Research (CIFAR) and Compute Canada. ...
arXiv:1512.05742v3
fatcat:lh34cnbvefcfxp2qwxfyiuuwhm
Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments
2019
Zenodo
The focus of corpora builders is essentially divided into three areas: corpus compilation, data processing, and corpus annotation. ...
Corpora are essential resources for computational linguistics and Natural Language Processing (NLP) fields. ...
As mentioned, Sinclair (2005) formulates the overall instructions proposed by the previous authors in ten fundamental criteria to follow in the design and the compilation of a general corpus: ...
doi:10.5281/zenodo.4441159
fatcat:nwix7lrzrbaxpgasing7mgdtwq
Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications
2022
Applied Sciences
Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase. ...
In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. ...
This section has explored corpora annotated for negation and speculation in 13 languages from six language families: Germanic languages (English, German, Swedish), Romance (French, Spanish), Uralic (Hungarian ...
doi:10.3390/app12105209
fatcat:jzm5hjhcqbbr5ck6cosat7n5zq
Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction
[article]
2020
arXiv
pre-print
We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. ...
We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities. ...
So, for the editions with more articles, we also extract all the articles from a larger sub-tree, and that favours even more the extraction of huge in-domain corpora for English and more modest ones for ...
arXiv:2005.01177v1
fatcat:i2xzqzsjjjadvnvrntutt43n3u
Linguistically-Based Comparison of Different Approaches to Building Corpora for Text Simplification: A Case Study on Italian
2022
Frontiers in Psychology
In this paper, we present an overview of existing parallel corpora for Automatic Text Simplification (ATS) in different languages focusing on the approach adopted for their construction. ...
To this end, we perform a two-level comparison on Italian corpora, since this is the only language, with the exception of English, for which there are large parallel resources derived through the two approaches ...
, and German (Suter et al., 2016) . ...
doi:10.3389/fpsyg.2022.707630
pmid:35350726
pmcid:PMC8958033
fatcat:7iidq6myirappc3vsp4vnozwyu
Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora
2021
Computational Linguistics
CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. ...
We conclude with recommendations on how to achieve generally applicable CDCR systems in the future — the most important being that evaluation on multiple CDCR corpora is strongly necessary. ...
Special thanks are due to Jan-Christoph Klie and Nafise Sadat Moosavi for the frequent exchange of ideas. This work was supported by the German Research Foundation through the ...
doi:10.1162/coli_a_00407
fatcat:wuyugyp4kbbnlplpzmcni7rz7i
Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora
[article]
2021
arXiv
pre-print
CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. ...
We conclude with recommendations on how to achieve generally applicable CDCR systems in the future -- the most important being that evaluation on multiple CDCR corpora is strongly necessary. ...
This work was supported by the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1 and grant GU 798/17-1). ...
arXiv:2011.12249v2
fatcat:6mbqk25lvbdmzhpsjmicx6274a
Ensemble Named Entity Recognition (NER): Evaluating NER Tools in the Identification of Place Names in Historical Corpora
2018
Frontiers in Digital Humanities
The identification and extraction of toponyms and spatial information mentioned in historical text collections has allowed its use in innovative ways, making possible the application of spatial analysis ...
In addition, the results showed that these NER systems are not strongly dependent on preprocessing and translation to Modern English. ...
Most of the NER tools that were considered for our study are based on supervised machine learning, but we nonetheless did not experiment with using the epistolary corpora for training new models, instead ...
doi:10.3389/fdigh.2018.00002
fatcat:73rk6pewyfccfe633ce4uem3xi
Polytonia
2021
Journal of Speech Sciences
Provided additional modules for the detection of prominence and prosodic boundaries, the resulting annotation may serve as an input for a phonological annotation. ...
This paper first proposes a labeling scheme for tonal aspects of speech and then describes an automatic annotation system using this transcription. ...
extraction ...
doi:10.20396/joss.v4i2.15053
fatcat:cnp7p32aa5ck5as4266qwjyycq
« Previous
Showing results 1 — 15 out of 1,387 results