Filters








2,215 Hits in 7.6 sec

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction [article]

Stefan Heid, Marcel Wever, Eyke Hüllermeier
2021 arXiv   pre-print
Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks.  ...  In this paper, we consider POS tagging within the framework of set-valued prediction, which allows the POS tagger to express its uncertainty via predicting a set of candidate POS tags instead of guessing  ...  Our approach of set-valued prediction for part-of-speech tagging is then presented in Section IV and evaluated empirically in Section V.  ... 
arXiv:2008.01377v3 fatcat:z4umy7tokfhjrmdxokr3klurxm

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features [article]

Thomas Haider
2021 arXiv   pre-print
A prerequisite for the computational study of literature is the availability of properly digitized texts, ideally with reliable meta-data and ground-truth annotation.  ...  Poetry corpora do exist for a number of languages, but larger collections lack consistency and are encoded in various standards, while annotated corpora are typically constrained to a particular genre  ...  Accent Ratio of Part-of-Speech Previous research has noted that part-of-speech annotation provides a good signal for the stress of words (Nenkova et al., 2007; Greene et al., 2010) .  ... 
arXiv:2102.08858v2 fatcat:arcauxw5rjc2pcvl5mlg5tg6dm

An automatic part-of-speech tagger for Middle Low German

Mariya Koleva, Melissa Farasyn, Bart Desmet, Anne Breitbarth, Véronique Hoste
2017 International Journal of Corpus Linguistics  
The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG).  ...  Such corpora have recently been developed for a variety of historical languages, or are still under development.  ...  Introduction Corpora of historical texts annotated with different levels of grammatical information, such as parts of speech, (inflectional) morphology, syntactic chunks, clausal syntax, provide an important  ... 
doi:10.1075/ijcl.22.1.05kol fatcat:kbfkv7wii5ajheiwvtq3zjnogq

Predicting the Direction of Derivation in English Conversion

Max Kisselew, Laura Rimell, Alexis Palmer, Sebastian Padó
2016 Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology  
We achieve the best overall account of the historical data by taking both frequency and semantic specificity into account.  ...  This paper investigates whether distributional information can be used to predict the diachronic direction of conversion for homophonous noun-verb pairs.  ...  AP acknowledges Leibniz Association grant SAS-2015-IDS-LWC and the Ministry of Science, Research, and Art of Baden-Württemberg.  ... 
doi:10.18653/v1/w16-2015 dblp:conf/sigmorphon/KisselewRPP16 fatcat:junhjcj3vvar7iqmavtsbkwize

Supervised collaboration for syntactic annotation of Quranic Arabic

Kais Dukes, Eric Atwell, Nizar Habash
2011 Language Resources and Evaluation  
The Quranic Arabic Corpus (http://corpus.quran.com) is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation including part-of-speech  ...  The Quran also benefits from a large body of existing historical grammatical analysis, which may be leveraged during this review.  ...  We thank Wajdi Zaghouani at the Linguistic Data Consortium, University of Pennsylvania for assistance in devising the Amazon Mechanical Turk experiment for tagging the Quran via crowdsourcing.  ... 
doi:10.1007/s10579-011-9167-7 fatcat:oa5ocqu7kvdu3fsytzzhiwbrsi

Code-Switching Ubique Est - Language Identification and Part-of-Speech Tagging for Historical Mixed Text

Sarah Schulz, Mareike Keller
2016 Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities  
In this paper, we describe the development of a language identification system and a part-of-speech tagger for Latin-Middle English mixed text.  ...  To this end, we annotate data with language IDs and Universal POS tags (Petrov et al., 2012) .  ...  We are greatly endebted to the Pontifical Institute of Mediaeval Studies (PIMS), Toronto, for their support and kind permission to use a searchable PDF version of the sermon transcripts.  ... 
doi:10.18653/v1/w16-2105 dblp:conf/latech/SchulzK16 fatcat:tzcf4u5s7bhxrn2i6mrs35eh6a

Domain and Task Adaptive Pretraining for Language Models

Leonard Konle, Fotis Jannidis
2020 Workshop on Computational Humanities Research  
All current state-of-the-art systems in NLP utilize transformer based language models trained on massive amounts of text.  ...  Training a model from scratch using Electra [5] is not competitive for our data sets.  ...  But their use in the Computational Humanities context is hindered by the lack of historical text corpora to train them from scratch.  ... 
dblp:conf/chr/KonleJ20 fatcat:7ve5pfelmjg7tn5vvi73ay5dgq

NLP-CIC @ DIACR-Ita: POS and Neighbor Based Distributional Models for Lexical Semantic Change in Diachronic Italian Corpora (short paper)

Jason Angel
2020 International Workshop on Evaluation of Natural Language and Speech Tools for Italian  
Our first model solely relies on part-of-speech usage and an ensemble of distance measures.  ...  We propose two models representing the target words across the periods to predict the changing words using threshold and voting schemes.  ...  Acknowledgments The authors thank CONACYT for the computer resources provided through the INAOE Supercomputing Laboratory's Deep Learning Platform for Language Technologies.  ... 
dblp:conf/evalita/Angel20 fatcat:ajyr2yabsvdntas4rux7st7qxy

Short term diachronic shifts in part-of-speech frequencies: A comparison of the tagged LOB and F-LOB corpora

Christian Mair, Marianne Hundt, Geoffrey N. Leech, Nicholas Smith
2002 International Journal of Corpus Linguistics  
Both corpora were tagged using a version of the CLAWS part-of-speech-tagger developed at Lancaster, and part of the material was post-edited manually in Freiburg to assess the accuracy of the automatic  ...  The paper presents a comparison of tag frequencies in two matching one-million word reference corpora of British standard English, the 1961 LOB-corpus and its 1991 "clone" produced at Freiburg.  ...  tackling all those problems in the investigation of which consistency of tagging across corpora is a greater priority than minimising errors within corpora through manual post-editing.  ... 
doi:10.1075/ijcl.7.2.05mai fatcat:z3oyb5w57rdfvpcjzb67vknjma

NLP-CIC @ DIACR-Ita: POS and Neighbor Based Distributional Models for Lexical Semantic Change in Diachronic Italian Corpora [article]

Jason Angel, Carlos A. Rodriguez-Diaz, Alexander Gelbukh, Sergio Jimenez
2020 arXiv   pre-print
Our first model solely relies on part-of-speech usage and an ensemble of distance measures.  ...  We propose two models representing the target words across the periods to predict the changing words using threshold and voting schemes.  ...  Acknowledgments The authors thank CONACYT for the computer resources provided through the INAOE Supercomputing Laboratory's Deep Learning Platform for Language Technologies.  ... 
arXiv:2011.03755v1 fatcat:jj7262eiqrebdgeyi2ec2r6lm4

Natural Language Processing Across Time: An Empirical Investigation on Italian [chapter]

Marco Pennacchiotti, Fabio Massimo Zanzotto
2008 Lecture Notes in Computer Science  
Indeed, while NLP tools for Italian achieve today good performance, it is not clear if they could be successfully used for the humanities, to support the critical study of historical works.  ...  The first goal is to understand to what extent such tools can be used "as they are" for the automatic analysis of old literary works.  ...  Rules consist in a triggering condition and an emitted part-of-speech tag.  ... 
doi:10.1007/978-3-540-85287-2_36 fatcat:yxfmekbyrngs3l7qiqm5atdgom

EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

Michael Beißwenger, Sabine Bartsch, Stefan Evert, Kay-Michael Würzner
2016 Proceedings of the 10th Web as Corpus Workshop  
The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed  ...  and Web corpora.  ...  the DFG network Empirikom for fruitful discussions in the design stage of the task.  ... 
doi:10.18653/v1/w16-2606 dblp:conf/aclwac/BeisswengerBEW16 fatcat:dae6g6utsbcd3fnkremmtdk3p4

Slavic Corpus and Computational Linguistics

Dagmar Divjak, Serge Sharoff, Tomaž Erjavec
2017 Journal of Slavic Linguistics  
First, we discuss why the corpus linguistic approach was discredited by generative linguists in the second half of the 20th century, how it made a comeback through advances in computing and was adopted  ...  Finally, we survey the types of research requiring corpora that Slavic linguists are involved in world-wide, and the resources they have at their disposal.  ...  Examples of such categories are things that most linguists take for granted, such as words or part-of-speech tags, for example.  ... 
doi:10.1353/jsl.2017.0008 fatcat:i7iqxtcd5vejtlssz3cezrczgy

Mixed-initiative development of language processing systems

David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, Marc Vilain
1997 Proceedings of the fifth conference on Applied natural language processing -  
These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data.  ...  This paper describes a new set of integrated tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to "bootstrapping" the manual tagging process, with the goal of reducing  ...  Some of the specific extensions to the user interface that we have already begun building include part-of-speech tagging (and "dense" markup more generally), and full parse syntactic tagging (where we  ... 
doi:10.3115/974557.974608 dblp:conf/anlp/DayAHKRV97 fatcat:zvhmz7d5ujd2zicejxr4ub64lq

2 Annotating Middle Welsh: POS tagging and chunk-parsing a corpus of native prose [chapter]

2020 Morphosyntactic Variation in Medieval Celtic Languages  
In order to obtain the maximally reliable tags, a wide range of parameter settings was tried, varying those features.  ...  I described how a combination of minimal pre-processing, a systematic extension of current tag sets for historical corpora and a hierarchical way of chunk-parsing can yield important information needed  ... 
doi:10.1515/9783110680744-003 fatcat:fynygulqhvhjxce675b4sjutim
« Previous Showing results 1 — 15 out of 2,215 results