37,333 Hits in 3.8 sec

Significance testing of word frequencies in corpora

Jefrey Lijffijt, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, Heikki Mannila
2014 Digital Scholarship in the Humanities  
We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.  ...  We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the  ...  In this study, we focus on the assessment of the statistical significance of differences in word frequencies between corpora.  ... 
doi:10.1093/llc/fqu064 dblp:journals/lalc/LijffijtNSPPM16 fatcat:ojxizewy6jbd3ktd4wzbtjctze

Evolution of the Modern Phase of Written Bangla: A Statistical Study [article]

Paheli Bhattacharya, Arnab Bhattacharya
2013 arXiv   pre-print
We collect three different types of corpora---classical, newspapers and blogs---and test whether the differences in their features are statistically significant.  ...  in a word or of different words in a sentence.  ...  Although the classical corpus exhibits longer words in terms of syllables (due to the H < A test), the non-equality test (H ¬ A ) is not significant.  ... 
arXiv:1310.1590v1 fatcat:j7xitacj7resnixwd36dp7k55u

Getting rid of the Chi-square and Log-likelihood tests for analysing vocabulary differences between corpora

Yves Bestgen
2018 Quaderns de Filologia: Estudis Lingüístics  
However, because this specific use of the Chi-square test is not valid, it produces far too many significant results.  ...  Log-likelihood and Chi-square tests are probably the most popular statistical tests used in corpus linguistics, especially when the research is aiming to describe the lexical variations between corpora  ...  If a test claims that a given word is more frequent in one variety of English than it is in another because it finds a significant difference between the frequency of this word in the two corpora, it is  ... 
doi:10.7203/qf.22.11299 fatcat:3phqshkgibdjrc7lzbhfrpjjcu

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora [chapter]

Marina Santini, Wiktor Strandqvist, Mikael Nyström, Marjan Alirezai, Arne Jönsson
2018 Communications in Computer and Information Science  
In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora.  ...  Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.  ...  Both correlation tests confirm that the distributions of the word frequencies of the two corpora are not positively correlated and this difference is statistically significant.  ... 
doi:10.1007/978-3-319-99133-7_17 fatcat:ncso5ksl5vfwvkrdeqehfqbuze

Testing the Robustness of Laws of Polysemy and Brevity Versus Frequency [chapter]

Antoni Hernández-Fernández, Bernardino Casas, Ramon Ferrer-i-Cancho, Jaume Baixeries
2016 Lecture Notes in Computer Science  
The pioneering research of G. K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws.  ...  Here we focus on a couple of them: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be  ...  This research work has been supported by the SGR2014-890 (MACDA) project of the Generalitat de Catalunya, and MINECO project APCOM (TIN2014-57226-P). Bibliography  ... 
doi:10.1007/978-3-319-45925-7_2 fatcat:u2d5ndijiba2dd65o36gsky5oe

Corpus Linguistics for Vocabulary: A guide for Research by Paweł Szudarsk. Routledge Publications 2018. 239 pp. ISBN: 978-1-138-18721-4

Vahid Pahlevansadegh, Mehrdad Vasheghani Farahani
2020 Journal of Language and Education  
The first sub-section focuses on the significance of frequency, which is a rudimentary function of corpora.  ...  It also elaborates on different types of frequency in vocabulary, such as the frequency of spoken and written words and that of content and function words.  ... 
doi:10.17323/jle.2020.10554 fatcat:pyj6zw7kcbdjdk4pv3arn322y4

The Role of Native and Learner Corpora in Vocabulary Test Design

Eman Saleh Akeel
2016 English Language Teaching  
The article aims to illustrate how both native and learner corpora can be used in language testing in general and in the development of vocabulary tests in particular.</p>  ...  It covers the benefits and limitations of using corpora in language testing and argues for the importance and usefulness of using native as well as learner corpora as tools for designing a vocabulary test  ...  Acknowledgments I would like to express my gratitude to the Ministry of Higher Education and to King Abdulaziz University for granting me a scholarship to pursue my postgraduate studies in the United Kingkdom  ... 
doi:10.5539/elt.v9n7p10 fatcat:xemljrrtabexbktmu27vm2npda

Subtlex-pl: subtitle-based word frequency estimates for Polish

Paweł Mandera, Emmanuel Keuleers, Zofia Wodniecka, Marc Brysbaert
2014 Behavior Research Methods  
In addition to frequencies for word forms, SUBTLEX-PL includes measures of contextual diversity, part-of-speech-specific word frequencies, frequencies of associated lemmas, and word bigrams, providing  ...  Our results suggest that the two corpora may have unequal potential for explaining human performance for words in different frequency ranges and that corpora based on written materials severely overestimate  ...  We thank Jon Andoni Duñabeitia, Gregory Francis, and an anonymous reviewer for insightful comments on an earlier draft of the manuscript, Adam Przepiórkowski for providing access to the BS-NCP word frequencies  ... 
doi:10.3758/s13428-014-0489-4 pmid:24942246 fatcat:dwpcd7se5rd33dvpsp4cqh4uxm

Worldlex: Twitter and blog word frequencies for 66 languages

Manuel Gimenes, Boris New
2015 Behavior Research Methods  
High-frequency words are processed more accurately and more rapidly than low-frequency words, both in comprehension and in production (More recently, another source of corpora was found to be reliable:  ...  Lexical frequency is one of the strongest predictors of word processing time. The frequencies are often calculated from book-based corpora, or more recently from subtitlebased corpora.  ...  For each of the five new corpora, we compared the word frequency and the CD measure.  ... 
doi:10.3758/s13428-015-0621-0 pmid:26170053 fatcat:mdumvhzvxzdx5iiv3riovxajma

Frequency of Low-Frequency Words in Text Corpora

Pavel Rychlý
2010 Recent Advances in Slavonic Natural Languages Processing  
Low-frequency words, esp. words occurring only once in a text corpus, are very popular in text analysis. Also many lexicographers draw attention to such words.  ...  This paper lists a detailed statistical analysis of low-frequency words. The results provides important information for many practical applications, including lexicography and language modeling.  ...  Acknowledgements This work has been partly supported by the Ministry of Education, Youth and Sports of Czech Republic under the project LC536 and within the National Research Programme II project 2C06009  ... 
dblp:conf/raslan/Rychly10 fatcat:xan2fgzlabhrhbgl4m4ody4q6u

Quantitative Properties of Russian Adjective-Noun Collocations across Dictionaries and Corpora

Maria Khokhlova
2020 Workshop on Cognitive Modeling and Computational Linguistics  
We tested the following hypothesis, i.e. high collocation frequencies correspond to the fact that the item is represented in several dictionaries.  ...  The paper discusses the differences between collocations extracted from a number of Russian dictionaries paying attention to their frequency characteristics based on corpora.  ...  them in corpora are significant (p < 0.05 according to the Friedman test).  ... 
dblp:conf/acl-cmcl/Khokhlova20 fatcat:4f222snwh5fjvarws6e6n3ecna

Testing Zipf's meaning-frequency law with wordnets as sense inventories

Francis Bond, Arkadiusz Janz, Marek Maziarz, Ewa Rudnicka
2019 Global WordNet Conference  
Zipf, more frequent words have more senses. We have tested this law using corpora and wordnets of English, Spanish, Portuguese, French, Polish, Japanese, Indonesian and Chinese.  ...  On the other hand, the law disastrously fails in predicting the number of senses for a single lemma.  ...  Acknowledgments This research was financed by the National Science Centre, Poland, grant number 2018/29/B/HS2/02919, and supported by the Polish Ministry of Education and Science, Project CLARIN-PL, and  ... 
dblp:conf/wordnet/BondJMR19 fatcat:oimoscfvqjfkpdxs234gobra7a

Analyzing Idioms and Their Frequency in Three Advanced ILI Textbooks: A Corpus-Based Study

Sepideh Alavi, Aboozar Rajabpoor
2014 English Language Teaching  
Chi-square tests were then run to discover whether there were significant differences among the frequencies of occurrence of each idiom across each corpus.  ...  frequencies of the idioms across the three corpora.  ...  Another chi-square test was run to find the frequency of occurrence of each idiom in the MICASE, BNC and Brown corpora online to discover whether there were any differences among the frequencies of occurrence  ... 
doi:10.5539/elt.v8n1p170 fatcat:tjx7npzyu5d7jot342ezh57vr4

A multi-lingual and cross-domain analysis of features for text simplification

Regina Stodden, Laura Kallmeyer
2020 International Conference on Language Resources and Evaluation  
In this paper, we investigate their relevance for Czech, German, English, Spanish, and Italian text simplification corpora.  ...  In text simplification and readability research, several features have been proposed to estimate or simplify a complex text, e.g., readability scores, sentence length, or proportion of POS tags.  ...  Acknowledgments This research is part of the PhD-program "Online Participation", supported by the North Rhine-Westphalian funding scheme "Forschungskolleg".  ... 
dblp:conf/lrec/StoddenK20 fatcat:ux7wkyzqm5a7hm3wuipsuzutte

Corpus Statistics in Text Classification of Online Data [article]

Marina Sokolova, Victoria Bobicev
2018 arXiv   pre-print
In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results.  ...  Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies.  ...  The test indicated that difference between the five measures is not significant (P value = 0.9981); hence, the basic characteristics of the corpora are compatible.  ... 
arXiv:1803.06390v1 fatcat:y6w6a4ykxbc6rafavgnqveu44q
« Previous Showing results 1 — 15 out of 37,333 results