Worldlex: Twitter and blog word frequencies for 66 languages

Manuel Gimenes, Boris New
2015 Behavior Research Methods  
Lexical frequency is one of the strongest predictors of word processing time. The frequencies are often calculated from book-based corpora, or more recently from subtitlebased corpora. We present new frequencies based on Twitter, blog posts, or newspapers for 66 languages. We show that these frequencies predict lexical decision reaction times similar to the already existing frequencies, or even better than them. These new frequencies are freely available and may be downloaded from
more » ... x.lexique.org. Keywords Word frequency . Cross-language frequency . Twitter . Blogs The number of occurrences of a word within a corpus is one of the best predictor of word processing time (Howes & Solomon, 1951) . High-frequency words are processed more accurately and more rapidly than low-frequency words, both in comprehension and in production (More recently, another source of corpora was found to be reliable: movie subtitles. The subtitle-based frequencies were first computed in French by New, Brysbaert, Véronis, and Pallier (2007) . The authors showed two main results. First, they showed that the subtitle-based frequencies were a better predictor of reaction times than the book-based frequencies. Second, the subtitle-based frequencies were complementary to book-based frequencies. For instance, typical words from spoken language in everyday life were much more frequent in the subtitle-based than in the book-based corpora. Because the book-based and subtitle-based frequencies were shown to be complementary in the analyses (they explained more variance together than separately), the authors concluded that bookbased frequencies could be good estimates of written language and that subtitle-based frequencies could be good estimates of spoken language. The subtitle-based frequencies were then
doi:10.3758/s13428-015-0621-0 pmid:26170053 fatcat:mdumvhzvxzdx5iiv3riovxajma