Multilingual Dictionary Based Construction of Core Vocabulary

Winston Wu, Garrett Nicolai, David Yarowsky
2020 International Conference on Language Resources and Evaluation  
We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries. Our newly developed core concept vocabulary list derived from these dictionary consensus methods achieves high overlap with existing widely utilized core vocabulary lists targeted at applications such as first and second language learning or field linguistics. Our in-depth analysis illustrates
more » ... multiple desirable properties of our newly proposed core vocabulary set, including their non-compositionality. We employ a cognate prediction method to recover missing coverage of this core vocabulary in massively multilingual dictionary construction, and we argue that this core vocabulary should be prioritized for elicitation when creating new dictionaries for low-resource languages for multiple downstream tasks including machine translation and language learning.
dblp:conf/lrec/WuNY20 fatcat:6rlgztjmhna7ji4k2der5a5wbu