Quantitative Comparative Linguistics based on Tiny Corpora:N-gram Language Identification of Wordlists of Known and Unknown Languages from Amazonia and Beyond

Frank Seifart, Roger Mundry
2015 Journal of Quantitative Linguistics  
Can an unknown Amazonian language be identified by statistical procedures based on n-gram frequencies if only a short list of words is available and at the same time, the available data of the potential candidate languages are also limited to relatively short wordlists? In this paper we show that n-gram frequencies (specifically 1-grams and 2-grams) allow us to identify languages reliably based on as few as 20 words, as long as these are transcribed consistently, and as long as characteristic
more » ... nogram and bigram frequencies for these languages have previously been established based on consistently transcribed data. If no such consistently transcribed data are available, as is the case of our Amazonian case study, such procedures clearly fail for wordlists with 50 or fewer words. Our study thus contributes to exploring the limits of such automated detection procedures, both in terms of corpus size and transcription quality.
doi:10.1080/09296174.2015.1037161 fatcat:27c2umnmc5dvljxjplb2oaiz3q