A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
2020
Proceedings of the 28th International Conference on Computational Linguistics
unpublished
Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged
doi:10.18653/v1/2020.coling-main.579
fatcat:e5wzlagpozbatjv2vxepvv4mde