Non-linear Mapping for Improved Identification of 1300+ Languages

Ralf Brown
2014 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)  
Non-linear mappings of the form P (ngram) γ and log(1+τ P (ngram)) log(1+τ ) are applied to the n-gram probabilities in five trainable open-source language identifiers. The first mapping reduces classification errors by 4.0% to 83.9% over a test set of more than one million 65-character strings in 1366 languages, and by 2.6% to 76.7% over a subset of 781 languages. The second mapping improves four of the five identifiers by 10.6% to 83.8% on the larger corpus and 14.4% to 76.7% on the smaller
more » ... rpus. The subset corpus and the modified programs are made freely available for download at
doi:10.3115/v1/d14-1069 dblp:conf/emnlp/Brown14 fatcat:d7naeus2q5hh3fuum5r4euoyba