Language identification based on string kernels

C. Kruengkrai, P. Srichaivattana, V. Sornlertlamvanich, H. Isahara
IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005.  
In this paper, we propose a novel approach for automatically identifying the language of a given text based on the concept of string kernels. Our approach can identify the language from the text directly, regardless of its coding system. In particular, we view the text in a more fine-grained encoding as the string of bytes. The similarity between two strings can be implicitly computed through an efficient dynamic alignment using suffix trees. We provide empirical evidence that applying the
more » ... t applying the string kernels to the language identification problem yields an impressive performance using two different kernel classifiers: the kernelized version of the centroid-based method and the support vector machines. Our experiments are based on a reasonable scale of the data set in terms of the number of languages to be identified, including 17 different languages.
doi:10.1109/iscit.2005.1567018 fatcat:6bycoajo7zbw7aw6kv5fbezklm