Corpus Creation and Transformer based Language Identification for Code-Mixed Indian Language

S Thara, Prabaharan Poornachandran
2021 IEEE Access  
Social media users have the proclivity to write majority of the data for under resourced languages in code-mixed format. Code-mixing is defined as mixing of two or more languages in a single sentence. Research in code-mixed text helps apprehend security threats, prevalent on social media platforms. In such instances, language identification is an imperative task of code-mixed text. The focus of this paper is to carry out a word-level language identification (WLLI) of Malayalam-English
more » ... data, from social media platforms like YouTube. This study was centered around BERT, a transformer model, along with its variants -CamemBERT, DistilBERT -for intuitive perception of the language at the word-level. The propounded approach entails tagging Malayalam-English code-mixed data set with six labels: Malayalam (mal), English (eng), acronyms (acr), universal (univ), mixed (mix) and undefined (undef). Newly developed corpus of Malayalam-English was deployed for appraisal of the effectiveness of state-of-the-art models like BERT. Evaluation of the proffered approach, accomplished with other code-mixed language such as Hindi-English, notched a 9% increase in the F1-score.
doi:10.1109/access.2021.3104106 fatcat:5n4c3ih7ejaqhkztrct6juzvqq