Corpus Creation and Language Identification in Low-Resource Code-Mixed Telugu-English Text

Kusampudi Siva Subrahamanyam Varma, Language Technologies Research Centre, IIIT Hyderabad, Telangana, India, Anudeep Chaluvadi, Radhika Mamidi, Language Technologies Research Centre, IIIT Hyderabad, Telangana, India, Language Technologies Research Centre, IIIT Hyderabad, Telangana, India
2021 Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications   unpublished
Code-Mixing (CM) is a common phenomenon in multilingual societies. CM plays a significant role in technology and medical fields where terminologies in the native language are not available or known. Language Identification (LID) of the CM data will help solve NLP tasks such as Spell Checking, Named Entity Recognition, Parts-Of-Speech tagging, and Semantic Parsing. In the current era of machine learning, a common problem to the above-mentioned tasks is the availability of Learning data to train
more » ... odels. In this paper, we introduce two Telugu-English CM manually annotated datasets (Twitter dataset and Blog dataset). The Twitter dataset contains more romanization variability and misspelled words than the blog dataset. We compare across various classification models and perform extensive bench-marking using both Classical and Deep Learning Models for LID compared to existing models. We propose two architectures for language classification (Telugu and English) in CM data: (1) Word Level Classification (2) Sentence Level word-byword Classification and compare these approaches presenting two strong baselines for LID on these datasets.
doi:10.26615/978-954-452-072-4_085 fatcat:nppwiemkcjekjc37dw7mxujcby