Simple But Not Naïve: Fine-Grained Arabic Dialect Identification Using Only N-Grams

Sohaila Eltanbouly, May Bashendy, Tamer Elsayed
2019 Proceedings of the Fourth Arabic Natural Language Processing Workshop  
This paper presents the participation of Qatar University team in MADAR shared task, which addresses the problem of sentence-level fine-grained Arabic Dialect Identification over 25 different Arabic dialects in addition to the Modern Standard Arabic. Arabic Dialect Identification is not a trivial task since different dialects share some features, e.g., utilizing the same character set and some vocabularies. We opted to adopt a very simple approach in terms of extracted features and
more » ... n models; we only utilize word and character ngrams as features, and Naïve Bayes models as classifiers. Surprisingly, the simple approach achieved non-naïve performance. The official results, reported on a held-out testing set, show that the dialect of a given sentence can be identified at an accuracy of 64.58% by our best submitted run.
doi:10.18653/v1/w19-4624 dblp:conf/wanlp/EltanboulyBE19 fatcat:p2ofcnyuvbeu7kn7h2mety6nhi