Zero-resource Multi-dialectal Arabic Natural Language Understanding

Muhammad Khalifa, Hesham Hassan, Aly Fahmy
2021 International Journal of Advanced Computer Science and Applications  
A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only – identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with
more » ... led DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as ˜10% F_1 (NER), 2% accuracy (POS tagging), and 4.5% F_1 (SRD). We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for leveraging the relatively abundant labeled MSA datasets to develop DA models for zero and low-resource dialects. We also report new state-of-the-art performance on all three tasks and open-source our fine-tuned models for the research community.
doi:10.14569/ijacsa.2021.0120369 fatcat:kw57zdjmmbcjzg42gaa2372bnu