Analysing cross-lingual transfer in lemmatisation for Indian languages

Kumar Saurav, Kumar Saunack, Pushpak Bhattacharyya
2020 Proceedings of the 28th International Conference on Computational Linguistics   unpublished
Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. However, most of the prior work on this topic has focused on high resource languages. In this paper, we evaluate cross-lingual approaches for low resource languages, especially in the context of morphologically rich Indian languages. We test our model on six languages from two different families and develop linguistic insights into each model's performance. Models We adapt the
more » ... wo-step attention process from the state of the art (Anastasopoulos and Neubig, 2019) on the SIGMORPHON 2019 morphological inflection task (McCarthy et al., 2019), switching the input and output to use it as a lemmatiser. The model has four parts: separate encoders for both the tags and the input character sequence, an attention mechanism, and a decoder. The encoder for the lemma is single layer bidirectional LSTM. Morphological tags are also input to the model for which we use self-attention encoders (Vaswani et al., 2017) without * These authors contributed equally to this work This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http: //creativecommons.org/licenses/by/4.0/.
doi:10.18653/v1/2020.coling-main.534 fatcat:phiuzotopbbs3frlu4fdzt2bna