Cross-corpus Native Language Identification via Statistical Embedding

Francisco Rangel, Paolo Rosso, Julian Brooke, Alexandra Uitdenbogerd
2018 Proceedings of the Second Workshop on Stylistic Variation   unpublished
In this paper, we approach the task of native language identification in a realistic crosscorpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. We have proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced. 42
doi:10.18653/v1/w18-1605 fatcat:lppobndm7zesndxthge7bixdqe