Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification

Aditya Mogadala, Achim Rettinger
2016 Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies  
In many languages, sparse availability of resources causes numerous challenges for textual analysis tasks. Text classification is one of such standard tasks that is hindered due to limited availability of label information in lowresource languages. Transferring knowledge (i.e. label information) from high-resource to low-resource languages might improve text classification as compared to the other approaches like machine translation. We introduce BRAVE (Bilingual paRAgraph VEctors), a model to
more » ... earn bilingual distributed representations (i.e. embeddings) of words without word alignments either from sentencealigned parallel or label-aligned non-parallel document corpora to support cross-language text classification. Empirical analysis shows that classification models trained with our bilingual embeddings outperforms other stateof-the-art systems on three different crosslanguage text classification tasks.
doi:10.18653/v1/n16-1083 dblp:conf/naacl/MogadalaR16 fatcat:z2qri3ovsvhwrhjkqfkw4plxhi