Part-of-speech Taggers for Low-resource Languages using CCA Features

Young-Bum Kim, Benjamin Snyder, Ruhi Sarikaya
2015 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing  
In this paper, we address the challenge of creating accurate and robust partof-speech taggers for low-resource languages. We propose a method that leverages existing parallel data between the target language and a large set of resourcerich languages without ancillary resources such as tag dictionaries. Crucially, we use CCA to induce latent word representations that incorporate cross-genre distributional cues, as well as projected tags from a full array of resource-rich languages. We develop a
more » ... robability-based confidence model to identify words with highly likely tag projections and use these words to train a multi-class SVM using the CCA features. Our method yields average performance of 85% accuracy for languages with almost no resources, outperforming a state-of-the-art partiallyobserved CRF model.
doi:10.18653/v1/d15-1150 dblp:conf/emnlp/KimSS15 fatcat:6faju5ipmjf7hojldmcm4lmaou