pioNER: Datasets and Baselines for Armenian Named Entity Recognition [article]

Tsolak Ghukasyan, Garnik Davtyan, Karen Avetisyan, Ivan Andrianov
2018 arXiv   pre-print
In this work, we tackle the problem of Armenian named entity recognition, providing silver- and gold-standard datasets as well as establishing baseline results on popular models. We present a 163000-token named entity corpus automatically generated and annotated from Wikipedia, and another 53400-token corpus of news sentences with manual annotation of people, organization and location named entities. The corpora were used to train and evaluate several popular named entity recognition models.
more » ... ngside the datasets, we release 50-, 100-, 200-, 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia.
arXiv:1810.08699v1 fatcat:p6fpka5gibcxxesbjha6ee235y