Combined Machine-Learning Approach to PoS-Tagging of Middle English Corpora

Raoul Karimov, Chelyabinsk State University
2018 Crossroads A Journal of English Studies  
This paper considers the problem of part-of-speech tagging in Middle English corpora (as well as historical corpora in general). Whereas PoS-tagging in general is now considered a solved problem for Modern English and is mainly achieved via hidden Markov models (HMM) and matrix-based word-to-vector conversions with every word in the dictionary being embedded into a single dimension, this approach relies on recurrent syntactic structures and context-free generative grammars and is therefore not
more » ... pplicable to older iterations of the English language due to irregular word order. As such, we believe that Middle English could be better handled by a morphographemic encoding and instance-based machine learning algorithms like SVM, random forests, kNN, etc. Using a moving-average method to generate multidimensional vectors giving a reliable numeric representation of character composition and sequences, we have achieved a precision and recall of 87.5% in classifying Middle English words by their part of speech while using a simplistic combined voting-based binary classifier. This result could be deemed satisfactory and encourages further research in the area.
doi:10.15290/cr.2018.21.2.04 fatcat:3wxatuco7bf5zl3hnwaajm2lcm