Design and Development of Part of Speech Tagger for Ge'ez Language Using Hybrid Approach

Gebremeskel Hagos Gerbremedhin
2019 The International Journal of Science & Technoledge  
Introduction Language is one of the fundamental features of human behavior and it constitutes a crucial component of our lives. In its written form, it serves as a means of recording information and knowledge on a long term-basis and transmitting what it records from one generation to the next. In its spoken form, it serves as a means of coordinating our day-to-day life with others [1]. According to Noam Chomsky [2], a language is a set (finite or infinite) of sentences, each finite in length
more » ... d constructed out of a finite set of elements. Language is an aspect of human behavior. In written form, it is a long-term record of knowledge from one generation to the next while in spoken form it is a means of communication. Language is the key aspect of human intelligence and can be categorized as natural and Artificial language. Natural language is an ordinary language that has evolved as the normal means of communication among people. Examples: English, Ge'ez, Amharic, Afaan-Oromo and Tigrigna. Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve generation is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications in a computer [3]. Additionally, NLP is the means for accomplishing different types of tasks and/or applications. Such tasks include part of speech (POS) tagging, named entity recognition (NER), information retrieval (IR), speech recognition, machine translation, question answering etc. [3]. POS tagging is the process of assigning part of speech like noun, verb, preposition, pronoun, adverb, adjective or other lexical class markers to each word in a sentence or literature.POS tagging is the first step to understanding a natural language. Most other tasks and applications heavily depend on it [4].The significance of POS (also known as word classes, morphological classes, or lexical tags) for language processing is that it gives large amount of information about a word and its neighbor. POS tagging is considered as one of the basic necessary tools. The accuracy of many NLP applications depends on the accuracy of POS tagger [5]. POS tagging can be used in text to speech (TTS), IR, shallow parsing, information extraction (IE), linguistic research for corpora [6]and also as an intermediate step for higher level NLP tasks such as parsing, semantic analysis, machine translation, and many more [6]. POS tagging, thus, is a necessary application for advanced NLP applications in Ge'ez or any other languages. Abstract: Part of Speech tagging is the process of assigning part of speech or other lexical class markers to each word in a sentence or literature. It is the first step to understanding a natural language. Most other tasks and applications heavily depend on it. As to the best of the researcher's knowledge, Ge'ez is the language which does not have developed POS tagger so far. Therefore, this work proposes a hybrid approach, Trigram N tag tagger combined with human written rule, Regular expression and morphological pattern analysis-based tagger, for Ge'ez part of speech tagger. Ge'ez literatures on syntax, morphology and grammar are reviewed to understand nature of the language and also to identify possible tag sets. Since there was no readymade standard corpus for Ge'ez language, as a result, 26 broad tag sets were identified and 15,154 words from around 1,305 sentences collected from one genre i.e., Holy bible. Then, those words were manually tagged by Ge'ez language professionals for training and testing purpose. Several techniques have been suggested to tag words automatically with their POS tags. Among these, the hybrid of TnT with human annotated rule, regex and morphological pattern analysis of Ge'ez language is assumed to perform better than the TnT taggers taken alone. Different experiments are conducted for the three types of taggers namely the TnT tagger, TnT with Regex tagger and Hybrid tagger. Therefore, 77.87%, 82.23% and 94.32% performances are obtained for TnT tagger, TnT with Regex tagger and Hybrid taggers respectively. Therefore, it is possible to conclude that the hybrid tagger performs better than the TnT tagger and TnT with Regex tagger used individually.
doi:10.24940/theijst/2019/v7/i12/st1912-009 fatcat:gfnt4t5d5jgnheuqyppsmw3uue