A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging
Majdi Sawalha, Eric Atwell
The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic
... of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash '-' represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. 'Noun' in Arabic subsumes what are traditionally referred to in English as 'noun' and 'adjective'. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would Word Structure 6.1 (2013): 43-99 count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora. 1. Introduction: part-of-speech tagging and part-of-speech tag sets Part-of-speech taggers are used to enrich a corpus by adding a part-of-speech category label to each word, showing the broad grammatical class of the word, and morphological features such as tense, number, gender, etc. The list of all grammatical category labels is called the tag set. The design of the tag set is an important prerequisite to this annotation task. The task requires a tagging scheme, where each tag or label is practically defined by showing the words and contexts where each tag applies; and a tagger, a program responsible for assigning a tag to each word in the corpus by implementing tag set and tagging scheme in a tag-assignment algorithm (Atwell 2008). Automatic taggers have been used from the early years of Corpus Linguistics. TAGGIT in 1971 achieved an accuracy of 77% tested on the Brown corpus. In the late 1970s, CLAWS1, a data-driven statistical tagger was built to carry out the annotation of the Lancaster/ Oslo-Bergen corpus (LOB), and had an accuracy rate of 96-97%. Later tagger development included systems based on Hidden Markov Models (HMM); HMM taggers have been made for several languages. The Brill tagger (Brill 1995) is an example of data-driven symbolic tagger. The ENGCG and EngCG-2 are based on a framework known as Constraint Grammar (CG) (Voutilainen 2003). Recently, many new systems based on a variety of Markov Model and Machine Learning (ML) techniques have appeared for many languages. Hybrid solutions have also been investigated (Voutilainen 2003). ACOPOST, 2 A Collection Of POS Taggers, consists of four taggers of different frameworks: Maximum Entropy Tagger (MET), Trigram Tagger (T3), Error-driven Transformation-Based Tagger (TBT) and Example-based tagger (ET). The SNoW-based Part of Speech Tagger 3 and LBJ Part of Speech Tagger 4 make use of the Sequential Model. NLTK, 5 the Natural Language Toolkit, includes Python re-implementations of several POS taggers such as; Regexp Tagger, N-Gram Tagger, Brill Tagger and HMM Tagger; in addition NLTK includes tutorials and documentation on tagging. RelEx 6 provides English-language part-of-speech tagging, entity tagging, as well as other types of tags (gender, date, money, etc.). Spejd 7 -Shallow Parsing and Disambiguation Engine is a tool for simultaneous rule-based morphosyntactic disambiguation and partial parsing. VISL Constraint Grammar 8 is an example of rule based disambiguation. Enriching the source text samples of corpora with part-of-speech information for each word, as a first level of linguistic enrichment, results in more useful research resources. English corpora have been developed for a long time and for a variety of formats, types and genres. Several English corpora have been enriched with Part-of-Speech tagging, and a variety of different English corpus part-of-speech tag sets have been developed, including: the Brown corpus (BROWN), the 44 (SCRIBE), etc (Atwell 2008). The AMALGAM 9 multi-tagged corpus amalgamates all these tagging schemes in a common collection of English texts: in the AMALGAM corpus, the different part-of-speech tag sets used in these English general-purpose corpora are applied to illustrate the range of rival English corpus tagging schemes, and the texts are also parsed according to a range of rival parsing schemes, so each sentence has more than one parse-tree, called 'a forest' (Atwell, Demetriou, Hughes, Schiffrin, Souter & Wilcock 2000) . Part-of-speech tag sets and taggers have also been developed for other European languages. The EAGLES, European Advisory Group on Language Engineering Standards project, drew up standards for tag sets, morphological classes and codes for (western) European languages, including EAGLES Recommendations for the morphosyntactic annotation of corpora (Leech & Wilson 1999); a synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora: a common proposal and applications to European languages (Monachini & Calzolari 1996) ; and an EAGLES study of the relation between tag sets and taggers (Teufel, Schmid, Heid & Schiller 1996) . The potential uses of a part-of-speech tagged corpus are key factors in deciding the range and number of part-of-speech tags. Many linguistic analyses use part-of-speech tagged corpora to analyse text and extract information, where part-of-speech tags play an essential role in classifying text and direct search to the actions, events, places, etc described in the text. The most obvious applications are in lexicography and natural language processing (NLP) computational linguistics. Further applications include using the tags in data compression (Teahan 1998); and as a possible guide in the search for extra-terrestrial intelligence (Elliott & Atwell 2000) . Other generic applications that make use of part-of-speech tag information are: searching and concordancing, grammatical error detection in Word Processing, training Neural Networks for grammatical analysis of text, or training statistical language processing models (Atwell 2008). Part-of-Speech tagging is a key technology in discovering suspicious events from text (Zolfagharifard 2009), and processing Arabic is a key task in discovering these suspicious events.