A spelling corrector for Basque based on morphology

I. Aduriz, M. Urkia, I. Alegria, X. Artola, N. Ezeiza, K. Sarasola
1997 Literary and Linguistic Computing  
I n t r o d u c t i o n This paper describes the components used in the elaboration of the commercial Xuxen spelling checker/corrector for Basque. Because Basque is a highly inflected and agglutinative language, the spelling checker/corrector has been conceived as a by-product of a general purpose morphological analyser/generator (Alegria et al., 96). The two-level model of morphology (Koskenniemi, 83) that we use is based on two main components -see Sproat (1992): • A lexicon where the
more » ... s (lemmas and affixes) and the possible links among them (morphotactics) are defined. • A set of rules which controls the mapping between the lexical level and the surface level due to the morphonological transformations (morphophonemics). There are four kind of rules: context restriction rules "=>" (lexical character may be realized as the lexical one in the given context), surface coercion rules "<=" (lexical character must be realized as the lexical one in the given context), composite rules "<=>" (lexical character must be realized as the lexical one in the given context and this change is licit only in this context) and exclusion rules (lexical character may not be realized as the lexical one in the given context). The rules are independent from the morphotactics. The rules are compiled into transducers, so it is possible to apply the system for both analysis and generation. In order to increase the coverage and the robustness, the analyser has been designed in an incremental way and it consists of three main modules: the standard analyser, the analyser of linguistic variants -due to dialectal uses and competence errors-, and the analyser without lexicon which can recognize word-forms without having their lemmas in the lexicon. An important feature of the analyser is its homogeneity as the three different steps are based on two-level morphology, very different from ad-hoc solutions. This analyser is a basic tool for current and future work on automatic processing of Basque and its first applications is the commercial spelling corrector named Xuxen that is presented here. First we describe the subsystem added to the analyser in order to increase relevantly the coverage in competence errors Table 1 Precision of the corrector Without changing the main idea of the correction method, the precision can be improved slowing it (assuming the speed of morphological checking is constant). For example it would be possible, but very slow with our analyser, to generate and test all the possible words with an edit-distance higher than one from the original misspelling. Another way could be investigating in the line proposed by Oflazer and Guzey (1994); based on flexible morphological decomposition, although by the moment we have found the same problems of response time.
doi:10.1093/llc/12.1.31 fatcat:vurdwyw4mrbr7nmk36lru4avim