A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification [chapter]

Yanbo J. Wang, Frans Coenen, Robert Sanderson
2009 Lecture Notes in Computer Science  
Data pre-processing is an important topic in Text Classification (TC). It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between textcategories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (i) linguistic, (ii) statistical, and (iii) hybrid (i) & (ii). With regard to language-independent TC, our study relates to the statistical aspect
more » ... . The nature of textual data pre-processing includes: Document-base Representation (DR) and Feature Selection (FS). In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti⋅Sebastiani⋅Simi Coefficient). Our proposed approach is presented under a statistical "bag of phrases" DR setting. The experimental results, based on the well-established associative text classification approach, demonstrate that our proposed technique outperforms existing mechanisms with respect to the accuracy of classification.
doi:10.1007/978-3-642-03348-3_33 fatcat:dlzofjfyvbflvhs6r7qbzalh34