A Novel Feature Selection Based Gravitation for Text Categorization

Jieming Yang, Zhiying Liu, Zhaoyang Qu
2016 International Journal of Database Theory and Application  
The high dimensionality of feature space is a big hurdle in applying many sophisticated methods to text categorization. The feature selection method is one of methods which reduce the high dimensionality of feature space. In this paper, we proposed a new feature selection algorithm based on gravitation, named GFS, which regards a feature occurring in one category as an object, and all objects corresponding to a feature occurring in various categories can constitute a gravitational field, then
more » ... e gravitation of a feature with unknown category label on which all objects in the gravitational field act is used for feature selection. We have evaluated GFS on three benchmark datasets (20-Newgroups, Reuters-21578 and WebKB), using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM), and compared it with four well-known feature selection algorithms (information gain, document frequency, orthogonal centroid feature selection and Poisson distribution). The experiments show that GFS performs significantly better than other feature selection algorithms in terms of micro F1, macro F1 and accuracy. 212 Copyright ⓒ 2016 SERSC is as close as possible to the original class distribution [7] . Dash and Liu [3] considered the factors mentioned above, and believed that the feature selection attempts to select the minimally sized subset of features according to the two criteria: (1) the classification accuracy does not significantly decrease; (2) the resulting class distribution given the selected features is as close as possible to the original class distribution. There are four basic steps in a typical feature selection method, such as generation procedure, evaluation function, stopping criterion and validation procedure [3] . During the four steps of feature selection algorithm, the evaluation function is a vital one. It tries to measure the discriminating ability of a feature or a subset to distinguish the different class labels [3] . Blum & Langley [8] grouped the feature selection methods into three classes: embed, wrapper, and filtering. The characteristics of the embed approach is that the feature selection process is clearly embedded in the basic induction algorithm. The wrapper approach is to select feature subset using the evaluation function as a wrapper around the learning algorithm, and these features will be used on the same learning algorithm [9, 10]. The filtering approach selects the feature subset using the evaluation function that is independent to the learning method [9] . The most popular and computationally fast feature selection is the filtering approach [4], and the proposed method GFS in this study is also a filtering approach. There are numerous well-known feature selection algorithms, such as document frequency (DF) , information gain (IG), χ 2 -statistic [11],odds ratios (OR) [12], mutual information [11], bi-normal separation (BNS) [13], Best Terms [4], the Orthogonal Centroid Feature Selection (OCFS) [14], the most relevant with category [15, 16], improved Gini index [17], class discriminating measure (CDM) [18], measure using Poisson distribution [19], , and so on. Most of these feature selection algorithms calculate the score of a feature for categorization based on information theory, probability and mathematical statistics, then all of the features in the training set are ranked and the top k features are selected to form the reduced feature space. In this paper, we proposed a new feature selection based on the theory of universal gravitation, named GFS, which assumes that a feature in every category of training set is an object and the amount of this feature occurring in every category of training set is the mass of the object. So a feature occurring in all the categories form a gravitation field, and then the gravitation of a feature in this gravitation field can be calculated. If the gravitation of category c i acting on a feature is bigger, this feature contains more information for category c i . To evaluate GFS method, we used two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM) on three benchmark text corpora (20-Newgroups, Reuters-21578 and WebKB) and compared it with four feature selection algorithms (information gain, document frequency, the orthogonal centroid feature selection and Poisson distribution).The experiments show that GFS performs significantly better than other feature selection algorithms in terms of micro F1, macro F1 and accuracy. The rest of this paper is organized as follows: Section 2 presents the state of the art for feature selection methods. Section 3 describes and analyzes the basic principle and implementation of the proposed method. The experimental details are given in Section 4 and the experimental results are listed in the Section 5. The statistical analysis and discussion are presented in Section 6. Our conclusion and future work direction are provided in the last Section. Related Work Information Gain Information Gain (IG) [21] is frequently used as a criterion in the field of machine learning [11] . The information gain of a given feature t k with respect to the class c i is the
doi:10.14257/ijdta.2016.9.3.21 fatcat:2iv7ffxv2rcjlnck26by7sydqm