Impact of feature selection techniques in Text Classification: An Experimental study
JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES
This work is a study of comparing different feature selection techniques on the accuracy of text classification. Text Mining or Document Categorization is a supervised learning (an Information Retrieval task which learns from labeled train data) technique where it uses labeled (set of instances with predefine labels) train instances or data to learn the categorization job and then it categorize the test text instances automatically using the system that is learnt. In the field of IR and
... d of IR and management tasks, classification plays an important lead. The text categorization procedure includes the steps text pre-processing (cleaning, stop word removal and stemming), feature extraction or feature reduction or feature selection and then categorization. In this work, two machine learning algorithm/classifiers (Naïve Bayes and K-Nearest Neighbor) are used for classification. The analyzed experimental results show that Naïve Bayes algorithm gives more accuracy in many cases i.e. with many feature selection techniques and K-Nearest Neighbor classifier works well only in the cases, when the feature selection techniques either Information Gain (IG) or Mutual Information (MI). The results of experiments reported here were generated while Self-made corpus used for training and Reuters-21578 corpus used for testing.