A Hybrid Attribute Selection Approach for Text Classification
Journal of the AIS
The application of text mining in organizations is growing. Text classification, an important type of text mining problem, is characterized by a large attribute space and entails an efficient and effective attribute selection procedure. There are two general attribute selection approaches: the filter approach and the wrapper approach. While the wrapper approach is potentially more effective in finding the best attribute subset, it is cost-prohibitive in most text classification applications. In
... on applications. In this paper, we propose a hybrid attribute selection approach that is both efficient and effective for text classification problems. We apply the proposed approach to detect and prevent Internet abuse in the workplace, which is becoming a major problem in modern organizations. The empirical evaluations we conducted using a variety of classification algorithms, indexing schemes, and attribute selection methods demonstrate the utility of the proposed approach. We found that combining the filter and wrapper approaches not only boosts the accuracies of text classifiers but also brings down the computational costs significantly. semantic relatedness, so that the groups (or their centroids, or a representative of them), instead of the individual terms, may be used as dimensions of the vector space. In this paper, we propose a hybrid attribute selection approach that is both efficient and effective for text classification problems. It first applies the filter approach to reduce the full attribute set to a much smaller subset and then applies the wrapper approach to further tune the attribute subset. We apply the proposed hybrid approach to address the organizational problem of Internet abuse and demonstrate empirically the utility of the approach. The rest of the paper is organized as follows. We first review the text classification and attribute selection literature. We then propose and describe the hybrid attribute selection approach. Next, we describe how we empirically evaluate the proposed approach in the domain of workplace Internet abuse and discuss the findings. Finally, we conclude the paper and outline potential future research directions.