Feature selection on hierarchy of web documents

Dunja Mladenić, Marko Grobelnik
2003 Decision Support Systems  
The paper describes feature subset selection used in learning on text data (text-learning) and gives a brief overview of feature subset selection commonly used in machine learning. Several known and some new feature scoring measures appropriate for feature subset selection on large text data are described and related to each other. Experimental comparison of the described measures is given on real-world data collected from the Web. Machine learning techniques are used on data collected from
more » ... o, a large text hierarchy of Web documents. Our approach includes some original ideas for handling large number of features, categories and documents. T h e h i g h n umber of features is reduced by feature subset selection and additionally by using'stop-list', pruning low frequency features and using a short description of each document g i v en in the hierarchy instead of using the document itself. Documents are represented as feature-vectors that include word sequences instead of including only single words as commonly used when learning on text data. An e cient approach to generating word sequences is proposed. Based on the hierarchical structure, we propose a way of dividing the problem into subproblems, each representing one of the categories included in the Yahoo hierarchy. In our learning experiments, for each of the subproblems naive B a yesian classi er was used on text data. The result of learning is a set of independent classi ers, each used to predict probability that a new example is a m e m ber of the corresponding category. Experimental evaluation on real-world data shows that the proposed approach g i v es good results. The best performance was achieved by the feature selection based on a feature scoring measure known from information retrieval called Odds ratio and using relatively small number of features.
doi:10.1016/s0167-9236(02)00097-0 fatcat:qhverop3vzhfpk4k75wrfitvq4