Cost-Sensitive Feature Extraction and Selection in Genre Classification

Ryan Levering, Michal Cutler
2009 Journal for Language Technology and Computational Linguistics  
Automatic genre classification of Web pages is currently young compared to other Web classification tasks. Corpora are just starting to be collected and organized in a systematic way, feature extraction techniques are incon sistent and not well detailed, genres are constantly in dispute, and novel applications have not been implemented. This paper attempts to review and make progress in the area of feature extraction, an area that we believe can benefit all Web page classification, and genre
more » ... ssification in particular. We first present a framework for the extraction of various Web-specific feature groups from distinct data models based on a tree of potentials models and the transformations that create them. Then we introduce the concept of cost-sensitivity to this tree and provide an algorithm for per forming wrapper-based feature selection on this tree. Finally, we apply the cost-sensitive feature selection algorithm on two genre corpora and analyze the performance of the classification results.
dblp:journals/ldvf/LeveringC09 fatcat:umba2miytzfrpbzw26olrokpcq