Pruning the vocabulary for better context recognition

R.E. Madsen, S. Sigurdsson, L.K. Hansen, J. Larsen
2004 Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.  
Language independent 'bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many nonconsistent words for text categorization. These non-consistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to study the effect of reducing the least relevant words from the bagof-words representation. We
more » ... onsider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.
doi:10.1109/icpr.2004.1334270 dblp:conf/icpr/MadsenSHL04 fatcat:lto3ltt6src43hcv7buqkamnbi