Noise reduction in a statistical approach to text categorization

Yiming Yang
1995 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95  
This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposed and evaluated, including: an aggressive removal of "non-informative words" from texts before training; the use of a truncated singular value decomposition to cut off noisy "latent semantic structures" during training; the elimination of non-influential components in the
more » ... LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results.
doi:10.1145/215206.215367 dblp:conf/sigir/Yang95 fatcat:7xixjc7pv5ggbh6ueopoys4lkm