CVS

Christina Yip Chung, Bin Chen
2002 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02  
As information volume in enterprise systems and in the Web grows rapidly, how to accurately retrieve information is an important research area. Several corpus based smoothing techniques have been proposed to address the data sparsity and synonym problems faced by information retrieval systems. Such smoothing techniques are often unable to discover and utilize the correlations among terms. We propose CVS, a Correlation-Verification based Smoothing method, that considers co-occurrence information
more » ... in smoothing. Strongly correlated terms in a document are identified by their co-occurrence frequencies in the document. To avoid missing correlated terms with low co-occurrence frequencies but specific to the theme of the document, the joint distributions of terms in the document are compared with those in the corpus for statistical significance. A common approach to apply corpus based smoothing techniques to information retrieval is by refining the vector representations of documents. This paper investigates the effects of corpus based smoothing on information retrieval by query expansion using term clusters generated from a term clustering process. The results can also be viewed in light of the effects of smoothing on clustering. Empirical studies show that our approach outperforms previous corpus based smoothing techniques. It improves retrieval effectiveness by 14.6%. The results demonstrate that corpus based smoothing can be used for query expansion by term clustering.
doi:10.1145/775107.775115 fatcat:2ak44ejh6bfobhwgcqvidpdx7y