Narrowing the semantic gap - improved text-based web document retrieval using visual features

Rong Zhao, W.I. Grosky
2002 IEEE transactions on multimedia  
In this paper, we present the results of our work that seek to negotiate the gap between low-level features and high-level concepts in the domain of web document retrieval. This work concerns a technique, latent semantic indexing (LSI), which has been used for textual information retrieval for many years. In this environment, LSI determines clusters of co-occurring keywordssometimes called concepts-so that a query which uses a particular keyword can then retrieve documents perhaps not
more » ... this keyword, but containing other keywords from the same cluster. In this paper, we examine the use of this technique for content-based web document retrieval, using both keywords and image features to represent the documents. Two different approaches to image feature representation, namely, color histograms and color anglograms, are adopted and evaluated. Experimental results show that LSI, together with both textual and visual features, is able to extract the underlying semantic structure of web documents, thus helping to improve the retrieval performance significantly, even when querying is done using only keywords.
doi:10.1109/tmm.2002.1017733 fatcat:52qgrls22fho7ogfoct3vt7zye