Narrowing the semantic gap - improved text-based web document retrieval using visual features

Rong Zhao, W.I. Grosky
2002 IEEE transactions on multimedia  
In this paper, we present the results of our work that seeks to negotiate the gap between low-level features and high-level concepts in the domain of web document retrieval. This work concerns a technique, Latent Semantic Indexing (LSI), which has been used for textual information retrieval for many years. In this environment, LSI is used to determine clusters of cooccurring keywords, sometimes, called concepts, so that a query which uses a particular keyword can then retrieve documents perhaps
more » ... e documents perhaps not containing this keyword, but containing other keywords from the same cluster. In this paper, we examine the use of this technique for content-based web document retrieval, using both keywords and image features to represent the documents. Two different approaches to image feature representation, namely, color histograms and color anglograms, are adopted and evaluated. Experimental results show that LSI, together with both textual and visual features, is able to extract the underlying semantic structure of web documents, thus helping to improve the retrieval performance significantly. Fig. 6. Results of semantic-based retrieval using both keywords and image features (color anglogram).
doi:10.1109/tmm.2002.1017733 fatcat:52qgrls22fho7ogfoct3vt7zye