Visual Reranking: From Objectives to Strategies

xinmei tian, Dacheng Tao
2011 IEEE Multimedia  
W ith the rapid development of recording and storage devices, as well as the significant improvement of transmission and compression techniques, the amount of multimedia data (for example, image, video, and audio) on the Web is increasing and the video-and image-sharing sites are becoming more and more popular. There are hundreds of millions of videos on YouTube, Tudou, and Youku. Flickr hosts more than five billion images and Facebook users have uploaded more than 50 billion photos. Many
more » ... search and view images and videos on the Web every day. YouTube, for example, serves more than one billion videos every day. As a consequence, efficient and effective multimedia search tools are essential for Web surfing. There are usually two ways to perform multimedia search: content-based and text-based. In content-based retrieval, which has been used extensively in the past two decades, a user provides example images or video clips and then similar images and videos are returned by querying a visual-representation index in a large-scale database. The contentbased method suffers from three disadvantages. The first disadvantage is the well-known semantic gap between low-level visual features and high-level semantic concepts, leading to irrelevant images returned. The second disadvantage is the redundancy of the search result. Content-based image and video retrieval (CBIVR) usually ranks images according to their visual similarities with regard to the query examples. As a consequence, many images that vary slightly from the examples are returned as the top results. Although these near-duplicate images are relevant, they provide insufficient information for users. The third disadvantage is the neglect of user experiences. Example images and video clips essentially required by CBIVR might be unavailable for most users. Furthermore, with one or few examples, a user's search intention cannot be clearly expressed. If the user provides an example image with a horse on grassland, what does the user indeed want? It could be horse images with various backgrounds, any animal (for example, a goat, a cow, or a lion) on grassland, or something else altogether. Additionally, it could be difficult or even impossible for users to find proper examples to express complex intentions. Searching with textual queries is more natural for users, and it leads to another important search style, that is, text-based multimedia search. Text-based multimedia search completely relies on indexing the associated textual information of images, such as image tags, webpage filenames, and surrounding text. With textual information, well-understood search techniques can be applied directly to image and video search. Text-based multimedia search is efficient and has been widely used in practical applications. Most image and video search engines, such as Google and Bing, are built around this method. Although text-based multimedia search has the aforementioned advantages, it also suffers from several drawbacks. The first is the mismatching between the images and videos and their associated textual descriptions. The second disadvantage is that the textual representation could be ambiguous because of the influence of polysemy and synonymy. Third, textual information is insufficient to distinguish images of different relevance, which means that some slightly relevant samples will be returned as the results. To address the problems existing in current multimedia search, visual reranking has become a popular method in recent years. Visual reranking is an integrated framework (see Figure 1 ) that aims to obtain effective retrieval results efficiently. It leverages the advantages of content-based and text-based retrieval.
doi:10.1109/mmul.2011.36 fatcat:ti34miqu4vbqnhto3oa6q7znpu