Scalable ranked retrieval using document images
Document Recognition and Retrieval XXI
Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains an open research question. The most common approach has been to perform text retrieval using terms generated by optical character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm, which matches
... sub-images containing text or graphical objects, can provide additional benefit in satisfying a user's information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when used as a basis for relevance feedback can yield improvements in retrieval effectiveness. L2 Distance # Valid Matches SURF Accuracy # Valid Hash Collisions as well [3, 4] . To minimize the storage cost and computational requirements of this matching, the SURF feature vector is reduced to 8 dimensions using PCA. This indexing scheme is used to create the following inverted index: Each index key points to the unique ID for the document it was computed from and its associated feature vector. The X and Y coordinates and the orientation of the interest point are stored for geometric filtering, discussed in the next section. This index reduces search complexity by >10 8 over the naïve approach.