Reverse annotation based retrieval from large document image collections

Pramod Sankar K.
2010 Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10  
A number of projects are dedicated to creating digital libraries from scanned books, such as Google Books, UDL, Digital Library of India (DLI), etc. The ability to search in the content of document images is essential for the usability and popularity of these DLs. In this work, we aim toward building a retrieval system over 120K document images coming from 1000 scanned books of Telugu literature. This is a very hard problem because: i) OCRs are not robust enough for Indian languages, especially
more » ... the Telugu script, ii) the document images contain large number of degradations and artifacts, iii) scalability to large collections is hard. Moreover, users expect that the search system accept text-based queries and retrieve relevant results in interactive times. We propose a Reverse Annotation framework [1], that labels word-images by their equivalent text label in the offline phase. Reverse Annotation applies a retrieval based approach to recognition. It first identifies a set of keywords which are considered useful for labeling and retrieval. Exemplars are obtained for each word from a crude OCR or human annotations. The labels are then propagated across the rest of the collection by matching words in the imagefeature space. Since such a matching is computationally expensive, scalability is achieved using a fast approximate nearest neighbor technique based on Hierarchical K-Means. Our framework allows us to assign text labels for document images offline, allowing us to build a search index for quick online retrieval. An example query and the retrieved results are shown in Figure 1 . We are unaware of any conventional OCRs which can recognize such images. There are three major contributions of our work: i) recognizing the entire document collection together, instead of one-at-a-time, ii) speeding up recognition by clustering multiple instances of a given word, iii) recognising at the word-level, avoiding the pitfalls of character segmentation and recognition. Using the techniques developed from my work, we were able to successfully build a retrieval system over our chal-
doi:10.1145/1835449.1835694 dblp:conf/sigir/Sankar10 fatcat:6xqmyuetpbfclps6ced75fbsbm