Content-based document image retrieval in complex document collections

G. Agam, S. Argamon, O. Frieder, D. Grossman, D. Lewis, Xiaofan Lin, Berrin A. Yanikoglu
2007 Document Recognition and Retrieval XIV  
We address the problem of content-based image retrieval in the context of complex document images. Complex document are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex
more » ... ocuments are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images. This is in contrast to existing datasets for content-based image retrieval which normally contain only thousands of images. We believe that the formation of such large dataset is essential in understanding the problems associated with realistic applications.
doi:10.1117/12.703163 dblp:conf/drr/ArgamonFGL07 fatcat:v73nietpbvh6ldxxb6ouzi6u6i