Information Retrieval from Document Image Databases [chapter]

Shijian Lu, Chew Lim Tan
2014 Advances in Digital Document Processing and Retrieval  
With the proliferation of digital libraries, an increasing number of document images of different characteristics are being produced. Information retrieval accordingly becomes an urgent problem for the access of the text information within these archived document images. This chapter presents a word shape coding approach that retrieves document images without OCR (optical character recognition). Several word shape coding schemes are presented, which convert a word image in a word shape code by
more » ... word shape code by using a few topological word shape features such as character boundary extrema, character holes, and character water reservoirs. A document image can then be converted into a document vector that encodes the occurrence frequency of the contained word images. Document images can thus be retrieved based on the similarity between the converted document vectors. Experiments show that the word shape coding approach is fast, robust, and capable of retrieving document images efficiently without OCR.
doi:10.1142/9789814368711_0004 fatcat:4tidgjsb3naxfniqmsubjlandm