Spotting Where to Read on Pages - Retrieval of Relevant Parts from Page Images [chapter]

Koichi Kise, Masaaki Tsujino, Keinosuke Matsumoto
2002 Lecture Notes in Computer Science  
This paper presents a new method of document image retrieval that is capable of spotting parts of page images relevant to a user's query. This enables us to improve the usability of retrieval, since a user can find where to read on retrieved pages. The effectiveness of retrieval can also be improved because the method is little influenced by irrelevant parts on pages. The method is based on the assumption that parts of page images which densely contain keywords in a query are relevant to it.
more » ... characteristics of the proposed method are as follows: (1) Two-dimensional density distributions of keywords are calculated for ranking parts of page images, (2) The method relies only on the distribution of characters so as not to be affected by the errors of layout analysis. Based on the experimental results of retrieving Japanese newspaper articles, we have shown that the proposed method is superior to a method without the function of dealing with parts, and sometimes equivalent to a method of electronic document retrieval that works on error-free text. it is necessary to cope with OCR errors. The OCR errors are not limited to the misrecognition of individual characters, but include the errors in layout analysis and identification of reading order. Retrieval. The effectiveness of document (or page) ranking is an important problem of retrieval, though most of the existing methods deal mainly with keyword spotting on page images. In order to obtain the ranking, we should define and utilize a measure of similarity between a user's query (a set of keywords) and a page image, according to the spotted keywords. Presentation. It seems that the problem of presentation is often overlooked. This problem is caused by the disparity between the size of page images and the size of images that can be displayed. For example, newspaper pages scanned with the resolution of, say, 200 dpi, are too large for ordinary displays. Images of A4 pages could cause the same problem if we use PDA's. Thus it is important to locate where to read on pages in addition to select pages which contain information relevant to a query.
doi:10.1007/3-540-45869-7_43 fatcat:kpakccjsu5dcrms5s7c4cg2gz4