Extraction of Layout Entities and Sub-layout Query-based Retrieval of Document Images [article]

Anukriti Bansal, Sumantra Dutta Roy, Gaurav Harit
2016 arXiv   pre-print
Layouts and sub-layouts constitute an important clue while searching a document on the basis of its structure, or when textual content is unknown/irrelevant. A sub-layout specifies the arrangement of document entities within a smaller portion of the document. We propose an efficient graph-based matching algorithm, integrated with hash-based indexing, to prune a possibly large search space. A user can specify a combination of sub-layouts of interest using sketch-based queries. The system
more » ... partial matching for unspecified layout entities. We handle cases of segmentation pre-processing errors (for text/non-text blocks) with a symmetry maximization-based strategy, and accounting for multiple domain-specific plausible segmentation hypotheses. We show promising results of our system on a database of unstructured entities, containing 4776 newspaper images.
arXiv:1609.02687v1 fatcat:itvnsou35jfmnmhams4w4stxdy