A Realistic Dataset for Performance Evaluation of Document Layout Analysis

Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, Stefan Pletschacher
2009 2009 10th International Conference on Document Analysis and Recognition  
† There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created
more » ... sing a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded.
doi:10.1109/icdar.2009.271 dblp:conf/icdar/AntonacopoulosBPP09 fatcat:bsv54ehuyjc23g7rixqjrdlzlq