Historical document digitization through layout analysis and deep content classification

Andrea Corbelli, Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
2016 2016 23rd International Conference on Pattern Recognition (ICPR)  
Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features,
more » ... while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the "Enciclopedia Treccani", a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.
doi:10.1109/icpr.2016.7900272 dblp:conf/icpr/CorbelliBGC16 fatcat:qhpnmbhdrzdnppxxq7hert7ule