An Image Based Approach for Content Analysis in Document Collections [chapter]

Reinhold Huber-Mörk, Alexander Schindler
2013 Lecture Notes in Computer Science  
We consider the task of content based analysis and categorization in large-scale historical book scanning projects. Mixed content, deprecated language, noise and unexpected distortions suggest an image based approach. The use of keypoint extractors combined with the bag of features approach is applied to scanned text documents. In order to incorporate spatial information into the bag of features approach we consider three methods of spatial verification. An approach based on comparison of
more » ... tical properties of local keypoint properties such as size orientation and scale showed comparable quality in content comparison while being computationally much more efficient. Cluster analysis delivers groups of pages characterized by common properties, especially duplicated page content is detected with high reliability.
doi:10.1007/978-3-642-41939-3_27 fatcat:vrb2pqj6czdhlod5sksuszkpha