Duplicate detection approaches for quality assurance of document image collections

Roman Graf, Reinhold Huber-Mörk, Alexander Schindler, Sven Schlarb
2013 Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems - MEDES '13  
This paper presents an evaluation of different methods for automatic duplicate detection in digitized collections. These approaches are meant to support quality assurance and decision making for long term preservation of digital content in libraries and archives. In this paper we demonstrate advantages and drawbacks of different approaches. Our goal is to select the most efficient method which satisfies the digital preservation requirements for duplicate detection in digital document image
more » ... document image collections. Workflows of different complexity were designed in order to demonstrate possible duplicate detection approaches. Assessment of individual approaches is based on workflow simplicity, detection accuracy and acceptable performance, since image processing methods typically require significant computation. Applied image processing methods create expert knowledge that facilitates decision making for long term preservation. We employ AI technologies like expert rules and clustering for inferring explicit knowledge on the content of the digital collection. A statistical analysis of the aggregated information and the qualitative analysis of the aggregated knowledge are presented in the evaluation part of the paper.
doi:10.1145/2536146.2536157 dblp:conf/medes/GrafHSS13 fatcat:fnbtrmry4vcrriagcilhdwktrq