Scalable Feature Extraction from Noisy Documents

Loic Lecerf, Boris Chidlovskii
2009 2009 10th International Conference on Document Analysis and Recognition  
We cope with the metadata recognition in layoutoriented documents. We address the problem as a classification task and propose a method for automatic extraction of relevant features, in presence of content and structural noise, caused by scanning, OCR and segmentation problems. The method is based on the automatic analysis of documents and requires no particular preprocessing. The method mines the documents and determines frequent patterns, which are both literal patterns and their
more » ... n. We also propose a sampling technique which processes a sample of documents and uses the Chernoff bounds to estimate the pattern frequency in the entire dataset. As a number of frequent patterns as feature candidates grows, the method applies a scalable feature selection method to determine the most relevant features to a given classification task. A series of evaluations on two collections show that the method performs comparably to the manual work on rule writing made by domain experts. 10th International Conference on Document Analysis and Recognition 978-0-7695-3725-2/09 $25.00
doi:10.1109/icdar.2009.227 dblp:conf/icdar/LecerfC09 fatcat:cihmizehgbfnriu2nopv3a2xai