Public domain optical character recognition
Document Recognition II
A public domain document processing system has been developed by the National Institute of Standards and Technology (NIST). The system is a standard reference form-based handprint recognition system for evaluating optical character recognition (OCR), and it is intended to provide a baseline of performance on an open application. The system's source code, training data, performance assessment tools, and type offorms processed are all publicly available. The system recognizes the handprint
... on Handwriting Sample Forms like the ones distributed with NIST Special Database I. From these forms, the system reads handprinted numeric fields, upper and lowercase alphabetic fields, and unconstrained text paragraphs comprised of words from a limited-size dictionary. The modular design of the system makes it useful for component evaluation and comparison, training and testing set validation, and multiple system voting schemes. The system contains a number of significant contributions to OCR technology, including an optimized Probabilistic Neural Network (PNN) classifier that operates a factor of 20 times faster than traditional software implementations of the algorithm. The source code for the recognition system is written in C and is organized into II libraries. In all, there are approximately 19,000 lines of code supporting more than 550 subroutines. Source code is provided for form registration, form removal, field isolation, field segmentation, character normalization, feature extraction, character classification, and dictionary-based postprocessing. The recognition system has been successfully compiled and tested on a host of UNIX workstations inclUding computers manufactured by Digital Equipment Corporation, Hewlett Packard, IBM, Silicon Graphics Incorporated, and Sun Microsystems.* This paper gives an overview of the recognition system's software architecture, including descriptions of the various system components along with timing and accuracy statistics. A standard reference form-based handprint recognition system for evaluating optical character recognition (OCR) has been developed. I The system has been developed as an open application; the system's source code, training data, performance assessment tools, and types of forms processed are all publicly available. The system architecture and software organization is completely documented for those interested in technology integration. The source code for the standard reference system is written in C and is organized into II libraries. In all, there are approximately 19,000 lines of code supporting more than 550 subroutines. Source code is provided for form registration, form removal, field isolation, field segmentation, character normalization, feature extraction, character classification, and dictionary-based postprocessing. Any portion of the system may be used without restriction in commercial products. Due to its modular design, a component of the system may be easily replaced by an alternative algorithm. The same set of input data can be run through the augmented system, and performances between the standard reference system and the augmented system can be compared. The system can be retrained and tested in a controlled way so that the impact of different training set profiles can be compared, and a training set that provides maximum robustness can be determined. Developers may find that the techniques used in the standard reference system provide complimentary results to their own systems. If this is the case, then combining the recognition results from the two systems, or allowing the systems to vote may improve overall recognition performance as demonstrated in Reference 2.