Extracting person names from diverse and noisy OCR text

Thomas L. Packer, Joshua F. Lutes, Aaron P. Stewart, David W. Embley, Eric K. Ringger, Kevin D. Seppi, Lee S. Jensen
2010 Proceedings of the fourth workshop on Analytics for noisy unstructured text data - AND '10  
Named entity recognition from scanned and OCRed historical documents can contribute to historical research. However, entity recognition from historical documents is more difficult than from natively digital data because of the presence of word errors and the absence of complete formatting information. We apply four extraction algorithms to various types of noisy OCR data found "in the wild" and focus on full name extraction. We evaluate the extraction quality with respect to handlabeled test
more » ... a and improve upon the extraction performance of the individual systems by means of ensemble extraction. We also evaluate the strategies with different applications in mind: the target applications (browsing versus retrieval) involve a trade-off between precision and recall. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.
doi:10.1145/1871840.1871845 dblp:conf/and/PackerLSERSJ10 fatcat:6hph2wedg5ef7mzdwxi5g7qt4y