Optical Character Recognition and Parsing of Typeset Mathematics1

Richard J. Fateman, Taku Tokuyasu, Benjamin P. Berman, Nicholas Mitchell
1996 Journal of Visual Communication and Image Representation  
There is a wealth of mathematical knowledge that could be potentially very useful in many computational applications, but is not available in electronic form. This knowledge comes in the form of mechanically typeset books and journals going back more than one hundred years. Besides these older sources, there are a great many current publications, lled with useful mathematical information, which are di cult if not impossible to obtain in electronic form. Our work intends to encode, for use by
more » ... puter algebra systems, integral tables and other documents currently available in hardcopy only. Our strategy is to extract character information from these documents, which is then passed to higher-level parsing routines for further extraction of mathematical content (or any other useful two-dimensional semantic content). This information can then be output as, for example, a Lisp or T E X expression. We h a ve a l s o d e v eloped routines for rapid access to this information, speci cally for nding matches with formulas in a table of integrals. This paper reviews our current e orts, and summarizes our results and the problems we h a ve encountered.
doi:10.1006/jvci.1996.0002 fatcat:llyu2ae2lzbu7fzymfouajx76a