Efficient and effective OCR engine training

Christian Clausner, Apostolos Antonacopoulos, Stefan Pletschacher
2019 International Journal on Document Analysis and Recognition  
Effici e n t a n d eff e c tiv e OCR e n gi n e t r ai ni n g Cl a u s n er, C, Ant o n a c o p o ulo s, A a n d Pl e t s c h a c h er, S h t t p:// dx. d oi.o r g/ 1 0. 1 0 0 7/ s 1 0 0 3 2-0 1 9-0 0 3 4 7-8 Ti t l e Effici e n t a n d eff e c tiv e OCR e n gi n e t r ai ni n g A u t h o r s Cl a u s n er, C, Ant o n a c o p o ulo s, A a n d Pl e t s c h a c h er, S Typ e Articl e U RL This ve r sio n is a v ail a bl e a t : h t t p:// u sir.s alfo r d. a c. u k/id/ e p ri n t/ 5 2 6 9 6/ P u
more » ... l i s h e d D a t e 2 0 1 9 U SIR is a di git al c oll e c tio n of t h e r e s e a r c h o u t p u t of t h e U niv e r si ty of S alfo r d. W h e r e c o py ri g h t p e r mi t s, full t e x t m a t e ri al h el d in t h e r e p o si to ry is m a d e fr e ely a v ail a bl e o nli n e a n d c a n b e r e a d , d o w nlo a d e d a n d c o pi e d fo r n o nc o m m e r ci al p riv a t e s t u dy o r r e s e a r c h p u r p o s e s . Pl e a s e c h e c k t h e m a n u s c ri p t fo r a n y fu r t h e r c o py ri g h t r e s t ri c tio n s. Fo r m o r e info r m a tio n, in cl u di n g o u r p olicy a n d s u b mi s sio n p r o c e d u r e , pl e a s e c o n t a c t t h e R e p o si to ry Te a m a t: u si r@ s alfo r d. a c. u k . Abstract We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine's training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail. Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
doi:10.1007/s10032-019-00347-8 fatcat:awnt5v62yrfyrcbflvgmiggeri