Filters








60 Hits in 7.9 sec

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin [article]

Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter
2018 arXiv   pre-print
In this paper we describe a dataset of German and Latin ground truth (GT) for historical OCR in the form of printed text line images paired with their transcription.  ...  We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95% (early printings) and 98% (19th century Fraktur printings) character accuracy rates on unseen test cases  ...  Acknowledgments We are grateful to our colleagues Phillip Beckenbauer for aligning the ground truth of the Early New High German Corpus to the printed text lines of the respective books and the training  ... 
arXiv:1809.05501v1 fatcat:spsdr5alarhebckhqcagyau7r4

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs [article]

Matteo Romanello, Sven Najem-Meyer, Bruce Robertson
2021 arXiv   pre-print
As part of this paper, we also release GT4HistComment, a small dataset with OCR ground truth for 19th classical commentaries and Pogretra, a large collection of training data and pre-trained models for  ...  In this paper, we evaluate the performances of two pipelines suitable for the OCR of historical classical commentaries.  ...  Recurrent neural networks with Long Short Term Memory (LSTM), such as implemented in OCRopus 2 2 and 3 3 , led to good results both on historical documents in English and German Fraktur script [2] , and  ... 
arXiv:2110.06817v1 fatcat:jdcursc7vjcitpgnh6p74id32e

OCR of historical printings of Latin texts

Uwe Springmann, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek, Florian Fink
2014 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage - DATeCH '14  
Initial experiments for the OCR engines Tesseract and OCRopus show that some training on historical fonts and the application of lexical resources raise character accuracies beyond those of Finereader  ...  This paper deals with the application of OCR methods to historical printings of Latin texts.  ...  In the case of early modern Latin the situation is less dramatic than for, e.g., early German, where lack of any orthographic standard got the same word spelled differently in the same document.  ... 
doi:10.1145/2595188.2595205 dblp:conf/datech/SpringmannNMSGF14 fatcat:y5alwyxxkvhkreg4elpv7vhrxa

New Approaches to OCR for Early Printed Books

Nikolaus Weichselbaumer, Mathias Seuret, Saskia Limbach, Rui Dong, Manuel Burghardt, Vincent Christlein
2020 DigItalia  
Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data.  ...  We concentrated on Gothic font groups that were commonly used in German texts printed in the 15th and 16th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher  ...  One of the biggest obstacles for OCR use in early printed books is the fact that OCR engines are usually trained with modern-day fonts.  ... 
doi:10.36181/digitalia-00015 fatcat:rukrigshsjddxcvhdc6bcyxvpi

Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning [article]

Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe
2018 arXiv   pre-print
We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models  ...  trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions  ...  Base Lines is the number of lines used to train the voters of the previous iteration. The lines added (Add. Lines) randomly (RDM) on Ground Truth for OCR on historical documents (this issue). 2.  ... 
arXiv:1802.10038v2 fatcat:oudtxdhjsjan3ijp2ykw4dsgbi

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus [article]

U. Springmann, A. Lüdeling
2017 arXiv   pre-print
The OCR results have been evaluated for accuracy against the ground truth of unseen test sets.  ...  Training specific OCR models was possible because the necessary *ground truth* is available as error-corrected diplomatic transcriptions.  ...  modern ones and the engines cannot be trained very well by end users on these historical fonts, training on modern fonts has only very limited value for the OCR of historical printings.  ... 
arXiv:1608.02153v2 fatcat:fkbrkdf7fvhm5mb6upxzukmjtm

Building an efficient OCR system for historical documents with little training data

Jiří Martínek, Ladislav Lenc, Pavel Král
2020 Neural computing & applications (Print)  
To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.  ...  Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data.  ...  financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership or other equity interest; and expert testimony or  ... 
doi:10.1007/s00521-020-04910-x fatcat:bj5roowsj5aa7ikdnrfntjjqg4

Historical Document Processing: Historical Document Processing: A Survey of Techniques, Tools, and Trends [article]

James P. Philips, Nasseh Tabrizi
2020 arXiv   pre-print
Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars.  ...  , to convert images of ancient manuscripts, letters, diaries, and early printed texts automatically into a digital format usable in data mining and information retrieval systems.  ...  For Western historical documents, research datasets exist for medieval Latin, medieval German and Spanish, a variety of early modern European languages, and eighteenth-century English.  ... 
arXiv:2002.06300v2 fatcat:nxufntuk7famfph6ownyuys2py

Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting

Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe
2018 2018 13th IAPR International Workshop on Document Analysis Systems (DAS)  
In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books.  ...  After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model.  ...  But historical spelling patterns are much more variable than modern ones and the same word is often spelled and printed in more than one form even in the same document.  ... 
doi:10.1109/das.2018.30 dblp:conf/das/ReulSWP18 fatcat:roujfrpycrdp3excggxibmz4ba

Efficient and effective OCR engine training

Christian Clausner, Apostolos Antonacopoulos, Stefan Pletschacher
2019 International Journal on Document Analysis and Recognition  
Abstract We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system.  ...  All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine's training processes themselves, text recognition, and quantitative evaluation of  ...  However, for the multitude of historical documents and for documents written in the many smaller languages in the world, out-of-the-box OCR engines do not perform optimally or even not at all.  ... 
doi:10.1007/s10032-019-00347-8 fatcat:awnt5v62yrfyrcbflvgmiggeri

OCR4all – An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings [article]

Christian Reul, Dennis Christ, Alexander Hartelt, Nico Balbach, Maximilian Wehner, Uwe Springmann, Christoph Wick, Christine Grundig, Andreas Büttner, Frank Puppe
2019 arXiv   pre-print
Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character  ...  In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow.  ...  the OCR4all workflow in close cooperation.  ... 
arXiv:1909.04032v1 fatcat:czzg6o6i5baxdcnsc2cacm5xmy

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Christian Reul, Dennis Christ, Alexander Hartelt, Nico Balbach, Maximilian Wehner, Uwe Springmann, Christoph Wick, Christine Grundig, Andreas Büttner, Frank Puppe
2019 Applied Sciences  
This is mostly due to the fact that the required ground truth for training stronger mixed models (for segmentation, as well as text recognition) is not available, yet, neither in the desired quantity nor  ...  Nevertheless, in the last few years, great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout analysis and segmentation, character  ...  Three early printed books, printed between 1476 and 1505 in German and Latin, were used as training and evaluation data.  ... 
doi:10.3390/app9224853 fatcat:3dd7pnyblrdq3e4lsjlodkd52y

From Text to Data [chapter]

2020 Digital Methods in the Humanities  
13 The code for the OCR pipline especially the pyFlow part is based on the original work of Madis Rumming, a former member of the INF team.  ...  In addition to that we can also compare our values to the ones published by Google. We focus on the values for modern English and Fraktur.  ...  In the context of OCR this transcription is called ground truth data. 62 or these accuracy tests we will input the test data images into the pipe line and compare the output text with the ground truth  ... 
doi:10.14361/9783839454190-004 fatcat:ztai6dsg5fg2hlxkuh3e2vqowi

ICDAR 2013 Competition on Historical Book Recognition (HBR 2013)

A. Antonacopoulos, C. Clausner, C. Papadopoulos, S. Pletschacher
2013 2013 12th International Conference on Document Analysis and Recognition  
It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2013 and the 2 nd International Workshop on Historical Document Imaging and Processing (HIP2013  ...  However, there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical books, especially for OCR. †  ...  ) in both Latin and Fraktur scripts.  ... 
doi:10.1109/icdar.2013.294 dblp:conf/icdar/AntonacopoulosCPP13a fatcat:66fixhe3rbgtjn2gxzj5w3apgi

Europeana Newspapers OCR Workflow Evaluation

Stefan Pletschacher, Christian Clausner, Apostolos Antonacopoulos
2015 Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing - HIP '15  
It gives a detailed overview of how the involved software performed on a representative dataset of historical newspaper pages (for which ground truth was created) with regard to general text accuracy as  ...  Specific types of errors are examined and evaluated in order to identify possible improvements related to the employed document image analysis and recognition methods.  ...  For a more realistic evaluation (current OCR engines are still limited with regard to the character sets they can recognise -especially related to historical documents) both ground truth and result text  ... 
doi:10.1145/2809544.2809554 dblp:conf/icdar/PletschacherCA15 fatcat:qwe3fq74brbrdkfgiersdtw3cq
« Previous Showing results 1 — 15 out of 60 results