Filters








1,554 Hits in 3.3 sec

Correcting OCR text by association with historical datasets

Susan E. Hauser, Jonathan Schlaifer, Tehseen F. Sabir, Dina Demner-Fushman, Scott Straughan, George R. Thoma, Tapas Kanungo, Elisa H. Barney Smith, Jianying Hu, Paul B. Kantor
2003 Document Recognition and Retrieval X  
This is done by comparing affiliations historically associated with the author name to the OCR text in the affiliation field.  ...  Our objective is to use the historical author and affiliation relationships from this large dataset to find potentially correct, complete affiliations based on the author text and the affiliation text  ... 
doi:10.1117/12.476046 dblp:conf/drr/HauserSSDST03 fatcat:r42rz4fiq5hgrmchazp5fpj5s4

Evaluating the Impact of OCR Errors on Topic Modeling [chapter]

Stephen Mutuvi, Antoine Doucet, Moses Odeo, Adam Jatowt
2018 Lecture Notes in Computer Science  
In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents.  ...  Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength  ...  In the latter, documents may be strongly associated with a given topic in one run, but may be more closely associated with an alternative topic in another run [14] .  ... 
doi:10.1007/978-3-030-04257-8_1 fatcat:qikszebtf5gjbhz55corlljyfy

Enhancing Predictability of Handwritten Document Content using HTR and Word Substitution

2020 International Journal of Innovative Science and Modern Engineering  
Blur detection on every word before segmentation is also substituted with a new word by our OCR algorithm to avoid false positive results and are instead substituted with suitable words.  ...  Handwritten Text Recognition (HTR) can become progressively abysmal when the documents are damaged with smudges, blemishes and blurs. Recognition of such documents is a challenging task.  ...  Table I . compares the mean text similarity for these documents with text recognized by HTR, the recognized text then corrected with word substitutions and the text recognized using state-of-the-art Google  ... 
doi:10.35940/ijisme.g1240.056720 fatcat:4th3s5ukj5cehaczgtvfhfydua

Survey of Post-OCR Processing Approaches

Thi-Tuyet-Hai Nguyen, Adam Jatowt, MIickael Coustaty, Antoine Doucet
2021 Zenodo  
Additionally, many texts have been already processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected.  ...  OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials.  ...  The OCR-GT is the OCR ground-truth of a dataset of historical newspapers, whose texts were published between 1700 and 1995. It has 2,000 pages processed by Abbyy FineReader version 8.1, 9.0, 10.0.  ... 
doi:10.5281/zenodo.4635569 fatcat:x5qoluap7rgyxakv5lm5qcysya

Digitised historical text: Does it have to be mediOCRe?

Beatrice Alex, Claire Grover, Ewan Klein, Richard Tobin
2012 Conference on Natural Language Processing  
We analyse the quality of ocred text compared to a gold standard and show how it can be improved by performing two automatic correction steps.  ...  This paper reports on experiments to improve the Optical Character Recognition (ocr) quality of historical text as a preliminary step in text mining.  ...  We would also like to thank Clare Llewellyn for her valuable input and help with the annotation.  ... 
dblp:conf/konvens/AlexGKT12 fatcat:zojxhnpbufck5lxtkas3nddh7i

Survey of Post-OCR Processing Approaches

Thi-Tuyet-Hai Nguyen, Adam Jatowt, MIickael Coustaty, Antoine Doucet
2021 Zenodo  
Additionally, many texts have been already processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected.  ...  OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials.  ...  Three recent competitions on post-OCR text correction organised in ICDAR2017, ICDAR2019, and ALTA2017 are starting points of solving this problem by assessing submitted approaches on the same dataset with  ... 
doi:10.5281/zenodo.4640070 fatcat:6jnyehazujadvejgls6vpnu6ta

Neural OCR Post-Hoc Correction of Historical Corpora [article]

Lijun Lyu, Maria Koutraki, Martin Krickl, Besnik Fetahu
2021 arXiv   pre-print
Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.  ...  Optical character recognition (OCR) is crucial for a deeper access to historical collections.  ...  This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).  ... 
arXiv:2102.00583v1 fatcat:splb3gorvba6rjpyzgakyol4oe

Neural OCR Post-Hoc Correction of Historical Corpora

Lijun Lyu, Maria Koutraki, Martin Krickl, Besnik Fetahu
2021 Transactions of the Association for Computational Linguistics  
Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.  ...  Optical character recognition (OCR) is crucial for a deeper access to historical collections.  ...  Acknowledgments This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).  ... 
doi:10.1162/tacl_a_00379 fatcat:hahxk3vcsbgoraprqw5p5brsva

Automatic Assessment of OCR Quality in Historical Documents

Anshul Gupta, Ricardo Gutierrez-Osuna, Matthew Christy, Boris Capitanu, Loretta Auvil, Liz Grumbach, Richard Furuta, Laura Mandell
2015 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall.  ...  Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools.  ...  A number of studies have focused on post-correcting errors in OCR outputs by modeling typographical variations in historical documents; see (Reynaert 2008; Reffle and Ringlstetter 2013) and references  ... 
doi:10.1609/aaai.v29i1.9487 fatcat:packdzbakvcuxjjt36qjertc7m

ICDAR 2013 Competition on Historical Book Recognition (HBR 2013)

A. Antonacopoulos, C. Clausner, C. Papadopoulos, S. Pletschacher
2013 2013 12th International Conference on Document Analysis and Recognition  
However, there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical books, especially for OCR. †  ...  It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2013 and the 2 nd International Workshop on Historical Document Imaging and Processing (HIP2013  ...  by creating a dataset with ground truth [4] and making it available to all researchers.  ... 
doi:10.1109/icdar.2013.294 dblp:conf/icdar/AntonacopoulosCPP13a fatcat:66fixhe3rbgtjn2gxzj5w3apgi

Correcting noisy OCR

John Evershed, Kent Fitch
2014 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage - DATeCH '14  
We describe a system for automatic post OCR text correction of digital collections of historical texts.  ...  Word correction candidates are generated by a deep heuristic search of weighted edit combinations guided by a trie. Testing shows good improvements in word error rate.  ...  TESTS AND RESULTS Datasets Raw OCR text from a relevant corpus, paired with ground-truth text, is needed.  ... 
doi:10.1145/2595188.2595200 dblp:conf/datech/EvershedF14 fatcat:ll6ezhvvxnaprg4wvbuj3t6n5e

Multi-Input Attention for Unsupervised OCR Correction

Rui Dong, David Smith
2018 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding  ...  A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences.  ...  Acknowledgements This work was supported by NIH grant 2R01DC009834-06A1, the Andrew W.  ... 
doi:10.18653/v1/p18-1220 dblp:conf/acl/SmithD18 fatcat:uvapnkodyjbxhonp3k57u6auwy

Constructing a Recipe Web from Historical Newspapers [chapter]

Marieke van Erp, Melvin Wevers, Hugo Huurdeman
2018 Lecture Notes in Computer Science  
We provide OCR quality indicators and their impact on the extraction process. We enrich the recipes with links to information on the ingredients.  ...  Our research shows how natural language processing, machine learning, and semantic web can be combined to construct a rich dataset from heterogeneous newspapers for the historical analysis of food culture  ...  We thank Jesse de Does for the OCR quality measure, Marten Postma and Emiel van Miltenburg for querying Open Dutch WordNet, and Richard Zijdeman for fruitful discussions on the dataset concept.  ... 
doi:10.1007/978-3-030-00671-6_13 fatcat:u2ttyp37uzbr5ltxbgebacczhq

Efficient and effective OCR engine training

Christian Clausner, Apostolos Antonacopoulos, Stefan Pletschacher
2019 International Journal on Document Analysis and Recognition  
Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects.  ...  All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine's training processes themselves, text recognition, and quantitative evaluation of  ...  In [6] , for instance, it is reported how recognition rates for non-mainstream documents (Polish historical texts) can be significantly improved by training the OCR engine used.  ... 
doi:10.1007/s10032-019-00347-8 fatcat:awnt5v62yrfyrcbflvgmiggeri

A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers

Mariona Coll Ardanuy, David Beavan, Kaspar Beelen, Kasra Hosseini, Jon Lawrence, Katherine McDonough, Federico Nanni, Daniel van Strien, Daniel C. S. Wilson
2022 Journal of Open Humanities Data  
We present a new dataset for the task of toponym resolution in digitized historical newspapers in English.  ...  The dataset consists of 3,364 annotated toponyms, of which 2,784 have been provided with a link to Wikipedia.  ...  with data access, and to the members of Living with Machines who helped with the annotations.  ... 
doi:10.5334/johd.56 fatcat:2wenxxqnvvfdtd67y645vipg4e
« Previous Showing results 1 — 15 out of 1,554 results