A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Correcting OCR text by association with historical datasets
2003
Document Recognition and Retrieval X
This is done by comparing affiliations historically associated with the author name to the OCR text in the affiliation field. ...
Our objective is to use the historical author and affiliation relationships from this large dataset to find potentially correct, complete affiliations based on the author text and the affiliation text ...
doi:10.1117/12.476046
dblp:conf/drr/HauserSSDST03
fatcat:r42rz4fiq5hgrmchazp5fpj5s4
Evaluating the Impact of OCR Errors on Topic Modeling
[chapter]
2018
Lecture Notes in Computer Science
In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. ...
Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength ...
In the latter, documents may be strongly associated with a given topic in one run, but may be more closely associated with an alternative topic in another run [14] . ...
doi:10.1007/978-3-030-04257-8_1
fatcat:qikszebtf5gjbhz55corlljyfy
Enhancing Predictability of Handwritten Document Content using HTR and Word Substitution
2020
International Journal of Innovative Science and Modern Engineering
Blur detection on every word before segmentation is also substituted with a new word by our OCR algorithm to avoid false positive results and are instead substituted with suitable words. ...
Handwritten Text Recognition (HTR) can become progressively abysmal when the documents are damaged with smudges, blemishes and blurs. Recognition of such documents is a challenging task. ...
Table I . compares the mean text similarity for these documents with text recognized by HTR, the recognized text then corrected with word substitutions and the text recognized using state-of-the-art Google ...
doi:10.35940/ijisme.g1240.056720
fatcat:4th3s5ukj5cehaczgtvfhfydua
Survey of Post-OCR Processing Approaches
2021
Zenodo
Additionally, many texts have been already processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. ...
OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials. ...
The OCR-GT is the OCR ground-truth of a dataset of historical newspapers, whose texts were published between 1700 and 1995. It has 2,000 pages processed by Abbyy FineReader version 8.1, 9.0, 10.0. ...
doi:10.5281/zenodo.4635569
fatcat:x5qoluap7rgyxakv5lm5qcysya
Digitised historical text: Does it have to be mediOCRe?
2012
Conference on Natural Language Processing
We analyse the quality of ocred text compared to a gold standard and show how it can be improved by performing two automatic correction steps. ...
This paper reports on experiments to improve the Optical Character Recognition (ocr) quality of historical text as a preliminary step in text mining. ...
We would also like to thank Clare Llewellyn for her valuable input and help with the annotation. ...
dblp:conf/konvens/AlexGKT12
fatcat:zojxhnpbufck5lxtkas3nddh7i
Survey of Post-OCR Processing Approaches
2021
Zenodo
Additionally, many texts have been already processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. ...
OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials. ...
Three recent competitions on post-OCR text correction organised in ICDAR2017, ICDAR2019, and ALTA2017 are starting points of solving this problem by assessing submitted approaches on the same dataset with ...
doi:10.5281/zenodo.4640070
fatcat:6jnyehazujadvejgls6vpnu6ta
Neural OCR Post-Hoc Correction of Historical Corpora
[article]
2021
arXiv
pre-print
Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%. ...
Optical character recognition (OCR) is crucial for a deeper access to historical collections. ...
This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28). ...
arXiv:2102.00583v1
fatcat:splb3gorvba6rjpyzgakyol4oe
Neural OCR Post-Hoc Correction of Historical Corpora
2021
Transactions of the Association for Computational Linguistics
Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%. ...
Optical character recognition (OCR) is crucial for a deeper access to historical collections. ...
Acknowledgments This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28). ...
doi:10.1162/tacl_a_00379
fatcat:hahxk3vcsbgoraprqw5p5brsva
Automatic Assessment of OCR Quality in Historical Documents
2015
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. ...
Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. ...
A number of studies have focused on post-correcting errors in OCR outputs by modeling typographical variations in historical documents; see (Reynaert 2008; Reffle and Ringlstetter 2013) and references ...
doi:10.1609/aaai.v29i1.9487
fatcat:packdzbakvcuxjjt36qjertc7m
ICDAR 2013 Competition on Historical Book Recognition (HBR 2013)
2013
2013 12th International Conference on Document Analysis and Recognition
However, there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical books, especially for OCR. † ...
It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2013 and the 2 nd International Workshop on Historical Document Imaging and Processing (HIP2013 ...
by creating a dataset with ground truth [4] and making it available to all researchers. ...
doi:10.1109/icdar.2013.294
dblp:conf/icdar/AntonacopoulosCPP13a
fatcat:66fixhe3rbgtjn2gxzj5w3apgi
Correcting noisy OCR
2014
Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage - DATeCH '14
We describe a system for automatic post OCR text correction of digital collections of historical texts. ...
Word correction candidates are generated by a deep heuristic search of weighted edit combinations guided by a trie. Testing shows good improvements in word error rate. ...
TESTS AND RESULTS
Datasets Raw OCR text from a relevant corpus, paired with ground-truth text, is needed. ...
doi:10.1145/2595188.2595200
dblp:conf/datech/EvershedF14
fatcat:ll6ezhvvxnaprg4wvbuj3t6n5e
Multi-Input Attention for Unsupervised OCR Correction
2018
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding ...
A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. ...
Acknowledgements This work was supported by NIH grant 2R01DC009834-06A1, the Andrew W. ...
doi:10.18653/v1/p18-1220
dblp:conf/acl/SmithD18
fatcat:uvapnkodyjbxhonp3k57u6auwy
Constructing a Recipe Web from Historical Newspapers
[chapter]
2018
Lecture Notes in Computer Science
We provide OCR quality indicators and their impact on the extraction process. We enrich the recipes with links to information on the ingredients. ...
Our research shows how natural language processing, machine learning, and semantic web can be combined to construct a rich dataset from heterogeneous newspapers for the historical analysis of food culture ...
We thank Jesse de Does for the OCR quality measure, Marten Postma and Emiel van Miltenburg for querying Open Dutch WordNet, and Richard Zijdeman for fruitful discussions on the dataset concept. ...
doi:10.1007/978-3-030-00671-6_13
fatcat:u2ttyp37uzbr5ltxbgebacczhq
Efficient and effective OCR engine training
2019
International Journal on Document Analysis and Recognition
Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. ...
All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine's training processes themselves, text recognition, and quantitative evaluation of ...
In [6] , for instance, it is reported how recognition rates for non-mainstream documents (Polish historical texts) can be significantly improved by training the OCR engine used. ...
doi:10.1007/s10032-019-00347-8
fatcat:awnt5v62yrfyrcbflvgmiggeri
A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers
2022
Journal of Open Humanities Data
We present a new dataset for the task of toponym resolution in digitized historical newspapers in English. ...
The dataset consists of 3,364 annotated toponyms, of which 2,784 have been provided with a link to Wikipedia. ...
with data access, and to the members of Living with Machines who helped with the annotations. ...
doi:10.5334/johd.56
fatcat:2wenxxqnvvfdtd67y645vipg4e
« Previous
Showing results 1 — 15 out of 1,554 results