A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Information Retrieval Based on OCR Errors in Scanned Documents
2003
2003 Conference on Computer Vision and Pattern Recognition Workshop
The proposed algorithm consists of two basic steps. In the first step, we apply editing operations on OCR words that generate a collection of error-grams and correction rules. ...
An important proportion of documents are document images, i.e. scanned documents. For their retrieval, it is important to recognize their contents. ...
When the data is noisy or corrupted, as the case with OCR text, exact string matching becomes inappropriate and another measure is needed to facilitate information retrieval on collections of OCR text. ...
doi:10.1109/cvprw.2003.10020
dblp:conf/cvpr/FataichaCNS03
fatcat:fmzhvw2bz5aepipgucu3xa5nsy
Improved string matching under noisy channel conditions
2001
Proceedings of the tenth international conference on Information and knowledge management - CIKM'01
This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels. ...
The algorithm is more general than most approximate matching algorithms and allows string-to-string edits with arbitrary costs. ...
ACKNOWLEDGEMENTS The authors would like to thank John Platt, Rado Nickolov, and an anonymous reviewer for their suggestions on earlier drafts, and Henry Burgess and Stephen Robertson for helpful discussions ...
doi:10.1145/502585.502646
dblp:conf/cikm/Collins-ThompsonSD01
fatcat:fk62h25shrcanbofkfxa4pbgi4
Improved string matching under noisy channel conditions
2001
Proceedings of the tenth international conference on Information and knowledge management - CIKM'01
This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels. ...
The algorithm is more general than most approximate matching algorithms and allows string-to-string edits with arbitrary costs. ...
ACKNOWLEDGEMENTS The authors would like to thank John Platt, Rado Nickolov, and an anonymous reviewer for their suggestions on earlier drafts, and Henry Burgess and Stephen Robertson for helpful discussions ...
doi:10.1145/502645.502646
fatcat:5zgt23nji5fw5eyw2cej3klnum
iOCR: Informed Optical Character Recognition for Election Ballot Tallies
[article]
2022
arXiv
pre-print
The purpose of this study is to explore the performance of Informed OCR or iOCR. iOCR was developed with a spell correction algorithm to fix errors introduced by conventional OCR for vote tabulation. ...
The results found that the iOCR system outperforms conventional OCR techniques. ...
Post-processing uses a string similarity algorithm like Levenshtein distance and a dictionary derived from a large corpus of text to detect and correct misspellings in the OCR output text [8] . ...
arXiv:2208.00865v1
fatcat:2fz4pcvzzffotgsei6yx24qbni
Enhancing the Searchability of Page-Image PDF Documents Using an Aligned Hidden Layer from a Truth Text
2016
Proceedings of the 2016 ACM Symposium on Document Engineering - DocEng '16
In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. ...
The alignment of the truth text with the image is guided by using OCRprovided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other ...
ACKNOWLEDGEMENTS We thank Clive Adams of the Institute for Mental Health, University of Nottingham, for making available to us PDF documents from the Cochrane Schizophrenia Group's collection, around which ...
doi:10.1145/2960811.2967157
dblp:conf/doceng/KnightB16
fatcat:eer6r2j4a5fx7k2gpbgnkkedry
A Fast Alignment Scheme for Automatic OCR Evaluation of Books
2011
2011 International Conference on Document Analysis and Recognition
In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment. ...
This process is recursively applied to each text segment in between matching unique words until the text segments become very small. ...
The recursive text alignment scheme (RETAS) is proposed for evaluating the OCR accuracy of books. The basic idea is to scale the whole string alignment problem into manageable size problems. ...
doi:10.1109/icdar.2011.157
dblp:conf/icdar/YalnizM11
fatcat:temqkfdqwbhzteygcckmobmzz4
Automatic Processing of Document Annotations
1998
Procedings of the British Machine Vision Conference 1998
The author simply passes the annotated documents through a sheetfeed scanner and then brings up the electronic document in a text editor. ...
This procedure might have interesting applications in document database retrieval, since it allows an electronic document to be indexed by a printed version of itself. ...
The ASCII and printed text are represented by strings of word lengths (a), then matched using an approximate string matching algorithm (b). ...
doi:10.5244/c.12.44
dblp:conf/bmvc/StevensGD98
fatcat:hkd5trwywvhvvoamc4exwb2d7m
An approximate multi-word matching algorithm for robust document retrieval
2006
Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06
|S|) denote the number of nodes in the trees representing the word set (resp. the text), and |Q| donotes the number of the states of the model used for string similarity. ...
Given a set of words, such as a dictionary, this paper proposes an efficient dynamic programming (DP) algorithm to find the occurrences of each word in a text. ...
The main contribution of this paper is to develop an efficient algorithm for multi-word detection to apply the text error model to IR from OCR generated text. Now let us define the problem. ...
doi:10.1145/1183614.1183623
dblp:conf/cikm/Takasu06
fatcat:bmqx3v4pt5dape76sxqsmequ24
A Simple and Practical Approach to Improve Misspellings in OCR Text
[article]
2021
arXiv
pre-print
The focus of our paper is the identification and correction of non-word errors in OCR text. ...
In this paper, we develop an unsupervised method that can handle both errors. The method we develop leads to a sizable improvement in the correction rates. ...
word compound-splitter toolsto suggest candidate corrections for errors in an input OCR text. ...
arXiv:2106.12030v1
fatcat:gbafkl5ilrcq7hzloo6fwmryva
Pattern matching techniques for correcting low-confidence OCR words in a known context
2000
Document Recognition and Retrieval VIII
The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification ...
A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals. ...
However, future research may be envisioned toward the design of an OCR phonetic word-matching algorithm. ...
doi:10.1117/12.410842
dblp:conf/drr/FordHLT01
fatcat:i33jiwk4x5bs7fcc5hvlvsnhem
Mobile-Based Word Matching Detection using Intelligent Predictive Algorithm
2019
International Journal of Interactive Mobile Technologies
This article aims to apply IP algorithm together with Optical Character Recognition (OCR) tool for mobile-based word matching detection. ...
Word matching is a string searching technique for information retrieval in Natural Language Processing (NLP). ...
In data preparation, Optical Character Recognition (OCR) is an electronic tool can be used to prepare an electronic document for data analysis. ...
doi:10.3991/ijim.v13i09.10848
fatcat:z4l74a3dmndkfbckm4fd4u354a
Style-independent document labeling: design and performance evaluation
2003
Document Recognition and Retrieval XI
In this paper, we first describe a system (called ZoneMatch) for automated generation of crucial geometric and non-geometric features of important bibliographical fields based on string-matching and clustering ...
Experimental results show that the labeling performance of the rule-based algorithm is significantly improved when the generated features are used. ...
THE ALGORITHMS The core algorithms of the ZoneMatch module consist of two parts: a string-matching algorithm and a clustering algorithm for feature generation. ...
doi:10.1117/12.532039
dblp:conf/drr/MaoKT04
fatcat:lupc7il4c5dbdf5bikvj6gsxdi
Automatic Fax Routing
[chapter]
2004
Lecture Notes in Computer Science
For all these "noisy" words, a set of features is computed which include internal text features, location features, and relationship features. ...
The parameters of the word relevance function are learned from training data using the AdaBoost learning algorithm. Words are then compared to the database of recipients to find likely matches. ...
While there is an extremely efficient algorithm for finding the exact match between a word and a large database, a solution for the task of finding the best string under the string edit distance is not ...
doi:10.1007/978-3-540-28640-0_46
fatcat:fqgsv64lmjgd7hko5x7ao5l4mi
Linking multimedia presentations with their symbolic source documents
2003
Proceedings of the eleventh ACM international conference on Multimedia - MULTIMEDIA '03
An algorithm is presented that automatically matches images of presentation slides to the symbolic source file (e.g., PowerPoint TM or Acrobat TM ) from which they were generated. ...
The matching algorithm extracts features from the image data, including OCR output, edges, projection profiles, and layout and determines the symbolic file that contains the most similar collection of ...
ACKNOWLEDGEMENTS We would like to thank Jamey Graham for his support and advice, and Daniel Van Olst, Kim McCall, and Bradley Rhodes, for helping to collect test images. ...
doi:10.1145/957013.957122
dblp:conf/mm/ErolHL03
fatcat:64grtfopmfgb5bpogcvnijwoiy
Approximate String Matching for Detecting Keywords in Scanned Business Documents
2019
Recent Advances in Slavonic Natural Languages Processing
This paper presents an approximate string matching method using weighted edit distance for searching keywords in OCR-ed business documents. ...
Optical Character Recognition (OCR) is achieving higher accuracy. However, to decrease error rate down to zero is still a human desire. ...
Acknowledgements This work has been partly supported by Konica Minolta Business Solution Czech within the OCR Miner project. ...
dblp:conf/raslan/Ha19
fatcat:vfxrok7e2ncv5amlpg4x6dgyku
« Previous
Showing results 1 — 15 out of 3,139 results