Filters








3,139 Hits in 5.1 sec

Information Retrieval Based on OCR Errors in Scanned Documents

Y. Fataicha, M. Cheriet, J. Y. Nie, C. Y. Suen
2003 2003 Conference on Computer Vision and Pattern Recognition Workshop  
The proposed algorithm consists of two basic steps. In the first step, we apply editing operations on OCR words that generate a collection of error-grams and correction rules.  ...  An important proportion of documents are document images, i.e. scanned documents. For their retrieval, it is important to recognize their contents.  ...  When the data is noisy or corrupted, as the case with OCR text, exact string matching becomes inappropriate and another measure is needed to facilitate information retrieval on collections of OCR text.  ... 
doi:10.1109/cvprw.2003.10020 dblp:conf/cvpr/FataichaCNS03 fatcat:fmzhvw2bz5aepipgucu3xa5nsy

Improved string matching under noisy channel conditions

Kevyn Collins-Thompson, Charles Schweizer, Susan Dumais
2001 Proceedings of the tenth international conference on Information and knowledge management - CIKM'01  
This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels.  ...  The algorithm is more general than most approximate matching algorithms and allows string-to-string edits with arbitrary costs.  ...  ACKNOWLEDGEMENTS The authors would like to thank John Platt, Rado Nickolov, and an anonymous reviewer for their suggestions on earlier drafts, and Henry Burgess and Stephen Robertson for helpful discussions  ... 
doi:10.1145/502585.502646 dblp:conf/cikm/Collins-ThompsonSD01 fatcat:fk62h25shrcanbofkfxa4pbgi4

Improved string matching under noisy channel conditions

Kevyn Collins-Thompson, Charles Schweizer, Susan Dumais
2001 Proceedings of the tenth international conference on Information and knowledge management - CIKM'01  
This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels.  ...  The algorithm is more general than most approximate matching algorithms and allows string-to-string edits with arbitrary costs.  ...  ACKNOWLEDGEMENTS The authors would like to thank John Platt, Rado Nickolov, and an anonymous reviewer for their suggestions on earlier drafts, and Henry Burgess and Stephen Robertson for helpful discussions  ... 
doi:10.1145/502645.502646 fatcat:5zgt23nji5fw5eyw2cej3klnum

iOCR: Informed Optical Character Recognition for Election Ballot Tallies [article]

Kenneth U. Oyibo, Jean D. Louis, Juan E. Gilbert
2022 arXiv   pre-print
The purpose of this study is to explore the performance of Informed OCR or iOCR. iOCR was developed with a spell correction algorithm to fix errors introduced by conventional OCR for vote tabulation.  ...  The results found that the iOCR system outperforms conventional OCR techniques.  ...  Post-processing uses a string similarity algorithm like Levenshtein distance and a dictionary derived from a large corpus of text to detect and correct misspellings in the OCR output text [8] .  ... 
arXiv:2208.00865v1 fatcat:2fz4pcvzzffotgsei6yx24qbni

Enhancing the Searchability of Page-Image PDF Documents Using an Aligned Hidden Layer from a Truth Text

Ian A. Knight, David F. Brailsford
2016 Proceedings of the 2016 ACM Symposium on Document Engineering - DocEng '16  
In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine.  ...  The alignment of the truth text with the image is guided by using OCRprovided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other  ...  ACKNOWLEDGEMENTS We thank Clive Adams of the Institute for Mental Health, University of Nottingham, for making available to us PDF documents from the Cochrane Schizophrenia Group's collection, around which  ... 
doi:10.1145/2960811.2967157 dblp:conf/doceng/KnightB16 fatcat:eer6r2j4a5fx7k2gpbgnkkedry

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

Ismet Zeki Yalniz, R. Manmatha
2011 2011 International Conference on Document Analysis and Recognition  
In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment.  ...  This process is recursively applied to each text segment in between matching unique words until the text segments become very small.  ...  The recursive text alignment scheme (RETAS) is proposed for evaluating the OCR accuracy of books. The basic idea is to scale the whole string alignment problem into manageable size problems.  ... 
doi:10.1109/icdar.2011.157 dblp:conf/icdar/YalnizM11 fatcat:temqkfdqwbhzteygcckmobmzz4

Automatic Processing of Document Annotations

J. Stevens, A. Gee, C. Dance
1998 Procedings of the British Machine Vision Conference 1998  
The author simply passes the annotated documents through a sheetfeed scanner and then brings up the electronic document in a text editor.  ...  This procedure might have interesting applications in document database retrieval, since it allows an electronic document to be indexed by a printed version of itself.  ...  The ASCII and printed text are represented by strings of word lengths (a), then matched using an approximate string matching algorithm (b).  ... 
doi:10.5244/c.12.44 dblp:conf/bmvc/StevensGD98 fatcat:hkd5trwywvhvvoamc4exwb2d7m

An approximate multi-word matching algorithm for robust document retrieval

Atsuhiro Takasu
2006 Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06  
|S|) denote the number of nodes in the trees representing the word set (resp. the text), and |Q| donotes the number of the states of the model used for string similarity.  ...  Given a set of words, such as a dictionary, this paper proposes an efficient dynamic programming (DP) algorithm to find the occurrences of each word in a text.  ...  The main contribution of this paper is to develop an efficient algorithm for multi-word detection to apply the text error model to IR from OCR generated text. Now let us define the problem.  ... 
doi:10.1145/1183614.1183623 dblp:conf/cikm/Takasu06 fatcat:bmqx3v4pt5dape76sxqsmequ24

A Simple and Practical Approach to Improve Misspellings in OCR Text [article]

Junxia Lin
2021 arXiv   pre-print
The focus of our paper is the identification and correction of non-word errors in OCR text.  ...  In this paper, we develop an unsupervised method that can handle both errors. The method we develop leads to a sizable improvement in the correction rates.  ...  word compound-splitter toolsto suggest candidate corrections for errors in an input OCR text.  ... 
arXiv:2106.12030v1 fatcat:gbafkl5ilrcq7hzloo6fwmryva

Pattern matching techniques for correcting low-confidence OCR words in a known context

Glenn Ford, Susan E. Hauser, Daniel X. Le, George R. Thoma, Paul B. Kantor, Daniel P. Lopresti, Jiangying Zhou
2000 Document Recognition and Retrieval VIII  
The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification  ...  A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals.  ...  However, future research may be envisioned toward the design of an OCR phonetic word-matching algorithm.  ... 
doi:10.1117/12.410842 dblp:conf/drr/FordHLT01 fatcat:i33jiwk4x5bs7fcc5hvlvsnhem

Mobile-Based Word Matching Detection using Intelligent Predictive Algorithm

Hamidah Jantan, Nurul Aisyiah Baharudin
2019 International Journal of Interactive Mobile Technologies  
This article aims to apply IP algorithm together with Optical Character Recognition (OCR) tool for mobile-based word matching detection.  ...  Word matching is a string searching technique for information retrieval in Natural Language Processing (NLP).  ...  In data preparation, Optical Character Recognition (OCR) is an electronic tool can be used to prepare an electronic document for data analysis.  ... 
doi:10.3991/ijim.v13i09.10848 fatcat:z4l74a3dmndkfbckm4fd4u354a

Style-independent document labeling: design and performance evaluation

Song Mao, Jong Woo Kim, George R. Thoma, Elisa H. Barney Smith, Jianying Hu, James Allan
2003 Document Recognition and Retrieval XI  
In this paper, we first describe a system (called ZoneMatch) for automated generation of crucial geometric and non-geometric features of important bibliographical fields based on string-matching and clustering  ...  Experimental results show that the labeling performance of the rule-based algorithm is significantly improved when the generated features are used.  ...  THE ALGORITHMS The core algorithms of the ZoneMatch module consist of two parts: a string-matching algorithm and a clustering algorithm for feature generation.  ... 
doi:10.1117/12.532039 dblp:conf/drr/MaoKT04 fatcat:lupc7il4c5dbdf5bikvj6gsxdi

Automatic Fax Routing [chapter]

Paul Viola, James Rinker, Martin Law
2004 Lecture Notes in Computer Science  
For all these "noisy" words, a set of features is computed which include internal text features, location features, and relationship features.  ...  The parameters of the word relevance function are learned from training data using the AdaBoost learning algorithm. Words are then compared to the database of recipients to find likely matches.  ...  While there is an extremely efficient algorithm for finding the exact match between a word and a large database, a solution for the task of finding the best string under the string edit distance is not  ... 
doi:10.1007/978-3-540-28640-0_46 fatcat:fqgsv64lmjgd7hko5x7ao5l4mi

Linking multimedia presentations with their symbolic source documents

Berna Erol, Jonathan J. Hull, Dar-Shyang Lee
2003 Proceedings of the eleventh ACM international conference on Multimedia - MULTIMEDIA '03  
An algorithm is presented that automatically matches images of presentation slides to the symbolic source file (e.g., PowerPoint TM or Acrobat TM ) from which they were generated.  ...  The matching algorithm extracts features from the image data, including OCR output, edges, projection profiles, and layout and determines the symbolic file that contains the most similar collection of  ...  ACKNOWLEDGEMENTS We would like to thank Jamey Graham for his support and advice, and Daniel Van Olst, Kim McCall, and Bradley Rhodes, for helping to collect test images.  ... 
doi:10.1145/957013.957122 dblp:conf/mm/ErolHL03 fatcat:64grtfopmfgb5bpogcvnijwoiy

Approximate String Matching for Detecting Keywords in Scanned Business Documents

Hien Thi Ha
2019 Recent Advances in Slavonic Natural Languages Processing  
This paper presents an approximate string matching method using weighted edit distance for searching keywords in OCR-ed business documents.  ...  Optical Character Recognition (OCR) is achieving higher accuracy. However, to decrease error rate down to zero is still a human desire.  ...  Acknowledgements This work has been partly supported by Konica Minolta Business Solution Czech within the OCR Miner project.  ... 
dblp:conf/raslan/Ha19 fatcat:vfxrok7e2ncv5amlpg4x6dgyku
« Previous Showing results 1 — 15 out of 3,139 results