Filters








504 Hits in 5.2 sec

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining [article]

Guowei Xu, Wenbiao Ding, Weiping Fu, Zhongqin Wu, Zitao Liu
2021 arXiv   pre-print
We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward.  ...  In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts.  ...  [9] created noisy data using random character swaps, substitutions, insertions and deletions and improved model performance in machine translation under permuted inputs.  ... 
arXiv:2107.07113v1 fatcat:6x6gunu3azelvkuvqzafncouje

IDENTIFICATION AND SEGMENTATION OF TOUCHING BRAHMI CHARACTERS FROM DEGRADED DIGITAL ESTAMPAGE IMAGES USING ENSEMBLE CLASSIFIER

Aniket S. Nagane, Shankar M. Mali
2021 Indian Journal of Computer Science and Engineering  
The algorithm uses ensemble classification technique to identify the touching characters using 9 unique features.  ...  The results of proposed algorithm are significant to identify and correctly segment the touching Brahmi script characters from degraded digital estampage images.  ...  Patil for his academic support and valuable inputs in the research.  ... 
doi:10.21817/indjcse/2021/v12i6/211206110 fatcat:itcja33e5jf4zndj46y4cbqwze

Improved string matching under noisy channel conditions

Kevyn Collins-Thompson, Charles Schweizer, Susan Dumais
2001 Proceedings of the tenth international conference on Information and knowledge management - CIKM'01  
This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels.  ...  We develop a method for evaluating our technique and use it to examine the relative effectiveness of each sub-component of the algorithm.  ...  ACKNOWLEDGEMENTS The authors would like to thank John Platt, Rado Nickolov, and an anonymous reviewer for their suggestions on earlier drafts, and Henry Burgess and Stephen Robertson for helpful discussions  ... 
doi:10.1145/502585.502646 dblp:conf/cikm/Collins-ThompsonSD01 fatcat:fk62h25shrcanbofkfxa4pbgi4

Improved string matching under noisy channel conditions

Kevyn Collins-Thompson, Charles Schweizer, Susan Dumais
2001 Proceedings of the tenth international conference on Information and knowledge management - CIKM'01  
This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels.  ...  We develop a method for evaluating our technique and use it to examine the relative effectiveness of each sub-component of the algorithm.  ...  ACKNOWLEDGEMENTS The authors would like to thank John Platt, Rado Nickolov, and an anonymous reviewer for their suggestions on earlier drafts, and Henry Burgess and Stephen Robertson for helpful discussions  ... 
doi:10.1145/502645.502646 fatcat:5zgt23nji5fw5eyw2cej3klnum

2D Morphable Feature Space for Handwritten Character Recognition

N. Shobha Rani, Vasudev T, Chandrajith M, Manohar N
2020 Procedia Computer Science  
A feature vector is generated from the normalized Gabor features are extracted from pincushion and distance transform models of a character image and classified using Ada boost classifier with a recognition  ...  A feature vector is generated from the normalized Gabor features are extracted from pincushion and distance transform models of a character image and classified using Ada boost classifier with a recognition  ...  In the proposed method, classification is performed using Ada boost classifier [21] . Ada boost classifier is defined of simple weak classifier and its boosting counter part.  ... 
doi:10.1016/j.procs.2020.03.280 fatcat:y3h7ric7z5bdnmtgjmymsqbthq

Towards a Robust OCR System for Indic Scripts

Praveen Krishnan, Naveen Sankaran, Ajeet Kumar Singh, C.V. Jawahar
2014 2014 11th IAPR International Workshop on Document Analysis Systems  
The current Optical Character Recognition (OCR) systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents.  ...  Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs.  ...  Our web based system is designed to continuously improve the performance over sessions. We use the data provided by the user for improving the performance of the recognizer.  ... 
doi:10.1109/das.2014.74 dblp:conf/das/KrishnanSSJ14 fatcat:leydyfvr2vdyrpz4gpei4jiug4

MAPS

Deepak Kumar, M. N. Anil Prasad, A. G. Ramakrishnan
2012 Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing - ICVGIP '12  
The segmented text image is recognized using the trial version of Omnipage OCR. We have tested our method on ICDAR 2003 and ICDAR 2011 datasets.  ...  This approach, which is unique and distinct from the existing methods, results in improved segmentation.  ...  CONCLUSION AND FUTURE WORK We have proposed an algorithm for effective segmentation of words from different word image datasets.  ... 
doi:10.1145/2425333.2425348 dblp:conf/icvgip/KumarPR12 fatcat:d6dk45jazvflve553ambyvnhtq

A post-processing scheme for malayalam using statistical sub-character language models

Karthika Mohan, C. V. Jawahar
2010 Proceedings of the 8th IAPR International Workshop on Document Analysis Systems - DAS '10  
In this paper, we propose a post-processing scheme which uses statistical language models at the sub-character level to boost word level recognition results.  ...  We use a multi-stage graph representation and formulate the recognition task as an optimization problem.  ...  Performance on real data We had seen an improvement in accuracy at word level and character level when the SSLM was used.  ... 
doi:10.1145/1815330.1815394 dblp:conf/das/MohanJ10 fatcat:4mmiflgiazai5b52nw4aahak2a

Bootstrapped OCR error detection for a less-resourced language variant

Adrien Barbaresi
2016 Conference on Natural Language Processing  
As there are OCR errors throughout the corpus but no clean reference for this variant of German, automatic OCR correction implies to overcome data sparseness and nonstandard spelling, including compounds  ...  string search performs best for error correction.  ...  Acknowledgments This work has been supported by a CLARIN-D special interest group dedicated to late modern and contemporary digital history (FAG-9).  ... 
dblp:conf/konvens/Barbaresi16 fatcat:laj3dxcl3vfdddm3xfytoaodqy

Segmentation-Free Speech Text Recognition for Comic Books

Christophe Rigaud, Jean-Christophe Burie, Jean-Marc Ogier
2017 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)  
We compare the performances of pre-trained OCR and segmentation-free approach for speech text of comic books written in Latin script.  ...  We demonstrate that few good quality pre-trained OCR output samples, associated with other unlabeled data with the same writing style, can feed a segmentation-free OCR and improve text recognition.  ...  We are grateful to all authors and publishers of comics images from eBDtheque dataset for allowing us to use and share their works.  ... 
doi:10.1109/icdar.2017.288 dblp:conf/icdar/RigaudBO17 fatcat:s636kurrdzaajkpe6ixlibm7ma

A Survey on Deep learning based Document Image Enhancement [article]

Zahra Anvari, Vassilis Athitsos
2022 arXiv   pre-print
Document image enhancement plays a crucial role as a pre-processing step in many automated document analysis and recognition tasks such as character recognition.  ...  These document images could be degraded or damaged due to various reasons including poor lighting conditions, shadow, distortions like noise and blur, aging, ink stain, bleed-through, watermark, stamp,  ...  Through super-resolving these document images characters become more legible and it leads to OCR performance boost.  ... 
arXiv:2112.02719v4 fatcat:sznkn6vkr5fabag2pmaff6bhky

Automatic Assessment of OCR Quality in Historical Documents

Anshul Gupta, Ricardo Gutierrez-Osuna, Matthew Christy, Boris Capitanu, Loretta Auvil, Liz Grumbach, Richard Furuta, Laura Mandell
2015 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions  ...  Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools.  ...  Improving OCR transcriptions In a final step, we tested whether our algorithm could be used to improve the overall OCR performance.  ... 
doi:10.1609/aaai.v29i1.9487 fatcat:packdzbakvcuxjjt36qjertc7m

An Efficient Language-Independent Multi-Font OCR for Arabic Script [article]

Hussein Osman, Karim Zaghw, Mostafa Hazem, Seifeldin Elsehely
2020 arXiv   pre-print
This paper also proposes an improved font-independent character segmentation algorithm that outperforms the state-of-the-art segmentation algorithms.  ...  Lastly, the paper proposes a neural network model for the character recognition task.  ...  This paper proposes a complete language-independent Arabic OCR pipeline with an improved character segmentation algorithm based on word-level features and a bio-inspired character recognition model based  ... 
arXiv:2009.09115v1 fatcat:y4mb4rr2wnaphf4bhvkratyqve

A new PDE-based approach for singularity-preserving regularization: application to degraded characters restoration

Fadoua Drira, Frank LeBourgeois, Hubert Emptoz
2011 International Journal on Document Analysis and Recognition  
As a solution, we propose to tackle the problem of degraded text characters with PDE (Partial Differential Equation)-based approaches.  ...  Degradations harm the legibility of the digitized documents and limit their processings.  ...  The total improvement is less than 1% with 6.18% of improved characters and 5.22% of degraded characters (Tab.9).  ... 
doi:10.1007/s10032-011-0165-5 fatcat:lhlnpmi3cbfuvfluq3wiwqs34q

Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting

Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe
2018 2018 13th IAPR International Workshop on Document Analysis Systems (DAS)  
The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration  ...  After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model.  ...  Our goal is to improve the OCR accuracy with a given amount of GT by training different models and use voting to combine them.  ... 
doi:10.1109/das.2018.30 dblp:conf/das/ReulSWP18 fatcat:roujfrpycrdp3excggxibmz4ba
« Previous Showing results 1 — 15 out of 504 results