Filters








529 Hits in 3.8 sec

Low-resource OCR error detection and correction in French Clinical Texts

Eva D'hondt, Cyril Grouin, Brigitte Grau
2016 Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis  
In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology  ...  While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training material, these  ...  Conclusion In this paper we presented a method for the detection and correction of OCR errors in French patient files.  ... 
doi:10.18653/v1/w16-6108 dblp:conf/acl-louhi/DhondtGG16 fatcat:rnx665burnginkqeudoehozeh4

An Efficient Method for Automatic Recognizing Text Fields on Identification Card

Nguyen Thi Thanh Tan, Nguyen Ha Nam
2020 VNU Journal of Science Mathematics - Physics  
In this article, we propose an efficient method to recognize information fields for identification in ID card using Convolutional Neural Network (CNN) and Long Short-Term Memory networks (LSTM).  ...  The problem of optical character and handwriting recognition has been interested by researchers in long time ago. It has obtained great results in theory as well as practical applications.  ...  Acknowledgments This work has been sponsored and funded by Ho Chi Minh City University of Food Industry under Contract No. 149/ HD-DCT.  ... 
doi:10.25073/2588-1124/vnumap.4456 fatcat:scvt5ug4sjepvlipmzzd2pdmcq

A BLSTM Network for Printed Bengali OCR System with High Accuracy [article]

Debabrata Paul, Bidyut Baran Chaudhuri
2019 arXiv   pre-print
This paper presents a printed Bengali and English text OCR system developed by us using a single hidden BLSTM-CTC architecture having 128 units.  ...  Here, we did not use any peephole connection and dropout in the BLSTM, which helped us in getting better accuracy. This architecture was trained by 47,720 text lines that include English words also.  ...  We also plan to include Assamese script in our system. Moreover, we intend to enhance our system to recognize the Bengali text printed in obsolete Lino-Monotype fonts.  ... 
arXiv:1908.08674v1 fatcat:oi3tkwqndbgmzewpjufvu42rri

Vartani Spellcheck – Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance [article]

Aditya Pal, Abhijit Mustafi
2020 arXiv   pre-print
Automatic spelling error detection and context-sensitive error correction can be used to improve accuracy by post-processing the text generated by these OCR systems.  ...  We use a lookup dictionary and context-based named entity recognition (NER) for detection of possible spelling errors in the text.  ...  We would also like to thank the creators of publicly available Hindi datasets which were used extensively in our research.  ... 
arXiv:2012.07652v1 fatcat:u55lueliknbrhn3m4uvhxrw4qi

Auto-ML Deep Learning for Rashi Scripts OCR [article]

Shahar Mahpod, Yosi Keller
2020 arXiv   pre-print
In this work we propose an OCR scheme for manuscripts printed in Rashi font that is an ancient Hebrew font and corresponding dialect used in religious Jewish literature, for more than 600 years.  ...  In particular, we derive an AutoML scheme to optimize the CNN architecture, and a book-specific CNN training to improve the OCR accuracy.  ...  This is considered a Character classification error, and a correct Letter classification.  ... 
arXiv:1811.01290v2 fatcat:dxgzyoqwh5dwpk4ijpxigdwqz4

Open-source OCR engine integration with Greek dictionary

Alkiviadis Tsimpiris, Dimitris Varsamis, Charalampos Strouthopoulos, George Pavlidis
2021 Zenodo  
The training applied in the embedded LSTM deep learning model of Tesseract, before the integration of the new Greek dictionary.  ...  To achieve this goal, an open access dictionary was initially used which was enriched with words that exist in the Greek restaurant menus.  ...  and Innovation, under the call RESEARCH -CREATE -INNOVATE, project code:(T1EDK-02015).  ... 
doi:10.5281/zenodo.5887015 fatcat:woxrhvuwf5g6tfkfr3wsymnady

ICDAR 2019 Competition on Post-OCR Text Correction

Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreux
2019 Zenodo  
The present challenge consists of two tasks: 1) error detection and 2) error correction.  ...  Five teams submitted results, the error detection scores vary from 41 to 95% and the best error correction improvement is 44%.  ...  We are grateful to all libraries and institutions that allowed the use of their material for this competition.  ... 
doi:10.5281/zenodo.3459116 fatcat:vm533g5wizcmfakvxqpep7fxr4

Sequence-to-Label Script Identification for Multilingual OCR [article]

Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, Ash Hurst, Ashok C. Popat
2017 arXiv   pre-print
Experiments on scanned books and photos containing 232 languages in 30 scripts show 16% reduction of script identification error rate compared to the baseline.  ...  Therefore we reframe line script identification as a sequence-to-label problem and solve it using two components, trained end-toend: Encoder and Summarizer.  ...  ACKNOWLEDGMENT The authors would like to thank Thomas Deselaers, Reeve Ingle, Sergey Ioffe, Henry Rowley, and Ray Smith for comments on early versions of this paper.  ... 
arXiv:1708.04671v2 fatcat:pxibqflqyff63her263hroyvwq

End-to-End Optical Character Recognition for Bengali Handwritten Words [article]

Farisa Benta Safir, Abu Quwsar Ohi, M.F. Mridha, Muhammad Mostafa Monowar, Md. Abdul Hamid
2021 arXiv   pre-print
The proposed method achieves 0.091 character error rate and 0.273 word error rate performed using DenseNet121 model with GRU recurrent layer.  ...  Further, we experiment with two different recurrent neural networks (RNN) methods, LSTM and GRU.  ...  Error Rate (CER) indicates the number of erroneous predictions made by the OCR system.  ... 
arXiv:2105.04020v1 fatcat:h5dzzxbqn5bb5dmgf7a7rgfkuu

ICDAR 2019 Competition on Post-OCR Text Correction

Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreux
2019 2019 International Conference on Document Analysis and Recognition (ICDAR)  
The present challenge consists of two tasks: 1) error detection and 2) error correction.  ...  Five teams submitted results, the error detection scores vary from 41 to 95% and the best error correction improvement is 44%.  ...  We are grateful to all libraries and institutions that allowed the use of their material for this competition.  ... 
doi:10.1109/icdar.2019.00255 dblp:conf/icdar/RigaudDCM19 fatcat:fzzvbgu3bfenjpjctzocgue2ai

Optical character recognition with neural networks and post-correction with finite state methods

Senka Drobac, Krister Lindén
2020 International Journal on Document Analysis and Recognition  
Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis.  ...  The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua).  ...  If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly  ... 
doi:10.1007/s10032-020-00359-9 fatcat:cjonawcrebec7n34iapdg7frqu

Reverse-engineering Bar Charts Using Neural Networks [article]

Fangfang Zhou, Yong Zhao, Wenjiang Chen, Yijing Tan, Yaqi Xu, Yi Chen, Chao Liu, Ying Zhao
2020 arXiv   pre-print
We further introduce an attention mechanism into the framework to achieve high accuracy and robustness. Synthetic and real-world datasets are used to evaluate the effectiveness of the method.  ...  We adopt a neural network-based object detection model to simultaneously localize and classify textual information. This approach improves the efficiency of textual information extraction.  ...  ACKNOWLEDGMENTS This work was supported in part by the National Natural Science and Technology Fundamental Resources Investigation Program of China (No. 2018FY10090002), the National Natural Science Foundation  ... 
arXiv:2009.02491v1 fatcat:g5tsoiqnqzhnfcv72kd7fdzesm

Improving OCR Accuracy on Early Printed Books using Deep Convolutional Networks [article]

Christoph Wick, Christian Reul, Frank Puppe
2018 arXiv   pre-print
While the standard model of line based OCR uses a single LSTM layer, we utilize a CNN- and Pooling-Layer combination in advance of an LSTM layer.  ...  This paper proposes a combination of a convolutional and a LSTM network to improve the accuracy of OCR on early printed books.  ...  However, this behaviour might not be desired since the method not only corrects OCR errors but also normalizes historical spellings.  ... 
arXiv:1802.10033v1 fatcat:terzx26rqjf3dg57663szuu5uy

Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR

Saman Idrees, Hossein Hassani
2021 Applied Sciences  
Tesseract LSTM is a popular Optical Character Recognition (OCR) engine that has been trained and used in various languages.  ...  We train Tesseract using an Arabic dataset, and then we use a considerably small amount of texts in Persian-Arabic to train the engine to recognize Sorani texts.  ...  We are also grateful to Lesley A T Gaj for her generous assistance in proofreading the manuscript.  ... 
doi:10.3390/app11209752 fatcat:ksog7uz7fngbvdevbfozmxnffq

Quranic Optical Text Recognition Using Deep Learning Models

Masnizah Mohd, Faizan Qamar, Idris Al-Sheikh, Ramzi Salah
2021 IEEE Access  
A better performance in word recognition rate (WRR) and character recognition rate (CRR) is achieved in the experiments. The LSTM and GRU are compared in the Arabic text recognition domain.  ...  In addition, a public database is built for research purposes in Arabic text recognition that contains the diacritics and the Uthmanic script, and is large enough to be used with the deep learning models  ...  Another comparison was made between two types of RNN (LSTM and GRU) to compare between LSTM and GRU in the OCR domain, where the previous works used LSTM in the OCR domain.  ... 
doi:10.1109/access.2021.3064019 fatcat:obe2pevoijdwpos3ejrnwo4afa
« Previous Showing results 1 — 15 out of 529 results