Filters








10 Hits in 8.3 sec

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set [article]

Youssef Bassil, Mohammad Alwani
2012 arXiv   pre-print
Google Web 1T 5-gram data set.  ...  The cornerstone of this proposed approach is the use of Google Web 1T 5-gram data set as a dictionary of words to spell-check OCR text.  ...  Acknowledgment This research was funded by the Lebanese Association for Computational Sciences (LACSC), Beirut, Lebanon under the "Web-Scale OCR Research Project -WSORP2011".  ... 
arXiv:1204.0188v1 fatcat:rfudthyyk5hujiromjlru5hu5e

Statistical Learning for OCR Text Correction [article]

Jie Mei, Aminul Islam, Yajing Wu, Abidalrahman Moh'd, Evangelos E. Milios
2016 arXiv   pre-print
The evaluation results show that our model can correct 61.5% of the OCR-errors (considering the top 1 suggestion) and 71.5% of the OCR-errors (considering the top 3 suggestions), for cases where the theoretical  ...  for the characteristics of OCR errors.  ...  We use five-grams in Google Web 1T corpus for exact and relaxed context matching.  ... 
arXiv:1611.06950v1 fatcat:o732efklkrbkxgkmjwuf3rx6uu

Survey of Post-OCR Processing Approaches

Thi-Tuyet-Hai Nguyen, Adam Jatowt, MIickael Coustaty, Antoine Doucet
2021 Zenodo  
OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials.  ...  Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones.  ...  Lastly, they choose the best alternative for each detected error relying on word 5-gram frequency. Soni et al. [147] concentrate on handling segmentation errors via Google 1T Web ngrams.  ... 
doi:10.5281/zenodo.4640070 fatcat:6jnyehazujadvejgls6vpnu6ta

Survey of Post-OCR Processing Approaches

Thi-Tuyet-Hai Nguyen, Adam Jatowt, MIickael Coustaty, Antoine Doucet
2021 Zenodo  
OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials.  ...  Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones.  ...  Lastly, they choose the best alternative for each detected error relying on word 5-gram frequency. Soni et al. [147] concentrate on handling segmentation errors via Google 1T Web ngrams.  ... 
doi:10.5281/zenodo.4635569 fatcat:x5qoluap7rgyxakv5lm5qcysya

Exploring web scale language models for search query processing

Jian Huang, Jianfeng Gao, Jiangbo Miao, Xiaolong Li, Kuansan Wang, Fritz Behr, C. Lee Giles
2010 Proceedings of the 19th international conference on World wide web - WWW '10  
We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation.  ...  In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title  ...  Our context-sensitive query speller is evaluated on this set of queries that need correction.  ... 
doi:10.1145/1772690.1772737 dblp:conf/www/HuangGMLWBG10 fatcat:i6gxpcm5tfenhn6k6p6ukwdbqy

Improving Open-Vocabulary Scene Text Recognition

Jacqueline L. Feild, Erik G. Learned-Miller
2013 2013 12th International Conference on Document Analysis and Recognition  
We also evaluate this full system on two standard data sets, ICDAR 2003 and ICDAR 2011, and show an increase in word recognition performance compared to the current state-of-the-art methods.  ...  We avoid this limitation by incorporating language information from a large web-based lexicon of around 13.5 million words.  ...  ACKNOWLEDGMENTS This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. S121000000211. It is also supported by NSF Grant IIS-0916555.  ... 
doi:10.1109/icdar.2013.125 dblp:conf/icdar/FeildL13 fatcat:il636romjrft5agelvmvssrf2m

Unsupervised Text Segmentation for Automated Error Reduction

Lenz Furrer
2014
The proposed approach is nearly knowledge-free, in that it does not rely on languagedependent, man-made resources.  ...  The proposed approach is nearly knowledge-free, in that it does not rely on language-dependent, manmade resources.  ...  Bassil and Alwani (2012) perform corpus-based corrections with the Google Web 1T 5-Gram Data Set.  ... 
doi:10.5167/uzh-101471 fatcat:57izvxzcfbb7dncxlv2fkt43hu

Facial expression recognition in the wild : from individual to group [article]

Abhinav Dhall, University, The Australian National, University, The Australian National
2018
Earlier methods were based on fiducial points. However, as fiducial points detection is an open problem for real-world images, HPN can be error-prone. A HPN method based on response [...]  ...  The database is constructed and labelled using a semi-automatic process based on closed caption subtitle based keyword search.  ...  SPI protocol represents the data on the web, where the chances are less that the subject in the train will ever appear in the test set.  ... 
doi:10.25911/5d4ea922db07c fatcat:g2yn7xzq5baxpfsuwdna574l2e

A coprocessor for fast searching in large databases: Associative Computing Engine [article]

Christophe Layer, Universität Ulm, Universität Ulm
2016
However, in order to retrieve a few relevant kilobytes from a large digital store, one moves up to hundreds of gigabytes of data between memory and processor over a bandwidth-restricted bus.  ...  In this work, the two most important features are the overall scalability of the design allowing the support of different bit widths and module depths along the data path, as well as the obtainment of  ...  Correction of the NLF using a LUT to approximate the arch shaped error curve after a first correction through the linear Λ-shaped error approximation. Fig. A. 5 . 5 Fig.  ... 
doi:10.18725/oparu-891 fatcat:fddqtzupefdodgb32znkgqyk5e

A parallel workflow for online correlation and clique-finding : with applications to finance

Camilo Rostoker
2007
Finally, we embed our new algorithm within a data processing pipeline that performs high throughput correlation and clique-based clustering of thousands of variables from a high-frequency data stream.  ...  , real-life intra-day stock market data in order to determine clusters of stocks exhibiting highly correlated short-term trading patterns.  ...  ., t 60), there are 60 data points with LS.t 1, 30 data points with 1t = 2, 15 data points Lt = 4, 5 data points with 12 and 1 data point with Lt = 60.  ... 
doi:10.14288/1.0051985 fatcat:y6det5taebbujf3easucrk7qpa