25,777 Hits in 16.7 sec

Retrieving poorly degraded OCR documents

Y. Fataicha, M. Cheriet, J. Y. Nie, C. Y. Suen
2005 International Journal on Document Analysis and Recognition  
The second step uses query terms and error-grams to generate additional query terms, identify appropriate matching terms, and determine the degree of relevance of retrieved document images to the user's  ...  The proposed approach has been trained on 979 document images to construct about 2,822 error-grams and tested on 100 scanned Web pages, Y.  ...  Error-grams and correction rules were first generated using the training set and then combined to extend query terms.  ... 
doi:10.1007/s10032-005-0147-6 fatcat:sh7rpswuqjfmdaptnbggyxtlxq

Explaining Data-Driven Document Classifications

David Martens, Foster Provost
2014 MIS Quarterly  
The main theoretical contribution of the work is the definition of a new sort of explanation as a minimal set of words (terms, more generally), such that removing all words within this set from the document  ...  The results show the explanations to be concise and document-specific, and to be capable of providing better understanding of the exact reasons for the classification decisions, of the workings of the  ...  We extend our gratitude to AdSafe Media and Josh Attenberg for many discussions into the problem of safe advertising.  ... 
doi:10.25300/misq/2014/38.1.04 fatcat:6felere3mfc2pa2zhzeu6cm74q

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Nawei Chen, Dorothea Blostein
2006 International Journal on Document Analysis and Recognition  
Document image classification is an important step in Office  ...  Acknowledgements We gratefully acknowledge the financial support provided by the Xerox Foundation, and by NSERC, Canada's Natural Sciences and Engineering Research Council.  ...  Ittner et al. [28] Textual features from OCR results A fixed-length vector representing weights of index terms Rocchio's algorithm, a technique in text cate- gorization Learn weights of index  ... 
doi:10.1007/s10032-006-0020-2 fatcat:2ssef27glvh7dik37emkr4zpd4

Document Retrieval: Expertise in Identifying Relevant Documents

Philip J. Smith
1990 IEEE Data Engineering Bulletin  
I would like to thank the authors for accepting my invitation to contribute to this special issue. Many of them have to make time from their busy schedules in order to meet our deadline.  ...  He then describes a simple term weight strategy for the analysis of local document structures.  ...  The research is supported by AFOSR, NSF and DEC. Acknowledgments The authors would like to acknowledge the contributions of Hong-Tai Chou to the design of the text search algorithm for ORION.  ... 
dblp:journals/debu/Smith90 fatcat:dlyur6m4wjdylanby4cy23jouu

Supervised Learning Methods for Bangla Web Document Categorization

Ashis Kumar Mandal, Rikta Sen
2014 International Journal of Artificial Intelligence & Applications  
Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents.  ...  For Bangla, empirical results support that all four methods produce satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy document feature vectors.  ...  SVM classifier results for five categories of BD corpus the number of training examples versus the accuracy in terms of average F-measure.  ... 
doi:10.5121/ijaia.2014.5508 fatcat:rewdaah5wzfy3i7usu7vt22xgi

A Review of Machine Learning Algorithms for Text-Documents Classification

Baharum Baharudin, Lam Hong Lee, Khairullah Khan
2010 Journal of Advances in Information Technology  
The aim of this paper is to highlight the important techniques and methodologies that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges  ...  With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information  ...  However, the drawback of the decision rule method is the impossibility to assign a document to a category exclusively due to the rules from different rule sets is applicable to each other.  ... 
doi:10.4304/jait.1.1.4-20 fatcat:nx23oqf3gbgiha45s2enn2hqqq

Analysis and Interpretation of Graphical Documents [chapter]

Bart Lamiroy, Jean-Marc Ogier
2014 Handbook of Document Image Processing and Recognition  
Short introductory text This chapter is dedicated to the analysis and the interpretation of graphical documents, and as such, builds upon many of the topics covered in other parts of this handbook.  ...  It will therefore not focus on any of the technical issues related to graphical documents, such as low level filtering and binarization, primitive extraction and vectorization as developed in Chapters  ...  order to define the expectations of the user in terms of interpreted objects.  ... 
doi:10.1007/978-0-85729-859-1_19 fatcat:kcxhax4yijbkpk4a5wefxe7urq

Genre identification for office document search and browsing

Francine Chen, Andreas Girgensohn, Matthew Cooper, Yijuan Lu, Gerry Filby
2011 International Journal on Document Analysis and Recognition  
These results provide support for a topic-independent approach to identification of coarse office document genres.  ...  to improve the performance of genre identification.  ...  The results from training on A, tuning on B, and testing on C to produce binary genre classifications g 1 were combined with the results from training on B, tuning on A, and testing on C, to produce binary  ... 
doi:10.1007/s10032-011-0163-7 fatcat:ogby7vevq5h2lf5kwg55emlkb4

Representing Documents via Latent Keyphrase Inference

Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare R. Voss, Jiawei Han
2016 Proceedings of the 25th International Conference on World Wide Web - WWW '16  
But these methods are not desirable when applied to vertical domains (e.g., literature, enterprise, etc.) due to low coverage of in-domain concepts in the general knowledge base and interference from out-of-domain  ...  Being aware of this, researchers have proposed concept-based models that rely on a human-curated knowledge base to incorporate other related concepts in the document representation.  ...  to Knowledge (BD2K) initiative (, and MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at UIUC.  ... 
doi:10.1145/2872427.2883088 pmid:28229132 pmcid:PMC5318165 dblp:conf/www/LiuRSCVH16 fatcat:7bnq3lg7areatavtfkjaw5pane

Characteristics of document similarity measures for compliance analysis

Asad Sayeed, Soumitra Sarkar, Yu Deng, Rafah Hosn, Ruchi Mahindru, Nithya Rajamani
2009 Proceeding of the 18th ACM conference on Information and knowledge management - CIKM '09  
This paper describes the use of document similarity measures -Cosine similarity and Latent Semantic Indexing -to identify the top candidate templates on which a more detailed (and expensive) compliance  ...  Comparison of results of using the different methods are presented.  ...  It uses a combination of statistical techniques (language models) and heuristic rules to construct the document tree.  ... 
doi:10.1145/1645953.1646106 dblp:conf/cikm/SayeedSDHMR09 fatcat:4udjsossize5jiyv55sv4aan74

Information extraction: beyond document retrieval

Robert Gaizauskas, Yorick Wilks
1998 Journal of Documentation  
In this paper we give a synoptic view of the growth text processing technology of information extraction (IE) whose function is to extract information about a pre-specified set of entities, relations or  ...  Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960's and 70's till the present, discuss the techniques being used to carry out the task,  ...  Acknowledgments Thanks to Beth Sundheim of the US Naval Command, Control, and Ocean Surveillance Center RDT\&E Division (NRaD) for detailed comments during the preparation of this paper.  ... 
doi:10.1108/eum0000000007162 fatcat:diexgodbeng4lh6kfrn4kylrfm

Subword-based approaches for spoken document retrieval

Kenney Ng, Victor W. Zue
2000 Speech Communication  
First, what are suitable subword units and how well can they perform? Second, how can these units be reliably extracted from the speech signal?  ...  This thesis explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech  ...  An interesting question, then, is how to select the "best" set of models resulting from multiple training runs.  ... 
doi:10.1016/s0167-6393(00)00008-x fatcat:4jig4v5w25gqpjmbej6k2x2byq

Stemming Effectiveness in Clustering of Arabic Documents

Osama A.Ghanem, Wesam M. Ashour
2012 International Journal of Computer Applications  
From experiments, results show that light stemming achieved best results in terms of recall, precision and F-measure when compared with others stemming.  ...  Stemming is an important technique, used as feature selection to reduce many redundant features have the same root in root-based stemming and have the same syntacticalform in light stemming.  ...  -The second stage is based on the introduction of weights associated to index terms inorder to improve the retrieval relevant to the user.  ... 
doi:10.5120/7620-0674 fatcat:3uttk5fhu5av5fwd752xpm6yie

Front Matter: Volume 8658

Proceedings of SPIE, Richard Zanibbi, Bertrand Coüasnon
2013 Document Recognition and Retrieval XX  
Numbers in the index correspond to the last two digits of the six-digit CID number.  ...  Utilization of CIDs allows articles to be fully citable as soon as they are published online, and connects the same identifier to all online, print, and electronic versions of the publication.  ...  (United States) 8658 13 Rule-based versus training-based extraction of index terms from business documents: how to combine the results [8658-28] D. Schuster, M. Hanke, K. Muthmann, D.  ... 
doi:10.1117/12.2020094 fatcat:kvr4h3apybgzzcd2kqytooax6q

Summarization from medical documents: a survey

Stergos Afantenos, Vangelis Karkaletsis, Panagiotis Stamatopoulos
2005 Artificial Intelligence in Medicine  
It mainly focuses on the issue of scaling to large collections of documents in various languages and from different media, on personalization issues, on portability to new sub-domains, and on the integration  ...  Objective: The aim of this paper is to survey the recent work in medical documents summarization.  ...  Many thanks also to Ms. Eleni Kapelou and Ms. Irene Doura for checking the use of English.  ... 
doi:10.1016/j.artmed.2004.07.017 pmid:15811783 fatcat:n7u6ji5t2rgkvjktacjf4rdire
« Previous Showing results 1 — 15 out of 25,777 results