Filters








103 Hits in 4.3 sec

Do Thesauri enhance rule-based categorization for OCR text?

Kazem Taghva, Jeffrey Coombs, Tapas Kanungo, Elisa H. Barney Smith, Jianying Hu, Paul B. Kantor
2003 Document Recognition and Retrieval X  
A rule-based automatic text categorizer was tested to see if two types of thesaurus expansion, called query expansion and Junker expansion respectively, would improve categorization.  ...  Thesauri used were domainspecific to an OCR test collection focussed on a single topic. Results show that neither type of expansion significantly improved categorization.  ...  We conclude that the use of a domain-specific thesaurus will not significantly improve the performance of a rule-based text categorizer on OCR text.  ... 
doi:10.1117/12.472835 dblp:conf/drr/TaghvaC03 fatcat:4lmc6xnm6badld2tvq7xzt35sy

Text-Mining: Application Development Challenges [chapter]

Sundar Varadarajan, Kas Kasravi, Ronen Feldman
2003 Applications and Innovations in Intelligent Systems X  
With focus on rule-based information extraction, and references to actual cases, the authors share their experiences from developing several text-mining applications in diverse industries.  ...  This paper reviews the best practices and challenges for project managers and developers involved in implementing text-mining applications.  ...  In a nutshell, information extraction was found to provide a much better infrastructure for text-mining than text categorization.  ... 
doi:10.1007/978-1-4471-0649-4_17 fatcat:ruq2vvv5xzbxtp5xl3bpuncjr4

All Text Considered: A Perspective on Mass Digitizing and Archival Processing

Larisa Miller
2013 The American Archivist  
Amid stagnant or diminishing resources, archival repositories are increasingly expected to digitize entire collections for the Internet.  ...  This article explores the idea of coupling robust collection-level descriptions to mass digitization and optical character recognition to provide full-text search of unprocessed and backlogged modern collections  ...  spoken-word video and audio recordings, speech-to-text software might produce searchable text comparable to that of OCR for text-based materials.  ... 
doi:10.17723/aarc.76.2.6q005254035w2076 fatcat:dqc2ucrqa5fcpd3ezzppfav3be

TEI2019 "What is Text, really? TEI and beyond" Book of Abstracts [article]

TEI2019 Local Organisers And Contributors
2019 Zenodo  
However, for object-based disciplines, like archaeology or museology, where text and its encoding is only a small part of their data modelling ecosystem, the value of TEI is not so [...]  ...  For text-centric disciplines the TEI offers a range of solutions that address core research needs.  ...  ACKNOWLEDGMENTS This work was partially supported by the "Patto per Catania" under the "Fondo Sviluppo e Coesione 2014-2020: Piano per il Mezzogiorno". 1 TTHUB: Text Technologies Hub for Extending  ... 
doi:10.5281/zenodo.3445894 fatcat:ttmfugaau5btjg2jgqlu4t6tia

Modeling and solving term mismatch for full-text retrieval

Le Zhao
2012 SIGIR Forum  
A part of the work has turned into this dissertation, and the rest prepared me well for my future adventures.  ...  However, it was not well understood how often term mismatch happens in retrieval, how important it is for retrieval, or how it affects retrieval performance.  ...  Many of them are OCR texts, and contain spelling and spacing errors.  ... 
doi:10.1145/2422256.2422277 fatcat:iboh56u5kvdrhcnt4uqyorwvp4

A word spotting framework for historical machine-printed documents

A. L. Kesidis, E. Galiotou, B. Gatos, I. Pratikakis
2010 International Journal on Document Analysis and Recognition  
Pratikakis morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates  ...  In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine.  ...  on OCR.  ... 
doi:10.1007/s10032-010-0134-4 fatcat:2vqu3k6qjzbclagqmebyszmt4y

Indexing and Indices: An Essential Component of Information Discovery

Donald T. Hawkins
2016 Against the Grain  
JSTOR did a pilot indexing project with three disciplines, built rule bases using their thesauri, and automatically indexed them.  ...  (AI) developed a taxonomic structure and indexing rules for AACR.  ... 
doi:10.7771/2380-176x.6230 fatcat:psj7726n6vaapiqfhfkvnxjn64

Solon: A Holistic Approach for Modelling, Managing and Mining Legal Sources

Marios Koniaris, George Papastefanatos, Ioannis Anagnostopoulos
2018 Algorithms  
It utilizes a novel method for extracting semantic representations of legal sources from unstructured formats, such as PDF and HTML text files, interlinking and enhancing them with classification features  ...  However, legal documents are mainly stored and offered in different sources and formats that do not facilitate semantic machine-readable techniques, thus making difficult for legal stakeholders to acquire  ...  Solon utilizes the Apache Tika framework for converting PDF files into plain text. Tesseract OCR can be seemingly integrated with Apache Tika as to handle scanned text stored in PDF images.  ... 
doi:10.3390/a11120196 fatcat:vku3k3eahndelgx7yffmpctxr4

Semantic enrichment for enhancing LAM data and supporting digital humanities. Review article

Marcia Lei Zeng
2019 El Profesional de la Informacion  
This review article focuses on semantic enrichment for enhancing LAM data and supporting Digital Humanities.  ...  In order to enhance LAM data's quality and discoverability while enabling a self-sustaining ecosystem, "semantic enrichment" becomes a strategy increasingly used by LAMs during recent years.  ...  Acknowledgements The author would like to thank Dawn Sedor for providing valuable feedback and editorial assistance.  ... 
doi:10.3145/epi.2019.ene.03 fatcat:4ggte4plvfb6jlcszogwtscd4m

DMoG : A Data-Based Morphological Guesser

Vojtěch Kovář, Pavel Rychlý
2021 Zenodo  
We present a novel corpus-based approach to lemmatization of unknown words.  ...  The tool learns affix patterns from annotated data, and based on these patterns, it predicts other word forms that should be present in the corpus.  ...  The South Moravian Centre graciously funded the second author's work for International Mobility as a part of the Brno PhD. Talent project.  ... 
doi:10.5281/zenodo.6935329 fatcat:6jqt25fjcfe5fmcwfkmv2al4oe

Semantic Orientation of Sermons From the 1920s to 1980s—Areas of Consistency and Variability

Denise A. D. Bedford
2013 Journal of Cultural and Religious Studies  
Based on norms for white male adults, the score is moderately high for the Anxiety scale. It is between two and three standard deviations above the mean.  ...  For these reasons, two semantic technologies were selected to support data processing. The first technology is the SAS Content Categorization Suite.  ... 
doi:10.17265/2328-2177/2013.01.006 fatcat:ushs3oe3yrcrli3ayks22nr42m

Ontology Population and Enrichment: State of the Art [chapter]

Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Anastasia Krithara, Elias Zavitsanos
2011 Lecture Notes in Computer Science  
This breakdown of the learning process is used as a basis for the comparative analysis of existing tools and approaches.  ...  The purpose of this chapter is to present a survey of the most relevant methods, techniques and tools used for the task of ontology learning.  ...  SOBA uses a standard rule-based information extraction system, an enhanced version of SProUT - [32] , while [7] a part of speech tagger and a module for named entity recognition.  ... 
doi:10.1007/978-3-642-20795-2_6 fatcat:it6o6nsnnjhg5dzs3xglpjm6qy

SEMANTIC SEARCH BASED ON NATURAL LANGUAGE PROCESSING – A NUMISMATIC EXAMPLE

Karsten Tolle, Patricia Klinger, Sebastian Gampe, Ulrike Peter
2018 Journal of Ancient History and Archaeology  
Iconographic representations on ancient artifacts are described in many existing databases and literature as human readable text.  ...  As we show in our experiments based on numismatic datasets, the approach is generic in the sense that once the system is trained on one dataset, it can be applied without any further manual work also to  ...  Knowledge-based methods are used for domain-specific tasks in which a fixed set of relations are to be extracted. Usually these methods are based on pattern-matching rules.  ... 
doi:10.14795/j.v5i3.334 fatcat:ny2anbi3orfj3bcg6akp2exzsi

An analytical study of information extraction from unstructured and multidimensional big data

Kiran Adnan, Rehan Akbar
2019 Journal of Big Data  
First, a systematic review of existing techniques for IE subtasks for each data type i.e. text, image, audio and video.  ...  "Information extraction from text" section presents detailed discussion on IE subtasks such as NER, RE, EE, their techniques and comparison of techniques for text data.  ...  So it is easy to incorporate domain knowledge [114] Heavily rely on domain thesauri [11] Generating training data is time consuming in learningbased approaches whereas rule-based approaches require  ... 
doi:10.1186/s40537-019-0254-8 fatcat:qy5l55um7feeblec4hxohr3pqa

Recent Advances In Roman Numismatics

Ethan Gruber, John Dobbins
2013 Zenodo  
It is paramount to record the thought processes which have informed the intellectual and technical decisions made during the development of these projects for the benefit of future generations of scholars  ...  Integrating CHRR Data into OCRE CHRR and the Future of Roman Numismatics Where do coin hoards go from here?  ...  The indexing of OCRE into these search engines broadens access to the collection. 51 To enhance the user experience, when Cocoon builds the HTML representation of a NUDS/XML document for a coin type, links  ... 
doi:10.5281/zenodo.45328 fatcat:skzzs6nxavcrrjquixdbn3bz4e
« Previous Showing results 1 — 15 out of 103 results