Filters








37,368 Hits in 14.9 sec

Noise reduction in a statistical approach to text categorization

Yiming Yang
1995 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95  
This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping.  ...  Multiple noise reduction strategies are proposed and evaluated, including: an aggressive removal of "non-informative words" from texts before training; the use of a truncated singular value decomposition  ...  This work is supported in part by NIH Research Grant LM-05416 to Mayo Clinic, and National Library of Medicine Training Grant LM-07041 in Medical Informatics to the University of Minnesota.  ... 
doi:10.1145/215206.215367 dblp:conf/sigir/Yang95 fatcat:7xixjc7pv5ggbh6ueopoys4lkm

Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers [article]

H. Bahadir Sahin, Caglar Tirkaz, Eray Yildiz, Mustafa Tolga Eren, Ozan Sonmez
2017 arXiv   pre-print
Since automated processes are prone to ambiguity, we also introduce two new content specific noise reduction methodologies.  ...  Turkish Wikipedia Named-Entity Recognition and Text Categorization (TWNERTC) dataset is a collection of automatically categorized and annotated sentences obtained from Wikipedia.  ...  In Section 4, we explain how to use the gazetteers to automatically annotate and categorize Wikipedia texts to construct datasets along with dataset statistics, and noise reduction methodologies.  ... 
arXiv:1702.02363v2 fatcat:i7zt6gcncjhhxkaqiiyaarg2mi

Profile Categorization System based on Features Reduction

Olfa Mabrouk, Lobna Hlaoua, Mohamed Nazih Omri
2018 International Symposium on Artificial Intelligence and Mathematics  
In this regard, author profiling tries to determine the profile category of authors by analysing their published texts.  ...  This paper presents a profile categorization system to solve the multi-class categorization problem. The system consists of two modules: the processing module and the classifying module.  ...  The common feature reduction approach for text categorization is the feature selection.  ... 
dblp:conf/isaim/MabroukHO18 fatcat:fp4hwacat5cidps3irztw2aibq

Bootstrapping in Text Mining Applications

2016 International Journal of Science and Research (IJSR)  
Text mining involves analyzing large corpora of documents with thousands of words with a high level of noise content.  ...  The resulting noise-reduced dataset is the input to clustering algorithms.  ...  In the context of text categorization, examples are drawn from a heterogeneous set of text documents called a corpus, attributes are words and labels are broad topic areas of the document.  ... 
doi:10.21275/v5i1.nov152700 fatcat:rsiahnfozncyjc4korrt5tlaze

Multiclass text categorization for automated survey coding

Daniela Giorgetti, Fabrizio Sebastiani
2003 Proceedings of the 2003 ACM symposium on Applied computing - SAC '03  
Survey coding is the task of assigning a symbolic code from a predefined set of such codes to the answer given in response to an open-ended question in a questionnaire (aka survey).  ...  We formulate the problem of automated survey coding as a text categorization problem, i.e. as the problem of learning, by means of supervised machine learning techniques, a model of the association between  ...  We are grateful to Tom Smith and Jennifer Berktold for providing these texts and for assisting us in their interpretation.  ... 
doi:10.1145/952686.952691 fatcat:bdumeactm5cjrorlmfrs7dzmki

Multiclass text categorization for automated survey coding

Daniela Giorgetti, Fabrizio Sebastiani
2003 Proceedings of the 2003 ACM symposium on Applied computing - SAC '03  
Survey coding is the task of assigning a symbolic code from a predefined set of such codes to the answer given in response to an open-ended question in a questionnaire (aka survey).  ...  We formulate the problem of automated survey coding as a text categorization problem, i.e. as the problem of learning, by means of supervised machine learning techniques, a model of the association between  ...  We are grateful to Tom Smith and Jennifer Berktold for providing these texts and for assisting us in their interpretation.  ... 
doi:10.1145/952532.952691 dblp:conf/sac/GiorgettiS03 fatcat:thpb4ilmxvd4bbyrrfmsuvquum

A Cross-lingual Annotation Projection Approach for Relation Detection

Seokhwan Kim, Minwoo Jeong, Jonghoon Lee, Gary Geunbae Lee
2010 International Conference on Computational Linguistics  
In order to make our method more reliable, we introduce three simple projection noise reduction methods. The merit of our method is demonstrated through a novel Korean relation detection task.  ...  In this paper, we develop a cross-lingual annotation projection method that leverages parallel corpora to bootstrap a relation detector without significant annotation efforts for a resource-poor language  ...  In Section 2, we describe our cross-lingual annotation projection approach to relation detection task. Then, we present the noise reduction methods in Section 3.  ... 
dblp:conf/coling/KimJLL10 fatcat:3hfhfpvserhqnjviuyda5d5pbe

Analysis of an Automatic Text Content Extraction Approach in Noisy Video Images

C. P.Sumathi, N. Priya
2013 International Journal of Computer Applications  
The pre processing is done to de-noise the images through wavelet based approach by removing noise in the frequency field and reducing by the soft-threshold method.  ...  Low contrast, noise and poor quality are the main problems of text extraction in video images.  ...  In this paper the proposed method tends to provide an efficient and effective approach to the issue of text content extraction for a wider range of noisy video images.  ... 
doi:10.5120/11828-7529 fatcat:bgrnjkchcbfylm3vrwvxjloahm

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification [chapter]

Yanbo J. Wang, Frans Coenen, Robert Sanderson
2009 Lecture Notes in Computer Science  
It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between textcategories are identified.  ...  In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti⋅Sebastiani  ...  Noise Words (N): Common and rare words are collectively defined to be noise words in a document-base.  ... 
doi:10.1007/978-3-642-03348-3_33 fatcat:dlzofjfyvbflvhs6r7qbzalh34

DACS Dewey index-based Arabic Document Categorization System

A. F.Alajmi, E. M Saad, M H Awadalla
2012 International Journal of Computer Applications  
This paper is devoted to the development of Arabic Text Categorization System. First, a stop-words list is generated using statistical approach which captures the inflation of different Arabic words.  ...  Third, a semantic synonyms merge technique is presented for feature reduction. Finally a Dewey-Index Based Back-propagation Artificial Neural Network is developed for Arabic Document Categorization.  ...  Furthermore, it is necessary to filter out noise from important text; noise is an extraneous text that is not relevant to the task at hand [3] . An example of noise is stop-words.  ... 
doi:10.5120/7500-0634 fatcat:4fvnz7yqyrhkxb2vfrgksm7324

Text Data Mining: Theory and Methods

Jeffrey L. Solka
2008 Statistics Survey  
This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining.  ...  Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.  ...  One hopes that by applying dimensionality reduction, one can remove noise from the data and better apply our statistical data mining methods to discover subtle relationships that might exist between the  ... 
doi:10.1214/07-ss016 fatcat:rndtblcu7zaanjbmbtg5rqkkh4

A Multistage Feature Selection Model for Document Classification Using Information Gain and Rough Set

Mrs. Leena., Dr. Mohammed
2014 International Journal of Advanced Research in Artificial Intelligence (IJARAI)  
Huge number of documents are increasing rapidly, therefore, to organize it in digitized form text categorization becomes an challenging issue.  ...  Hence, to overcome the issues of text categorization feature selection is considered as an efficient technique.  ...  Still the problem arises is redundancy in the selected term. Redundant terms are equivalent to noise which causes a reduction in the accuracy of classifier.  ... 
doi:10.14569/ijarai.2014.031103 fatcat:rnqvqjzanvhurbyuxtpsllzose

Automatic Web Page Categorization using Principal Component Analysis

Richong Zhang, Michael Shepherd, Jack Duffy, Carolyn Watters
2007 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07)  
Today's search engines retrieve tens of thousands of web pages in response to fairly simple query articulations.  ...  This research investigates the automatic categorization of web pages using Principal Component Analysis.  ...  One approach to increasing the relevance of the results is to categorize the results in anticipation of the user need.  ... 
doi:10.1109/hicss.2007.98 dblp:conf/hicss/ZhangSDW07 fatcat:ucfrxwu7ajgsxev6pe64mponcm

An novel cluster based feature selection and document classification model on high dimension trek data

Lalitha Kumari, Ch. Satyanarayana
2017 International Journal of Engineering & Technology  
The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the document semantic similarity  ...  Also, most of the traditional models are applicable to limited text document sets for text analysis.  ...  This approach emphasizes on abstract, unlike other text mining approaches in the biomedical domain. In recent days, XML is used as a standard format for information sharing on the web.  ... 
doi:10.14419/ijet.v7i1.1.10146 fatcat:bounsf7f45hfhl3zvkdaq4yr4y

Pruning the vocabulary for better context recognition

R.E. Madsen, S. Sigurdsson, L.K. Hansen, J. Larsen
2004 Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.  
The representation is high dimensional though, containing many nonconsistent words for text categorization.  ...  In this communication our aim is to study the effect of reducing the least relevant words from the bagof-words representation.  ...  ACKNOWLEDGMENT The work is supported by the European Commission through the sixth framework IST Network of Excellence: Pattern Analysis, Statistical Modelling and Computational Learning (PASCAL), contract  ... 
doi:10.1109/icpr.2004.1334270 dblp:conf/icpr/MadsenSHL04 fatcat:lto3ltt6src43hcv7buqkamnbi
« Previous Showing results 1 — 15 out of 37,368 results