A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2018; you can also visit the original URL.
The file type is application/pdf
.
Filters
Noise reduction in a statistical approach to text categorization
1995
Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95
This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. ...
Multiple noise reduction strategies are proposed and evaluated, including: an aggressive removal of "non-informative words" from texts before training; the use of a truncated singular value decomposition ...
This work is supported in part by NIH Research Grant LM-05416 to Mayo Clinic, and National Library of Medicine Training Grant LM-07041 in Medical Informatics to the University of Minnesota. ...
doi:10.1145/215206.215367
dblp:conf/sigir/Yang95
fatcat:7xixjc7pv5ggbh6ueopoys4lkm
Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers
[article]
2017
arXiv
pre-print
Since automated processes are prone to ambiguity, we also introduce two new content specific noise reduction methodologies. ...
Turkish Wikipedia Named-Entity Recognition and Text Categorization (TWNERTC) dataset is a collection of automatically categorized and annotated sentences obtained from Wikipedia. ...
In Section 4, we explain how to use the gazetteers to automatically annotate and categorize Wikipedia texts to construct datasets along with dataset statistics, and noise reduction methodologies. ...
arXiv:1702.02363v2
fatcat:i7zt6gcncjhhxkaqiiyaarg2mi
Profile Categorization System based on Features Reduction
2018
International Symposium on Artificial Intelligence and Mathematics
In this regard, author profiling tries to determine the profile category of authors by analysing their published texts. ...
This paper presents a profile categorization system to solve the multi-class categorization problem. The system consists of two modules: the processing module and the classifying module. ...
The common feature reduction approach for text categorization is the feature selection. ...
dblp:conf/isaim/MabroukHO18
fatcat:fp4hwacat5cidps3irztw2aibq
Bootstrapping in Text Mining Applications
2016
International Journal of Science and Research (IJSR)
Text mining involves analyzing large corpora of documents with thousands of words with a high level of noise content. ...
The resulting noise-reduced dataset is the input to clustering algorithms. ...
In the context of text categorization, examples are drawn from a heterogeneous set of text documents called a corpus, attributes are words and labels are broad topic areas of the document. ...
doi:10.21275/v5i1.nov152700
fatcat:rsiahnfozncyjc4korrt5tlaze
Multiclass text categorization for automated survey coding
2003
Proceedings of the 2003 ACM symposium on Applied computing - SAC '03
Survey coding is the task of assigning a symbolic code from a predefined set of such codes to the answer given in response to an open-ended question in a questionnaire (aka survey). ...
We formulate the problem of automated survey coding as a text categorization problem, i.e. as the problem of learning, by means of supervised machine learning techniques, a model of the association between ...
We are grateful to Tom Smith and Jennifer Berktold for providing these texts and for assisting us in their interpretation. ...
doi:10.1145/952686.952691
fatcat:bdumeactm5cjrorlmfrs7dzmki
Multiclass text categorization for automated survey coding
2003
Proceedings of the 2003 ACM symposium on Applied computing - SAC '03
Survey coding is the task of assigning a symbolic code from a predefined set of such codes to the answer given in response to an open-ended question in a questionnaire (aka survey). ...
We formulate the problem of automated survey coding as a text categorization problem, i.e. as the problem of learning, by means of supervised machine learning techniques, a model of the association between ...
We are grateful to Tom Smith and Jennifer Berktold for providing these texts and for assisting us in their interpretation. ...
doi:10.1145/952532.952691
dblp:conf/sac/GiorgettiS03
fatcat:thpb4ilmxvd4bbyrrfmsuvquum
A Cross-lingual Annotation Projection Approach for Relation Detection
2010
International Conference on Computational Linguistics
In order to make our method more reliable, we introduce three simple projection noise reduction methods. The merit of our method is demonstrated through a novel Korean relation detection task. ...
In this paper, we develop a cross-lingual annotation projection method that leverages parallel corpora to bootstrap a relation detector without significant annotation efforts for a resource-poor language ...
In Section 2, we describe our cross-lingual annotation projection approach to relation detection task. Then, we present the noise reduction methods in Section 3. ...
dblp:conf/coling/KimJLL10
fatcat:3hfhfpvserhqnjviuyda5d5pbe
Analysis of an Automatic Text Content Extraction Approach in Noisy Video Images
2013
International Journal of Computer Applications
The pre processing is done to de-noise the images through wavelet based approach by removing noise in the frequency field and reducing by the soft-threshold method. ...
Low contrast, noise and poor quality are the main problems of text extraction in video images. ...
In this paper the proposed method tends to provide an efficient and effective approach to the issue of text content extraction for a wider range of noisy video images. ...
doi:10.5120/11828-7529
fatcat:bgrnjkchcbfylm3vrwvxjloahm
A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification
[chapter]
2009
Lecture Notes in Computer Science
It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between textcategories are identified. ...
In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti⋅Sebastiani ...
Noise Words (N): Common and rare words are collectively defined to be noise words in a document-base. ...
doi:10.1007/978-3-642-03348-3_33
fatcat:dlzofjfyvbflvhs6r7qbzalh34
DACS Dewey index-based Arabic Document Categorization System
2012
International Journal of Computer Applications
This paper is devoted to the development of Arabic Text Categorization System. First, a stop-words list is generated using statistical approach which captures the inflation of different Arabic words. ...
Third, a semantic synonyms merge technique is presented for feature reduction. Finally a Dewey-Index Based Back-propagation Artificial Neural Network is developed for Arabic Document Categorization. ...
Furthermore, it is necessary to filter out noise from important text; noise is an extraneous text that is not relevant to the task at hand [3] . An example of noise is stop-words. ...
doi:10.5120/7500-0634
fatcat:4fvnz7yqyrhkxb2vfrgksm7324
Text Data Mining: Theory and Methods
2008
Statistics Survey
This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. ...
Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study. ...
One hopes that by applying dimensionality reduction, one can remove noise from the data and better apply our statistical data mining methods to discover subtle relationships that might exist between the ...
doi:10.1214/07-ss016
fatcat:rndtblcu7zaanjbmbtg5rqkkh4
A Multistage Feature Selection Model for Document Classification Using Information Gain and Rough Set
2014
International Journal of Advanced Research in Artificial Intelligence (IJARAI)
Huge number of documents are increasing rapidly, therefore, to organize it in digitized form text categorization becomes an challenging issue. ...
Hence, to overcome the issues of text categorization feature selection is considered as an efficient technique. ...
Still the problem arises is redundancy in the selected term. Redundant terms are equivalent to noise which causes a reduction in the accuracy of classifier. ...
doi:10.14569/ijarai.2014.031103
fatcat:rnqvqjzanvhurbyuxtpsllzose
Automatic Web Page Categorization using Principal Component Analysis
2007
2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07)
Today's search engines retrieve tens of thousands of web pages in response to fairly simple query articulations. ...
This research investigates the automatic categorization of web pages using Principal Component Analysis. ...
One approach to increasing the relevance of the results is to categorize the results in anticipation of the user need. ...
doi:10.1109/hicss.2007.98
dblp:conf/hicss/ZhangSDW07
fatcat:ucfrxwu7ajgsxev6pe64mponcm
An novel cluster based feature selection and document classification model on high dimension trek data
2017
International Journal of Engineering & Technology
The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the document semantic similarity ...
Also, most of the traditional models are applicable to limited text document sets for text analysis. ...
This approach emphasizes on abstract, unlike other text mining approaches in the biomedical domain. In recent days, XML is used as a standard format for information sharing on the web. ...
doi:10.14419/ijet.v7i1.1.10146
fatcat:bounsf7f45hfhl3zvkdaq4yr4y
Pruning the vocabulary for better context recognition
2004
Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.
The representation is high dimensional though, containing many nonconsistent words for text categorization. ...
In this communication our aim is to study the effect of reducing the least relevant words from the bagof-words representation. ...
ACKNOWLEDGMENT The work is supported by the European Commission through the sixth framework IST Network of Excellence: Pattern Analysis, Statistical Modelling and Computational Learning (PASCAL), contract ...
doi:10.1109/icpr.2004.1334270
dblp:conf/icpr/MadsenSHL04
fatcat:lto3ltt6src43hcv7buqkamnbi
« Previous
Showing results 1 — 15 out of 37,368 results