Filters








95,993 Hits in 3.6 sec

Experiments in high-dimensional text categorization

Fred J. Damerau, Tong Zhang, Sholom M. Weiss, Nitin Indurkhya
2002 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02  
We present results for automated text categorization of the Reuters-810000 collection of news stories.  ...  We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection.  ...  In this paper, we make use of this data set to establish a new benchmark for evaluating text categorization performance in a high dimensional space.  ... 
doi:10.1145/564376.564442 dblp:conf/sigir/DamerauZWI02 fatcat:voguniipqzggdpmbtgsks6k3fe

Experiments in high-dimensional text categorization

Fred J. Damerau, Tong Zhang, Sholom M. Weiss, Nitin Indurkhya
2002 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02  
We present results for automated text categorization of the Reuters-810000 collection of news stories.  ...  We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection.  ...  In this paper, we make use of this data set to establish a new benchmark for evaluating text categorization performance in a high dimensional space.  ... 
doi:10.1145/564437.564442 fatcat:myi4ilmynzattlxm3lumivxkey

Arabic Text Classification Based on Features Reduction Using Artificial Neural Networks

F. A. Zaghoul, S. Al-Dhaheri
2013 2013 UKSim 15th International Conference on Computer Modelling and Simulation  
Text categorization is one solution to tackle this problem.  ...  The system's primary source of knowledge is an Arabic text categorization (TC) corpus built locally at the University of Jordan and available at http://nlp.ju.edu.jo.  ...  MSE VS epochs for Arabic documents categorization Experiment I In the first experiment, we have experimented with the common used feature selection method TF_IDF, in order to reduce the high dimensionality  ... 
doi:10.1109/uksim.2013.135 dblp:conf/uksim/ZaghoulA13 fatcat:3h67ywpzhbdtrgey3stbuywhuu

A Centroid Based Text Categorization Method Using Mean Shift

Man Yuan
2013 Journal of Information and Computational Science  
In this paper, we propose a method for text categorization based on Mean Shift. Mean Shift algorithm is a well developed technique in computer vision researches.  ...  Text categorization is an important research topic in Information Retrieval area and it is one of the key techniques for handling and organizing the huge amount of text data available on the Internet and  ...  Related Work Dimension Reduction in Text Categorization The most critical challenge for text categorization is the high dimensionality of the natural language text, often referred to as the "curse of  ... 
doi:10.12733/jics20102921 fatcat:b4ohjpvvbbcxlibx5nijmq2gdu

A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization

Jingyang Li, Maosong Sun, Xian Zhang
2006 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL - ACL '06  
in text categorization systems.  ...  Words and character-bigrams are both used as features in Chinese text processing tasks, but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported  ...  Few similar comparative studies have been reported for Text Categorization (Li et al., 2003) so far in literature.  ... 
doi:10.3115/1220175.1220244 dblp:conf/acl/LiSZ06 fatcat:cewpik3qq5bktn6oii5vizafzm

The Lao Text Classification Method Based on KNN

Zhuo Chen, Lan Jiang Zhou, Xuan Da Li, Jia Nan Zhang, Wen Jie Huo
2020 Procedia Computer Science  
Text categorization is a common application scenario in the NLP field, and has many applications in public opinion monitoring and news classification.  ...  Text categorization is a common application scenario in the NLP field, and has many applications in public opinion monitoring and news classification.  ...  Because the text is stored as a vector space, the dimension is high.  ... 
doi:10.1016/j.procs.2020.02.053 fatcat:zkixkrxc75cxnnj3zfbk4y6s3q

Enhancement of DTP Feature Selection Method for Text Categorization [chapter]

Edgar Moyotl-Hernández, Héctor Jiménez-Salazar
2005 Lecture Notes in Computer Science  
This paper studies the structure of vectors obtained by using term selection methods in high-dimensional text collection.  ...  Typically even a moderately sized collection of text has tens or hundreds of thousands of terms. Hence, the document vectors are high-dimensional.  ...  However, the vectors produced by DTP have a "sparse" behavior that is not commonly found in low-dimensional text collections.  ... 
doi:10.1007/978-3-540-30586-6_80 fatcat:h33dbbaj5zboperm2yrmffnkt4

Evaluating text categorization in the presence of OCR errors

Kazem Taghva, Thomas A. Nartker, Julie Borsack, Steven Lumos, Allen Condit, Ron Young, Paul B. Kantor, Daniel P. Lopresti, Jiangying Zhou
2000 Document Recognition and Retrieval VIII  
In this paper we describe experiments that investigate the effects of OCR errors on text categorization.  ...  We also observe that dimensionality reduction techniques eliminate a large number of OCR errors and improve categorization results.  ...  Our experiments show that OCR errors have little effect on text categorization once some form of dimensionality reduction has been applied.  ... 
doi:10.1117/12.410861 dblp:conf/drr/TaghvaNBLCY01 fatcat:cliqa7xqd5hrzkzdvnwnkc53o4

Random Subspace Method in Text Categorization

Mehrdad J. Gangeh, Mohamed S. Kamel, Robert P.W. Duin
2010 2010 20th International Conference on Pattern Recognition  
Due to the huge number of terms in even a moderate-size text corpus, high dimensional feature space is an intrinsic problem in TC.  ...  In text categorization (TC), which is a supervised technique, a feature vector of terms or phrases is usually used to represent the documents.  ...  However, it is not extensively investigated on strong classifiers such as SVM (that perform rather well in high dimensional feature space) nor in the area of text categorization.  ... 
doi:10.1109/icpr.2010.505 dblp:conf/icpr/GangehKD10 fatcat:u5nyw3mws5faxgv7sx4pz2yyhe

A Cluster Tree Method For Text Categorization

Zhaocai Sun, Yunming Ye, Weiru Deng, Zhexue Huang
2011 Procedia Engineering  
Experiments show that the cluster tree solves the high-dimensionality problem and outperforms C4.5 and CART on text data.  ...  Since more features are ignored, the classification accuracy is not high. To solve the problem, this paper uses a cluster tree for text categorization. Unlike familiar decision trees (e.g.  ...  However, previous works have found that the classification accuracy of decision tree is not high on text categorization. The difficulty of dealing with the text data is the high dimensionality [6] .  ... 
doi:10.1016/j.proeng.2011.08.709 fatcat:fq4hkohe3rglvjtsr2djbczaxy

Text categorization with Support Vector Machines: Learning with many relevant features [chapter]

Thorsten Joachims
1998 Lecture Notes in Computer Science  
This paper explores the use of Support Vector Machines SVMs for learning text classi ers from examples.  ...  It analyzes the particular properties of learning with text data and identi es why SVMs are appropriate for this task. Empirical results support the theoretical ndings.  ...  With their ability to generalize well in high dimensional feature spaces, SVMs eliminate the need for feature selection, making the application of text categorization considerably easier.  ... 
doi:10.1007/bfb0026683 fatcat:e6wov4nsd5fbjkdl4oyllkgssi

Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation

X. Lu, B. Zheng, A. Velivelli, C. Zhai
2006 JAMIA Journal of the American Medical Informatics Association  
Design: We studied two approaches that enhance the text categorization performance on sparse and high data dimensionality: (1) semantic-preserving dimension reduction by representing text with semantic-enriched  ...  In the real world, many information retrieval tasks are difficult because of high data dimensionality and the lack of annotated examples to train a retrieval algorithm.  ...  Conclusion In summary, we have studied two approaches for enhancing text categorization under the scenario of high dimensionality and scarce training data: (1) semantic-preserving dimension reduction with  ... 
doi:10.1197/jamia.m2051 pmid:16799127 pmcid:PMC1561790 fatcat:iix6pfcutnhvdg3q4sd53xihyy

Some Investigations on Machine Learning Techniques for Automated Text Categorization

Bhagirath Prajapati, Sanjay Garg, N C Chauhan
2013 International Journal of Computer Applications  
The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining.  ...  Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task.  ...  Step3: For TC high dimensionality of term space is not proper for many sophisticated algorithms (e.g. LLSF [8] ). Hence, before classification, dimensionality reduction (DR) is applied.  ... 
doi:10.5120/12340-8617 fatcat:jom2wztpfrdghc6vphb6dmiqm4

Improving arabic text categorization using decision trees

Fouzi Harrag, Eyas El-Qawasmeh, Pit Pichappan
2009 2009 First International Conference on Networked Digital Technologies  
To test the effectiveness of the proposed model, experiments were conducted using an in-house collected Arabic corpus for text categorization.  ...  The results showed that the proposed model was able to achieve high categorization effectiveness as measured by precision, recall and F-measure.  ...  Related work in Arabic text categorization Many researchers have been working on text categorization in English and other European languages, however few researchers work on text categorization for Arabic  ... 
doi:10.1109/ndt.2009.5272214 fatcat:yzbqjpqnhbehnizct33a7rfnuy

An empirical evaluation of text classification and feature selection methods

Muazzam Ahmed Siddiqui
2016 Artificial intelligence research  
Support Vector Machine with linear kernel reigned supreme for text categorization tasks producing highest F measures and low training times even in the presence of high class skew.  ...  An extensive empirical evaluation of classifiers and feature selection methods for text categorization is presented.  ...  Text categorization can also be seen as the problem of establishing decision boundaries in the high dimensional feature space.  ... 
doi:10.5430/air.v5n2p70 fatcat:utpb25jxhreiflpge5dsvjetmm
« Previous Showing results 1 — 15 out of 95,993 results