102 Hits in 4.0 sec

Improving document clustering in a learned concept space

Jean-François Pessiot, Young-Min Kim, Massih R. Amini, Patrick Gallinari
2010 Information Processing & Management  
We empirically show on four document collections, Reuters-21578, Reuters RCV2-French, 20Newsgroups and WebKB, that this new text representation noticeably increases the performance of the MM model.  ...  On the basis of this assumption we first find term clusters using a classification version of the EM algorithm.  ...  Acknowledgements This work was supported in part by the IST Program of the European Community, under the PASCAL Network of Excellence IST-2002-506778. This publication only reflects the authors view.  ... 
doi:10.1016/j.ipm.2009.09.007 fatcat:jeaskh6gdfguzo66djeybb5hhi

Semi-Supervised Linear Discriminant Clustering

Chien-Liang Liu, Wen-Hoar Hsaio, Chia-Hoang Lee, Fu-Sheng Gou
2014 IEEE Transactions on Cybernetics  
We conduct experiments on three data sets. The experimental results indicate that the proposed method can generally outperform other semi-supervised methods.  ...  We use soft LDA with hard labels of labeled examples and soft labels of unlabeled examples to find a projection matrix. The clustering is then performed in the new feature space.  ...  Fig. 2(a) and (b) summarizes the experimental results on CiteULike and Reuters-21578 data sets, respectively.  ... 
doi:10.1109/tcyb.2013.2278466 pmid:23996591 fatcat:dpxxp6lcyraxhb2rzrbryy2pqa

Classifying non-gaussian and mixed data sets in their natural parameter space

Cecile Levasseur, Uwe F. Mayer, Ken Kreutz-Delgado
2009 2009 IEEE International Workshop on Machine Learning for Signal Processing  
GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace.  ...  Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques.  ...  Reuters-21578 data set The Reuters-21578 text categorization test collection Distribution 1.0 is considered as the standard benchmark for automatic document organization systems and consists of documents  ... 
doi:10.1109/mlsp.2009.5306227 fatcat:jjftmx3ffvhj3ieojh2upflhyq

Diverse Topic Phrase Extraction through Latent Semantic Analysis

Jilin Chen, Jun Yan, Benyu Zhang, Qiang Yang, Zheng Chen
2006 IEEE International Conference on Data Mining. Proceedings  
Keyword extraction is an efficient approach to managing an explosion of online text on the Web.  ...  To demonstrate the performance of our method, we conducted experiments on two open datasets: 20 Newsgroup and Reuters-21578.We design three novel evaluation metrics, based on which both qualitative and  ...  Therefore, we need to augment the clustering approach with text extraction techniques which can provide short and readable gist of topics covered in a text collection.  ... 
doi:10.1109/icdm.2006.61 dblp:conf/icdm/ChenYZYC06 fatcat:xh2adlzkcrcixeblt56j57oljy

Feature Selection Using Particle Swarm Optimization in Text Categorization

Mehdi Hosseinzadeh Aghdam, Setareh Heidari
2015 Journal of Artificial Intelligence and Soft Computing Research  
The performance of the proposed method is compared with performance of other methods on the Reuters-21578 data set. Experimental results display the superiority of the proposed method.  ...  The high dimensionality of feature space increases the complexity of text categorization process, because it plays a key role in this process.  ...  To show the utility of the proposed algorithm and to compare it with information gain and CHI, a set of experiments were carried out on Reuters-21578 data set.  ... 
doi:10.1515/jaiscr-2015-0031 fatcat:llcotgrf4jbyxb3uutjecfnb2m

User Ex Machina : Simulation as a Design Probe in Human-in-the-Loop Text Analytics [article]

Anamaria Crisan, Michael Correll
2021 arXiv   pre-print
These models remain challenging to optimize and often require a "human-in-the-loop" approach where domain experts use their knowledge to steer and adjust.  ...  We find that user interactions have impacts that differ in magnitude but often negatively affect the quality of the resulting modelling in a way that can be difficult for the user to evaluate.  ...  We also thank the reviewers for their feedback.  ... 
arXiv:2101.02244v1 fatcat:dge3niwcpncjdbo2q3estntdj4

Discovery of hierarchical thematic structure in text collections with adaptive resonance theory

Louis Massey
2008 Neural computing & applications (Print)  
We present experimental results with binary ART1 on the benchmark Reuter-21578 corpus.  ...  Such is the case with the first two clusters (those with "year" and "reuter" as unique attribute), each respectively containing 781 and 1,502 documents distributed among 74 of the 93 topics of the human  ...  Our experimental methodology based on the proven F 1 quality measure, benchmark Reuter 21578 corpus, standard bag-of-words vector space representation and wellestablished pre-processing allows for easy  ... 
doi:10.1007/s00521-008-0178-2 fatcat:d2ztkflhwfabvavbb7eb4k6veu

Joint Image-Text News Topic Detection and Tracking with And-Or Graph Representation [article]

Weixin Li, Jungseock Joo, Hang Qi, Song-Chun Zhu
2015 arXiv   pre-print
Our method achieves superior performance compared to state-of-the-art methods on both a public dataset Reuters-21578 and a self-collected dataset named UCLA Broadcast News Dataset.  ...  The experimental results show that our method can explicitly describe the textual and visual data in news videos and produce meaningful topic trajectories.  ...  ACKNOWLEDGMENT This project is supported by the NSF CDI project CNS 1028381. The authors would like to thank Dr. Francis Steen and Tim Groeling at UCLA, and Dr.  ... 
arXiv:1512.04701v1 fatcat:uzj6cisl7jarhmi264wmeqizdy

An Automatic Text Document Classification using Modified Weight and Semantic Method

Term Frequency-Inverse Document Frequency(TF-IDF) method only assigned weight to the term based on the occurrence of the term.  ...  To analyze the performance of the proposed feature extraction methods, two benchmark datasets like Reuter-21578-R8 and 20 news group and two real time datasets like descriptive type answer dataset and  ...  NMF also used for blind source separating, acoustic signal processing and so on.  ... 
doi:10.35940/ijitee.k2123.1081219 fatcat:6h4rho2pvrdahbygtc47e65v2a

Compactly Supported Basis Functions as Support Vector Kernels for Classification

P. Wittek, Chew Lim Tan
2011 IEEE Transactions on Pattern Analysis and Machine Intelligence  
We argue that it is possible to order the features of a general data set so that consecutive features are statistically related to each other, thus enabling us to interpret the vector representation of  ...  By approximating the signal with compactly supported basis functions and employing the inner product of the embedding L 2 space, we gain a new family of wavelet kernels.  ...  We used the Reuters-21578 and the 20News collections to benchmark the performance of the proposed new kernel. Both data sets were obtained from [27] .  ... 
doi:10.1109/tpami.2011.28 pmid:21321366 fatcat:i5doawhnp5dmppgsztao2u4ocq

A review on feature selection and feature extraction for text classification

Foram P. Shah, Vibha Patel
2016 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)  
We examine performances of the classifiers applied on standard text categorization test collections and show the enhancements achieved by applying our extraction method.  ...  We project the high dimensional features of documents onto a new feature space having dimensions equal to the number of classes in order to form the abstract features.  ...  We examine performances of the classifiers on 3 standard and popular text collections: the Reuters-21578, 20 Newsgroups, and the ModApte-10 split of Reuters.  ... 
doi:10.1109/wispnet.2016.7566545 fatcat:hhfidimctrd5djjtmbuqmdju2e

Succinct and Informative Cluster Descriptions for Document Repositories [chapter]

Lijun Chen, Guozhu Dong
2006 Lecture Notes in Computer Science  
This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs.  ...  Human labeling of clusters is not viable when clustering is performed on demand or for very few users.  ...  validate the understandability of CDs, and (5) adapting previous ideas on the use of emerging patterns and contrasting patterns for building classifiers [23, 24, 25, 26] to construct succinct and informative  ... 
doi:10.1007/11775300_10 fatcat:quglhbvckzec5ggnbqqmtvqfae

Topic Extraction for Documents Based on Compressibility Vector

2012 IEICE transactions on information and systems  
We challenge our proposal with model documents, URCS and Reuters-21578 dataset, for relation analysis and topic extraction. The effectiveness of the proposed methods is shown by the simulations.  ...  Most of the methods being proposed need some processes such as stemming, stop words removal, and etc.  ...  Reuters-21578 is currently one of the most widely used test collection in information retrieval, machine learning, and other corpus-based research.  ... 
doi:10.1587/transinf.e95.d.2438 fatcat:ksuy7rpf2zfjhfkmlatk4ggyam

A kernel-based feature weighting for text classification

Peter Wittek, Chew Lim Tan
2009 2009 International Joint Conference on Neural Networks  
Adding expansion terms to the vector representation can also improve effectiveness. However, existing semantic smoothing kernels do not employ term expansion.  ...  This paper proposes a new nonlinear kernel for text classification to exploit semantic relations between terms to add weighted expansion terms.  ...  The F 1 measure is a composite measure of precision and recall. The most widely used benchmark corpus is the Reuters-21578 collection.  ... 
doi:10.1109/ijcnn.2009.5179022 dblp:conf/ijcnn/WittekT09 fatcat:e2ed6wltezgd7o7fexfyzuxmgy

Simultaneous Learning of Sentence Clustering and Class Prediction for Improved Document Classification

Minyoung Kim
2017 International Journal of Fuzzy Logic and Intelligent Systems  
In document classification it is common to represent a document as the so called bag-of-words form, which is essentially a global term distribution indicating how often certain terms appear in a text.  ...  a weighted term frequency vector that is aggregated from all sentences but weighed differently cluster-wise according to the prediction in the first model.  ...  Acknowledgements This study was supported by Seoul National University of Science & Technology.  ... 
doi:10.5391/ijfis.2017.17.1.35 fatcat:jpwhytonajeqrnmwf5kbxcixaq
« Previous Showing results 1 — 15 out of 102 results