A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Improving document clustering in a learned concept space
2010
Information Processing & Management
We empirically show on four document collections, Reuters-21578, Reuters RCV2-French, 20Newsgroups and WebKB, that this new text representation noticeably increases the performance of the MM model. ...
On the basis of this assumption we first find term clusters using a classification version of the EM algorithm. ...
Acknowledgements This work was supported in part by the IST Program of the European Community, under the PASCAL Network of Excellence IST-2002-506778. This publication only reflects the authors view. ...
doi:10.1016/j.ipm.2009.09.007
fatcat:jeaskh6gdfguzo66djeybb5hhi
Semi-Supervised Linear Discriminant Clustering
2014
IEEE Transactions on Cybernetics
We conduct experiments on three data sets. The experimental results indicate that the proposed method can generally outperform other semi-supervised methods. ...
We use soft LDA with hard labels of labeled examples and soft labels of unlabeled examples to find a projection matrix. The clustering is then performed in the new feature space. ...
Fig. 2(a) and (b) summarizes the experimental results on CiteULike and Reuters-21578 data sets, respectively. ...
doi:10.1109/tcyb.2013.2278466
pmid:23996591
fatcat:dpxxp6lcyraxhb2rzrbryy2pqa
Classifying non-gaussian and mixed data sets in their natural parameter space
2009
2009 IEEE International Workshop on Machine Learning for Signal Processing
GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace. ...
Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. ...
Reuters-21578 data set The Reuters-21578 text categorization test collection Distribution 1.0 is considered as the standard benchmark for automatic document organization systems and consists of documents ...
doi:10.1109/mlsp.2009.5306227
fatcat:jjftmx3ffvhj3ieojh2upflhyq
Diverse Topic Phrase Extraction through Latent Semantic Analysis
2006
IEEE International Conference on Data Mining. Proceedings
Keyword extraction is an efficient approach to managing an explosion of online text on the Web. ...
To demonstrate the performance of our method, we conducted experiments on two open datasets: 20 Newsgroup and Reuters-21578.We design three novel evaluation metrics, based on which both qualitative and ...
Therefore, we need to augment the clustering approach with text extraction techniques which can provide short and readable gist of topics covered in a text collection. ...
doi:10.1109/icdm.2006.61
dblp:conf/icdm/ChenYZYC06
fatcat:xh2adlzkcrcixeblt56j57oljy
Feature Selection Using Particle Swarm Optimization in Text Categorization
2015
Journal of Artificial Intelligence and Soft Computing Research
The performance of the proposed method is compared with performance of other methods on the Reuters-21578 data set. Experimental results display the superiority of the proposed method. ...
The high dimensionality of feature space increases the complexity of text categorization process, because it plays a key role in this process. ...
To show the utility of the proposed algorithm and to compare it with information gain and CHI, a set of experiments were carried out on Reuters-21578 data set. ...
doi:10.1515/jaiscr-2015-0031
fatcat:llcotgrf4jbyxb3uutjecfnb2m
User Ex Machina : Simulation as a Design Probe in Human-in-the-Loop Text Analytics
[article]
2021
arXiv
pre-print
These models remain challenging to optimize and often require a "human-in-the-loop" approach where domain experts use their knowledge to steer and adjust. ...
We find that user interactions have impacts that differ in magnitude but often negatively affect the quality of the resulting modelling in a way that can be difficult for the user to evaluate. ...
We also thank the reviewers for their feedback. ...
arXiv:2101.02244v1
fatcat:dge3niwcpncjdbo2q3estntdj4
Discovery of hierarchical thematic structure in text collections with adaptive resonance theory
2008
Neural computing & applications (Print)
We present experimental results with binary ART1 on the benchmark Reuter-21578 corpus. ...
Such is the case with the first two clusters (those with "year" and "reuter" as unique attribute), each respectively containing 781 and 1,502 documents distributed among 74 of the 93 topics of the human ...
Our experimental methodology based on the proven F 1 quality measure, benchmark Reuter 21578 corpus, standard bag-of-words vector space representation and wellestablished pre-processing allows for easy ...
doi:10.1007/s00521-008-0178-2
fatcat:d2ztkflhwfabvavbb7eb4k6veu
Joint Image-Text News Topic Detection and Tracking with And-Or Graph Representation
[article]
2015
arXiv
pre-print
Our method achieves superior performance compared to state-of-the-art methods on both a public dataset Reuters-21578 and a self-collected dataset named UCLA Broadcast News Dataset. ...
The experimental results show that our method can explicitly describe the textual and visual data in news videos and produce meaningful topic trajectories. ...
ACKNOWLEDGMENT This project is supported by the NSF CDI project CNS 1028381. The authors would like to thank Dr. Francis Steen and Tim Groeling at UCLA, and Dr. ...
arXiv:1512.04701v1
fatcat:uzj6cisl7jarhmi264wmeqizdy
An Automatic Text Document Classification using Modified Weight and Semantic Method
2019
VOLUME-8 ISSUE-10, AUGUST 2019, REGULAR ISSUE
Term Frequency-Inverse Document Frequency(TF-IDF) method only assigned weight to the term based on the occurrence of the term. ...
To analyze the performance of the proposed feature extraction methods, two benchmark datasets like Reuter-21578-R8 and 20 news group and two real time datasets like descriptive type answer dataset and ...
NMF also used for blind source separating, acoustic signal processing and so on. ...
doi:10.35940/ijitee.k2123.1081219
fatcat:6h4rho2pvrdahbygtc47e65v2a
Compactly Supported Basis Functions as Support Vector Kernels for Classification
2011
IEEE Transactions on Pattern Analysis and Machine Intelligence
We argue that it is possible to order the features of a general data set so that consecutive features are statistically related to each other, thus enabling us to interpret the vector representation of ...
By approximating the signal with compactly supported basis functions and employing the inner product of the embedding L 2 space, we gain a new family of wavelet kernels. ...
We used the Reuters-21578 and the 20News collections to benchmark the performance of the proposed new kernel. Both data sets were obtained from [27] . ...
doi:10.1109/tpami.2011.28
pmid:21321366
fatcat:i5doawhnp5dmppgsztao2u4ocq
A review on feature selection and feature extraction for text classification
2016
2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)
We examine performances of the classifiers applied on standard text categorization test collections and show the enhancements achieved by applying our extraction method. ...
We project the high dimensional features of documents onto a new feature space having dimensions equal to the number of classes in order to form the abstract features. ...
We examine performances of the classifiers on 3 standard and popular text collections: the Reuters-21578, 20 Newsgroups, and the ModApte-10 split of Reuters. ...
doi:10.1109/wispnet.2016.7566545
fatcat:hhfidimctrd5djjtmbuqmdju2e
Succinct and Informative Cluster Descriptions for Document Repositories
[chapter]
2006
Lecture Notes in Computer Science
This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. ...
Human labeling of clusters is not viable when clustering is performed on demand or for very few users. ...
validate the understandability of CDs, and (5) adapting previous ideas on the use of emerging patterns and contrasting patterns for building classifiers [23, 24, 25, 26] to construct succinct and informative ...
doi:10.1007/11775300_10
fatcat:quglhbvckzec5ggnbqqmtvqfae
Topic Extraction for Documents Based on Compressibility Vector
2012
IEICE transactions on information and systems
We challenge our proposal with model documents, URCS and Reuters-21578 dataset, for relation analysis and topic extraction. The effectiveness of the proposed methods is shown by the simulations. ...
Most of the methods being proposed need some processes such as stemming, stop words removal, and etc. ...
Reuters-21578 is currently one of the most widely used test collection in information retrieval, machine learning, and other corpus-based research. ...
doi:10.1587/transinf.e95.d.2438
fatcat:ksuy7rpf2zfjhfkmlatk4ggyam
A kernel-based feature weighting for text classification
2009
2009 International Joint Conference on Neural Networks
Adding expansion terms to the vector representation can also improve effectiveness. However, existing semantic smoothing kernels do not employ term expansion. ...
This paper proposes a new nonlinear kernel for text classification to exploit semantic relations between terms to add weighted expansion terms. ...
The F 1 measure is a composite measure of precision and recall. The most widely used benchmark corpus is the Reuters-21578 collection. ...
doi:10.1109/ijcnn.2009.5179022
dblp:conf/ijcnn/WittekT09
fatcat:e2ed6wltezgd7o7fexfyzuxmgy
Simultaneous Learning of Sentence Clustering and Class Prediction for Improved Document Classification
2017
International Journal of Fuzzy Logic and Intelligent Systems
In document classification it is common to represent a document as the so called bag-of-words form, which is essentially a global term distribution indicating how often certain terms appear in a text. ...
a weighted term frequency vector that is aggregated from all sentences but weighed differently cluster-wise according to the prediction in the first model. ...
Acknowledgements This study was supported by Seoul National University of Science & Technology. ...
doi:10.5391/ijfis.2017.17.1.35
fatcat:jpwhytonajeqrnmwf5kbxcixaq
« Previous
Showing results 1 — 15 out of 102 results