Filters








128,350 Hits in 4.2 sec

Document Clustering with K-tree [chapter]

Christopher M. De Vries, Shlomo Geva
2009 Lecture Notes in Computer Science  
We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering.  ...  K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality.  ...  K-tree consistently found higher purity clusters than other submissions. Even with many small high purity clusters, K-tree achieved a high micro purity score.  ... 
doi:10.1007/978-3-642-03761-0_43 fatcat:ajzsw6lsyneljnhr7rksa3oxcq

K-tree

Christopher M. De Vries, Shlomo Geva
2009 Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '09  
We introduce K-tree in an information retrieval context. It is an efficient approximation of the k-means clustering algorithm. Unlike k-means it forms a hierarchy of clusters.  ...  The K-tree has a low time complexity that is suitable for large document collections.  ...  MEDOID K-TREE We propose an extension to K-tree where all cluster centres are document exemplars. This is inspired by the kmedoids algorithm [5] .  ... 
doi:10.1145/1571941.1572094 dblp:conf/sigir/VriesG09 fatcat:ztrmvtlabjdjzoe6aqizk3mnve

Random Indexing K-tree [article]

Christopher M. De Vries and Lance De Vine and Shlomo Geva
2010 arXiv   pre-print
The results indicate that RI K-tree improves document cluster quality over the original K-tree algorithm.  ...  Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering.  ...  K-tree and Document Clustering The K-tree algorithm is well suited to clustering large document collections due to its low time complexity.  ... 
arXiv:1001.0833v2 fatcat:hyarhomkhnbsrlfdxfahu7qreq

Parallel Streaming Signature EM-tree

Christopher Michael De Vries, Lance De Vine, Shlomo Geva, Richi Nayak
2015 Proceedings of the 24th International Conference on World Wide Web - WWW '15  
The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms.  ...  Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size.  ...  We observed this with EM-tree in prior work [12] . The EMtree can be seeded with k-means||.  ... 
doi:10.1145/2736277.2741111 dblp:conf/www/VriesVGN15 fatcat:htqbtmzvebbaxfng5benjvjpse

Clustering with Random Indexing K-tree and XML Structure [chapter]

Christopher M. De Vries, Shlomo Geva, Lance De Vine
2010 Lecture Notes in Computer Science  
The RI K-tree is a scalable approach to clustering large document collections. This approach has produced quality clustering when evaluated using two different methodologies.  ...  The Random Indexing (RI) K-tree has been used with a representation that is based on the semantic markup available in the INEX 2009 Wikipedia collection.  ...  The RI projection produces dense document vectors that work well with the K-tree algorithm. Cluster quality has been measured with two metrics this year.  ... 
doi:10.1007/978-3-642-14556-8_40 fatcat:t3kosfq3jreuhnjcnggffevsoa

Unsupervised style classification of document page images

Song Mao, Lan Nie, G.R. Thoma
2005 IEEE International Conference on Image Processing 2005  
Finally, the K-medoids algorithm is used to find an optimal grouping of the trees into K clusters, each of which corresponds to a distinct document style.  ...  We evaluate our algorithm on test datasets with different cluster sizes and degrees of style similarity.  ...  Finally, the K-medoids algorithm is used to find an optimal grouping of the trees into K clusters, each of which corresponds to a distinct document style.  ... 
doi:10.1109/icip.2005.1530104 dblp:conf/icip/MaoNT05 fatcat:dmhiqnadizgini2dwrrecg5v2q

An Analytical Assessment on Document Clustering

Pushplata, Ram Chatterjee
2012 International Journal of Computer Network and Information Security  
Clustering is related to data mining for information retrieval. Relevant information is retrieved quickly while doing the clustering of documents.  ...  Index Terms-Data mining, Document clustering, Suffix Tree Clustering (STC) steps, K-means, Agglomerative Hierarchical Clustering (AHC), cosine similarity I.INTRODUCTION Clustering is raised from the  ...  SUFFIX TREE CLUSTERING ALGORITHM Suffix Tree Clustering [1, 12] uses the concept of document clustering for clustering the documents.  ... 
doi:10.5815/ijcnis.2012.05.08 fatcat:4nt2ryrsbjetzfpbzccloqwnmq

Hierarchical Document Clustering Using Frequent Itemsets [chapter]

Benjamin C.M. Fung, Ke Wang, Martin Ester
2003 Proceedings of the 2003 SIAM International Conference on Data Mining  
Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced.  ...  The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster.  ...  Acknowledgment The initial phase of this work benefited considerably from extensive discussions with Leo Chen and Linda Wu.  ... 
doi:10.1137/1.9781611972733.6 dblp:conf/sdm/FungWE03 fatcat:cb6xz4azhba7nfvzvo2poohuvm

Clustering XML Documents by Structure [chapter]

Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, Timos Sellis
2004 Lecture Notes in Computer Science  
Modeling the XML documents as rooted ordered labeled trees, we explore the application of clustering algorithms using distances that estimate the similarity between those trees in terms of the hierarchical  ...  This paper presents a framework for clustering XML documents by structure.  ...  Clustering XML documents We deal with the problem of clustering XML documents using 1. structural summaries of their representative rooted ordered labeled trees, 2. tree edit distances between these summaries  ... 
doi:10.1007/978-3-540-24674-9_13 fatcat:6teg5mmjajcjbizz3llq6li2ky

A methodology for clustering XML documents by structure

Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, Timos Sellis
2006 Information Systems  
Modeling the XML documents as rooted ordered labeled trees, we explore the application of clustering algorithms using distances that estimate the similarity between those trees in terms of the hierarchical  ...  This paper presents a framework for clustering XML documents by structure.  ...  Clustering XML documents We deal with the problem of clustering XML documents using 1. structural summaries of their representative rooted ordered labeled trees, 2. tree edit distances between these summaries  ... 
doi:10.1016/j.is.2004.11.009 fatcat:pxvevu7vafevtm4f5oich2cvim

Clustering XML Documents Using Structural Summaries [chapter]

Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, Timos Sellis
2004 Lecture Notes in Computer Science  
Modeling XML documents with tree-like structures, we face the 'clustering XML documents by structure' problem as a 'tree clustering' problem, exploiting distances that estimate the similarity between those  ...  This work presents a methodology for grouping structurally similar XML documents using clustering algorithms.  ...  . , D c k for every cluster C 1 , C 2 , . . . , C k , using the XML documents assigned to that cluster 8 .  ... 
doi:10.1007/978-3-540-30192-9_54 fatcat:5y4s7zxbnva4pjupwcxjord7ji

Efficient retrieval of the top-k most relevant spatial web objects

Gao Cong, Christian S. Jensen, Dingming Wu
2009 Proceedings of the VLDB Endowment  
Web documents are being geo-tagged, and geo-referenced objects such as points of interest are being associated with descriptive text documents.  ...  This paper proposes a new indexing framework for locationaware top-k text retrieval. The framework leverages the inverted file for text retrieval and the R-tree for spatial proximity querying.  ...  The authors thank Xin Cao for pre-processing the text documents used in the experiments.  ... 
doi:10.14778/1687627.1687666 fatcat:gxltjvze55cbrggijk7idz252y

Topological Tree for Web Organisation, Discovery and Exploration [chapter]

Richard T. Freeman, Hujun Yin
2004 Lecture Notes in Computer Science  
Each chain fully adapts to a specific topic, where its number of subtopics is determined using entropy-based validation and cluster tendency schemes.  ...  The tree is generated using an algorithm called Automated Topological Tree Organiser, which uses a set of hierarchically organised selforganising growing chains.  ...  Fig. 2 . 2 ATTO Topological Tree (left) and bisecting k-means binary tree (right) Fig. 1.  ... 
doi:10.1007/978-3-540-28651-6_70 fatcat:l74vfotkffgkxhi3etfzbwb2ye

XML Data Integration Based on Content and Structure Similarity Using Keys [chapter]

Waraporn Viyanon, Sanjay K. Madria, Sourav S. Bhowmick
2008 Lecture Notes in Computer Science  
Second, we measure the similarity degree based on data and structures of the two XML documents.  ...  This paper proposes a technique for approximately matching XML data based on the content and structure by detecting the similarity of subtrees clustered semantically using leaf-node parents.  ...  SLAX divides XML documents into smaller portions by parsing XML documents into K document trees.  ... 
doi:10.1007/978-3-540-88871-0_35 fatcat:evmlsnddqbfarkphvdvppke3j4

Multisets and Clustering XML Documents

Swami Iyer, Dan A. Simovici
2007 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007)  
We use operations on multisets of paths of document trees to define certain metrics on multisets.  ...  These metrics are used for clustering real and synthesized XML documents to produce high-quality clusterings.  ...  We model an XML document as a labeled rooted tree and represent the rooted labeled paths -a sequence of nodes of the tree starting with the root of the tree and ending with a leaf node -of the tree as  ... 
doi:10.1109/ictai.2007.18 dblp:conf/ictai/IyerS07 fatcat:2ibrazpm5bcudaxogeh5ffzfxi
« Previous Showing results 1 — 15 out of 128,350 results