A matrix density based algorithm to hierarchically co-cluster documents and words

Bhushan Mandhani, Sachindra Joshi, Krishna Kummamuru
2003 Proceedings of the twelfth international conference on World Wide Web - WWW '03  
This paper proposes an algorithm to hierarchically cluster documents. Each cluster is actually a cluster of documents and an associated cluster of words, thus a document-word co-cluster. Note that, the vector model for documents creates the document-word matrix, of which every co-cluster is a submatrix. One would intuitively expect a submatrix made up of high values to be a good document cluster, with the corresponding word cluster containing its most distinctive features. Our algorithm looks
more » ... r algorithm looks to exploit this. We have defined matrix density, and our algorithm basically uses matrix density considerations in its working. The algorithm is a partitional-agglomerative algorithm. The partitioning step involves the identification of dense submatrices so that the respective row sets partition the row set of the complete matrix. The hierarchical agglomerative step involves merging the most "similar" submatrices until we are down to the required number of clusters (if we want a flat clustering) or until we have just the single complete matrix left (if we are interested in a hierarchical arrangement of documents). It also generates apt labels for each cluster or hierarchy node. The similarity measure between clusters used for merging is based on the fact that the clusters here are co-clusters, and is a key point of difference from existing agglomerative algorithms. We will refer to the proposed algorithm as RPSA (Rowset Partitioning and Submatrix Agglomeration). We have compared it as a clustering algorithm with Spherical K-Means and Spectral Graph Partitioning. We have also evaluated some hierarchies generated by the algorithm.
doi:10.1145/775152.775225 dblp:conf/www/MandhaniJK03 fatcat:2i7hystiibejnimqbswed6ukrq