Frequent term-based text clustering

Florian Beil, Martin Ester, Xiaowei Xu
2002 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02  
Text clustering methods can be used to structure large sets of text or hypertext documents. The well-known methods of text clustering, however, do not really address the special problems of text clustering: very high dimensionality of the data, very large size of the databases and understandability of the cluster description. In this paper, we introduce a novel approach which uses frequent item (term) sets for text clustering. Such frequent sets can be efficiently discovered using algorithms
more » ... association rule mining. To cluster based on frequent term sets, we measure the mutual overlap of frequent sets with respect to the sets of supporting documents. We present two algorithms for frequent term-based text clustering, FTC which creates flat clusterings and HFTC for hierarchical clustering. An experimental evaluation on classical text documents as well as on web documents demonstrates that the proposed algorithms obtain clusterings of comparable quality significantly more efficiently than state-of-theart text clustering algorithms. Furthermore, our methods provide an understandable description of the discovered clusters by their frequent term sets.
doi:10.1145/775047.775110 dblp:conf/kdd/BeilEX02 fatcat:cnkrq6xvwvao5ncwqqksgh6mse