COOLCAT

Daniel Barbará, Yi Li, Julia Couto
2002 Proceedings of the eleventh international conference on Information and knowledge management - CIKM '02  
In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOL-CAT, which is capable of efficiently clustering large data sets of records with categorical attributes, and data streams. In contrast with other categorical clustering algorithms published in the past, COOLCAT's clustering results are very stable for different sample
more » ... or different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. Most importantly, COOLCAT is well equipped to deal with clustering of data streams (continuously arriving streams of data point) since it is an incremental algorithm capable of clustering new points without having to look at every point that has been clustered so far. We demonstrate the efficiency and scalability of COOLCAT by a series of experiments on real and synthetic data sets.
doi:10.1145/584792.584888 dblp:conf/cikm/BarbaraLC02 fatcat:bttsjzl4tna7hpfw64iyv23w6y