Scalable model-based clustering for large databases based on data summarization
IEEE Transactions on Pattern Analysis and Machine Intelligence
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources, such as memory and computation time. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. This EM algorithm is based on a
... w pseudo mixture model that is defined on the summary statistics according to aggregate behaviors of these sub-clusters of data items under an original mixture model. Thus, using much less computational resources, a clustering algorithm based on the framework can obtain similar clustering accuracy of the original mixture model. Taking the Gaussian mixture model to exemplify the framework, we establish a pseudo mixture model and develop a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combined with the BIRCH's and the grid-based data summarization procedures, EMADS is used to construct two model-based clustering algorithms: bEMADS and gEMADS, respectively. A series of experiments are conducted on both real-life and synthetic data sets. The comparison results substantiate that bEMADS can run one or two orders of magnitude faster than the traditional EM algorithm for the Gaussian mixture model with little or no loss of clustering accuracy. Furthermore, bEMADS normally generates significantly more accurate clustering results than other model-based clustering algorithms using similar computational resources. Experiments on gEMADS also indicate that EMADS is not sensitive to data summarization procedures.