K-means clustering via principal component analysis

Chris Ding, Xiaofeng He
2004 Twenty-first international conference on Machine learning - ICML '04  
Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix. These
more » ... ults indicate that unsupervised dimension reduction is closely related to unsupervised learning. Several implications are discussed. On dimension reduction, the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noisereduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values.
doi:10.1145/1015330.1015408 dblp:conf/icml/DingH04a fatcat:smptjr22srgfhlhfjijejknku4