Feature Selection for Clustering [chapter]

Susan Dumais, Magdalena Balazinska, Jeong-Hyon Hwang, Mehul A. Shah, Raimondo Schettini, Gianluigi Ciocca, Isabella Gagliardi, Manoranjan Dash, Poon Wei Koot, Benjamin Bustos, Tobias Schreck, Vassilis Plachouras (+37 others)
2009 Encyclopedia of Database Systems  
Clustering is an important data mining task. Data mining often concerns large and high-dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high-dimensionality or both. Di erent features a ect clusters di erently, some are important for clusters while others may hinder the clustering task. An e cient w ay of handling it is by selecting a subset of important features. It helps in nding clusters e ciently, understanding the data
more » ... r and reducing data size for e cient storage, collection and processing. The task of nding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is available. For unsupervised data, without class information, often principal components (PCs) are used, but PCs still require all features and they may be di cult to understand. Our approach: rst features are ranked according to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the e ectiveness and scalability of our approach for benchmark and synthetic data sets.
doi:10.1007/978-0-387-39940-9_613 fatcat:22mzosnsdrft5kdutbbdrbpadi