Turn Waste into Wealth

Shaoxu Song, Chunping Li, Xiaoquan Zhang
2015 Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15  
Dirty data commonly exist. Simply discarding a large number of inaccurate points (as noises) could greatly affect clustering results. We argue that dirty data can be repaired and utilized as strong supports in clustering. To this end, we study a novel problem of clustering and repairing over dirty data at the same time. Referring to the minimum change principle in data repairing, the objective is to find a minimum modification of inaccurate points such that the large amount of dirty data can
more » ... f dirty data can enhance the clustering. We show that the problem can be formulated as an integer linear programming (ilp) problem. Efficient approximation is then devised by a linear programming (lp) relaxation. In particular, we illustrate that an optimal solution of the lp problem can be directly obtained without calling a solver. A quadratic time approximation algorithm is developed based on the aforesaid lp solution. We further advance the algorithm to linear time cost, where a trade-off between effectiveness and efficiency is enabled. Empirical results demonstrate that both the clustering and cleaning accuracies can be improved by our approach of repairing and utilizing the dirty data in clustering.
doi:10.1145/2783258.2783317 dblp:conf/kdd/SongLZ15 fatcat:dghyivmv4ndufe7zhmr56jn2ea