NC-link: A New Linkage Method for Efficient Hierarchical Clustering of Large-Scale Data
In various disciplines, hierarchical clustering (HC) has been an effective tool for data analysis due to its ability to summarize hierarchical structures of data in an intuitive and interpretable manner. A run of HC requires multiple iterations, each of which needs to compute and update the pairwise distances between all intermediate clusters. This makes the exact algorithm for HC inevitably suffer from quadratic time and space complexities. To address large-scale data, various
... lel algorithms have been proposed to reduce the computational cost of HC. However, such algorithms still rely on conventional linkage methods (such as single, centroid, average, complete, or Ward's) for defining pairwise distances, mostly focusing on the approximation/parallelization of linkage computations. Given that the choice of linkage profoundly affects not only the quality but also the efficiency of HC, we propose a new linkage method named NC-link and design an exact algorithm for NC-link-based HC. To guarantee the exactness, the proposed algorithm maintains the quadratic nature in time complexity but exhibits only linear space complexity, thereby allowing us to address million-object data on a personal computer. To underpin the extensibility of our approach, we show that the algorithmic nature of NC-link enables single instruction multiple data (SIMD) parallelization and subquadratic-time approximation of HC. To verify our proposal, we thoroughly tested it with a number of large-scale real and synthetic data sets. In terms of efficiency, NC-link allowed us to perform HC substantially more space efficiently or faster than conventional methods: compared with average and complete linkages, using NC-link incurred only 0.7%-1.75% of the memory usage, and the NC-linkbased implementation delivered speedups of approximately 3.5 times over the centroid and Ward's linkages. With regard to clustering quality, the proposed method was able to retrieve hierarchical structures from input data as faithfully as in the popular average and centroid linkage methods. We anticipate that the existing approximation/parallel algorithms will be able to benefit from adopting NC-link as their linkage method for obtaining better clustering results and reduced time and space demands. INDEX TERMS Clustering algorithm, data mining, machine learning. 5594 2169-3536 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. VOLUME 5, 2017 2 Where no confusion arises, we also refer to the proposed NC-link-based HC algorithm as NC-link for convenience.