Distributed, MapReduce-Based Nearest Neighbor and E-Ball Kernel k-Means

Nikolaos Tsapanos, Anastasios Tefas, Nikos Nikolaidis, Ioannis Pitas
2015 2015 IEEE Symposium Series on Computational Intelligence  
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform
more » ... ring on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non linear separability of the input data. Kernel k-Means typically computes the kernel matrix, which contains the results of the kernel function for every possible sample combination. This matrix can be viewed as the weight matrix of a full graph, where the samples are the vertices and the edges are weighed according to the similarity between the samples they connect, according to the kernel function. In this context, it is possible to work on the Nearest Neighbor graph, where each sample is only connected to some of its closest samples, or only using information from samples that are sufficiently close to each other, referred to as -ball. Doing so reduces the size of the kernel matrix and can provide improved clustering results. In this paper, we present a MapReduce based distributed implementation of Nearest Neighbor and -ball Kernel k-Means.
doi:10.1109/ssci.2015.81 dblp:conf/ssci/TsapanosTNP15 fatcat:djn6hiv2xnazjfngdrji26g7mm