A Parallel GPU-Based Approach to Clustering Very Fast Data Streams

Pengtao Huang, Xiu Li, Bo Yuan
2015 Proceedings of the 24th ACM International on Conference on Information and Knowledge Management - CIKM '15  
Clustering data streams has become a hot topic in the era of big data. Driven by the ever increasing volume, velocity and variety of data, more efficient algorithms for clustering large-scale complex data streams are needed. In this paper, we present a parallel algorithm called PaStream, which is based on advanced Graphics Processing Unit (GPU) and follows the online-offline framework of CluStream. Our approach can achieve hundreds of times speedup on high-speed and high-dimensional data
more » ... compared with CluStream. It can also discover clusters with arbitrary shapes and handle outliers properly. The efficiency and scalability of PaStream are demonstrated through comprehensive experiments on synthetic and standard benchmark datasets with various problem factors.
doi:10.1145/2806416.2806545 dblp:conf/cikm/HuangLY15 fatcat:mrcsqqlwnvdnjpy3ax6fgoc7la