Towards efficient and scalable data mining using spark

Jie Deng, Zhiguo Qu, G.-M. Muntean, Yongxu Zhu, Xiaojun Wang
2014 2014 International Conference on Information and Communications Technologies (ICT 2014)   unpublished
Following the requirements of discovery of valuable information from data increasing rapidly, data mining technologies have drawn people's attention for the last decade. However, the big data era makes even higher demands from the data mining technologies in terms of both processing speed and data amounts. Any data mining algorithm itself can hardly meet these requirements towards effective processing of big data, so distributed systems are proposed to be used. In this paper, a novel method of
more » ... a novel method of integrating a sequential pattern mining algorithm with a fast large-scale data processing engine Spark is proposed to mine patterns in big data. We use the well-known algorithm PrefixSpan as an example to demonstrate how this method helps handle massive data rapidly and conveniently. The experiments show that this method can make full use of cluster computing resources to accelerate the mining process, with a better performance than the common platform Hadoop.
doi:10.1049/cp.2014.0616 fatcat:cds6qfkn4fhkpcuj5m2m5gg3h4