Pervasive parallelism in data mining

Srivatsava Daruru, Nena M. Marin, Matt Walker, Joydeep Ghosh
2009 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09  
All Netflix Prize algorithms proposed so far are prohibitively costly for large-scale production systems. In this paper, we describe an efficient dataflow implementation of a collaborative filtering (CF) solution to the Netflix Prize problem [1] based on weighted co-clustering [5] . The dataflow library we use facilitates the development of sophisticated parallel programs designed to fully utilize commodity multicore hardware, while hiding traditional difficulties such as queuing, threading,
more » ... ory management, and deadlocks. The dataflow CF implementation first compresses the large, sparse training dataset into co-clusters. Then it generates recommendations by combining the average ratings of the co-clusters with the biases of the users and movies. When configured to identify 20x20 co-clusters in the Netflix training dataset, the implementation predicted over 100 million ratings in 16.31 minutes and achieved an RMSE of 0.88846 without any fine-tuning or domain knowledge. This is an effective real-time prediction runtime of 9.7 µs per rating which is far superior to previously reported results. Moreover, the implemented co-clustering framework supports a wide variety of other large-scale data mining applications and forms the basis for predictive modeling on large, dyadic datasets [4, 7] .
doi:10.1145/1557019.1557140 dblp:conf/kdd/DaruruMWG09 fatcat:tnloaxmvozf7ddauyvyy4f7t3i