Skew-resistant parallel processing of feature-extracting scientific user-defined functions

YongChul Kwon, Magdalena Balazinska, Bill Howe, Jerome Rolia
2010 Proceedings of the 1st ACM symposium on Cloud computing - SoCC '10  
Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional
more » ... . These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.
doi:10.1145/1807128.1807140 dblp:conf/cloud/KwonBHR10 fatcat:uycukwm6ebdlhffywcv7ypzdjy