Improving Load Balance for Data-Intensive Computing on Cloud Platforms

Wei Dai, Ibrahim Ibrahim, Mostafa Bassiouni
2016 2016 IEEE International Conference on Smart Cloud (SmartCloud)  
Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organizations across a wide range of industries. The widespread data-intensive computing needs have inspired innovations in parallel and distributed computing, which has been the effective way to tackle massive computing workload for decades. One significant example is MapReduce, which is a programming model for expressing distributed computations on huge datasets, and an execution framework for
more » ... tensive computing on commodity clusters as well. Since it was originally proposed by Google, MapReduce has become the most popular technology for dataintensive computing. While Google owns its proprietary implementation of MapReduce, an open source implementation called Hadoop has gained wide adoption in the rest of the world. The combination of Hadoop and Cloud platforms has made data-intensive computing much more accessible and affordable than ever before. This dissertation addresses the performance issue of data-intensive computing on Cloud platforms from three different aspects: task assignment, replica placement, and straggler identification. Both task assignment and replica placement are subjects closely related to load balancing, which is one of the key issues that can significantly affect the performance of parallel and distributed applications. While task assignment schemes strive to balance data processing load among cluster nodes to achieve minimum job completion time, replica placement policies aim to assign block replicas to cluster nodes according to their processing capabilities to exploit data locality to the maximum extent. Straggler identification is also one of the crucial issues dataintensive computing has to deal with, as the overall performance of parallel and distributed applications is often determined by the node with the lowest performance. The results of extensive iv evaluation tests confirm that the schemes/policies proposed in this dissertation can improve the performance of data-intensive applications running on Cloud platforms. v ACKNOWLEDGMENTS
doi:10.1109/smartcloud.2016.44 dblp:conf/smartcloud/DaiIB16 fatcat:x7vb3ku3srdcfj6xplajx4ugie