A Survey of Load Balancing Techniques for Data Intensive Computing [chapter]

Zhiquan Sui, Shrideep Pallickara
2011 Handbook of Data Intensive Computing  
Data volumes have been increasing substantially over the past several years. Such data is often processed concurrently on a distributed collection of machines to ensure reasonable completion times. Load balancing is one of the most important issues in data intensive computing. Often, the choice of the load balancing strategy has implications not just for reduction of execution times, but also on energy usage, network overhead, and costs. Applications that are faced with processing large data
more » ... ssing large data volumes have a choice of relying on frameworks (often cloud-based) that are increasingly popular or designing algorithms that are suited for their application domain. Here, we will cover both. Our focus is a survey of the frameworks, APIs, and schemes used to load balance processing of voluminous data on a collection of machines while processing large data volumes in settings such as analytics (MapReduce), stream based settings, and discrete event simulations. In Sect. 2 we discuss several popular data intensive computing frameworks. APIs available to for the development of cloud-scale applications are discussed in Sect. 3. In Sect. 4, we describe both static and dynamic load balancing schemes and how the latter is used in different settings. Section 5 outlines our conclusions. Data Intensive Computing Frameworks Google MapReduce Framework MapReduce [1] is a framework introduced by Google that is well suited for concurrent processing of large datasets (usually more than 1 Tb) on a collection
doi:10.1007/978-1-4614-1415-5_6 fatcat:aj7uycimt5f65aeku64lbc7chm