BIGhybrid: a simulator for MapReduce applications in hybrid distributed infrastructures validated with the Grid5000 experimental platform
Concurrency and Computation
Cloud computing has increasingly been used as a platform for running large business and data processing applications. Conversely, Desktop Grids have been successfully employed in a wide range of projects, because they are able to take advantage of a large number of resources provided free of charge by volunteers. A hybrid infrastructure created from the combination of Cloud and Desktop Grids infrastructures can provide a low-cost and scalable solution for Big Data analysis. Although frameworks
... ike MapReduce have been designed to exploit commodity hardware, their ability to take advantage of a hybrid infrastructure poses significant challenges due to their large resource heterogeneity and high churn rate. In this paper is proposed BIGhybrid, a simulator for two existing classes of MapReduce runtime environments: BitDew-MapReduce designed for Desktop Grids and BlobSeer-Hadoop designed for Cloud computing, where the goal is to carry out accurate simulations of MapReduce executions in a hybrid infrastructure composed of Cloud computing and Desktop Grid resources. This work describes the principles of the simulator and describes the validation of BigHybrid with the Grid5000 experimental platform. Owing to BigHybrid, developers can investigate and evaluate new algorithms to enable MapReduce to be executed in hybrid infrastructures. This includes topics such as resource allocation and data splitting. Concurrency and Computation: Practice and Experience storage, communication, and queue services to customers for which they pay per resource usage. These resources can be used for deploying Hadoop clusters for data processing and analysis. In addition to Cloud computing, several other types of infrastructure are able to support dataintensive applications. Desktop Grids (DG) , for instance, have a large number of users around the world who donate idle computing power to multiple projects. DGs have been applied in several domains such as bio-medicine, weather forecasting, and natural disaster prediction. Merging DG with Cloud Computing (Cloud) into Hybrid Infrastructures could provide a more affordable mean of data processing. Nevertheless, although MR has been designed to exploit the capabilities of commodity hardware, its use in a hybrid infrastructure is a complex task because of the large resource heterogeneity and a high churn rate. This is usual for Desktop Grids but uncommon for Clouds. In addition, Hybrid infrastructures are environments which have geographically distributed resources in heterogeneous platforms such as Cloud, Grids and DG. The adaptation of an existing MR framework or the development of new software for hybrid infrastructures raises a number of research questions: how to create efficient strategies for data splitting and distribution, how to keep communications between the infrastructures to a minimum, how to deal with failures, sabotage, and data privacy. Moreover, the use of real-world test beds to evaluate MR applications is almost impossible due to the lack of reproducibility in the experimental conditions for DG and the complexity of fine-tuning Cloud software stacks. BIGhybrid is a toolkit for MR simulation in hybrid environments and was previously introduced in , with a focus on Cloud and DG. The simulator itself is based on the SimGrid framework  . The main purpose of this study is to demonstrate that the BIGhybrid simulator has features that allow it to carry out accurate simulation and that it is able to simulate the execution behavior of two types of middleware for two distinct infrastructures: BitDew-MR [7, 8] for Desktop Grid Computing and Hadoop-Blobseer  for Cloud computing. BIGhybrid has several desirable features: a) it is built on top of SimGrid with two different simulators -MapReduce over SimGrid (MRSG), a validated Hadoop simulator , and MapReduce Adapted Algorithms to Heterogeneous Environments (MRA++), a simulator used for heterogeneous environments  ; b) it has a trace toolkit that can enable analysis, monitoring and graphically plot the task executions; c) it is a trace-based simulator that is able to process real-world resource availability traces to implement realistic fault-tolerance scenarios. These traces are available in a web site called Failure Trace Archive (FTA), which is a centralized public repository of resource availability traces for various parallel and distributed systems  ; and d) its modular design allows for further extension. BIGhybrid can be used for evaluating scheduling strategies for MR applications in hybrid infrastructures. We believe that this kind of tool is of great value to researchers and practitioners who are working on big data applications and scheduling. For validation purposes, the experiments are executed over Grid5000  . Grid5000 is an experimental testbed, supported by INRIA, CNRS, RENATER and several universities in France. This study demonstrates that there is a similarity between the simulations of BIGhybrid and those of the MapReduce real experiments, which can serve to validate the simulator. The rest of this work is structured as follows. Section 2 examines related work, and provides an overview of the MR framework together with the other systems used. This work analyzes more detailed characteristics of the hybrid MR environment in Section 3; Section 4 introduces the BIGhybrid and there is an examination of new features like a volatile module and communication model in Subsection 4.5, and a more detailed evaluation in Section 5 with new experiments, including a statistical evaluation in Subsection 5.5, to make comparisons with a real-world environment in Grid5000. The conclusion and suggestions for future work are summarized in Section 6. BACKGROUND AND RELATED WORK This section shows the main concepts about the MapReduce framework and other systems that have been used to compose Big Data ecosystem in hybrid infrastructures. The related work demonstrates BlobSeer is a DFS that manages a huge amount of data in a flat sequence of bytes called BLOBs (Binary Large Objects). The data structure format allows a fine-grained access control. An unbalance workload is checked in the Hadoop file system (HDFS), when it receives new data from the incremental updates  . The existing storage file system has limited throughput under heavy access concurrency. HDFS does not support concurrent writes for the same file, and the data cannot be overwritten or appended to. BlobSeer maintains a most recent version of a particular file in a DHT (Distributed Hash Table) to favor efficient concurrent access to metadata, which enables the incremental updating of database files, and a high throughput with concurrent reading, writing and updating from data  . This is the main reason for using another file system like BlobSeer. This data structure is completely transparent for the Hadoop users. The fault-tolerance mechanism is a simple data replication across the machines, and enables the user to specify the replication level needed. The classical execution of MR on Hadoop was not changed and explores data locality similar to HDFS. In view of this, the BlobSeer was the best choice to implement the features of the incremental update quickly, without having to develop a new MapReduce framework for Cloud implementation. The incremental update is necessary for data management in a hybrid infrastructure. BitDew-MapReduce BitDew is a middleware that exploits protocols like P2P, http, BitTorrent and ftp. The architecture is decentralized and has independent services. These services control the behavior of the data system, such as replication, fault-tolerance, data placement, incremental update, lifetime, protocols and event-driven programming facilities. The Data Catalog maintains a centralized and updated meta-data list for the whole system. The model includes both stable and volatile storage. Stable storage is provided by stable machines or Cloud Storage like Dropbox and Google Drive, and volatile storage consists of local disks of volatile nodes. The MR implementation is an API that controls the master and worker daemon programs. This MR API can handle the Map and Reduce functions through BitDew services. Result checking is controlled through a majority voting mechanism . In the Hadoop implementation when the network experiences unavailability, a heartbeat mechanism signals to the master that the host is dead. Nevertheless, in BitDew the network can be temporarily offline without experiencing any failure. The fault tolerance system needs a synchronization schema, as pointed out by  where transient and permanent failures can be handled. A barrier-free computation is implemented to mitigate the host churn behavior  . The computation of Reduce nodes starts as soon as the intermediate results are available. These properties of BitDew-MapReduce described earlier, such as data placement, incremental update and fault-tolerance mechanism, are important to implement a hybrid infrastructure. In addition, the computing power offered by the DG infrastructure is also of value to provide new infrastructures, starting from the allocation of free resources. Related work Big Data applications have several implementations, nevertheless, dispersal data can be found in biological research studies, where the researchers need to investigate different databases, such as, in the protein structure analysis. These applications seek a genetic mapping that require a pre-existing reference genome to be employed for the read alignment of a gene  . The data processing is characterized by its ability to compare input data with different databases. This processing consists of several phases of search-merge-reduce, where the data are given an incremental update  . Another question to consider is that several biological databases are dispersed across different institutions like Gene Report  , Ensembl  and others. The solutions proposed for the hybrid infrastructure consider this heterogeneous scenario and are based on the scope of the MapReduce ANR project ‡ , in the context of biochemical research to produce medicines. Some researchers [21, 22, 23] have put forward Hadoop implementations based on a geodistributed dataset in multiple data centers. The authors state that, for instance, it is possible to have multiple execution paths for carrying out a MapReduce job in this scenario, and the performance can carry out a great deal. Nevertheless, a popular MapReduce open source, like Hadoop, does not support this feature naturally, and the major Cloud Service Providers (CSPs) do not usually provide a bandwidth guarantee  . The BlobSeer-Hadoop module reproduces the behavior of the MR framework, and invokes SimGrid operations whenever a network transfer or processing task must be performed. This simulation follows the Hadoop implementation, with a heartbeat mechanism to control the task execution.