A System Architecture for Running Big Data Workflows in the Cloud

Andrey Kashlev, Shiyong Lu
2014 2014 IEEE International Conference on Services Computing  
Scientific workflows have become an important paradigm for domain scientists to formalize and structure complex data-intensive scientific processes. The everincreasing volumes of scientific data motivate researchers to extend scientific workflow management systems (SWFMSs) to utilize the power of Cloud computing to perform big data analyses. Unlike workflows run in traditional on-premise environments such as stand-alone workstations or grids, Cloud workflows rely on dynamically provisioned
more » ... ting, storage and network resources that are terminated when no longer used. This dynamic and volatile nature of cloud resources as well as other cloud-specific factors introduce a new set of challenges for "Cloud-enabled" SWFMSs. Although few SWFMSs have been integrated with Cloud infrastructures that provide some experience for future research and development, a comprehensive study from an architectural perspective is still missing. To this end, we conduct a hands-on study by running a big data workflow in Amazon EC2, FutureGrid Eucalyptus and OpenStack clouds. From this experience we 1) identify the key challenges for running big data workflows in the cloud, 2) propose a generic implementation-independent system architecture that addresses these challenges, 3) develop a cloud-enabled SWFMS called DATAVIEW that delivers a specific implementation of the proposed architecture. Finally, to validate our proposed architecture we conduct a case study in which we design and run a big data workflow towards addressing EB-scale big data analysis problem in the automotive industry domain.
doi:10.1109/scc.2014.16 dblp:conf/IEEEscc/KashlevL14 fatcat:4mmgfb7lafdfrovnpmvtgp7gfi