Comparing FutureGrid, Amazon EC2, and Open Science Grid for Scientific Workflows

Gideon Juve, Mats Rynge, Ewa Deelman, Jens-S. Vockler, G. Bruce Berriman
2013 Computing in science & engineering (Print)  
Scientists have a number of computing infrastructures available to conduct their research, including grids and public or private clouds. This paper explores the use of these cyberinfrastructures to execute scientific workflows, an important class of scientific applications. It examines the benefits and drawbacks of cloud and grid systems using the case study of an astronomy application. The application analyzes data from the NASA Kepler mission in order to compute periodograms, which help
more » ... omers detect the periodic dips in the intensity of starlight caused by exoplanets as they transit their host star. In this paper we describe our experiences modeling the periodogram application as a scientific workflow using Pegasus, and deploying it on the FutureGrid scientific cloud testbed, the Amazon EC2 commercial cloud, and the Open Science Grid. We compare and contrast the infrastructures in terms of setup, usability, cost, resource availability and performance. As scientific data volumes grow, scientists will increasingly require data processing services that support easy access to distributed computing platforms such as grids and clouds. The astronomy community is undertaking surveys of unprecedented depth to understand the behavior of astrophysical sources with time, so much so that the 2010 Decadal Survey of Astronomy [4], which recommends national priorities for the field, declared the time domain "the last frontier" in astrophysics. These surveys are already beginning to produce data sets too large to be analyzed by local resources, and will culminate near the end of this decade with the Large Synoptic Survey Telescope (LSST) [33] , which expects to produce petabyte-scale data sets each night. Scientific workflows are a useful tool for managing these large-scale analyses because they express declaratively the relationships between computational tasks and data. As a result, workflows enable scientists to easily define multistage computational and data processing pipelines that can be executed in parallel on distributed resources, which automates complex analyses, improves application performance, and reduces the time required to obtain scientific results. Because scientific workflows often require resources beyond those available on a scientist's desktop, it is important to understand what are the benefits and drawbacks of various cyberinfrastructures available to the user so that they can make informed decisions about the most suitable platform for their application. Today scientists have many different options for deploying their applications, including grids such as Open Science Grid [24] or XSEDE [8], commercial clouds such as Amazon EC2 [1] or The Rackspace Cloud [34] and academic clouds like FutureGrid [11]. Traditionally, scientific workflows have been executed on campus clusters, or grids. Recently, however, scientists have been investigating the use of commercial and academic clouds for these applications. In contrast to grids and other traditional HPC systems, which provide only best-effort service and are configured and maintained by resource
doi:10.1109/mcse.2013.44 fatcat:loobib3hfbh63l3xuevxdpvyue