Distributed computing practice for large-scale science and engineering applications
Concurrency and Computation
INTRODUCTION, DEFINITIONS AND OUTLINE The process of developing and deploying large-scale distributed applications presents a critical and challenging agenda for researchers and developers working at the intersection of computer science, computational science, and a diverse range of application areas. In this paper, we review the state of the art and outline, through a gap analysis, our vision for the future. Our analysis is driven by examining the features and properties of a number of active
... cientific applications. Specifically, we S. JHA ET AL. identify and encapsulate recurring patterns within these applications. Before presenting the details of our analysis, we clarify a number of terms and set the context in the discussion hereafter. The focus in this paper is on computational science and the use of distributed computing in computational science, and this influences our definitions and discussion. Context Even though individual computers are becoming more powerful, there remains and will remain a need for aggregating distributed computational resources for scientific computing. In the simplest case, the demand for computing power at any time may exceed the capacity of individual systems that are available, requiring the coupling of physically distributed resources. Alternatively, higher throughput may be achieved by aggregating resources, or there may be a need to use specialized hardware in conjunction with general purpose computing units. Similarly, application components may require specific hardware, may need to store large datasets across multiple resources, or may need to compute near data that is too large to be efficiently transferred. Finally, distributed computing may also be required to facilitate collaborations between physically separated groups. However, despite the need, there is both a perceived and genuine lack of distributed scientific computing applications that can seamlessly utilize distributed infrastructures in an extensible and scalable fashion . The reasons for this exist at several levels. We believe that at the root of the problem is the fact that developing large-scale distributed applications is fundamentally a difficult process. Commonly acceptable and widely used models and abstractions remain elusive. Instead, many ad hoc solutions are used by application developers. The range of proposed tools, programming systems, and environments is bewilderingly large, making integration, extensibility and interoperability difficult. Against this backdrop, the sets of distributed infrastructure available to scientists continues to evolve, both in terms of their scale and capabilities as well as their complexity. Support for, and investments in, legacy applications need to be preserved, while at the same time facilitating the development of novel and architecturally different applications for new and evolving environments such as clouds. Whereas deployment and execution details should not complicate development, they should not be disjoint from the development process either, that is, tools, in support of deployment and execution of applications, should be cognizant of the approaches employed to develop applications. The modifier large-scale can be applied to several aspects. For data-intensive applications, large scales can arise when the amount of data to be analyzed is copious, or if the data generated is so large that it mandates real-time processing as intermediate storage is not feasible (e.g., Large Hadron Collider (LHC) ). Furthermore, data may be stored or generated in a large-scale, physically distributed fashion (e.g., Low-Frequency Array (LOFAR)  or the envisioned SKA  distributed radio telescopes), imposing high-volume, wide-area data transfers. Alternatively, it may be that the amount of computation required is large-scale, for example, requiring billions of independent tasks to be completed in a limited window of time; such large-scale applications require the coupling of physically distributed resources. Large-scale is also used to represent the number of users, number of resources, and geographical distribution at scale. There exists a particular interest in large-scale distributed applications, as these reinforce the need for distributed computing, as well as serving as a reminder of the extreme challenges involved in designing and programming distributed applications. In general, for large-scale distributed systems, issues of scalability, heterogeneity, fault-tolerance, and security prevail. In addition to these non-functional features of distributed systems, the need to manage application execution, possibly across administrative domains, and in heterogeneous environments with variable deployment support, is an important characteristic. In traditional highperformance computing, typically, peak performance is paramount, and while peak performance remains critical for distributed systems, there are additional performance metrics of significance. For example, in many parameter sweep applications that represent common, simple and prototypical application candidates for distributed usage, a requirement is to sustain throughput during execution; thus, there is an emphasis on average performance over long periods. Large-scale distributed 1561 science applications must maintain performance metrics such as throughput, despite the presence of failures, heterogeneity, and variable deployments. The broad set of performance metrics for distributed applications reflects the typically broad requirements and objectives for distributed applications. Consistent with this is the fact that the execution modes and environments for distributed applications are important characteristics that often determine the requirements, performance, and scalability of the application; this will be a recurring theme through this paper. For example, an application kernel such as a molecular dynamics simulator will be packaged and tooled differently based upon the expected execution modes, and whether they are local or distributed. Additionally, the mechanism of distributed control of the application will be influenced by the specifics of the execution environment and capabilities offered therein. The distributed science and engineering applications that we discuss in this paper are mostly derived from the e-Science or Grid community of applications, which in turn emphasize traditional high performance computing (HPC) application that have been modified to utilize distributed resources. The complexity of developing applications for such large-scale problems stems in part from combining the challenges inherent in HPC and large-scale distributed systems. Additionally, as the scale of operation increases, the complexity of developing and executing distributed applications increases both quantitatively and in qualitatively newer ways. The reasons for multiple and concurrent resource usage are varied and often application specific, but in general, the applications analyzed are resource intensive and thus not effectively or efficiently solvable on a single machine. Although many of the application characteristics will be reminiscent of transactional and enterprise applications, we will focus on science and engineering applications that have been executed on general purpose and shared production grid-infrastructure, such as the US TeraGrid  and European EGEE/EGI , and have not been tied to a specific execution environment. Striving for such extensibility and portability is an important functional requirement and thus a design constraint. Additionally, the science and engineering applications considered are typically used in single-user mode, that is, concurrent usage of the same instance by multiple users is highly constrained. Before moving to provide definitions of terms used and the scope of the paper, it is also important to mention some application characteristics that do not typically influence the design and operating conditions: the need for anything beyond the most elementary security and privacy, as well as the requirement of reliability and QoS have not been first-order concerns, and thus they have not been specific determinants of execution environments or imposed constraints in the deployment. Definitions and concepts This subsection defines the fundamental concepts used in the paper. Many terms have previously been defined/used in different contexts and with multiple meanings, and as a result, it is important to state how they are used in this paper. We keep the formalism and definitions to a minimum, and provide a common and consistent set of terms and definitions to support the paper. The key concepts of abstractions and patterns, and their relationship to distributed computational science applications are as shown in Figure 1 . Distributed applications: Distributed applications are those that (a) need multiple resources, or (b) would benefit from the use of multiple resources; examples of how they could benefit include increased peak performance, throughput, reduced time to solution, or reliability. Pattern: Patterns are formalizations of commonly occurring modes of computation, composition, and/or resource usage. Note that the use of the term patterns here is different from its use by the software engineering community. Patterns here refer to recurring requirements and usage, rather than just reusable solutions. Patterns can be discerned at each of the three stages of development, deployment, and execution for distributed applications. However, patterns often can not be pigeon-holed into a specific category, but may have elements belonging to different categories. MapReduce  is an interesting example: 1563 patterns they support can be used by new applications. By this process, the work in an initial set of applications can influence the development of a much larger set of applications. Scope of the paper and related work The main aim of this paper is to explore and provide insight into the type of abstractions that exist, as well as those that should be developed so as to better support the effective development, deployment, and execution of large-scale (scientific) high-performance applications on a broad range of infrastructure. The advantages of such abstractions vary from better utilization of underlying resources to the easier deployment of applications, as well as lowered development effort. As this paper relates to a number of themes in application development and distributed infrastructure, we attempt to better scope this work by relating it to real-world applications in scientific computing, along with an integrated analysis of application development (to include the tools and programming systems used) and their usage on distributed infrastructure. Consequently, our discussion of related work is aligned with this focus. The categorization of programming models and systems in this paper was initially undertaken without reference to any particular existing effort, but was later compared with published work of Lee and Talia , Parashar and Browne , Soh , and Rabhi  . It was found to generally agree with these works, with some differences that were well-understood. For example, previous classification attempts have been based around technology and have not considered the fundamental characteristics of applications. Lee and Talia  discuss properties and capabilities of grid programming tools. Their work begins with a discussion of the 'fundamentals' of programming models and properties of parallel and distributed programming. There is a discussion of 'programming issues' (e.g., performance and configuration management), however there is no discussion of these issues in the context of specific applications. Parashar and Browne  is primarily an analysis of grid systems, with some discussion of tools/languages and environments. The treatment of programming models (communication models and frameworks, distributed object models, component models, and service models) is not related to any specific application classes. Section 2 of Soh  has a brief discussion of grid programming approaches; it not only focuses on grid programming models but also has a brief mention of resource composition and program decomposition. It has some mapping between the programming model and the environment, but does not provide any specific application (or application class). Rabhi  is primarily focused on design patterns and skeletons as design abstractions. The editors motivate the book by the observation that, when faced with developing a new system, it is rare for a solution to be developed from scratch. Often, the solution to a previous closely related problem is adapted to the current context. They differentiate between a skeleton and a design pattern, and not only focus on the use of skeletons and patterns, primarily to support design, but also to identify how such approaches can be used to support software development. The book shares may common themes with the overall aims of our work, with the key difference being the lack of emphasis on application development. Thus, this paper builds upon important work done by previous researchers, but makes an important and logical progression by providing a comprehensive overview of the approaches to the development (programming), deployment, and execution (usage) in the context of real-distributed applications in science and engineering. Structure of the paper We begin this paper (Section 2) by discussing six representative applications that we believe capture the major characteristics, challenges, and rationale behind large-scale distributed scientific applications. Each application discussion is structured around the following points: (i) application overview; (ii) why the application needs distributed infrastructure; (iii) how the application utilizes distributed infrastructure; and (iv) challenges and opportunities for better utilization of distributed infrastructure. Later in the section, we introduce 'application vectors', to describe the distributed characteristics of the applications, and discuss how the values chosen for these vectors influence the applications' design and development.