Worldwide Computing: Adaptive Middleware and Programming Technology for Dynamic Grid Environments

Carlos A. Varela, Paolo Ciancarini, Kenjiro Taura
2005 Scientific Programming  
Towards a pervasive Grid: A research agenda Future-generations cyber-infrastructure must enable the dynamic coordinated composition of computing and information services. Resulting applications need to provide high-performance distributed computing to end users in a scalable, reliable and secure manner. This pervasive and ubiquitous grid computing infrastructure will view physical devices and computational agents uniformly, enabling radical improvement of existing applications, and opening the
more » ... oor for applications in entirely new domains. For example, wireless sensoractuator networks deployed in buildings and bridges can semi-automatically and cooperatively react to natural or man-made disasters in order to prevent human losses. Temperature, pressure, and stress-sensing devices can inform authorities about the probability of structural damage leading to collapse, can automatically activate fire extinguishing equipment, can help locate survivors and devise exit strategies, and can deviate traffic on emergency situations in semi-automated ways. Another example is virtual surgical planning, in which a surgeon simulates several surgery plans on a detailed computational model of a patient's body. The surgeon can interactively analyze the effect of different plans on the patient's body fluid dynamics. Another example is a distributed camera network coordinated by a real-time data mining component for air-port surveillance and security. Unusual irregular patterns of human behavior can be detected and communicated promptly to authorities. A final example is a manned and unmanned aerial vehicle network that can exchange sensed information and plans of action, and combine them with terrain and map databases for decentralized coordinated air traffic control. These applications will directly benefit from coordinated computational resources offered by the future pervasive grid. To realize this pervasive grid vision, it is imperative to develop programming technology and modular middleware to facilitate systems development on highly heterogeneous and dynamic cyber-infrastructure. In closed grid environments, a centralized coordination module often needs to reserve dedicated network and computing resources for specific tasks. Furthermore, users are often expected to manually allocate resources and install any needed software on target computing environments. In contrast, open dynamic grid environments require coordination and resource management protocols to be automated. Resources in open grid environments can be organized into peer-to-peer networks. These networks are scalable because information lives in individual nodes and communication is highly, if not completely, decentralized. In an envisioned grid computing scenario, a process or data item will be created by a human user in a single node. The process or data item will get replicated and propagated by middleware ISSN 1058-9244/05/$17.00 © 2005 -IOS Press and the authors. All rights reserved 256 C.A. Varela et al. / Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments layers to a dynamic network of processing and storage resources. The purpose of the middleware will be to provide high performance through resource profiling and optimization, scalability through dynamic reconfiguration and adaptation, fault-tolerance and data availability through replication, and privacy through encryption. All of these must be provided in a completely decentralized and automated manner. Realistic grid computing environments will be largescale, dynamic, and heterogeneous. Physical and computational agents will span the globe. New computational nodes will join at any time, for example, because a user points her web browser to a site coordinating a search for extraterrestrial intelligence from astronomical data. Nodes will also leave at any time, for example, because of user-directed actions or because of network or node failures. Furthermore, resources will have very different capabilities and constraints, for example, a small robot in a battle field may have energy constraints. This will prevent the robot from communicating too frequently with its base, thus preferring to send digested abstracts after local communication with nearby robots. Grid computing middleware needs to coordinate such heterogeneous computational and physical agents. Middleware must dynamically adapt applications to changing needs and to evolving underlying computational resources. Research in decentralized coordination and global resource optimization is needed to provide cost-effective services with high reliability and performance to end users. The fundamental research problems the pervasive grid community faces are: 1. Decentralized coordination: Coordination algorithms and protocols must account for physical and computational agents entering, leaving and moving about the system. Decentralized coordination is needed for systems to scale. Decentralized coordination protocols form the foundation for other distributed resource management problems. Modular middleware is critical to plug-in different coordination protocols to adjust to diverse application communication topologies and computation-to-communication ratios. 2. Resource optimization: Grid computing middleware needs to continuously map computational agents to underlying network resources. Dynamic resource profiling and allocation(e.g., see [30] ) needs new mathematical models that optimize global resource usage with minimal profiling and communication overhead and in the presence of local, partial, and even inaccurate knowledge. One potential interesting avenue of research is novel integrated static and dynamic program analyses for resource optimization. 3. Programmability: Usability is critical to the success of grid computing. Research is needed in new coordination models, abstractions, and languages that facilitate describing and reasoning about high-level specifications of distributed resource usage, composition, and constraints. For example, actor-oriented programming models, languages and infrastructure (e.g., see [32,84]) provide intuitive high-level abstractions and efficient middleware components that support these abstractions. 4. Quality of Service: Quality of service (QoS) requirements in grid computing application include scalability to world-wide computational environments, reliability in the presence of node and network failures, and enforcement of security policies, including data privacy, user authentication and authorization, and secure remote code execution. There is significant existing grid computing security research and software infrastructure (e.g., see [37, 41] ) but as grid environments become more open and dynamic, new challenges arise. Middleware and infrastructure for Grid computing Globally distributed computing Several research groups are trying to achieve distributed computing on a large scale. Wisconsin's Condor project studies high throughput computing by "hunting" for idle workstation cycles [58] . Berkeley's NOW project effectively distributes computation on a building-wide scale [8] , and Berkeley's Millennium project exploits a hierarchical cluster structure to provide distributed computing on a campus-wide scale [19] . The Globus project seeks to enable the construction of larger computational grids [37] . Virginia's Legion meta-system integrates research in object-oriented parallel processing, distributed computing, and security to form a programmable world-wide virtual computer [47] . Caltech's Infospheres project envisions a world-wide pool of millions of objects (or agents) much like the pool of documents on the World-Wide Web today [23] . WebOS seeks to provide operating sys-C.A. Varela et al. / Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments 257 tem services, such as client authentication, naming, and persistent storage, to wide area applications [83]. UIUC's 2K is an integrated operating system architecture addressing the problems of resource management in heterogeneous networks, dynamic adaptability, and configuration of component-based distributed applications [55]. Berkeley's SETI@Home project [77] uses idle cycles on the Internet to analyze astronomical data in search for patterns that might prove the existence of extraterrestrial life. Stanford's ProteinFold-ing@Home [66] has over a million participants worldwide volunteering their processing power in a computational attempt to understand protein folding. Grid computing middleware Grid computing has made great progress in the last few years. The basic mechanisms for accessing remote resources have been developed as part of the Globus Toolkit [38] and are now widely deployed and used. The grid community is now concentrating on middleware that enables the composability and interoperability of different service-oriented grid solution components by using Web Service standards. The emerging generation of grid middleware, based on the Open Grid Services Architecture (OGSA) [39, 40] is tackling interoperability and composability by using XML-based developments including WSDL for service definition, SOAP for data interchange, and UDDI for service registration and discovery. OGSA specifies the requirements for a layered set of protocols and services as follows: (i) Connectivity layer, concerned with communication and authentication, (ii) Resource layer, concerned with negotiating access to individual resources, and (iii) Collective layer, concerned with the coordinated use of multiple resources. Process migration and replication Research on process and data migration for grid applications includes [73, 81, 82] . A key idea behind these approaches is to detect service degradations and react to such events by dynamically reconfiguring application processes to effectively use available computational resources. Research on data and process replication includes [5, 24, 25, 64, 68, 71, 75, 85] . Most of these approaches only consider immutable data replication to avoid having to devise and implement expensive replica consistency protocols. An adaptive middleware layer is needed, capable of migrating and replicating data and processes proactively based on the dynamically changing availability of resources on the grid [4, 30, 56, 57, 78] . While adaptive process and data migration and replication can have a large impact on the performance of grid computing applications, they both assume a reasonable underlying model of resource usage and expected future performance and availability of grid resources. Two mechanisms to predict performance based on profiling resource usage are the Network Weather Service (NWS) [90] and the Globus Meta Discovery Service (MDS) [28] . Recent research has devised and evaluated different mechanisms for resource management in dynamic heterogeneous grid environments -e.g., see [6, 7, 10, 28, 43, 49, 51, 92] . However, how to map application resource needs onto dynamically changing computational resources in a transparent and efficient manner is still an open problem. A promising approach is to use econometric market-driven resource exchange based on supply and demand -the more applications that want to use a resource the more expensive it is, and conversely the more resources that are available to applications the cheaper they need to be to remain competitive. Research in this direction includes [88, 89] . An important consideration when adapting applications to dynamic grid environments through proactive data and process migration and replication is that the failure semantics of applications changes considerably. Research on fault detection and recovery through checkpointing and replication includes [16] [17] [18] 33, 74] . Notice that an application checkpointing mechanism is necessary for adaptive application migration and can readily be used for fault tolerance as well. More fine-grained process-level rather than application-level checkpointing and migration requires logging messages in transit to properly restore a distributed application state upon failure [33]. Application-level support for Grid application development Although some advances have been made in the area of application-level support for grid applications, much work still needs to be done. An MPI implementation using the Globus Toolkit allows execution of cluster computing programs in grid environments [36, 52] . One system that in particular targets numerical relativity is the Cactus system [7] . Cactus allows users to write application-specific modules and interface them to the Cactus framework which can run these modules 258 C.A. Varela et al. / Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments on the grid. While enabling communication across heterogeneous environments provides an important step towards enabling grid execution of applications written for cluster environments, the latency incurred over wide area networks makes highly synchronized iterative programs run inefficiently in typical grid environments (e.g., see [13] ). Due to the economic need to reuse existing application code, most application-level support for grid computing tries to impose as few requirements as possible on developers [59] . The goal is that applications delegate distribution issues completely to middleware, and therefore remain largely unchanged. However, it is also important to devise higherlevel programming models to enable faster software development cycles and more efficient and effective middleware-triggered dynamic application reconfiguration [9, 14, 45, 84 ]. An open research question is how to integrate application-level load balancing (as, e.g., in adaptive parallel applications [29, 34, 35, 69, 72] ) with middleware-level load balancing [30] without compromising the efficiency of the former or the transparency of the latter. Programming models and systems for adaptive parallelism Adaptive parallelism refers to parallelism that may change at runtime based on resource availability. Programs with adaptive parallelism are also said malleable. Malleability is desirable in Grid environment where availability of resources may change dynamically. It is clearly very difficult for the programmer to write malleable programs without suitable programming models/languages support. Challenges in supporting such programs involve (1) designing programming models in which writing malleable programs and reasoning about their correctness as well as their localities is easy, and (2) implementing such models with high performance communication and with scalability. Message passing models such as MPI and PVM are good for the programmer to reason about locality (and thus performance) of parallel programs. On the other hand, directly adding malleability to this model adds significant programming complexity because all the burden of keeping track of memberships and the mapping of data to nodes is on the programmer. Phoenix parallel programming model [79] improves this situation by introducing communication via abstract (virtual) node names. Message passing programs are difficult to be made malleable because they directly expose "processes" at the programming model level. High-level parallel programming models that hide the mapping between computation/data and physical resources are generally more amenable to adaptive parallelism. This is because the programmer does not use process names for communication and the task of mapping computation/data to resources is in principle up to the system, not the programmer. They include distributed shared memory and object-based models. Their challenges are then how to provide the programmer with a model of locality, with which s/he can reason about communication performance of programs. An important class of algorithms that has demonstrated an efficient utilization of adaptive parallelism is divide-and-conquer algorithms [15, 91] . Such algorithms divide a problem into smaller sub-tasks. Malleability can be achieved relatively easily by dynamic load balancing based on random-or latency-aware task stealing. A nice property is that such dynamic load balancing schemes do not incur much communication overhead if sub-tasks do not have dependencies among them (as is typically the case). Furthermore, under this assumption, fault-tolerance can be achieved relatively simply, by redoing sub-tasks that are lost. Programming and coordination technology Models of computation The complexity of distributed software development and the effectiveness of middleware optimizations are highly dependent on devising high-level programming and coordination models and languages for distributed computing. Nondeterminism, asynchrony, and partial failures make concurrent, distributed and mobile systems much harder to reason about and develop than sequential systems. Theoretical models of concurrency, mobility and distribution help reason about and implement open distributed systems. These models are critical to create appropriate programming abstractions, languages and middleware for global distributed computing. Two major schools of thought for concurrent programming models are process algebras and the actor model. The main representative of the process algebra school is the π-calculus [63], a simple yet expressive and composable model consisting of processes communicating through shared channels. The model pro- C.A. Varela et al. / Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments 259 vides primitive operations for synchronously reading or writing from channels, for creating new channels with scoped names, and for composing concurrent processes. The calculus has a well-developed theory, including a structural congruence that relates syntactically equivalent process expressions, an operational semantics that provides rules for the evolution of computations, and bisimulations that provide for the semantic equivalence of process expressions. The π-calculus has influenced work in several high-level programming languages, for example, Pict and Nomadic Pict [67, 87] . Other representative calculi include the join calculus [42] and mobile ambients [21]. The actor model of concurrent computation [1,48] defines an actor as a unit of concurrency, mobility and distribution in open systems. An actor encapsulates state and reacts to asynchronous messages by modifying its internal state, by creating new actors, or by sending messages to other known actors. The model assumes guaranteed message delivery and fairness in scheduling computations. A leading theory of actor computation [3] views actors as functional components and models them as expressions in an extended callby-value λ-calculus. Open systems are then viewed as composable actor configurations and an operational semantics defines valid transitions between configurations. Observational equivalence is used to equate actor systems. Among other applications, actor systems have been used for enterprise integration [80], real-time programming [70], fault-tolerance [2], coordination [20, 44] , and distributed artificial intelligence [31] . The actor model has also influenced programming language design -e.g., ABCL, Erlang, THAL and SALSA [12, 54, 84, 93] . Coordination models and languages There has been much recent interest in coordination models and languages [11, 26, 27, 46, 65] . Linda [22] is a coordination model that uses a globally shared tuple space as a means of coordination and communication. Additional work in this direction includes adding types to tuple spaces for safety, using objects instead of tuples and making tuple spaces first-class entities [50, 53, 61] . Many coordination models rely on reflection, the ability to introspect and customize basic behavior (e.g., see [62, 76, 86, 93] ). Critical characteristics of a coordination model and language are the separation of concerns it provides (not intermixing coordination and computation code) and its ability to scale to large systems through decentralized communication. SALSA is a general-purpose actor-oriented coordination language, especially designed to facilitate the development of dynamically reconfigurable open distributed applications. In addition to the actor model's first-class support for unbound concurrency, asynchronous message passing, and state encapsulation, SALSA follows a universal naming model with Internet-and Java-based support for actor migration and location-transparent message passing. Furthermore, to facilitate coordination of concurrent activities, SALSA provides three high-level abstractions for programmers: token-passing continuations, join blocks, and first-class continuations [84] . The SALSA platform provides an important research testbed to evaluate higher-level programming and coordination abstractions. SALSA as a coordination language, can be used to develop middleware agents that profile the network, exchange information, make distributed resource management decisions, and reconfigure distributed application components written in other conventional programming languages such as C++ or Java [60] . While programming and coordination models and languages help raise the level of abstraction for building grid computing systems, they do not guarantee high performance in distributed execution. Compilation technology and middleware research is needed to ensure efficient execution of distributed applications in dynamic grid environments. Introduction to the special issue on dynamic Grids and worldwide computing Keahey, Foster, Freeman and Zhang introduce the concepts of virtual workspaces and resource capabilities as abstraction layers between grid resources and grid applications. These abstractions enable more efficient matchmaking of distributed resources and applications and also have the potential for improving quality of service (QoS) for grid applications. The authors present two potential instantiations of the virtual workspace abstraction: (i) a dynamic account, representing a leased UNIX account mapped to a grid user identity, and (ii) a virtual machine, encapsulating and virtualizing access to physical machines. The taxonomy of virtual workspaces, including an XML Schema for representing them, can prove to be an important contribution in the definition, discovery, matching, and deployment of distributed grid resources. The paper presents an authoritative futuristic vision of dynamic grid computing. 260 C.A. Varela et al. / Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments
doi:10.1155/2005/132359 fatcat:tq2wrezzyzfevfdmj2bcazwgri