Reliability in grid computing systems

Christopher Dabrowski
2009 Concurrency and Computation  
cdabrowski@nist.gov SUMMARY In recent years, grid technology has emerged as an important tool for solving computeintensive problems within the scientific community and in industry. To further the development and adoption of this technology, researchers and practitioners from different disciplines have collaborated to produce standard specifications for implementing largescale, interoperable grid systems. The focus of this activity has been the Open Grid Forum, but other standards development
more » ... anizations have also produced specifications that are used in grid systems. To date, these specifications have provided the basis for a growing number of operational grid systems used in scientific and industrial applications. However, if the growth of grid technology is to continue, it will be important that grid systems also provide high reliability. In particular, it will be critical to ensure that grid systems are reliable as they continue to grow in scale, exhibit greater dynamism, and become more heterogeneous in composition. Ensuring grid system reliability in turn requires that the specifications used to build these systems fully support reliable grid services. This study surveys work on grid reliability that has been done in recent years and reviews progress made toward achieving these goals. The survey identifies important issues and problems that researchers are working to overcome in order to develop reliability methods for large-scale, heterogeneous, dynamic environments. The survey also illuminates reliability issues relating to standard specifications used in grid systems, identifying existing specifications that may need to be evolved and areas where new specifications are needed to better support reliability. organization or small group of cooperating organizations. As such, they will be managed in different administrative domains, rather than belonging to one, centrally managed domain. Different domains are likely to have dissimilar access and security policies, while the resources they manage will employ different processor architectures and operating systems software. Further the environment in which these resources exist will be highly dynamic, due to the combined effects of enterprises continuously joining and leaving the grid, administrative policy changes, and component upgrades. Even with standard interfaces and communications protocols in place, resource heterogeneity and dynamism will likely lead to component interactions that result in faults and failures which imperil executing user applications. Some faults encountered in current grid systems may prove hard to detect, as discussed in [10, 11] or wreak havoc by propagating through the grid [12] . Long-running applications that require many resources and must produce precise results are likely to be especially vulnerable [13] . Another potential source of faults will come from grid network services, which transport large datasets between grid resources. To do this, network services will need to coordinate many heterogeneous network components and maintain stable high-bandwidth connections for long periods [14] , thus increasing the chance of faults. Another complicating factor will be the asynchronous nature of this environment, in which distributed components utilize independent clocks and messages may be subject to unbounded delay. As a result, it will be more difficult for distributed components to coordinate their computations or to know if a component has failed or is just responding slowly [15] . The need to manage large numbers of computational, data, and network resources under conditions of scale, heterogeneity, and dynamism distinguishes grid systems from other types of distributed systems. As others have argued [16], these differences motivate development of reliability methods that are designed specifically for the conditions that prevail in grid environments. Despite the distinguishing characteristics of grid systems, methods for ensuring reliability of grid systems are closely related to, and partly based on, reliability methods developed in other branches of distributed systems research. Reliability work in other areas of distributed systems has a long and rich history in comparison with grid computing, as evidenced by past work on wide-area networks [17] [18] [19] , high-performance cluster computing [20] [21] [22] , and distributed database systems [23] . Also important for grid systems is previous work on quantitative estimation algorithms that measure reliability in distributed systems [24, 25] . Peer-to-peer networks also influence grid systems, but because this is also a new technology, reliability has perhaps been less extensively researched [26] . The efforts described in these surveys provide a basis for grid reliability research, and where appropriate these links are discussed in this study. RELIABILITY OF GRID RESOURCES Because of its obvious importance, ensuring the reliability of computing hardware and software resources that comprise grid systems has probably been the object of more effort than other functional areas identified above. Grid resources include processor clusters, supercomputers, storage devices, and related hardware, together with operating system and other software for managing these resources. Grid resources can also include dedicated software components that may be engaged by users to perform various functions, such as
doi:10.1002/cpe.1410 fatcat:xih4uaq3unf7hcxa67ssxoh2jm