Reliable heterogeneous applications
IEEE Transactions on Reliability
This thesis presents the notion of computational resiliency to provide reliability in heterogeneous distributed applications. The notion provides both software fault tolerance and the ability to tolerate information warfare (IW) attacks. This technology seeks to strengthen a military mission, rather than protect its network infrastructure using static defense measures such as network security, intrusion sensors, and firewalls. Even if a failure or successful attack is never detected, it should
... e possible to continue information operations and achieve mission objectives. Computational resiliency involves the dynamic use of replicated software structures, guided by mission policy, to achieve reliable operation. However, it goes further to automatically regenerate replication in response to a failure or attack, allowing the level of system reliability to be restored and maintained. Replicated structures can be protected through several techniques such as camouflage, dispersion, and layered security policy. This thesis examines a prototype concurrent programming technology to support computational resiliency in a heterogeneous distributed computing environment. The performance of the technology is explored through two example applications, concurrent sonar processing and remote sensing. We develop the associated performance analytical model and verify the model against the experimental results. Overhead of computational resiliency over homogeneous and heterogeneous systems are investigated. Load balancing techniques are used to improve the overall performance of the system especially on heterogeneous computing environments. 108 6.7 Summary 113 Conclusion and Future Work Any system that operates in highly adverse environments, such as battlefield command and control, must be able to operate reliably by tolerating failures and attacks. Many distributed systems have sought to use state replication, either in hardware or software, as a mechanism to provide fault-tolerance and recovery. These approaches provide graceful degradation of performance to the point where no further replicas are available and then system failure occurs. This is not sufficient to assure information operations in adverse military situations where networked resources may become available dynamically through retasking. General Approach We are investigating an alternative model of distributed computation termed computational resiliency. This model combines real-time attack assessment with process reconfiguration, dispersion, camouflage, on-the-fly replication, and layered security policy to reliably maintain information operations. To visualize how these concepts might operate, consider a distributed application as analogous to an apartment complex inhabited by a new strain of roach (a process or thread) 1 . The roaches are highly resilient: you can stamp on them, spray them, strike them with a broom but you never kill them all or prevent them from their goal of finding food (resources). To foil your eradication 1 Thanks to Cathy McCullum for providing this analogy. 4 Three principles were used to guide the development of the mechanisms described by the thesis statement. Each principle addresses part of the thesis statement, and together they form a basis for constructing resilient support mechanisms that fulfill the thesis. Transparency The methods to provide computational resiliency should be transparent to the applications. Application Programming Interface (API) provides the abstract definition of the required reliability and its realization is transparent to the applications in the presence of the failures or attacks. Scalability The supported mechanism to provide computational resiliency should be scalable. Use of replication mechanisms and local area network as an interconnection network can prevent the system from scalability. Overhead associated with replication and network communication should be reduced to make the system scalable, which can be achieved by load balancing. Portability The distributed computing environments consist of wide range of computers ranging from shared memory multiprocessors, distributed memory multicomputers, to a cluster of workstations. The developed software library should be portable to these various computing systems efficiently recognizing the underlying hardware capability for optimized implementation. Contribution The contributions of this research are: 5 1. A novel approach to provide fault tolerance and automatic recovery from attacks and failures. 2. A flexible software architecture that is application and platform independent. 3. Heterogeneous load balancing of replicated structures for performance improvement. 4. Demonstration of technologies using typical real-world applications in various fields. 5. An associated analytical model expressed in terms of application-dependent parameters and resiliency requirements. 6. Experimental studies to reveal the associated overhead for computational resiliency. Metrics of Success The following matrices are used in assessing the quality of the suggested approach in this thesis: 1. Overhead of Resiliency: Investigation of the overhead of replication and how to reduce it by means of load balancing. 2. Overhead of Recovery: Investigation of how fast the system can recover from failure and attacks and how to reduce the overhead of recovery process. 3. Accuracy of predictive models: Investigation of how accurately the analytical model can perform when the number of processors, application dependent factors, reliability factors, etc. are varied.