Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data [chapter]

Ravishankar K. Iyer, Zbigniew Kalbarczyk
2002 Lecture Notes in Computer Science  
The discussion in this paper focuses on the issues involved in analyzing the availability of networked systems using fault injection and the failure data collected by the logging mechanisms built into the system. In particular we address: (1) analysis in the prototype phase using physical fault injection to an actual system. We use example of fault injection-based evaluation of a software-implemented fault tolerance (SIFT) environment (built around a set of self-checking processes called
more » ... that provides error detection and recovery services to spaceborne scientific applications and (2) measurement-based analysis of systems in the field. We use example of LAN of Windows NT based computers to present methods for collecting and analyzing failure data to characterize network system dependability. Both, fault injection and failure data analysis enable us to study naturally occurring errors and to provide feedback to system designers on potential availability bottlenecks. For example, the study of failures in a network of Windows NT machines reveals that most of the problems that lead to reboots are software related and that though the average availability evaluates to over 99%, a typical machine, on average, provides acceptable service only about 92% of the time. Measurement-Based Analysis of System Dependability 293 3. Heap injections. The third set of experiments further broaden the failure scenarios by injecting errors in the dynamic heap data to maximize the possibility of error propagation. The results from these experiments are especially useful in evaluating how well intraprocess self-checks limit error propagation. REE computational model. The REE computational model consists of a trusted, radiation-hardened (rad-hard) Spacecraft Control Computer (SCC) and a cluster of COTS processors that execute the SIFT environment and the scientific applications. The SCC schedules applications for execution on the REE cluster through the SIFT environment. REE testbed configuration. The experiments were executed on a 4-node testbed consisting of PowerPC 750 processors running the Lynx real-time operating system. Nodes are connected through 100 Mbps Ethernet in the testbed. Between one and two megabytes of RAM on each processor were set aside to emulate local nonvolatile memory available to each node. The nonvolatile RAM is expected to store temporary state information that must survive hardware reboots (e.g., checkpointing information needed during recovery). Nonvolatile memory visible to all nodes is emulated by a remote file system residing on a Sun workstation that stores program executables, application input data, and application output data. SIFT Environment for REE The REE applications are protected by a SIFT environment designed around a set of self-checking processes called ARMORS (Adaptive Reconfigurable Mobile Objects of Reliability) that execute on each node in the testbed. ARMORs control all operations in the SIFT environment and provide error detection and recovery to the application and to the ARMOR processes themselves. We provide a brief summary of the ARMOR-based SIFT environment as implemented for the REE applications; additional details of the general ARMOR architecture appear in [13] .
doi:10.1007/3-540-45798-4_13 fatcat:uce7va7k6nerjehped7tlvxab4