Software-Implemented Hardware Fault Tolerance
[book]
2006
This technical report contains the text of Nahmsuk Oh's thesis "Software Implemented Hardware Fault Tolerance." FUNDING ABSTRACT Transient errors in computer systems can cause abnormal behavior and degrade system reliability, data integrity and availability. This is especially true in a space environment where transient errors are a major cause of concern. Fault avoidance techniques such as radiation hardening and shielding have been the major approaches to obtaining the required reliability.
more »
... cently, unhardened Commercial Off-The-Shelf (COTS) components have been investigated for space applications because of their higher density, faster clock rate, lower power consumption and lower price. Since COTS components are not radiation hardened, and it is desirable to avoid shielding, Software-Implemented Hardware Fault Tolerance (SIHFT) has been proposed to increase the data integrity and availability of COTS systems. This dissertation presents three new SIHFT techniques for error detection: Control Flow Checking by Software Signatures (CFCSS), Error Detection by Duplicated Instructions (EDDI), and Error Detection by Diverse Data and Duplicated Instructions (ED 4 I). Previously studied software techniques are either inadequate or require assistance from special hardware, but CFCSS, EDDI and ED 4 I are pure software methods. In CFCSS, signatures are embedded into the program during compilation and compared with run-time signatures during execution. In EDDI, instructions are duplicated at compile-time, and scheduled by exploiting Instruction-Level Parallelism (ILP) to reduce performance overhead. CFCSS and EDDI detect transient errors but not permanent faults. However, in ED 4 I, a program is compiled to a new program with diverse data so that it can detect a permanent fault. Our fault injection experiment simulating bit flips in memory shows that, for the designs simulated, EDDI provides over 98% fault coverage without any extra hardware. Because of instruction duplication, code size overhead is approximately 100%, but by exploiting ILP, we reduce the performance overhead down to 61% on average. For control flow checking experiment simulating branching faults, CFCSS provides 97% fault coverage. In addition, when we duplicate programs or instructions, we can use ED 4 I to enhance data integrity in the system. Furthermore, for space experiments, we have implemented EDDI and CFCSS in sort and FFT programs running in the ARGOS satellite. During a 136 day period, our techniques have detected a total of 198 out of 203 errors, and show 98% error detection coverage. While traditional error detection and fault tolerance techniques require special dedicated hardware, our SIHFT techniques use time redundancy for error detection and significantly improve data integrity without requiring special hardware. v ACKNOWLEDGEMENTS I express my deep gratefulness to my advisor, Professor Edward J. McCluskey, for his guidance and support during my study at Stanford. He modeled the high quality teaching and research that I aspire to follow in my career. He taught me how to solve research problems and clearly presents results. He encouraged me when I faced difficulties in research as well as life. Many things I learned from him will be of great help to me.
doi:10.1007/0-387-32937-4
fatcat:yl6wjopjozbrjejarqxdy3sjz4