CrashTest'ing SWAT: Accurate, gate-level evaluation of symptom-based resiliency solutions

A. Pellegrini, R. Smolinski, L. Chen, X. Fu, S. K. S. Hari, J. Jiang, S. V. Adve, T. Austin, V. Bertacco
2012 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE)  
Current technology scaling is leading to increasingly fragile components making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through monitoring software-level symptoms. SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated through microarchicture level simulations that it can provide high fault coverage and a Silent Data Corruption (SDC) rate of under 0.5% for both permanent and
more » ... nsient hardware faults for all but one hardware component studied. More accurate evaluations of SWAT require tests on industry strength processor, a commercial operating system, unmodified applications, and accurate low-level fault models. In this paper, we propose a FPGA based evaluation platform that provides the software, hardware, and fault model accuracy to verify symptom-based fault detection schemes. Our platform targets a OpenSPARC T1 processor design running a commercial operating system, OpenSolaris, and leverages CrashTest, an accurate gate-level fault analysis framework, to model gate-level permanent faults. Furthermore, we modified the OpenSPARC core to support hardware checkpoint and restore to make large volume of experiments feasible. With this platform we provide results for 30,620 fault injection experiments across the major components of the OpenSPARC T1 design and running five SPECInt 2000 benchmarks. With an overall conservative estimation of the SDC rate of 0.94%, the results are similar to previous microarchitectural level evaluations of SWAT and are encouraging for the effectiveness of symptom-based software detectors.
doi:10.1109/date.2012.6176660 dblp:conf/date/PellegriniSCFHJAAB12 fatcat:k4ali6yqpreabmkigkvobjkcgy