Hardware performance counters for system reliability monitoring

Elena Woo Lai Leng, Mark Zwolinski, Basel Halak
2017 2017 IEEE 2nd International Verification and Security Workshop (IVSW)  
As technology scaling reaches nanometre scales, the error rate due to variations in temperature and voltage, single event effects and component degradation increases, making components less reliable. In order to ensure a system continues to function correctly while facing known reliability issues, it is imperative that the system should have the means to detect the occurrence of errors due to the presence of faults. A system that behaves normally (no error detected in the system) exhibits a
more » ... ile, and any deviations from this profile indicate that there is an anomaly in the system. In this paper, we propose to use hardware performance counters (HPCs) to measure events that occur during the execution of the program. We explore the various counters available which could be use to identify the anomalous behaviour in the system and develop a methodology to observe the anomalies using HPCs by creating a faultfree pattern and observing any subsequent changes in that pattern. We evaluate the proposed technique using GemFI, an architectural simulator based on Gem5 with additional fault injection capabilities. We compare the results obtained at the end of the execution with data collected during a time interval. Our results show that HPCs can be used to identify anomalous behaviour in a system that would lead to failure.
doi:10.1109/ivsw.2017.8031548 dblp:conf/ivsw/LengZH17 fatcat:xadhlfsr2vcdrbegwlcz5rtlzy