Cardio: Adaptive CMPs for reliability through dynamic introspective operation

Andrea Pellegrini, Valeria Bertacco
2011 2011 IEEE International High Level Design Validation and Test Workshop  
Current technology scaling enables the integration of tens of processing elements into a single chip, and future technology nodes will soon allow the integration of hundreds of cores per device. While very powerful, many experts agree that these systems will be prone to a significant number of permanent and transient faults during their lifetime. If not properly handled, effects of runtime failures can be dramatic. In this work, we propose Cardio, a distributed architecture for reliable chip
more » ... tiprocessors. Cardio, a novel approach for onchip reliability is based on hardware detectors that spot failures and on software routines that reorganize the system to work around faulty components. Compared to previous online reliability solutions, Cardio provides failure reactivity comparable to hardware-only reliable solutions while requiring a much lower area overhead. Cardio operates a distributed resource manager to collect health information about components and leverages a robust distributed control mechanism to manage system-level recovery. Our architecture operational as long as at least one general purpose processor is still functional in the chip. We evaluated our design using a custom simulator and estimate its runtime impact on the SPECMPI benchmarks to be lower than 3%. We estimate its dynamic reconfiguration time to be comprised between 20 and 50 thousand cycles per failure. Hardware
doi:10.1109/hldvt.2011.6113983 dblp:conf/hldvt/PellegriniB11 fatcat:tshn2j45prc3hpovyuvrbcbihm