A systematic methodology to develop resilient cache coherence protocols

Konstantinos Aisopos, Li-Shiuan Peh
2011 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 '11  
Aggressive transistor scaling continues to increase integration capacity with each new technology node, but technology downscaling also increases the vulnerability of semiconductor devices and causes silicon failures. Thus, fault-tolerant architectures are emerging to guarantee reliable functionality on unreliable silicon. While tolerating faults within a processor core has been extensively researched, the many-core era introduces the challenge of reliable on-chip communication in Chip
more » ... on in Chip Multi-Processors (CMPs). In CMP systems, an unreliable interconnection network can lose or corrupt coherence messages, causing the entire chip to deadlock. In this work, we argue for a system-level resiliency solution to tolerate an unreliable underlying Network-on-Chip (NoC). We introduce a systematic methodology to transform a coherence protocol to a resilient one, by extending its Finite State Machine (FSM) with safe states and incorporating additional handshaking messages into transactions. The modified protocol ensures coherent and reliable transactions over any lossy NoC. Our approach is generic and can be applied to a wide range of protocols. It requires minimal hardware modifications and introduces only a slight performance overhead (an average of 0.8% during fault-free operation, and 1.9% even at an aggressive fault rate of one fault per msec).
doi:10.1145/2155620.2155627 dblp:conf/micro/AisoposP11 fatcat:72myyyh7ufbihhrfvzmfvlxsfq