Balancing reliability, cost, and performance tradeoffs with FreeFault

Dong Wan Kim, Mattan Erez
2015 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)  
Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mechanisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearlyfree resilience mechanism that is implemented entirely within a processor and can
more » ... olerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability. Memory Controller (MC) #(0~N)
doi:10.1109/hpca.2015.7056053 dblp:conf/hpca/KimE15 fatcat:mcj3zvmqkrdatdziyvexyof6w4