Measuring the Impact of Memory Errors on Application Performance

Mark Gottscho, Mohammed Shoaib, Sriram Govindan, Bikash Sharma, Di Wang, Puneet Gupta
2017 IEEE computer architecture letters  
Memory reliability is a key factor in the design of warehousescale computers. Prior work has focused on the performance overheads of memory fault-tolerance schemes when errors do not occur at all, and when detected but uncorrectable errors occur, which result in machine downtime and loss of availability. We focus on a common third scenario, namely, situations when hard but correctable faults exist in memory; these may cause an "avalanche" of errors to occur on affected hardware. We expose how
more » ... e hardware/software mechanisms for managing and reporting memory errors can cause severe performance degradation in systems suffering from hardware faults. We inject faults in DRAM on a real cloud server and quantify the single-machine performance degradation for both batch and interactive workloads. We observe that for SPEC CPU2006 benchmarks, memory errors can slow down average execution time by up to 2.5×. For an interactive web-search workload, average query latency degrades by up to 2.3× for a light traffic load, and up to an extreme 3746× under peak load. Our analyses of the memory error-reporting stack reveals architecture, firmware, and software opportunities to improve performance consistency by mitigating the worst-case behavior on faulty hardware.
doi:10.1109/lca.2016.2599513 fatcat:lmnmtq2zdjdm5fak2zaieyzjsi