Addressing failures in exascale computing

Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus (+16 others)
2014 The international journal of high performance computing applications  
Executive Summary The current approach to resilience for large high-performance computing (HPC) machines is based on global application checkpoint/restart. The state of each application is checkpointed periodically; if the application fails, then it is restarted from the last checkpoint. Preserving this approach is highly desirable because it requires no change in application software. The success of this method depends crucially on the following assumptions: 1. The time to checkpoint is mean
more » ... me before failure (MTBF). 2. The time to restart (which includes the time to restore the system to a consistent state) is MTBF; 3. The checkpoint is correct-errors that could corrupt the checkpointed state are detected before the checkpoint is committed. 4. Committed output data is correct (output is committed when it is read). It was not clear that these assumptions are currently satisfied. In particular, can one ignore silent data corruptions (SDCs)? It is clear that satisfying these assumptions will be harder in the future for the following reasons: • MTBF is decreasing faster than disk checkpoint time. • MTBF is decreasing faster than recovery time-especially recovery from global system failures. • Silent data corruptions may become too frequent, and errors will not be detected in time. • The output of the application may be used in real time. Each of these obstacles can be overcome in a different way: (1) we can checkpoint in RAM, rather than disk; (2) we can build global operating systems that fail less frequently or recover faster; (3) we can design hardware with lower SDC rates or, alternatively, use software to detect SDCs or tolerate them; and (4) we can use replication for the relatively rare real-time supercomputing applications. The different approaches are associated with different costs, risks, and uncertainties; we do not have enough information to choose one approach now. Therefore, we considered the following three design points: (1) business as usual, (2) system-level resilience, and (3) application-level resilience. Design point 1: Business as Usual This approach continues to use global checkpoint/restart. Hybrid checkpoint methods (using DRAM or NVRAM, as well as disk) can provide fast checkpoint and application restart time and can accommodate failure rates that are an order of magnitude higher than today's failure rates. The additional power consumption is low, but the acquisition cost of platforms will rise because of the need for additional memory. Two key technologies are needed for this approach to be feasible. (1) low SDC frequency (same as now) and low frequency of system failures or an order of magnitude improvement in system recovery time. Maintaining the current rate of hardware SDC seems possible, at the expense of <20% of additional silicon and energy; and vendor research can further lower the overhead. However, supercomputing needs both low power and low SDC rate. It is not clear that there is a large market for this combination; hence it is not clear that this combination will appear in lower-cost volume products. Silent hardware errors can be masked in software-the simple approach is to duplicate computations and compare results. Since most compute time and compute energy are spent moving data, a good hardware/software combination should enable the duplication of computation at a cost that is much less than • Application codes are becoming more complex. Multiphysics and multiscale codes couple an increasingly large number of distinct modules. Data assimilation, simulation, and analysis are coupled into increasingly complex workflows. Furthermore, the need to reduce communication, tolerate asynchrony, and tolerate failures results in more complex algorithms. The more complex libraries and application codes are more error-prone. Software error rates are discussed in Section 4 in more detail. 6 Applicable Technologies The solution to the problem of resilience at exascale will require a synergistic use of multiple hardware and software technologies. Avoidance: for reducing the occurrence of errors Detection: for detecting errors as soon as possible after their occurrence Containment: for limiting the impact of errors Recovery: for overcoming detected errors Diagnosis: for identifying the root cause of a detected error Repair: for repairing or replacing failed components We discuss potential hardware approaches in Section 3 and potential software solutions to resilience in Section 5.
doi:10.1177/1094342014522573 fatcat:menonpmgdfflzamz2fsivevxqm