Performance under failures of high-end computing

Ming Wu, Xian-He Sun, Hui Jin
2007 Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07  
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective
more » ... mance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.
doi:10.1145/1362622.1362687 dblp:conf/sc/WuSJ07 fatcat:udidiazmc5dc5poiixo4ektqxm