A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach
2010
2010 39th International Conference on Parallel Processing
The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and system failures on application performance, considering all the factors including workload, the
doi:10.1109/icpp.2010.80
dblp:conf/icpp/JinCZS10
fatcat:zaafiumudjdm7p4napqpthgicu