A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Evaluating the Error Resilience of Parallel Programs
2014
2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
As a consequence of increasing hardware fault rates, HPC systems face significant challenges in terms of reliability. Evaluating the error resilience of HPC applications is an essential step for building efficient fault-tolerant mechanisms for these applications. In this paper, we propose a methodology to characterize the resilience of OpenMP programs using fault-injection experiments. We find that the error resilience of OpenMP applications depends on the program structure and thread model;
doi:10.1109/dsn.2014.73
dblp:conf/dsn/FangPRG14
fatcat:emiejqiiknbgfd5vfemxxlh4ve