Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

Christian Engelmann, Thomas Naughton
<span title="">2013</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="" style="color: black;">2013 42nd International Conference on Parallel Processing</a> </i> &nbsp;
xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the
more &raquo; ... n, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1109/icpp.2013.114</a> <a target="_blank" rel="external noopener" href="">dblp:conf/icpp/EngelmannN13</a> <a target="_blank" rel="external noopener" href="">fatcat:onbxzxytazaktejzc7e77gvwge</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>