Modeling and optimization of non-blocking checkpointing for optimistic simulation on myrinet clusters

Francesco Quaglia, Andrea Santoro
2003 Proceedings of the 17th annual international conference on Supercomputing - ICS '03  
Checkpointing and Communication Library (CCL) is a recently developed software implementing CPU offloaded checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. Specifically, CCL implements a non-blocking execution mode of memory-to-memory data copy associated with checkpoint operations, based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must
more » ... mes be employed for several reasons, such as maintenance of data consistency, thus adding some overhead to (otherwise CPU costfree) non-blocking checkpoint operations. In this paper we present a cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization (MC). With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model. We have implemented MC within CCL, and we also report experimental results demonstrating the performance benefits from this optimized re-synchronization semantic, in terms of increase in the execution speed, for a Personal Communication System (PCS) simulation application.
doi:10.1145/782832.782834 fatcat:5uzhq3oo7vge5geggf6wgvfvlm