Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS
Proceedings of the SIGMETRICS symposium on Parallel and distributed tools - SPDT '98
Many practical scienti c computer applications would bene t from a simple checkpointing mechanism that provides automatic restart or recovery in response to faults and failures, and enables dynamic load balancing and improved resource utilization using task migration. However, developing applications with such capabilities, especially in distributed, heterogeneous operating environments, is very challenging. CUMULVS is a middleware infrastructure for interacting with parallel scienti c
... n programs and supports online visualization and computational steering. Using semantic information provided by user-level specications of selected program variables, CUMULVS interprets distributed data decompositions across heterogeneous collections of computing resources. It extracts and assembles subsets of local decomposed application data to form global views of the data. The base CUMULVS system has been extended to provide a user-level mechanism that assists in the collection of checkpoints for parallel simulations or other calculations. Via the same semantic interface used to identify and describe data elds for visualization and parameters for steering, the user application selects the minimal program state necessary to restart or migrate an application task. The CUMULVS run-time system utilizes this information to e ciently recover fault-tolerant applications by restarting failed tasks. Application tasks can also be migrated even across heterogeneous architecture boundaries t o a c hieve load balancing or to improve a task's locality with a required resource. CUMULVS handles the tedious and error-prone tasks involved, leaving the developer of fault-tolerant or migrating applications to focus on the application-speci c design details. This paper describes the CUMULVS interface for checkpointing, the issues faced in utilizing this interface when developing fault-tolerant and migrating applications, and the direction of future research in this area.