A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2011; you can also visit the original URL.
The file type is
In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approachesdoi:10.1109/clustr.2005.347043 dblp:conf/cluster/LimayeLGSELC05 fatcat:gnefnaeno5bhxly53rxcu5k4sm