Supervised Workpools for Reliable Massively Parallel Computing [chapter]

Robert Stewart, Phil Trinder, Patrick Maier
2013 Lecture Notes in Computer Science  
The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPC-GAP project we aim to coordinate symbolic computations in architectures with 10 6 cores. At that scale, failures are a real issue. Functional languages are well known for advantages both for parallelism and for reliability, e.g.
more » ... stateless computations can be scheduled and replicated freely. This paper presents a software level reliability mechanism, namely supervised fault tolerant workpools implemented in a Haskell DSL for parallel programming on distributed memory architectures. The workpool hides task scheduling, failure detection and task replication from the programmer. To the best of our knowledge, this is a novel construct. We demonstrate how to abstract over supervised workpools by providing fault tolerant instances of existing algorithmic skeletons. We evaluate the runtime performance of these skeletons both in the presence and absence of faults, and report low supervision overheads.
doi:10.1007/978-3-642-40447-4_16 fatcat:5j5e2ciccndhjo6lsogekogbxe