Combining FT-MPI with H2O: Fault-Tolerant MPI Across Administrative Boundaries

D. Kurzyniec, V. Sunderam
19th IEEE International Parallel and Distributed Processing Symposium  
We observe increasing interest in aggregating geographically distributed, heterogeneous resources to perform large scale computations. MPI remains the most popular programming paradigm for such applications; however, as the size of computing environments increases, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single
more » ... istrative domain. We propose to overcome these limitations by combining FT-MPI with the H2O resource sharing framework. Our approach allows users to run fault tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.
doi:10.1109/ipdps.2005.141 dblp:conf/ipps/KurzyniecS05 fatcat:fma5puvcibcmpmcgsmp6pwsczq