Fault-tolerant dynamic parallel schedules
Laboratoire de systèmes périphériques SECTION D'INFORmATIQUE i Acknowledgements First and foremost, I would like to thank my research director, Roger Hersch, for providing me the opportunity to work in his lab. The work involved not only the research and development leading to this thesis, but a wide array of other activities and research. I am particularly happy to have had the possibility to be lecturer for the peripherals course, and also to have worked on the lab's fantastic Visible Human
... tic Visible Human projects. Next I would like to thank all my colleagues, those who were in the lab when I started: Jean-Christophe Bessaud, Marc Mazzariol, and Patrick Emmel, those who came and went: Edouard Forler, Fabien Collaud, and Emzar Panikashvili, those who are still here: Emin Gabrielyan, Sylvain Chosson, Itzhak Amidror, and Basile Schaeli and also very particularly Fabienne Allaire. All these people contributed to make the lab an entertaining place to be. I would also like to thank all the students who have worked on projects related to Dynamic Parallel Schedules (DPS). In chronological order: Olivier Bruchez, Simon Chatelain, and Dany Lauener, who worked on a parallel video processing application for editing high resolution uncompressed video files; Stéphane Magnenat and Luc-Olivier de Charrière, who implemented a parallel virtual biotope; Leonardo Peparolo and Lorenzo Cantelli, who developed a prototype Java binding for DPS; Nicolas Frey, who worked on parallel implementations of genetic algorithms; Pascal Jermini, who developed the Java version of the DPS trace analyzer tool; Pierre Dumas, who implemented a parallel version of the T-Coffee multiple sequence alignment application with DPS. Thanks also to all the students who took the parallel programming course at EPFL -suffering through the first versions of DPS, followed by smoother sailing in the later years. Of course, I wouldn't be here if not for my family. I am therefore very grateful to my parents, Silke and Dieter Gerlach, for their unyielding support. Greetings go out to all those I know in the demoscene, in particular you Calodox guys! Start doing stuff again! ii iii Summary Dynamic Parallel Schedules (DPS) is a high-level framework for developing parallel applications on distributed memory computers such as clusters of PCs. DPS applications are defined by using directed acyclic flow graphs composed of user-defined operations. These operations derive from basic concepts provided by the framework: split, merge, leaf and stream operations. Whereas a simple parallel application can be expressed with a split-leaf-merge sequence of operations, flow graphs of arbitrary complexity can be created. DPS provides run-time support for dynamically mapping flow graph operations onto the nodes of a cluster. The flow graph based application description used in DPS allows the framework to offer many additional features, most of these transparently to the application developer. In order to maximize performance, DPS applications benefit from automatic overlapping of computations and communications and from implicit pipelining. The framework provides simple primitives for flow control and load balancing. Applications can integrate flow graph parts provided by other applications as parallel components. Since the mapping of DPS applications to processing nodes can be dynamically changed at runtime, DPS provides a basis for developing malleable applications. The DPS framework provides a complete fault tolerance mechanism based on the dynamic mapping capabilities, ensuring continued execution of parallel applications even in the presence of multiple node failures. DPS is provided as an open-source, cross-platform C++ library allowing DPS applications and services to run on heterogeneous clusters.