Experiences parallelizing, configuring, monitoring, and visualizing applications for clusters and multi-clusters
Advances in Parallel Computing
To make it simpler to experiment with the impact different configurations can have on the performance of a parallel cluter application, we developed the PATHS system. The PATHS system use a "wrapper" to provide a level of indirection to the actual run-time location of data making the data available from wherever threads or processes are located. A wrapper specify where data is located, how to get there, and which protocols to use. Wrappers are also used to add or modify methods accessing data.
... rappers are specified dynamically. A "path" is comprised of one or more wrappers. Sections of a path can be shared among two or more paths. By reconfiguring the LAM-MPI Allreduce operation we achieved a performance gain of 1.52, 1.79, and 1.98 on respectively two, four and eight-way clusters. We also measured the performance of the unmodified Allreduce operation when using two clusters interconnected by a WAN link with 30-50ms roundtrip latency. Configurations which resulted in multiple messages being sent across the WAN did not add any significant performance penalty to the unmodified Allreduce operation for packet sizes up to 4KB. For larger packet sizes the Allreduce operation rapidly detoriated performancewise. To log and visualize the performance data we developed EventSpace, a configurable data collecting, management and observation system used for monitoring low-level synchronization and communication behavior of parallel applications on clusters and multi-clusters. Event collectors detect events, create virtual events by recording timestamped data about the events, and then store the virtual events to a virtual event space. Event scopes provide different views of the application, by combining and pre-processing the extracted virtual events. Online monitors are implemented as consumers using one or more event scopes. Event collectors, event scopes, and the virtual event space can be configured and mapped to the available resources to improve monitoring performance or reduce perturbation. Experiments demonstrate that a wind-tunnel application instrumented with event collectors, has insignificant slowdown due to data collection, and that monitors can reconfigure event scopes to trade-off between monitoring performance and perturbation. The visual views we generated allowed us to detect anomalous communication behavior, and detect load balance problems. Introduction A current trend in parallel and distributed computing is that compute-and I/O-intensive applications are increasingly run on cluster and multi-cluster architectures. As we add computing resources to a parallel application, one of the fundamental questions is how well the application scales. There are two main ways of scaling an application when processors are added: speedup, where the goal is to solve a problem faster, and scaling up the problem, where the goal is to solve a larger problem (or get a more fine-grained solution to a given problem) in a fixed time by adding computing resources (see also Amdahl [ 1] vs. Gustafson [ 13]). As the complexity and problem size of parallel applications and the number of nodes in clusters increase, communication performance becomes increasingly important. Of eight scalable scientific applications investigated in [ 29] , most would benefit from improvements to MPI's collective operations, and half would benefit from improvements in point-to-point message overhead and reduced latency. Scaling an application when it is mapped onto different cluster and multi-cluster architectures involves controlling factors such as balancing the workload between the processes in the system, controlling inter-process communication latency, and managing interaction between the processes. In doing so, one of the main questions is understanding how an application is mapped to the given architecture. This requires an understanding of which computations are done where, where data is located, and when control and data flow through the system. We show that the performance of collective operations improve by a factor of 1.98 by using better mappings