Effects of communication latency, overhead, and bandwidth in a cluster architecture

Richard P. Martin, Amin M. Vahdat, David E. Culler, Thomas E. Anderson
1997 SIGARCH Computer Architecture News  
This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved
more » ... application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 s. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. cuses on a high performance cluster architecture, for which a fast Active Message layer has been developed to a low latency, high bandwidth network. We want to quantify the performance impact of our communication enhancements on applications and to understand if they have gone far enough. Furthermore, we want to understand which aspects of communication performance are most important. The main contributions of this work are (i) a reproducible empirical apparatus for measuring the effects of variations in communication performance for clusters, (ii) a methodology for a systematic investigation of these effects and (iii) an in-depth study of application sensitivity to latency, overhead, and bandwidth, quantifying application performance in response to changes in communication performance. Our approach is to determine application sensitivity to machine communication characteristics by running a benchmark suite on a large cluster in which the communication layer has been modified to allow the latency, overhead, per-message bandwidth and per-byte bandwidth to be adjusted independently. This four-parameter characterization of communication performance is based on the LogP model [2, 14] , the framework for our systematic investigation of the communication design space. By adjusting these parameters, we can observe changes in the execution time of applications on a spectrum of systems ranging from the current high-performance cluster to conventional LAN based clusters. We measure a suite of applications with a wide range of program characteristics, e.g., coarsegrained vs. fine-grained and read vs. write based, to enable us to draw conclusions about the effect of communication characteristics on classes of applications. Our results show that, in general, applications are most sensitive to communication overhead. This effect can easily be predicted from communication frequency. The sensitivity to message rate and data transfer bandwidth is less pronounced and more complex. Ap-plications are least sensitive to the actual network transit latency and the effects are qualitatively different than what is exhibited for the other parameters. Overall, the trends indicate that the efforts to improve communication performance pay off. Further improvements will continue to improve application performance. However, these efforts should focus on reducing overhead. We believe that there are several advantages to our approach of running real programs with realistic inputs on a flexible hardware prototype that can vary its performance characteristics. The interactions influencing a parallel program's overall performance can be very complex, so changing the performance of one aspect of the system may cause subtle changes to the program's behavior. For example, changing the communication overhead may change the load balance, the synchronization behavior, the contention, or other aspects of a parallel program. By measuring the full program on a modified machine, we observe the summary effect of the complex underlying interactions. Also, we are able to run applications on realistic input sizes, so we escape the difficulties of attempting to size the machine parameters down to levels appropriate for the small problems feasible on a simulator and then extrapolating to the real case [45] . These issues have driven a number of efforts to develop powerful simulators [38, 39], as well as to develop flexible hardware prototypes [24] . The drawback of a real system is that it is most suited to investigate design points that are "slower" than the base hardware. Thus, to perform the study we must use a prototype communication layer and network hardware with better performance than what is generally available. We are then able to scale back the performance to observe the "slowdown" relative to the initial, aggressive design point. By observing the slowdown as a function of network performance, we can extrapolate back from the initial design point to more aggressive hypothetical designs. We have constructed such an apparatus for clusters using commercially available hardware and publicly available research software. The remainder of the paper is organized as follows. After providing the necessary background in Section 2, Section 3 describes the experimental setup and our methodology for emulating designs with a range of communication performance. In addition, we outline a microbenchmarking technique to calibrate the effective communication characteristics of our experimental apparatus. Section 4 describes the characteristics of the applications in our benchmark suite and reports their overall communication requirements, such as message frequency, and baseline performance on sample input sets. Section 5 shows the effects of varying each of the four LogP communication parameters for our applications and, where possible, builds simple models to explain the results. Section 6 summarizes some of the related work and Section 7 presents our conclusions.
doi:10.1145/384286.264146 fatcat:k4dkqrtb7zf2vg22mvf3u2y6ky