Modeling parallel bandwidth
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures - SPAA '97
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp, logp, a n d qsm) account for bandwidth limitations using a per-processor parameter g > 1, such that each processor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m < p , such that the p processors can send at most m messages in total
... at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a per-processor (local) limitation v ersus an aggregate (global) limitation. We consider a number of basic problems such as broadcasting, parity, summation and sorting, and give several new upper and lower time bounds that demonstrate the advantage of globally-limited models over locally-limited models given the same aggregate bandwidth (i.e., p 1 g = m). In general, globally-limited models have a possible advantage whenever there is an imbalance in the number of messages sent/received by the processors. To exploit this advantage, the processors must schedule the sending of messages so as to respect the aggregate bandwidth limit. We present a new parallel scheduling algorithm for globally-limited models that enable an unknown, arbitrarily-unbalanced set of messages to be sent through the limited bandwidth within a (1 + ) factor of the optimal o ine schedule w.h.p., even if the penalty f o r o verloading the network is an exponential function of the overload. We also present a near-optimal algorithm for the case where long messages must be sent as its in consecutive time steps, as well as for the case where new messages to be sent arrive dynamically over an in nite time line. These results consider both message passing (distributed memory) and shared memory scenarios, and improve upon the best results for the locally-limited model by a factor of (g). Finally, w e present results quantifying the power of concurrent reads in a globally-limited bandwidth setting, including showing an ( p lg m m lg p ) time separation between the exclusive-read and the concurrent-read pram(m) models, which, when m p, greatly improves upon the 2 ( p lg p) separation known previously. Recently there has been an increasing interest in high-level models for general-purpose parallel computing that account for the bandwidth limitations in communication networks. Some models, such as the well-studied bsp 45] a n d logp 19] models, and the shared-memory qsm 24] model, assume that the primary bandwidth limitation in the network is captured by a local restriction on the rate at which a n individual processor can send or receive messages. In the bsp model, processors communicate through h-relations, in which each processor sends and receives at most h messages, at a cost of g h, where g is a bandwidth parameter. In the logp model, processors are charged o overhead to send or receive a message and can only send a message every g steps. The qsm 24] is a shared-memory model with a bandwidth parameter g at each processor, i.e., a processor can issue a request to shared-memory only once every g steps. Thus in these models, a large value for the parameter g models a per-processor restriction on network bandwidth. Other models, such a s t h e pram(m) model 40], assume that the primary bandwidth limitation in the network is captured by a global restriction on the rate at which messages can traverse the network. In the pram(m) model, there are m memory cells that can be used to communicate between the processors. A value for the parameter m that is much smaller than the number of processors models an aggregate restriction on network bandwidth. The logp model also provides a capacity constraint on the network, but this is modeled as a per-processor restriction bounding the number of messages simultaneously in transit to or from any one processor. Whether a local or a global bandwidth limitation is more suitable depends on the communication network of the machine being modeled. Local bandwidth limitations seem more suitable for networks in which e a c h processor has access to its \share" of the network bandwidth and no more. Also, if the primary bandwidth bottleneck is in the processor-network interface, then bandwidth should be modeled on a per-processor basis. Global bandwidth limitations seem more suitable for networks in which processors can \steal" unused bandwidth, by routing along alternative paths. If the primary network bottleneck is the bisection bandwidth, and this bandwidth can be divided among any subset of the processors, then bandwidth should be modeled on an aggregate basis. As an example of the impact of local versus global bandwidth restrictions, consider the problem of a single processor sending a distinct message to each o f t h e p;1 other processors (one-to-all personalized communication 34]). Suppose that a processor can send at most one message per time step. Then with a per-processor bandwidth parameter g > 1, bandwidth restrictions impose a lower bound of g(p ; 1) time. On the other hand, with an aggregate bandwidth parameter m, the bandwidth is not the bottleneck for any m 1, and we h a ve a l o wer bound of only p ; 1. Contributions of this paper. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a per-processor limitation (locally-limited) v ersus an aggregate limitation (globally-limited). For concreteness, we consider the following four models: The bsp model 45], a message-passing model with a per-processor bandwidth parameter g, denoted in this paper as the bsp(g) model. The qsm model 25], a shared-memory model with a per-processor bandwidth parameter g, denoted in this paper as the qsm(g) model. The bsp(m) model (de ned in this paper), similar to the bsp(g) but with an aggregate bandwidth parameter m.