Jesper Larsson Träff
2022 Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures  
We are considering the following problem. In a network of p processors, one designated root processor has n indivisible blocks of data that have to be broadcast (transmitted to) all other processors. The processors are fully connected, and in one communication operation, a processor can simultaneously receive one block of data from one other processor and send a block of data to one other, possibly different processor. This is the 1-ported, fully connected, bidirectional send-receive model [1]
for which the well-known broadcast lower bound is n − 1 + ⌈log 2 p⌉ communication rounds. A round-optimal broadcast schedule reaches this lower bound, and for each processor explicitly specifies, for each round which of the n blocks are to be sent and received to and from which other processors. In a homogeneous, linear-cost communication model, where transferring a divisible message of m units between any two processors takes α + βm units of time, this lower bound gives a broadcast time of α ⌈log 2 p − 1⌉ + 2 ⌈log 2 p − 1⌉α βm + βm by dividing m appropriately into n blocks. It is well-known that the problem can be solved optimally when p is a power of 2, and many hypercube and butterfly algorithms exist, e.g. [7] . These algorithms are typically difficult to generalize to arbitrary numbers of processors; an exception is the appealing hypercube-based construction in [6] . The preprocessing required for the schedule construction is O(log p) time steps per processor, but different processors have different roles and different communication patterns in the construction. A different, explicit, optimal construction with a symmetric communication pattern was given
doi:10.1145/3490148.3538560 fatcat:uye46lptjngy7p4uz7x66okrxy