Transposing Arrays on Multicomputers Using de Bruijn Sequences

Paul N. Swarztrauber
1998 Journal of Parallel and Distributed Computing  
Transposing an N × N array that is distributed row-or column-wise across P = N processors is a fundamental communication task that requires time-consuming interprocessor communication. It is the underlying communication task for the fast Fourier transform of long sequences and multi-dimensional arrays. It is also the key communication task for certain weather and climate models. A parallel transposition algorithm is presented for hypercube and mesh connected multicomputers with programmable
more » ... orks. The optimal scheduling of network transmissions is not unique and known to be non-trivial. Here, scheduling is determined by a single de Bruijn sequence of N bits. The elements in each processor are first preordered and then, in groups of log 2 P adjacent elements, either transmitted or not transmitted, depending on the corresponding bit in the de Bruijn sequence. The algorithm is optimal both in overall time and the time that any individual element is in the network. The results are extended to other communication tasks including shuffles, bit reversal, index reversal, and general index-digit permutation. The case P ≠ N and rectangular arrays with nonpower-of-two dimensions are also discussed. Algorithms for mesh connected multicomputers are developed by embedding the hypercube in the mesh. The optimal implementation of the algorithms requires certain architectural features that are not currently available in the marketplace. . However, single-port algorithms use only a fraction (log 2 P ) −1 of the total hypercube bandwidth and, as discussed below, a proportionate performance gain can be obtained with subsequent all-port algorithms which assume that all ports and channels can be active simultaneously. The performance of communication algorithms are generally stated for one of two communication systems; namely, packet-based systems that are in common use, and element-based systems like that used in the machines that were built by the Thinking Machines Company. For an element-based system, the time required to transmit l elements on a single channel is τl for all l . Saad and Schultz [13] call τ the elemental transfer time. The time required to transmit a packet on a single channel is γl +β where β is latency and γ −1 is the channel bandwidth. The optimum time for an all-port AAPC on a hypercube with an element-based communication system is τN ⁄2 [1, 4, 8, 9, 14, 20] . The optimum time with a packet-based communication system is γN ⁄2+βlog 2 P [14, 9] . Ho and Johnsson [8] provided optimal and near optimal algorithms depending on the value of N . Edelman [4] followed with -4an algorithm that is optimal for all values of N . Later Ho and Johnsson [9] provided an optimal algorithm for all N that also has optimal (minimal) span equal to log 2 P . Span is the maximum time that any element spends in the network. Bertsekas et al. [1] provide optimal algorithms for several basic communication tasks. Varvarigos and Bertsekas [21] use additive matrix decomposition to develop a class of optimal algorithms for isotropic communication tasks, i.e. a combination of task and architecture that is symmetric about any node. This includes transposition on both hypercubes and wraparound mesh interconnection networks (tori). The algorithm presented here also requires optimal time τN ⁄2 using element-based communication and γN ⁄2+βlog 2 P using packet-based communication. It also has optimal span log 2 P . It differs from previous work in the way that the transmissions are scheduled which is also a key difference between much of the earlier work referenced above. In previous work, elements are explicitly scheduled using a table that specifies the channels that each element must traverse on any given communication cycle. If the communication task is homogeneous, the schedules for any processor p are usually derived from the schedule for p = 0 using the symmetry of the task and the interconnection network. Here, scheduling is determined by a single de Bruijn sequence of N bits. The elements are preordered in each processor and, in groups of log 2 P adjacent elements, either transmitted or not, depending on the corresponding bit in the de Bruijn sequence. The de Bruijn sequence is the same for all processors and all communication cycles. The packet-based version of the algorithm gathers the appropriate elements and transmits them as a packet. The element-based transpose algorithm is developed in the next section together with the packet-based version that follows as a corollary. In section 3 the results are extended to mesh connected multicomputers and several related communication tasks. The paper is summarized in section 4 which includes a brief description of architectural features that permit optimal implementation. Transposing arrays on hypercubes. In this section we first develop the transposition algorithm for a hypercube with an element-based communication system. The packet-based version then follows as a corollary at the end of the section. Briefly the transpose algorithm consists of three parts: (a) the elements in each processor are preordered; (b) the elements are then scheduled and transmitted. For i = 0, . . . , N −1, log 2 P consecutive elements, beginning at location i , are either transmitted simultaneously (or not) depending on whether the i th bit in a de Bruijn sequence is 1 or 0; (c) the elements in each processor are
doi:10.1006/jpdc.1998.1476 fatcat:uf42yxexjnh23nscfets2pmcvi