Active messages

Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser
1992 Proceedings of the 19th annual international symposium on Computer architecture - ISCA '92  
The design challenge for large-scale multiprocessors is (1) to minimize communication overhead, (2) allow communication to overlap computation, and (3) coordinate the two without sacrificing processor cost/performance. We show that existing message passing multiprocessors have unnecessarily high communication costs. Research prototypes of message driven machines demonstrate low communication overhead, but poor processor cost/performance. We introduce a simple communication mechanism, Active
more » ... ages, show that it is intrinsic to both architectures, allows cost effective use of the hardware, and offers tremendous flexibility. Implementations on nCUBE/2 and CM-5 are described and evaluated using a split-phase shared-memory extension to C, Split-C. We further show that active messages are sufficient to implement the dynamically scheduled languages for which message driven machines were designed. With this mechanism, latency tolerance becomes a programming/compiling concern. Hardware support for active messages is desirable and we outline a range of enhancements to mainstream processors. head of communication is greatly reduced and an overlap of the two is easily achieved. In this paradigm, the hardware designer can meaningfully address what balance is required between processor and network performance. Algorithmic communication model The most common cost model used in algorithm design for largescale multiprocessors assumes the program alternates between computation and communication phases and that communication requires time linear in the size of the message, plus a start-up cost[9]. Thus, the time to run a program is T = Tmmpute + Tcomrnunzcate and TCOmmunZCate = N= (T, + LcTb), where T, is the start-up cost, Tb is the time per byte, L. is the message length, and Nc is the number of communications. To achieve 90% of the peak processor performance, the programmer must tailor the algorithm to achieve a sufficiently high ratio of computation to communication that T..mpute > 9Tcorrsrn.nzcate. A high-performance network is required to minimize the communication time, and it sits 90% idle! If communication and computation are overlapped the situation is very different. The time to run a program becomes T = max(Tco~Put, + N~T" N.L~Tb). Thus, to achieve high processor efficiency, the communication and connpute times need only balance, and the compute time need only swamp the communication overhead, i.e., TCOmput~>> N. T.. By examining the average time between communication phases (T..~p.te /NJ and the time for message transmission, one can easily compute the per-processor bandwidth through the network required to sustain a given level of processor utilization. The hardware can be designed to reflect this balance. The essential properties of the communication mechanism are that the start-up cost must be low and that it must facilitate the overlap and co-ordination of communication with on-going computation. @ 1992 ACM 0-89791 -509-7/92/0005/0256 $1.50 256
doi:10.1145/139669.140382 dblp:conf/isca/EickenCGS92 fatcat:bmlcdrlbh5gmnpjjal4y5oorpe