The reverse-acceleration model for programming petascale hybrid systems

S. Pakin, M. Lang, D. J. Kerbyson
2009 IBM Journal of Research and Development  
Current technology trends favor hybrid architectures, typically with each node in a cluster containing both general-purpose and specialized accelerator processors. The typical model for programming such systems is host-centric: The general-purpose processor orchestrates the computation, offloading performancecritical work to the accelerator, and data are communicated only among general-purpose processors. In this paper, we propose a radically different hybrid-programming approach, which we call
more » ... oach, which we call the reverse-acceleration model. In this model, the accelerators orchestrate the computation, offloading work that cannot be accelerated to the general-purpose processors. Data is communicated among accelerators, not among general-purpose processors. Our thesis is that the reverse-acceleration model simplifies porting codes to hybrid systems and facilitates performance optimization. We present a case study of a legacy neutron-transport code that we modified to use reverse acceleration and ran across the full 122,400 cores (general-purpose plus accelerator) of the Los Alamos National Laboratory Roadrunner supercomputer. Results indicate a substantial performance improvement over the unaccelerated version of the code. Attribute Accelerator version [18] Reverse-acceleration version [43] Code running on SPEs Inner I loop Entire program Communication type No inter-SPE communication Intra-socket, inter-socket, and cross-cluster SPE communication Data movement Volume (I line and dependencies) moved twice Surfaces moved directly to the SPEs that require them PPE involvement Controls SPE workers; manages data Minimal involvement and only internal to CML message-passing library Scale explored (cores) Single socket (1 PPE þ 8 SPEs) Full Roadrunner (12,240 Opteron cores þ 12,240 PPEs þ 97,920 SPEs) IBM J. RES. & DEV. VOL. 53 NO. 5 PAPER 8 2009 S. PAKIN ET AL. 8 : 9 performance of the hardware. Next, we show performance measurements of the lowest-level software communication layers, and then we present the CML performance. After completing the presentation of communication performance, we contrast the singlesocket performance with the accelerator and reverseacceleration versions of Sweep3D that were described in the previous section. Finally, we provide Sweep3D scaling data to demonstrate how well the reverse-acceleration model can perform at an extreme scale. Figure 5 plots the data rates achievable across each core boundary. Horizontal lines indicate the theoretical peak data rate across each type of interconnect in the Roadrunner system: the EIB [32] for intra-cell SPE-to-SPE communication (i.e., within a cell socket), FlexIO [33] for inter-cell SPE-to-SPE communication (i.e., between cell sockets on the same blade), the memory interface controller (MIC) [32] for SPE-to-PPE communication, the PCIe bus for PPE-to-Opteron communication [34], and the InfiniBand network for Opteron-to-Opteron communication [26]. Points on curves represent measured data. Memory flow controller (MFC) DMA commands [43] are used for SPE-to-SPE and SPE-to-main memory data transfers. These commands provide a Put/Get interface and were, Primitive performance
doi:10.1147/jrd.2009.5429074 fatcat:vaxso4lh35hddmdfdbfbacpww4