Enabling a highly-scalable global address space model for petascale computing

Vinod Tipparaju, Edoardo Aprá, Weikuan Yu, Jeffrey S. Vetter
2010 Proceedings of the 7th ACM international conference on Computing frontiers - CF '10  
Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of computational science teams using these new systems. In this regard, Global Address Space (GAS) programming models provide a straightforward and easy to use addressing model, which can lead to improved productivity. However, the scalability of GAS depends
more » ... rectly on the design and implementation of the runtime system on the target petascale distributed-memory architecture. In this paper, we describe the design, implementation, and optimization of the Aggregate Remote Memory Copy Interface (ARMCI) runtime library on the Cray XT5 2.3 PetaFLOPs computer at Oak Ridge National Laboratory. We optimized our implementation with the flow intimation technique that we have introduced in this paper. Our optimized ARMCI implementation improves scalability of both the Global Arrays (GA) programming model and a real-world chemistry application NWChem from small jobs up through 180,000 cores. Introduction Systems with unprecedented computational power are continuously pushing the frontier of high performance computing (HPC) [2] . Several sites have deployed systems that can perform 10 15 floating point operations per second (petaflop): Cray XT5 (a.k.a. Jaguar) at the Oak Ridge National Laboratory (ORNL), IBM Cellbased system at Los Alamos National Laboratory (LANL), BlueGene/P at Forschungszentrum Juelich (FZJ). These facilities are used to solve important computational science problems in areas such as climate modeling, life sciences, and energy production. Yet many challenges in scientific productivity and application efficiency continue to plague these systems as they grow to unprecedented number of processes and complexity. 1 In this regard, GAS (Global Address Space) programming models -both the Partitioned and Asynchronous Partitioned Global Address Space -are being considered as an alternative model for programming these complex machines to improve productivity and application efficiency. Briefly, a GAS model provides an abstraction that allows threads to access the remote memory of other nodes as if they were accessing local node memory using hardware shared memory. By virtue of the abstraction they provide, partitioned Global address space languages like Unified Parallel C (UPC) [3], Co-Array Fortran (CAF) [11] , and Global Address Space libraries such as Global Arrays (GA) Toolkit [1] have the unique ability to expose features, such as low-overhead communication or global address space support in the underlying hardware. Systems that lack one or more of these features typically result in poor performance. Conceptually, Global Address Space (GAS) models do not differentiate between local and remote accesses. By contrast, Partitioned Global Address Space (PGAS) is a category of GAS models that requires applications to explicitly distinguish between local and remote memory accesses, while providing simple mechanisms for reading, writing, and synchronizing remote memory. One benefit of this explicit separation is that the user is forced to consider and optimize the performance of remote memory access while leaving the optimization of local memory accesses to the compiler. Recently, a slightly different category of PGAS model, termed Asynchronous Partitioned Global Address space model, has emerged to add additional capabilities such as remote method invocations. IBM's X10 language [5] and Asynchronous Remote Methods (ARM) [25] in UPC have pioneered this new model. All the above mentioned GAS languages and libraries use the services of an underlying communication library (which we refer to as the GAS Runtime) for serving their communication needs. GAS languages normally use this runtime as a compilation target to do the data transfers on distributed memory architectures. They have a translation layer that translates a GAS access to a corresponding data transfer on the underlying system using the GAS runtime. Two example GAS Runtime libraries are GASNet [7] and ARMCI [20], both of which are used in numerous Global Address Space languages and libraries. Latencytolerating features, such as non-blocking data transfers and message aggregation, in these runtime systems enable the GAS languages and libraries to obtain the best possible, close to the hardware, performance on clusters. In this paper, we demonstrate the scalability of a specific Global Address Space model -Global Arrays -by designing and implementing a highly scalable port of its GAS runtime, ARMCI . This scalable GA/ARMCI ultimately enables the scaling of a real scientific application (the electronic structure methods of the chemistry computer code NWChem) to 180,000 cores on the 2.3 PetaFLOPs Cray XT5 at Oak Ridge National Laboratory. Our design and implementation of the Aggregate Remote Memory Copy Interface (ARMCI) on the Cray XT5 hardware uses the Portals communication layer. To achieve this scalability, we introduce the concept of flow intimation -a unique and a useful technique that enables us to achieve performance at scale and yet use limited buffer space for one-sided communication. This end goal of performance at scale influenced every step of this project by aiming to efficiently exploit all of the systems hardware components: high-speed network, aggregate memory size, and multi-core components of the processing nodes of the Cray 2 Figure 1: The Structure of the Global Arrays library on the Cray XT5 XT5. The rest of the paper is structured as follows: we start with an overview of the structure of Global Arrays library in Section 2; we discuss the validation benchmarks we used and the connection setup details in Section 3; we describe the issues we faced in scaling this model with relation to the features of the physical network interconnect (Seastar2+) and the lowest level API to program it (Portals) in Section 4 (this section also introduces flow intimation); and finally we discuss the achieved performance in the context of NWChem in Section 5 and conclude with future steps in Section 6. The Global Arrays (GA) library provides an efficient and portable GAS styled shared memory programming interface for distributed memory computers. Each process in a parallel program can asynchronously access logical blocks of physically distributed dense multi-dimensional arrays, without the need for explicit cooperation by other processes. GA is a unique GAS model that provides explicit functionality to realize the difference between local and non-local data accesses, supports asynchronous data accesses, provides interfaces that translate to remote procedure calls, and naturally supports load-balancing. GA is equipped with the ARMCI runtime system to support blocking and non-blocking data transfers for contiguous, vector and strided data transfers. The structure of GA is shown in Figure 1 . The application (in our case NWChem) uses only the GA interfaces, a message passing wrapper (to initialize the message passing library), and the MA layer. The rest of the elements seen in the figure are not exposed to the user application. The highlighted area in the figure shows the primary components of GA: the Distributed Array layer (DA), ARMCI, and the Memory Allocator (MA). MA provides simple interfaces to allocate "local" memory. We will further describe the DA layer, the GA programming interfaces, and ARMCI. Distributed Array (DA) DA is the layer in GA that realizes the virtually shared memory access and translates it to actual process/virtualaddress information. A simple shared memory style access to a section of a GA data array can translate to multiple blocks of physically distributed data. This is the layer that gives the GA operations the information 3 about the actual location of the data. Such translation subsequently results in calls to one-sided ARMCI calls. Programming Interfaces in GA GA provides a plethora of interfaces that operate on the array abstractions. Most of the interfaces are described in [17] . There are three main categories of GA interfaces of interest here: array creation, onesided, and data parallel. All the GA interfaces have both C and Fortran bindings. The array creation interfaces result in the creation of data structures that are later used by the Distributed Array layer. Subsequently ARMCI memory allocation interface is used to allocate the actual memory for the array. An example of 2D-array creation interface in Fortran: logical function ga create(type, dim1, dim2, array name, chunk1, chunk2, g a). The memory allocation, the data structure and the allocation, and their sizes need to be handled carefully. The GA one-sided operations, after the necessary index translation using the DA layer, result in calls to the ARMCI one-sided API. Access to a GA segment via a one-sided operation may result in multiple non-blocking ARMCI function calls based on the distribution of physical array. Very efficient, low latency, non-blocking calls are important for GA. With the number of ARMCI calls made in a typical NWChem run (discussed in section 5), even sub-micro second saving in each call collectively amounts to a noticeable performance difference in the application. An example of a GA one-sided operation to get a section of a remote array into a local buffer is: subroutine ga get(g a, ilo, ihi, jlo, jhi, buf, ld). GA data parallel operations are collective in nature, and may translate into several ARMCI one-sided and atomic function calls, simultaneously, across all the involved processes. An example of a data parallel operation to scale and add two arrays g a and g b into a third array g c in Fortran is: subroutine ga add(alpha, g a, beta, g b, g c). Since several ARMCI function calls may be made simultaneous at the scale of the entire system, controlling the flow of these messages is a critical problem to address. GA is optimized to overlap intra-node data transfers in shared memory and inter-node data transfers using non-blocking ARMCI calls. The ARMCI Runtime System GA uses ARMCI as the primary communication layer. Neither GA nor ARMCI can work without a messagepassing library and elements of the execution environment (job control, process creation, interaction with the resource manager). ARMCI, in addition to being the underlying communication interface for GA, has been used to implement other communication libraries and compilers [11, 23] . ARMCI offers an extensive set of functionality in the area of RMA communication: 1) data transfer operations (Get, Put Accumulate); 2) atomic operations; 3) memory management and synchronization operations; and 4) locks. Communication in most of the non-collective GA operations is implemented as one or more ARMCI communication operations. ARMCI supports blocking and non-blocking versions of contiguous, strided and vector data transfer operations along with Read-Modify-Write operations. ARMCI uses the fastest available mechanism underneath 4 Design of ARMCI One-Sided Communication for a Scalable GA Programming Model All contiguous ARMCI Put/Get interfaces were directly implemented on top of the Portals Put and Get calls. The three categories of one-sided calls in ARMCI to be considered during the design are: a) non-contiguous ARMCI Put/Get; b)accumulate; c) Read-Modify-Write(RMW); and d) Lock/Unlock. In addition, there are the collective memory allocation operations to prepare communicatable memory. We first started with a naive solution (described in [27] ), translating all the above mentioned categories into multiple contiguous portals calls. Several techniques for transmitting non-contiguous data have been discussed in Tipparaju et al. [28] , all of them can be applied here. However, preliminary benchmarking (cf. Section 4.2) demonstrated that the server-based technique was ideal for non-contiguous and atomic operations on the XT5. In this technique, a communication helper thread is spawned on each node. One-sided messages that don't directly have a corresponding portals call are packed and sent to the communication helper thread. The helper thread receives, unpacks and processes the messages on behalf of all the processes on the node. For the rest of this discussion, all the application processes are referred to as clients and the Communication Helper Thread is referred to as CHT. Before discussing CHT, we first describe how CHT can access the memory of any client on the node. no. of processors 200 250 300 350 400 450 500 Wall time (seconds) naive with CHT Figure 7: Walltime to solution for the DFT siosi8 benchmark (7108 basis functions). The timing includes the complete calculation to the convergence of LDA wavefunction. CCSD(T) As stated above, the CCSD(T) is more expensive than the DFT methods (its cost roughly scales as N 7 , while DFT scales as N − N 3 , where N is the number of basis functions). Therefore it is a natural candidate for demonstrating petaclass performance once an efficient parallel implementation is in place. We reported performance measurements by using as the base the parallel implementation of CCSD(T) in NWChem of Kobayashi and Rendell [14], which was designed to effectively utilize massively parallel processors and to minimize the use of I/O resources. Previous CCSD(T) runs with the same NWChem implementation achieved a performance of 6.3 TFlops using 1,400 processors [24] on a cluster of Itanium2 processor with a Quadrics QsnetII network, while more recent runs at PNNL utilized 14,000 processors on an InfiniBand network of Opteron processors [10] . What distinguishes the benchmark numbers reported here is the unprecedented scale of the calculations and floating-point performance achieved. We run a series of benchmark with the 5.1 version of NWChem [9]. We used the (H 2 O) 18 water cluster with a modified cc-pvtz [12] basis set for a total of 918 basis functions. This benchmark was run on the Jaguar XT5 at ORNL. Figure 8 shows the walltime for (H 2 O) 18 for different processor runs. The Jaguar supercomputer used for these tests was recently upgraded to Hex-core from Quad-core increasing the number of cores per node from 8 to 12. In the graph on the left side of Figure 8 we show the scaling for up to 90,000 cores on the Quad-core Jaguar (before the upgrade). All 8 cores on each node were used for computation. The last data point at 90,000 processes reached a sustained 64-bit floating-point performance of 358 TFlops. In the same figure, the graph on the right side of the figure shows the scaling of (H 2 O) 18 after the upgrade. In this case, only 10 of the 12 cores per node were used for computation. One core was exclusively left for CHT utilization while the core 0 on socket 0 was left unused (the reason for this had to do with the amount of OS activity that was measured on this core and is outside the scope of this paper). The last data point at 180,000 processes reached a sustained 64-bit floating-point performance of 718 TFlops.
doi:10.1145/1787275.1787326 dblp:conf/cf/TipparajuAYV10 fatcat:tqrmjarxrnaaflqghcwpl32zfe