Runtime system design of decoupled execution paradigm for data-intensive high-end computing

Kun Feng, Yanlong Yin, Chao Chen, Hassan Eslami, Xian-He Sun, Yong Chen, Rajeev Thakur, William Gropp
2013 2013 IEEE International Conference on Cluster Computing (CLUSTER)  
High performance computing are widely used for scientific discoveries by running scientific computation programs. Many of these applications are getting more and more data intensive [1] . They generate or access huge amount of data during some execution phases. However, traditional supercomputers are designed for computing-intensive tasks. They usually have highdensity clusters of processing cores and their storage systems are placed remotely and connected to the computing clusters with
more » ... . This separation of the computing system and the storage system causes the data Input/Output performance bottleneck, especially for the data-intensive phases of HPC applications. This bottleneck degrades the HPC system's efficiency. To ease this I/O bottleneck and improve HPC system's efficiency on data-intensive applications, we proposed a decoupled execution paradigm in our previous work [2] . We add two clusters of data-processing nodes on the I/O path between the computing nodes and the storage system. These data-processing nodes consist of compute-side data nodes and storage-side data notes. Compute-side data nodes are compute nodes that are dedicated for data processing. Storage-side data nodes are specially designed nodes that are connected to file servers with fast network. Compute-side data nodes reduce the size of computing generated data before sending them to storage nodes. Storage-side data nodes reduce the size of data retrieved from storage before sending them to compute-side data nodes [2] . To make the decoupled system easy to program, we need to design and implement the corresponding programming model and runtime system. The programming model is the interface that allows users to specify operations to be decoupled. And the runtime system provides runtime support and carries out the operations on demanded data nodes according to user's needs. In this poster, we present our prototyping design and implementation of the runtime system of the decoupled execution paradigm. We design two mechanisms to decouple the data-intensive operations from computing nodes to data nodes, covering two types of MPI environment configurations. In the first configuration, the computing nodes and the data nodes belong to the same MPI runtime environment. In other 978-1-4799-0898-113 $31.00 c 2013 IEEE words, the computing processes and the decoupled data handling processes are managed by the same MPI process manager; so they can communicate with each other through simple MPI messages to pass the operation parameters or return the operation results. The runtime system uses MPI OP CREATE to register user-defined functions as decoupled operations. It then applies the operations over specified data sets using MPI REDUCE or MPI ALLREDUCE. In the second configuration, the computing nodes and the data nodes belongs to separated MPI runtime environments. As a result, computing processes and data handling processes cannot communicate with each other via MPI messages. We adopt the client-server RPC mechanism, where a RPC server daemon is running in each data nodes cluster. Each decoupled operation is implemented as an executable. When user's program initiates a decoupled operation, the runtime system works as the RPC client and issues an RPC call to the RPC server. The server daemon will spawn data handling tasks on data nodes with the predefined executable, using MPI SPAWN. The results will be returned in the return value of the RPC call or saved into intermediate result files. The programming interface is implemented as an extension of standard MPI library (the MPI implementation we used is MPICH2). The interface allows user to specify critical parameters, such as executable name, files to be handled, number of data handling tasks, task locations, result formats, and etc. Our preliminary results have shown that our prototyping runtime system achieves the design goal of decoupling with ineligible overhead. In future, we plan to evaluate the system with extensive HPC applications, so that the programming interface is comprehensive enough. REFERENCES [1] A.
doi:10.1109/cluster.2013.6702642 dblp:conf/cluster/FengYCESCTG13 fatcat:2fwjihcrgve4pepd5s3ansri3i