Jcluster: an efficient Java parallel environment on a large-scale heterogeneous cluster

Bao-Yin Zhang, Guang-Wen Yang, Wei-Min Zheng
2006 Concurrency and Computation  
In this paper, we present Jcluster, an efficient Java parallel environment that provides some critical services, in particular automatic load balancing and high-performance communication, for developing parallel applications in Java on a large-scale heterogeneous cluster. In the Jcluster environment, we implement a task scheduler based on a transitive random stealing (TRS) algorithm. Performance evaluations show that the scheduler based on TRS can make any idle node obtain a task from another
more » ... de with much fewer stealing times than random stealing (RS), which is a well-known dynamic load-balancing algorithm, on a large-scale cluster. In the performance aspects of communication, with the method of asynchronously multithreaded transmission, we implement a high-performance PVM-like and MPI-like message-passing interface in pure Java. The evaluation of the communication performance is conducted among the Jcluster environment, LAM-MPI and mpiJava on LAM-MPI based on the Java Grande Forum's pingpong benchmark. Copyright has emerged as an attractive platform allowing heterogeneous resources to be harnessed for large-scale computation. Realizing the performance potential of large-scale clusters as a platform for incrementally scalable computing presents many challenges, while the load balancing and the high-performance communication become two very important aspects. Random stealing (RS) is a well-known dynamic load-balancing algorithm, used both in sharedmemory and distributed-memory systems. RS attempts to steal a task from a randomly selected node when a node finds its own task queue empty, repeating steal attempts until it succeeds. RS is provably efficient in terms of time, space, and communication for the class of fully strict computations [1, 2] , and the natural algorithm is stable [3] . The communication is only initiated when nodes are idle. When the system load is high, no communication is needed. This causes the system to behave well under high loads. Some systems that implement RS include Cilk [4], JAWS [5], Satin [6, 7] , and so on. Cilk provides an efficient C-based runtime system for the multithreaded parallel programming with a RS scheduler. JAWS schedules load over a dynamically varying computing infrastructure to provide a multithreaded programming environment on heterogenous clusters. However, on a large-scale cluster, a node must randomly steal many times to obtain a task from another node. This will not only increase the idle time for all nodes, but also produce a heavy network communication overhead. In order to solve this problem, Shis, one of load-balancing policies in the EARTH system [8, 9] , slightly modifies RS by remembering the originating node (history information) from which a task was last received and sending requests directly to that node (the shortcut path). In the Jcluster environment, we implement a transitive random stealing (TRS) algorithm, which further improves Shis with a transitive policy. By the random baseline technique, we experimentally compare the performance of TRS with Shis and RS for five different load distributions on the Tsinghua EastSun cluster. It also shows that the scheduler based on TRS can make any idle node obtain a task from another node with much fewer stealing times on a large-scale cluster. This greatly reduces the idle time for all nodes and the network communication overhead, so as to improve the scalable performance of the system. In the performance aspects of communication, previous efforts at Java-based message-passing frameworks have mainly focused on making the functionality of the message-passing interface available in Java, either through native code wrappers to existing MPI libraries such as mpiJava [10] and JavaMPI [11] , or pure Java implementations such as JPVM [12] , MPIJ, as part of the DOGMA project at BYU [13] , and Collective Communication Java (CCJ) [14] . In the Jcluster environment, with the method of asynchronously multithreaded transmission, we implement a reliable, high-performance PVM-like and MPI-like message-passing interface using the lightweight UDP protocol, like the Panda project [15] , in pure Java. A very useful MPI communication benchmark, MPIBench, which provides a more accurate measurement of the performance of MPI communication routines was provided in [16] . However, we have not found a Java version of MPIBench. In this paper, the evaluation of the communication performance is conducted among the Jcluster environment, LAM-MPI and mpiJava on LAM-MPI based on the Java Grande Forum's pingpong benchmark. This paper is organized as follows: some related works are given in the next section. We present the design and implementation of the Jcluster environment in Section 3. In Section 4, we show some performance evaluations for the load balancing and the communication. Finally, Section 5 concludes the paper with remarks about the current and future works. JCLUSTER: AN EFFICIENT JAVA PARALLEL ENVIRONMENT 1543 RELATED WORKS The EARTH runtime system [9] supports several dynamic load-balancer policies, whose goal is to design simple balancers that deliver good load distribution with minimum overheads for a fine-grain multithreaded execution model. JAWS [5] schedules load over a dynamically varying computing infrastructure to provide a multithreaded programming environment on heterogenous clusters. However, it does not achieve optimal performance due to various overheads. For instance, lower communication overhead. In [17] , an agent-based infrastructure was introduced that provides software services and functions for developing and deploying high-performance programming models and applications on clusters in Java. As an example, the Java Object Passing Interface (JOPI) [18] has been developed, in which the messages of small sizes affect the communication overhead, mainly due to the serialization. The Manta project [19] supports several interesting flavors of message-passing codes in Java, including Collective Communication Java (CCJ) [14] , Satin [6, 7] , and so on. CCJ is a RMI-based collective communication library written entirely in Java. Satin presents a system for running divideand-conquer programs on wide-area systems with an efficient load-balancing algorithm, cluster-aware random stealing (CRS). CRS focuses on the performance optimization for wide-area networks with high latency and low bandwidth, RS is used in single cluster systems. JPVM [12] is a PVM-like library implemented in pure Java. It provides an explicit message-passing interface based distributed memory MIMD parallel programming in Java. However, some experiments were conducted to measure the overhead of creating tasks and communication. The basic results, according to the JPVM author, were not encouraging. MPIJ, as a part of the Distributed Object Groups Metacomputing Architecture (DOGMA) project at BYU [13], is a pure Java implementation of a large subset of MPI features. Their implementation is based on the proposed MPI bindings of Carpenter et al. [20]. Morin et al. provide an excellent overview of MPIJ's design [21]. However, MPIJ is not currently available for public download. JavaMPI [11] and mpiJava [10] are two efforts to provide native method wrappers to existing MPI libraries. Both approaches provide the Java programmers access to the complete functionality of a well-supported MPI library. This hybrid approach, while simple, does have a number of limitations on heterogeneous clusters. For instance, mpiJava based on MPICH 1.2.6 does not support the communication between Windows and Linux. DESIGN AND IMPLEMENTATION OF THE JCLUSTER ENVIRONMENT Design of the Jcluster environment The Jcluster environment is designed to have the following characteristics. • Pure Java implementation that suits the heterogeneous clusters. • Automatic load balancing with the TRS algorithm on a large-scale cluster. • A high-performance message-passing API which takes advantage of the lightweight UDP protocol and Java thread facility. JCLUSTER: AN EFFICIENT JAVA PARALLEL ENVIRONMENT 1549 With this interface, we implement a high-performance PVM-like and MPI-like message-passing interface. In addition, an object-passing interface is supported by the Java object serialization. Jcluster programming API The Jcluster environment provides simple interfaces to ease programming. Four classes will be used: JTasklet, JEnvironment, JMessage and JException. The JTasklet class is a base class, and every application must inherit this class and implement the method work() which will be called by the environment. The class has a reference 'env', which is an instance of the class JEnvironment. The JEnvironment class provides all of the methods of the PVM-like and MPI-like message-passing interface. We implement the following methods for synchronous and asynchronous point-to-point message-passing interfaces. void pvm_ssend(JMessage message, int taskId, int tag) This method sends a message to the tasklet whose Id is taskId with a tag, waits until the message has been received using the receive method. void pvm_send(JMessage message, int taskId, int tag) The same as pvm ssend, waits until the message has been sent to the receiving tasklet's node. void pvm_isend(JMessage message, int taskId, int tag) The same as pvm ssend, without any wait. JMessage pvm_recv(int taskId, int tag) This method receives and returns a message from the tasklet whose Id is taskId with a tag, waits until the message is available. JMessage pvm_nrecv(int taskId, int tag) This method receives and returns a message from the tasklet whose Id is taskId with a tag if available, otherwise returns null. The same as the PVM-like message-passing interface, we implement the MPI-like point-to-point message-passing interface. void MPI_ssend(Object buf,int start,int count,Datatype type,int dest,int tag) This method sends a message to the tasklet whose Id is dest with a tag, waits until the message has been received using the receive method. void MPI_send(Object buf,int start,int count,Datatype type,int dest,int tag) The same as MPI ssend, waits until the message has been sent to the receiving tasklet's node. void MPI_isend(Object buf,int start,int count,Datatype type, int dest, int tag)
doi:10.1002/cpe.986 fatcat:uayqgokolbgf5mi2n52d2jlwky