MinEX: a latency-tolerant dynamic partitioner for grid computing applications

Sajal K. Das, Daniel J. Harvey, Rupak Biswas
2002 Future generations computer systems  
9 The information power grid (IPG) being developed by NASA is designed to harness, the power of geographically distributed computers, databases, and human expertise, in order to solve large-scale realistic computational problems. This type of a metacomputing infrastructure is necessary to present a unified virtual machine to application developers that hides the intricacies of a highly heterogeneous environment and yet maintains adequate security. In this paper, we present a novel
more » ... nt partitioning scheme, called MinEX, that dynamically balances processor workloads while minimizing data movement and runtime communication, for applications that are executed in a parallel distributed fashion on the IPG. The number of IPG nodes, the number of processors per node, and the interconnect speeds are parameterized in a simulation experiment to derive conditions under which the IPG would be suitable for solving such applications. Experimental results demonstrate that MinEX is an effective load balancer for the IPG when the nodes are connected by a high-speed asynchronous interconnection network. that is both ubiquitous and uniformly 32 accessible through a convenient interface. Some other 33 areas that would benefit from such a nationwide 34 infrastructure include: 35 • desktop coupling to remote resources so as to pro-36 vide access to large data-bases and high-end graph-37 ics facilities [10]; 38 • user access to sophisticated instruments through re-39 mote connections utilizing virtual reality techniques 40 [9]; 41 • Remote interactions with parallel and distributed 42 supercomputer simulations [11, 12] . 43 The IPG is one of the several approaches to develop 44 what are called Computational Grid 2 (in short, Grid) 45 2 Not to be confused with computations on discretization grids. 1 0167-739X/02/$ -see front matter © 2002 Published by Elsevier Science B.V. 2 PII: S 0 1 6 7 -7 3 9 X ( 0 1 ) 0 0 0 7 3 -5 Das et al. / Future Generation Computer Systems 857 (2002) 1-13 capabilities and/or implementations [16]. For exam-46 ple, Condor [23] was an early success in developing a 47 distributed system to manage research studies at work-48 stations around the world. However, it does not ade-49 quately deal with the security issues that are important 50 for a general Grid implementation. Other Grid-based 51 systems include Nimrod [1], NetSolve [4], NEOS [6], 52 Legion [17], and CAVERN [22]. The Globus Meta-53 computing Infrastructure Toolkit [15] has been ex-54 tremely successful in providing a portable virtual ma-55 chine environment. Mechanisms exist within Globus 56 to share remote resources, provide adequate security, 57 and allow MPI-based message passing. Due to its gen-58 eral, portable, and modular nature, Globus has been 59 chosen by NASA as the middleware to implement the 60 IPG. 61 Till date, only a few limited studies have been per-62 formed at NASA Ames Research Center to determine 63 the viability of large-scale parallel and distributed 64 computing on the IPG [2,13]. In [2], latency tolerance 65 and load balancing modifications were implemented 66 for a computational fluid dynamics (CFD) applica-67 tion to compensate for the slower communication 68 speed between two IPG computers (nodes). Results 69 showed that the application actually ran faster under 70 Globus on two nodes of four processors each than on 71 a single tightly coupled machine of eight processors. 72 However, this result is clouded in that asynchronous 73 message passing was supported over the wide area 74 network but not within the single platform. The re-75 sults presented in [13] demonstrated the feasibility of 76 parallel distributed computing on homogeneous IPG 77 testbeds, but performance was significantly affected 78 by increased communication times. The paper con-79 cluded that poorer connectivity and larger latencies 80 due to geographical separation in a realistic IPG en-81 118 In this paper, we propose a novel partitioner, called 119 MinEX, that optimizes the two important steps of 120 PLUM (namely, balancing and remapping) as part 121 of the partitioning process. Instead of attempting to 122 merely balance the load and reduce the runtime in-123 terprocessor communication like most other partition-124 ers, the objective of MinEX is to minimize the to-125 tal runtime of the application. This approach coun-126 ters the possibility that perfectly balanced loads with 127 minimal communication can still incur excessive re-128 distribution costs for adaptive applications. MinEX is 129 also used to experiment with latency tolerant tech-130 niques for the IPG. Our experimental results show that 131 MinEX reduces the workload migrated by PLUM, and 132 lowers the communication cost over partitions gener-133 ated by SBN. For example, for 32 partitions with our 134 test case, PLUM showed an edge cut (reflecting the 135 communication overhead) of 10.9% and redistributed 136 63,270 mesh elements. The corresponding numbers 137 for the SBN-based approach were 36.5% and 19,446. 138 In contrast, the MinEX partitioner values were 20.9% 139 and 30,548, respectively, while maintaining compara-140 ble load balance. Thus, MinEX attempts to optimize 141 439 vTot, * VMap, * VList, * EList}, where |V| is the num- 440 ber of active vertices, |E| the number of edges, vTot 441 the total vertex count (including merged vertices), 442 * VMap is a pointer to the list of active vertices, 443 * VList is a pointer to the complete list of vertices, 444 and * EList is a pointer to the list of edges. 445
doi:10.1016/s0167-739x(01)00073-5 fatcat:6xrbdu3wh5ci5nks3t3xqxu7qq