Linear Algebra Computation Benchmarks on a Model Grid Platform [chapter]

Loriano Storchi, Carlo Manuali, Osvaldo Gervasi, Giuseppe Vitillaro, Antonio Laganà, Francesco Tarantelli
2003 Lecture Notes in Computer Science  
The interest of the scientific community in Beowulf clusters and Grid computing infrastructures is continuously increasing. The present work reports on a customization of Globus Software Toolkit 2 for a Grid infrastructure based on Beowulf clusters, aimed at analyzing and optimizing its performance. We illustrate the platform topology and the strategy we adopted to implement the various levels of process communication based on Globus and MPI. Communication benchmarks and computational tests
more » ... d on parallel linear algebra routines widely used in computational chemistry applications have been carried out on a model Grid infrastructure composed of three 3 Beowulf clusters connected through an ATM WAN (16 Mbps). (l-MPI) methods. The l-MPI we have used is LAM/MPI [9] and provides for communication via TCP/IP among nodes in a dedicated network or via sharedmemory for processes running on the same machine. In accord with the MPICH-G2 communication hierarchy, we can thus essentially distinguish between two point-to-point communication levels: inter-cluster communication (Level 1) and intra-cluster communication (Level 4). As already mentioned, we notice that the effective bandwidth connecting GIZA to HPC and GRID is slower than that between HPC and GRID. This asymmetry may be thought of as simulating the communication inhomogeneity of a general Grid. Consider now a typical broadcast, where one has to propagate some data to 24 machines, 8 in each cluster. For convenience, the machines in cluster HPC will be denoted p 0 ,p 1 ,...,p 7 , those in GRID as p 8 ,p 9 ,...,p 15 , and those in GIZA as p 16 ,p 17 ,...,p 23 . The MPICH-G2 MPI_Bcast operation over the communicator MPI_COMM_WORLD, rooted at p 0 , produces a cascade of broadcasts, one for each communication level. Thus, in this case, there will be a broadcast at the WAN inter-cluster level, involving processes p 0 , p8 and p 16 , followed by three intracluster propagations, where l-MPI will take over. So we have just two inter-cluster point-to-point communication steps, one from p 0 to p 8 and another from p 0 to p 16 , and then a number of intra-cluster communications. The crucial point to be made here is that communication over the slow links is minimized, while the three fast local (intra-cluster) broadcast propagations can take place in parallel. In this prototype situation, the strategy adopted in our own implementation of the broadcast is essentially identical, but we have optimized the broadcast operation at the local level. The essential difference between our implementation of the broadcast and the LAM/MPI one is that in the latter, when a node propagates its data via TCP/IP, non-blocking (asynchronous) send operations over simultaneously open sockets are issued, while we opted for blocking operations. The local broadcast tree is depicted in Fig. 2 . The LAM/MPI choice appears 124 56 3 7 0 Fig. 2. Local broadcast tree in a 8-node cluster. to be optimal on a high bandwidth network where each node is connected inde-302 L. Storchi et al.
doi:10.1007/3-540-44862-4_32 fatcat:ji65qwzgpvchrabug3n26l4nbq