Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Radiosity
Journal of Parallel and Distributed Computing
Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for long-range communication. In this paper, we study the partitioning and scheduling techniques required to
... ain effective parallel performance on applications that use a range of hierarchical N-body applications. To obtain representative coverage, we examine applications that use the three most promising methods used today. Two of these, the Barnes-Hut method and the Fast Multipole Method, are the best methods known for classical N-body problems (such as molecular or galactic simulations). The third is a recent hierarchical method for radiosity calculations in computer graphics, which applies the hierarchical N-body approach to a problem with very different characteristics. We find that straightforward decomposition techniques which an automatic scheduler might implement do not scale well, because they are unable to simultaneously provide load balancing and data locality. However, all the applications yield very good parallel performance if appropriate partitioning and scheduling techniques are implemented by the programmer. For the Barnes-Hut and Fast Multipole applications, we show that simple yet effective partitioning techniques can be developed by exploiting some key insights into both the solution methods and the classical problems that they solve. Using a novel partitioning technique, even relatively small problems achieve 45-fold speedups on a 48-processor Stanford DASH machine (a cache-coherent, shared address space multiprocessor) and 118-fold speedups on a 128-processor simulated architecture. The very different characteristics of the radiosity application require a different partitioning/scheduling approach to be used for it; however, it too yields very good parallel performance.