An FMM Based on Dual Tree Traversal for Many-Core Architectures
Journal of Algorithms & Computational Technology
The present work attempts to integrate the independent efforts in the fast N-body community to create the fastest N-body library for many-core and heterogenous architectures. Focus is placed on low accuracy optimizations, in response to the recent interest to use FMM as a preconditioner for sparse linear solvers. A direct comparison with other state-of-the-art fast N -body codes demonstrates that orders of magnitude increase in performance can be achieved by careful selection of the optimal
... rithm and low-level optimization of the code. The current N-body solver uses a fast multipole method with an efficient strategy for finding the list of cell-cell interactions by a dual tree traversal. A task-based threading model is used to maximize thread-level parallelism and intra-node load-balancing. In order to extract the full potential of the SIMD units on the latest CPUs, the inner kernels are optimized using AVX instructions. In a recent study we have shown that FMM becomes faster than FFT when scaling to thousands of GPUs  . The comparative efficiency between FMM and FFT can be explained from the asymptotic amount of communication. On a distributed memory system with P nodes, a 3-D FFT requires two global transpose communications between √ P processes so the communication complexity is O( √ P ). On the other hand, the hierarchical nature of the FMM reduces the amount of communication to O(log P ). A preliminary feasibility study for Exascale machines  indicates that the necessary bandwidth for FFT could only be provided by a fat-tree or hypercube network. However, constructing such network topologies for millions of nodes is prohibitive in terms of cost, and the current trend of using torus networks is likely to continue. Therefore, network topology is another area where the trend in hardware is deviating from the requirements of common algorithms. Hierarchical methods are promising in this respect, and FMM is undoubtedly one of them. It is clear from the above arguments that FMM could be an efficient alternative algorithm for many scientific applications on Exascale machines. One common objection is that FMM requires much more operations than other fast algorithms like multigrid and FFT, and therefore is much slower. However, as future microarchitectures move towards less and less Byte/flop, the asymptotic constant of the arithmetic complexity becomes less of a concern. Therefore, the advantage in the communication complexity trumps the disadvantages as mentioned in the previous paragraph. Related Work FMM is a relatively new algorithm compared to well established linear algebra solvers and FFT, and has ample room for both mathematical and algorithmic improvement, not to mention the need to develop highly optimized libraries. We will briefly summarize the recent efforts in this area by categorizing them into; CPU optimization, GPU optimization, MPI parallelization, algorithmic comparison, auto-tuning, and data-driven execution.