Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Liu Peng, Guangming Tan, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta, Dongrui Fan, Hao Zhang, Fenglong Song
2013 Journal of Parallel and Distributed Computing  
Molecular dynamics (MD) simulation has broad applications, and an increasing amount of computing power is needed to satisfy the large scale of the real world simulation. The advent of the many-core paradigm brings unprecedented computing power, but it remains a great challenge to harvest the computing power due to MD's irregular memory-access pattern. To address this challenge, this paper presents a joint application/architecture study to enhance the scalability of MD on Godson-T -like manycore
more » ... architecture. First, a preprocessing approach leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then three incremental optimization strategies -a novel data-layout to improve data locality, an on-chip localityaware parallel algorithm to enhance data reuse, and a pipelining algorithm to hide latency to shared memory -are proposed to enhance on-chip parallelism for Godson-T many-core processor. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency of 0.99 on 64 cores, which is confirmed by a field-programmable gate array emulator. Also the performance per watt of MD on Godson-T is much higher than MD on a 16-cores Intel core i7 symmetric multiprocessor (SMP) and 26 times higher than MD on an 8-core 64-thread Sun T2 processor. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a hierarchical parallelization scheme is designed to map the MD algorithm to Godson-T many-core cluster and a simple performance model is derived, which suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments. (L. Peng), tgm@ict.ac.cn (G. Tan). makes the efficiency of on-chip parallelism increasingly more important. Challenges to achieve an efficient on-chip parallel MD algorithm mainly arise from two aspects: (1) MD application is characterized by irregular memory access which imposes a difficulty on locality optimization; (2) many-core hardware limitation (volume of on-chip memory, bandwidth of on-chip networking, etc.) constrains the size of working-set per core which imposes difficulty on on-chip parallelization. To address these difficulties, this paper presents a joint study from both application and architecture aspects on how to achieve the scalability and high performance of MD on an Godson-T -like emerging many-core architecture, where we map an MD algorithm to the architecture for achieving high on-chip parallel efficiency. We focus on MD simulation with nonbonded n-tuple interactions, which is common in materials simulations [26] and provides a broad computationalcharacteristics context for algorithmic design. The objective of this paper is not only to identify how application scientists can utilize new mechanisms provided in emerging many-core architectures to improve performance of their 0743-7315/$ -see front matter. Published by Elsevier Inc. L. Peng et al. / J. Parallel Distrib. Comput. ( ) -applications, but also to compare the usefulness of various architectural mechanisms as evidenced by their impacts on application performance, which could guide future hardware developments. This work thus serves as an example of architecture-algorithm codesign to inform the development of future exascale computing systems [29] . The main contributions of this paper are eight-fold: L. Peng et al. / J. Parallel Distrib. Comput. ( ) -L. Peng et al. / J. Parallel Distrib. Comput. ( ) -
doi:10.1016/j.jpdc.2012.07.007 fatcat:hfoousphcfcsnmxm2jjirllbpa