GROMACS: Fast, flexible, and free

David Van Der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E. Mark, Herman J. C. Berendsen
2005 Journal of Computational Chemistry  
This article describes the software suite GROMACS (Groningen MAchine for Chemical Simulation) that was developed at the University of Groningen, The Netherlands, in the early 1990s. The software, written in ANSI C, originates from a parallel hardware project, and is well suited for parallelization on processor clusters. By careful optimization of neighbor searching and of inner loop performance, GROMACS is a very fast program for molecular dynamics simulation. It does not have a force field of
more » ... ts own, but is compatible with GROMOS, OPLS, AMBER, and ENCAD force fields. In addition, it can handle polarizable shell models and flexible constraints. The program is versatile, as force routines can be added by the user, tabulated functions can be specified, and analyses can be easily customized. Nonequilibrium dynamics and free energy determinations are incorporated. Interfaces with popular quantum-chemical packages (MOPAC, GAMES-UK, GAUSSIAN) are provided to perform mixed MM/QM simulations. The package includes about 100 utility and analysis programs. GROMACS is in the public domain and distributed (with source code and documentation) under the GNU General Public License. It is maintained by a group of developers from the Universities of Groningen, Uppsala, and Stockholm, and the Max Planck Institute for Polymer Research in Mainz. Its Web site is http://www.gromacs.org. GROMACS is an acronym for GROningen MAchine for Chemical Simulation. This sounds strange for a software package, but the name arises from a collaborative project of our group in the in Groningen, in the early 1990s, to construct a dedicated parallel computer system for molecular simulation. Until that time the Fortran package GROMOS had been developed in our group, mainly by van Gunsteren, who continued the development of GROMOS after he moved to the ETH in Zürich, Switzerland, in 1990. 1 The hardware project concerned the construction of a system of 32 processors (Intel i860, a 40 MHz, 80 Mflop RISC processor), communicating in a ring structure by bidirectional dual-memory access, and with a theoretical peak performance of 2.5 Gflop/s. Effective communication required a different method of neighbor searching and a distribution of the neighbor list over processors. In fact, a complete software renewal was necessary, and it was decided to write the GROMACS software in C. Communication was done in C-routines that made use of calls supplied by the PVM (Parallel Virtual Machine) library; later, this was replaced by MPI (Message Passing Initiative). These have proven to be good choices, because ANSI C and MPI became a standard, and most other languages popular with computer scientists at the time (as Pascal and OCCAM for parallel systems) have become obsolete. This renewal gave the opportunity to redesign the software and incorporate several innovations that simplified and accelerated the computation. (See refs. 2-5 for the GROMACS version 1.0 of the late 1994s.) The hardware project was quite successful, and resulted in a prototype and a final machine (Amphisbaena, 6 constructed by CHESS Engineering B.V., Haarlem, The Netherlands). Its performance exceeded the vector supercomputers of the time, CRAY Y/MP and NEC SX-3, by a factor of 3 to 6, and it served the group well for many years. However, with the upcoming mass-produced personal computers in the later 90s, and the rapid increase in processor speed, systems quickly became technically obsolete and the construction of special-purpose machines was no longer a cost-effective option. Since then, the GROMACS software was further developed and optimized, especially for use on PC-clusters. An early design decision was the choice to work with particle decomposition rather than domain decomposition to distribute work over the processors. In the latter case, spatial domains are assigned to processors, which enables finding spatial neighbors quickly by local communication only, but complications due to particles that move over spatial boundaries are considerable. Domain decomposition is a better choice only when linear system size considerably exceeds the range of interaction, which is seldom the case in molecular dynamics. With particle decomposition each processor computes the forces and coordinate/velocity updates for an assigned fraction of the particles, using a precomputed neighborlist evenly distributed over processors. 7 The force F ij arising from the pair interaction between particles i and j, which is needed for the velocity update of both particles i and j, is computed only once and communicated to other processors. Every processor keeps in its local memory the complete coordinate set of the system rather than restricting storage to the coordinates it needs. This is simpler and saves communication overhead, while the memory claim is usually not a limiting factor at all, even for millions of particles. The neighborlist, on the other hand, which can contain up to 1000 times the number of particles, is distributed over the processors. Communication is essentially restricted to sending coordinates and forces once per time step around the processor ring. These choices have proven to be robust over time and easily applicable to modern processor clusters. While retaining algorithmic choices with proven validity and efficiency from GROMOS, 8 such as leap-frog integration of the equations of motion, 9 double-range neighborlist using charge groups, incorporation of SHAKE 10,11 to maintain holonomic constraints, weak coupling to temperature and pressure baths, 12 and stochastic dynamics algorithms, 13,14 new concepts 15 could be incorporated. These included (1) the computation of the virial of the force as a single, rather than double sum over particles, thus increasing the performance by 40% by avoiding an inner loop calculation; 3,16 (2) the use of triclinic boxes, which encompass all possible periodic constructs, 17 and (3) the storage of the translation vector that defines the image of each particle in the neighborlist, thus avoiding conditional statements in the inner loop over particle pairs. Further efficiency-enhancing features, incorporated in version 3.0, are described in a review in 2001: 18 (4) the use of a special software routine to evaluate the inverse square root, (5) force/energy evaluation by cubic spline interpolation from tabulated values, (6) fast grid-based neighbor searching, (7) a new and more robust linear constraint solver LINCS, 19 and (8), the use of multimedia (3DNow! and SSE) instructions on Pentium (III and higher), Athlon, and Duron processors.
doi:10.1002/jcc.20291 pmid:16211538 fatcat:itnvwynqcrh57plfypmo3p23d4