Innovatives Supercomputing in Deutschland
inSiDE • Vol. 11 No. 1 • Spring 2013
current edition
about inSiDE
index  index prev  prev next  next

SuperMUC boosts the largest molecular dynamics simulation by 4X in Number of Particles

MD simulation has become a recognized tool in engineering and natural sciences, complementing theory and experiment. Despite its development for over half a century, scientists still quest for ever larger and longer simulation runs to cover processes on greater length and time scales. Due to the massive parallelism MD typically exhibits, it is a preeminent task for high-performance computing.

The simulation code ls1 mardyn has been enhanced within the BMBF-funded project IMEMO [1] in order to provide an efficient tool for large-scale MD simulations of phenomena in inhomogeneous systems in chemical engineering. Targeting condensation processes and flow phenomena on the nanoscale, it supports rigid-body electroneutral molecular models composed out of an arbitrary number of Lennard-Jones (LJ) sites, point charges, point dipoles and point quadrupoles. As these scenarios typically require large particle numbers and also show heterogeneous density distributions of particles, see Fig. 1, sophisticated load balancing algorithms have been incorporated into the program to enable good scalability on large processor counts.

Together with colleagues from LRZ, laboratories of Thermodynamics at TU Kaiserslauten and U Paderborn and HLRS, we optimized our MD code on the micro-architecture level for a specific processor: the Intel Sandy Bridge EP based SuperMUC operated at the Leibniz Supercomputing Centre in Munich. This system features 147456 cores and is at present the biggest x86 system worldwide with a theoretical double precision peak performance of more than 3 PFLOPS or 6 PFLOPS for single-precision, placed #6 on the Top500 list of November 2012. The system was assembled by IBM and features a highly efficient hot-water cooling solution. In contrast to supercomputers offered by Cray, SGI or even IBM's own BlueGene, the machine is based on a high-performance commodity network: a FDR-10 infiniband interconnect by Mellanox. The optimizations targeting SuperMUC include the vectorization for exploiting the new AVX instruction set of the underlying Intel Sandy Bridge architecture, see [2] for details. This results in calculating always four particle interactions with one force calculation kernel call and therefore in drastic reduction of clock cycles. Furthermore, we applied memory optimizations which allow us to reduce the number of bytes needed per particle to 32, please refer to [3]. Due to the utilized LJ potential, we cannot leverage the instruction level parallelism of Sandy Bridge optimally. However, we were able to add a light-weight shared memory parallelization instead which accelerated the code by 12% when using Intel Hyper-Threading Technology on a per core basis.

Figure 1.: Scenario with a heterogeneous particle distribution as it occurs in the simulation of nucleation (with help from LRZ application support).

In order to evaluate the performance of the MD simulation code ls1 mardyn, we executed different tests on SuperMUC. Our test scenario is similar to the initial configuration of nucleation scenarios, where particles are distributed on a regular grid. The single-center Lennard-Jones particles, modeling e.g. Argon, were arranged according to a body-centered cubic lattice, with a number density of ρσ3=0.78 in reduced units. The time step length was set to 1 fs. With respect to strong scaling behavior, we ran a scenario with N=4.8*109 particles, which fully utilizes the memory available on 8 nodes (128 cores), as 18 GB per node are needed for particle data. Fig. 2 nicely shows that a nearly perfect scaling was achieved for up to 146016 cores using 292032 threads at a parallel efficiency of 42 % comparing 128 to 146016 cores. To better understand this performance, we investi- gate the influence of the decreasing particle number per core, as it occurs in this strong scaling experiment, in Fig. 3: here we measured achievable GFLOPS depending on the number of particles simulated on 8 nodes. Further- more, we show the influence of different cutoff-radii, which determine the number of interactions per molecule, as this parameter also severily effects the FLOP rate. To make a fair comparison with preceding publications possible, we conducted our runs with a cutoff radius of 3.5 ρ. Already for N=3*108 particles, i.e. 2.3*106 particles / core (approx. 8% of the available memory) we are able to hit the performance of roughly 550 GFLOPS per 8 nodes, which we also obtained for N=4.8*109 (37.5*106 particles per core).

Figure 2a.:FLOPS measured for strong and weak scaling experiment.

Moreover, we performed a weak scaling analysis with 28.25*106 molecules per core. This allowed us to perform the, to our knowledge, largest MD simulation to date, simulating 4.125*1012 particles on 146016 cores with one time step taking roughly 40s. For this scenario an absolute performance of 591.2 TFLOPS with a speedup of 133183 X in comparison to a single core was achieved, which corresponds to 9.4% single-precision peak performance efficiency.

Figure 2b.: Speedup for strong and weak scaling experiment. For strong scaling, the speedup is normalized to 128 processes. For weak scaling, the speedup is relative to a single process. Figure 3.: Flop rate on 128 cores in dependence of the number of particles and the cutoff radius (which determines the number of interaction partners of a particle).


[1] C. Niethammer, C. W. Glass, M. Bernreuther, S. Becker, T. Windmann, M. T. Horsch, J. Vrabec, W. Eckhardt Innovative HPC Methods and Application to Highly Scalable Molecular Simulation (IMEMO). In Inside - Innovatives Supercomputing in Deutschland, Volume 10(1), April 2012.

[2] W. Eckhardt, A. Heinecke An efficient Vectorization of Linked-Cell Particle Simulations. In ACM International Conference on Computing Frontiers, p. 241–243, May 2012.

[3] W. Eckhardt, A. Heinecke Memory-Efficient Implementation of a Rigid-Body Molecular Dynamics Simulation. In Proceedings of the 11th International Symposium on Parallel and Distributed Computing - ISPDC 2012, p. 103–110. IEEE, Munich, June 2012.

• W. Eckhardt
A. Heinecke
Gauss Centre for Supercomputing (GCS)

top  top