Innovatives Supercomputing in Deutschland
inSiDE • Vol. 11 No. 2 • Autumn 2013
current edition
about inSiDE
index  index prev  prev next  next

Gadget3: Numerical Simulation of Structure Formation in the Universe

Figure 1: Visualization of a 1Gpc3 cosmological box, showing the filamentary distribution of matter in the Universe and zooming onto a super cluster region. Simulation was performed with Gadget3.

Cosmological simulations play an important role in theoretical studies of structure formation in the Universe. Such simulations are an essential tool to accurately calculate theoretical predictions of the distribution and state of the baryonic and dark matter in the Universe, especially in the non-linear regime of gravitational dynamics and hydrodynamics, where galaxies and clusters of galaxies form out of the large scale structure. Realistic modeling must include baryonic matter as well as dark matter component and has to be able to describe physical processes such as star formation, supernovae, and AGN feedback, as well as transport processes and magnetic fields. Therefore, numerical simulations must be capable of properly resolving and following those phenomena using different numerical techniques simultaneously.

The KONWIHR-III project "Tuning a cosmo Gadget for SuperMUC" is funded by the Federal State of Bavaria. It aims at optimizing Gadget3 [1,2], a N-body Magneto-Smoothed-Particle-Hydrodynamics code for cosmological simulations, for which we have been granted computing time on SuperMUC at the Leibniz Rechenzentrum. From the astrophysical point of view, current cosmological simulations are limited by the number of tracer particles which can be used. On one hand, large volumes need to be simulated to sample a representative part of the Universe. But on the other hand, even utilizing several 109 particles (as for the visualization shown in Fig. 1), the resulting resolution is still poor (~30kpc) and such simulations do not resolve properly individual galaxies. Hence, larger volumes and higher resolution are needed to produce a theoretical counterpart to interpret the data coming from current and forthcoming astronomical surveys. To achieve this, future simulations must reach 1011 particles and beyond following various physical processes. Being able to perform such large hydrodynamical cosmological simulations will, for the first time, allow a detailed comparison with a variety of multi-wavelength observations.

From the numerical point of view, Gadget3 is a massively parallel code, which employs various algorithms to treat different physical processes. Originally (with exception of the gravity solver) it relies on a pure MPI implementation. However, on current and future hardware configurations like SuperMUC, the code is expected to run on hundreds of thousands cores, making a pure MPI implementation challenging for such a large number of MPI tasks. Additionally, previously subdominant parts of the code as well as the communication starts to strongly dominate and eventually make the code unusable on large MPI task numbers. Therefore, Gadget3 was converted to a hybrid code, making intensive use of either posix threads (as in the original gravity solver) or OpenMP (as focused in this project) for node level multithreading. We therefore aimed at identifying bottlenecks, implementing more shared memory parallelized parts of the code and optimizing Gadget3 to be usable as hybrid MPI/OpenMP code on HPC facilities like SuperMUC.

Figure 2: Cumulative CPU time as a function of simulation time given as dimensionless cosmological parameter (1 is present) broken down for the most computationally intensive parts of Gadget3. The dashed lines represent the Gadget3 version at project start and the solid lines represent the latest version after optimization. Individual parts show very large performance gain (except for the gravity tree where the correction of data race conditions slightly slows down the algorithm) and we even see an overall total improvement of almost 30 %.

Following some of the processes in details (e.g. evolution of black holes), as well as the on-the-fly post processing (which is essential for scientific analysis of such simulations), involves finding and identifying substructures at simulation run time. Substeps, like the density estimation and gravitational unbinding check of material within substructures, are the most challenging and essential parts of the post processing. Therefore, optimization and OpenMP parallelization of those parts is a good example of the main work done within this KONWIHR-III project. Substructure identification has to be performed typically 100 times during the simulation. Within this process, density calculation (for all particles) is one of the most time consuming parts (up to 75 % of the total execution time). In that case it is possible to distribute particles over multiple threads and then process many particles in parallel on each node. Additionally, various sort operations have to be performed which gain significantly using multithreaded sort algorithms. Doing so, we improved the overall performance of these parts of the algorithm by a factor 1.8 for the typical setup of 2 MPI tasks per node and 16 OpenMP threads per task which we use for large simulations on SuperMUC.

Figure 3: Strong scaling of Gadget3 on SuperMUC. The plot shows the number of particles processed per core, time step, and wall clock time, broken down for the most computationally intensive parts of the calculation and spans from 0.5 to 16 islands of SuperMUC. Solid and dashed lines indicate two different communication structures which can be used.

More generally, implementing multithreading in several parts of the code, especially in the most computationally intensive parts, has led so far to a total improvement in performance of almost 30 % (Fig. 2).

Furthermore, our team was the first one to run Gadget3 exploiting 16 islands of SuperMUC. We successfully ran Gadget3 on 0.5, 1, 2, 4, 8, and 16 islands (i.e. 131,072 cores) using the setup given above (Fig. 3). We demonstrate that the code performs very well for a strong scaling up to a significant fraction of SuperMUC. The scaling is very close to the ideal case up to 4 islands. Beyond this point, it is encouraging to see that the improved version of Gadget3 performs reasonably well and can effectively make use of 16 thin node islands, even when various additional physical processes are switched on. This result is quite promising for future simulations on SuperMUC and for further optimizations we continue to work on.


[1] Springel, V. 2005, MNRAS, 364, 1105-1134

[2] Springel, V., Yoshida, N., White, S.D.M. 2001, New Astronomy, 6, 79-117

• Gurvan Bazin
• Klaus Dolag
Universitätsstern warte, Ludwig-Maximilians Universität München

• Nicolay Hammer
Leibniz Supercomputing Centre (LRZ)

top  top