Innovatives Supercomputing in Deutschland
inSiDE • Vol. 5 No. 2 • Autumn 2007
current edition
archive
centers
events
download
about inSiDE
index  index prev  prev next  next

Sustaining Tflop/s in Simulations of Quantum Chromodynamics

The first computer delivering a performance of more than one Tflop/s in the Linpack benchmark appeared on the TOP-500 list in June 1997 [1]. Sustaining a Tflop/s with a real application in everyday production runs is another story. The single CPU performance has to be good, the programme must scale sufficiently well, and it should not use the whole computer. For German quantum chromodynamics researchers sustaining one Tflop/s became reality with the installation of the current generation of supercomputers at LRZ and NIC, the SGI Altix 4700 and the IBM BlueGene/L. A third supercomputer offering comparable performance in single QCD applications is the apeNEXT at NIC/DESY Zeuthen.

Quantum chromodynamics (QCD) is the theory of strongly interacting elementary particles. The theory describes particle properties like masses and decay constants from first principles. The starting point of QCD is an infinitedimensional integral. To deal with the theory on the computer space-time continuum is replaced by a four-dimensional regular finite lattice with (anti-) periodic boundary conditions. After this discretization the integral is finite-dimensional but rather high-dimensional. The highdimensional integral is solved by Monte- Carlo Methods.

The basic building blocks of QCD are called quarks (matter particles) and gluons (particles mediating the interaction of quarks). The quark fields cannot be represented directly on a computer. In the computations they appear as large sparse matrices which describe systems of linear equations. QCD programmes spend most of their execution time in solving theses systems of linear equations. One research aim of the lattice QCD community is finding better algorithms by which less systems of equations have to be solved or which improve the convergence of iterative solvers. In any solver and an overall QCD programme the multiplication of the so-called hopping matrix with a vector is the dominant operation.

Sustaining Tflop/s in a QCD programme practically means sustaining Tflop/s in the hopping matrix multiplication. The prerequisite is a parallel computer with an excellent network. For example, in a
Fortran/MPI implementation about 30 % of the compute time is needed for communication on the BlueGene/L.

How can one Tflop/s be sustained? At the single CPU level QCD programmes benefit from the fact that the basic operations involve small complex matrices. One can perform at the order of ten floating point operations per memory access. As a rule of thumb the resulting performance is about 20-25 % of peak when programming in Fortran or C. The single CPU performance can be considerably
improved by employing low level programming techniques like assembler, multimedia streaming functions, or the BlueGene double hummer routines.

QCD programmes are parallelized by a domain decomposition. If one aims at one Tflop/s the domains become so small that their surface to volume ratio is at the order of one or even larger. This has the effect that a domain completely fits into a large data cache, which is the case on the Altix. On the other hand the data from the large domain surface has to be communicated to eight nearest neighbour processes. On the remote processes that data will not be in the cache but has to be fetched from main memory. Performance benefits from data caches but a substantial fraction of the data is never cached.

At the software level important optimization techniques are prefetching data from memory and overlapping communication and computation. These techniques could be used at a high level on LRZ‘s previous machine, the Hitachi SR8000-F1. On that machine prefetching instructions were inserted by the compiler which led to a single CPU performance of more than 40 % of peak for the hopping matrix multiplication implemented in Fortran. In a hybrid programming approach, employing Fortran, OpenMP and MPI, one could achieve that communication and computation overlap. The resulting parallel performance was 30-40 % of peak [2].

On the BlueGene and the Altix hiding communication is not so straightforward to implement. On the eight-way SMP nodes of the SR8000 one could use one thread for communication while seven threads compute resulting in 12,5 % communication overhead. Using one of the two cores of the BlueGene or Altix processors for communication would produce 50 % communication overhead. However, on both machines lower level techniques are available by which one can try to hide communication overhead.

In our code BQCD (Berlin quantum chromodynamics programme) the hopping matrix multiplication was implemented in assembler for the BlueGene and the Altix [3]. On the BlueGene the network can be directly accessed using special load/ store instructions. In the course of the computation each node needs to receive part of the data from the boundary of its neighbouring nodes, and likewise it has to send part of the data from its boundary to neighbouring nodes. In order to hide communication latency the assembler kernel always looks ahead a few iterations and sends data that will be needed by a remote node. When a CPU needs data from another node, it polls for arriving data packets. We can see that communication and computation really overlap by studying strong scaling: when going
from one to eight racks the speed-up is 5.3 for the Fortran/MPI implementation but it is 7.3 for the assemblercode (see figures). Optimizing computations alone would have decreased the speed-up of the Fortran/MPI programme because the communication part would be unchanged.

On the Altix a promising method for hiding communication latency is using Altix Shmem-pointers by which one can access remote memory directly without any function calls. The idea is to directly write to remote memory in the course of computations similar to the approach taken on the BlueGene. We used Shmem-pointers in a Fortran/C and an assembler implementation. Unfortunately there was no gain in both cases. Nevertheless, re-writing the hopping matrix multiplication in assembler improved the overall performance by about 30 %.

In production runs typically one Blue- Gene rack (2,048 cores) is used and a performance of 1,1 Tflop/s or 19 % of peak is sustained. On the Altix almost 1,5 Tflop/s or 23 % of the peak performance are measured when using 1,000 cores. Technically speaking these values were obtained for the whole conjugate gradient solver in a lattice QCD formulation with clover improved Wilson fermions employing even/odd preconditioning. Other groups achieve similar performance figures with their implementations. In other words people sustain one Tflop/s or more on one eighth of the BlueGene/L at NIC or an even smaller part of the Altix at LRZ which makes the Tflop/s available as the normal sustained performance of QCD simulations.

 

References:

[1] www.top500.org

[2] Schierholz, G., Stüben, H. Optimizing the Hybrid Monte Carlo Algorithm on the Hitachi SR8000, in Wagner, S., Hanke, W., Bode, A., Durst, F. (Eds.), High Performance Computing in Science and Engineering, Munich, 2004, Springer-Verlag

[3] Streuer, T., Stüben, H. Simulations of QCD in the Era of Sustained Tflop/s Computing, Contribution to Parallel Computing 2007 (ParCo2007), Aachen and Jülich, September 4-7, 2007 (in preparation)

• Hinnerk Stüben

Konrad-Zuse- Zentrum für Informationstechnik Berlin (ZIB)


top  top