Sustaining Tflop/s
in Simulations of
Quantum Chromodynamics
The first computer delivering a performance
of more than one Tflop/s in the
Linpack benchmark appeared on the
TOP-500 list in June 1997 [1]. Sustaining
a Tflop/s with a real application in
everyday production runs is another
story. The single CPU performance has
to be good, the programme must scale
sufficiently well, and it should not use
the whole computer. For German quantum
chromodynamics researchers sustaining
one Tflop/s became reality
with the installation of the current generation
of supercomputers at LRZ and
NIC, the SGI Altix 4700 and the IBM
BlueGene/L. A third supercomputer offering
comparable performance in single
QCD applications is the apeNEXT at
NIC/DESY Zeuthen.
Quantum chromodynamics (QCD) is the
theory of strongly interacting elementary
particles. The theory describes
particle properties like masses and
decay constants from first principles.
The starting point of QCD is an infinitedimensional
integral. To deal with the
theory on the computer space-time continuum
is replaced by a four-dimensional
regular finite lattice with (anti-) periodic
boundary conditions. After this discretization
the integral is finite-dimensional
but rather high-dimensional. The highdimensional
integral is solved by Monte-
Carlo Methods.
The basic building blocks of QCD are
called quarks (matter particles) and
gluons (particles mediating the interaction
of quarks). The quark fields cannot
be represented directly on a computer.
In the computations they appear as
large sparse matrices which describe
systems of linear equations. QCD programmes
spend most of their execution
time in solving theses systems of linear
equations. One research aim of the lattice
QCD community is finding better
algorithms by which less systems of
equations have to be solved or which
improve the convergence of iterative solvers.
In any solver and an overall QCD
programme the multiplication of the
so-called hopping matrix with a vector
is the dominant operation.
Sustaining Tflop/s in a QCD programme
practically means sustaining Tflop/s in
the hopping matrix multiplication. The
prerequisite is a parallel computer with
an excellent network. For example, in a
Fortran/MPI implementation about 30 %
of the compute time is needed for communication
on the BlueGene/L.
How can one Tflop/s be sustained? At
the single CPU level QCD programmes
benefit from the fact that the basic operations
involve small complex matrices.
One can perform at the order of ten
floating point operations per memory
access. As a rule of thumb the resulting
performance is about 20-25 % of peak
when programming in Fortran or C. The
single CPU performance can be considerably
improved by employing low level
programming techniques like assembler,
multimedia streaming functions, or the
BlueGene double hummer routines.
QCD programmes are parallelized by
a domain decomposition. If one aims
at one Tflop/s the domains become so
small that their surface to volume ratio
is at the order of one or even larger.
This has the effect that a domain completely
fits into a large data cache,
which is the case on the Altix. On the
other hand the data from the large domain
surface has to be communicated
to eight nearest neighbour processes.
On the remote processes that data
will not be in the cache but has to be
fetched from main memory. Performance
benefits from data caches but a
substantial fraction of the data is never
cached.

At the software level important optimization
techniques are prefetching data
from memory and overlapping communication
and computation. These
techniques could be used at a high level
on LRZ‘s previous machine, the Hitachi
SR8000-F1. On that machine prefetching
instructions were inserted by the
compiler which led to a single CPU performance
of more than 40 % of peak
for the hopping matrix multiplication
implemented in Fortran. In a hybrid programming
approach, employing Fortran,
OpenMP and MPI, one could achieve
that communication and computation
overlap. The resulting parallel performance
was 30-40 % of peak [2].
On the BlueGene and the Altix hiding
communication is not so straightforward
to implement. On the eight-way
SMP nodes of the SR8000 one could
use one thread for communication
while seven threads compute resulting
in 12,5 % communication overhead.
Using one of the two cores of
the BlueGene or Altix processors for
communication would produce 50 %
communication overhead. However, on
both machines lower level techniques
are available by which one can try to
hide communication overhead.
In our code BQCD (Berlin quantum
chromodynamics programme) the hopping
matrix multiplication was implemented
in assembler for the BlueGene
and the Altix [3].
On the BlueGene the network can be
directly accessed using special load/
store instructions. In the course of the
computation each node needs to receive
part of the data from the boundary
of its neighbouring nodes, and
likewise it has to send part of the data
from its boundary to neighbouring
nodes. In order to hide communication
latency the assembler kernel always
looks ahead a few iterations and sends
data that will be needed by a remote
node. When a CPU needs data from
another node, it polls for arriving data
packets. We can see that communication
and computation really overlap by
studying strong scaling: when going
from one to eight racks the speed-up
is 5.3 for the Fortran/MPI implementation
but it is 7.3 for the assemblercode (see figures). Optimizing computations
alone would have decreased
the speed-up of the Fortran/MPI programme
because the communication
part would be unchanged.

On the Altix a promising method for
hiding communication latency is using
Altix Shmem-pointers by which one can
access remote memory directly without
any function calls. The idea is to
directly write to remote memory in the
course of computations similar to the
approach taken on the BlueGene. We
used Shmem-pointers in a Fortran/C
and an assembler implementation. Unfortunately
there was no gain in both
cases. Nevertheless, re-writing the
hopping matrix multiplication in assembler
improved the overall performance
by about 30 %.
In production runs typically one Blue-
Gene rack (2,048 cores) is used and a
performance of 1,1 Tflop/s or 19 %
of peak is sustained. On the Altix almost
1,5 Tflop/s or 23 % of the peak
performance are measured when
using 1,000 cores. Technically speaking
these values were obtained for the
whole conjugate gradient solver in a
lattice QCD formulation with clover
improved Wilson fermions employing
even/odd preconditioning. Other
groups achieve similar performance
figures with their implementations. In
other words people sustain one Tflop/s
or more on one eighth of the BlueGene/L at NIC or an even smaller part of the
Altix at LRZ which makes the Tflop/s
available as the normal sustained performance
of QCD simulations.
References:
[1] www.top500.org
[2] Schierholz, G., Stüben, H.
Optimizing the Hybrid Monte Carlo
Algorithm on the Hitachi SR8000,
in Wagner, S., Hanke, W., Bode, A.,
Durst, F. (Eds.), High Performance
Computing in Science and Engineering,
Munich, 2004, Springer-Verlag
[3] Streuer, T., Stüben, H.
Simulations of QCD in the Era of Sustained
Tflop/s Computing, Contribution to
Parallel Computing 2007 (ParCo2007),
Aachen and Jülich, September 4-7, 2007
(in preparation)
• Hinnerk Stüben
Konrad-Zuse-
Zentrum für
Informationstechnik
Berlin
(ZIB)
top |