Massively Parallel Multilevel
Finite Element Solvers on the Altix 4700
In the most recent TOP-500 list, published
in June 2007, the HLRB II at the
Leibniz Computing Centre of the Bavarian
Academy of Sciences is ranked at
position 10 for solving a linear system
with 1,58 million unknowns at a rate
of 56,5 Teraflops in the Linpack benchmark.
However, this impressive result
is of little direct value for scientific applications.
There are few real life problems
that could profit from the solution
of a general dense system of equations
of such a size.
Typical supercomputer applications
today fall primarily in two classes. They
are either variants of molecular dynamics
simulations or they require the solution
of sparse linear systems as they
e.g. arise in finite element problems
for the solution of Partial Differential
Equations (PDEs). These two application
classes become apparent when
one reviews the history of the Gordon
Bell Prize, the most prestigious award
in high end computing. All Gordon Bell
Prizes fall in either of these two categories.
It is also interesting to see
the correlation between architecture
and application. For example, when the
Earth Simulator, a classical vector architecture,
was leading the TOP-500
list, the Bell Prize was warded to applications
with significant PDE content.
More recently, the prize has been
awarded for molecular dynamics-based
applications, since this is the realm of
the IBM/BlueGene systems that have
been leading the list in the past few
years. This, however, is not an indication
that the interest in fast PDE solvers
has declined, and therefore we
will report here on our results for a
massively parallel finite element solver
for elliptic PDEs. The HLRB II system is
an SGI-Altix that went into operation in
September 2006 with 4,096 processors
and an aggregate main memory
of 17,5 Terabytes (“phase 1”). In April
2007 this system was upgraded to
9,728 cores and 39 Terabytes of main
memory (“phase 2”). In particular in
terms of available main memory, this
is currently one of the largest computers
in the world. Though the HLRB II
is a general purpose supercomputer,
it is especially well suited for finite element
problems, since it has a large
main memory and a high bandwidth.
With this article we would like to demonstrate
the extraordinary power of
this system for solving finite element
problems, but also which algorithmic
choices and implementation techniques
are necessary to exploit this architecture
to its full potential.
The test problem reported in this article
is a finite element discretization
on tetrahedral 3D finite elements for
a linear, scalar, elliptic PDE in 3D, as
it could be used as a building block in
numerous more advanced applications.
We have selected this problem since it
has a wide range of applications, and
also, because it is an excellent test
example for any high performance
computer architecture.
Algorithms for very large
Scale Systems
In this article we focus on multigrid
algorithms [1,2], since these provide
mathematically the most efficient solvers
for systems originating from elliptic
PDEs. Since multigrid algorithms rely
on using a hierarchy of coarser Grids,
clever data structures must be used
and the parallel implementation must
be designed carefully so that the communication
overhead remains minimal.
This is not easy, but our results below
will demonstrate excellent performance
on solving linear systems with up to
3 x 1,011 unknowns and for up to almost
10,000 processors.
Before we turn to the techniques that
make this possible, we would like to
comment on why we do not use domain
decomposition methods that might be
suggested as an alternative to using
multigrid. In particular, it may be argued
that it is easier to implement parallel
domain decomposition efficiently than
parallel multigrid, since it avoids the
need of a Grid hierarchy. However, the
price for the ease of implementation is
a deterioration of the convergence rate
that makes plain domain decomposition
methods unsuitable on massively
parallel systems. Of course, they can be
improved by using an auxiliary coarse
space within each iteration of the algorithm.
In our view, however, with a
coarse space, domain decomposition
methods lose much of their advantage,
since they will suffer from essentially
the same bottleneck as multigrid.
We wish to point out that the need for
global communication is a fundamental
feature of elliptic PDEs and so there is
no hope to get around it by any algorithm.
Multigrid methods seem to use
the minimal amount of computation
and also the minimal amount of communication
that is necessary to solve
the problem. Using a hierarchical mesh
structure is essential not only to keep
the operation count optimal, but also to
keep for the amount of communication
minimal. The difficulty is to organize
and implement this efficiently. Using
a mesh hierarchy can only be avoided
if one is willing to pay by using more
iterations. In total this may therefore
lead to even more communication. We
believe that the computational results
below demonstrate that multigrid is the
method of choice for solving extremely
large PDE problems in parallel.
Hierarchical Hybrid Grids
HHG (“Hierarchical Hybrid Grids“) [1,2,3]
is a framework for the multigrid solution
for finite element (FE) problems.
FE methods are often preferred for
solving elliptic PDEs, since they permit
flexible, unstructured meshes. Among
the multigrid methods, algebraic multigrid
also supports unstructured Grids
automatically. Geometric multigrid, in
contrast, relies on a given hierarchy
of nested Grids. On the other hand,
geometric multigrid achieves a significantly
higher performance in terms of
unknowns computed per second. HHG
is designed to close this gap between
FE flexibility and geometric multigridperformance by using a compromise
between structured and unstructured
Grids: a coarse input FE mesh is organized
into the Grid primitives: vertices,
edges, faces, and volumes that are then
refined in a structured way as indicated
in Figure 1. This approach preserves
the flexibility of unstructured meshes,
while the regular internal structure allows
for an efficient implementation on
current computer architectures, especially
on parallel computers.

Parallelization
To exploit high-end computers, the
programs must be parallelized using
message passing. The HHG framework
is an ideal starting point for this, since
the mesh partitioning can be essentially
accomplished on the level of the
coarse input Grid, that is, with a Grid
size that can still be handled efficiently
by standard mesh partitioning software
like Metis. Figure 2a shows a simple
2D example of such a Grid distribution.
Two triangular elements are assigned
to the two processes P0 and P1. The
unknowns on the edge between the elements
are coupled to both elements
and are thus needed by both processes.
This introduces communication
(Figure 2b) and is equivalent to using
ghost nodes, as is typical in parallel
mesh algorithms. The edge data structure itself can be assigned to any one
of the two processors.
In order to avoid excessive latency, the
algorithmic details and communication
must be designed carefully. The multigrid
solver uses a Gauß-Seidel smoother
that traverses the Grid points in the
order of the primitives of the coarse
input mesh: vertices - edges - faces -
volumes. During the update of any
such group, no parallel communication
is necessary, because a vertex, for
example, can only be connected to another
vertex indirectly via an edge. This
means that, rather than sending many
small messages, each type of primitive
can have its parallel dependencies
updated as a single large message,
which greatly reduces communication
latency. However, a true Gauß-Seidel
sweep traversing over the Grid points
still requires too many communication
steps during each iteration, since
neighboring Grid points might belong to
different processes. The current HHG
implementation ignores a few such dependencies,
thus giving the smoother
the characteristics of a Jacobi Iteration
at the affected points. Numerically, this
leads to a slight deterioration of the
convergence rate, but the reduction in
execution speed more than outweighs
this effect.

World Record in linear
System Solving
In our lagest computation to date, we
have used 9,170 cores of HLRB II and
HHG to solve a finite element problem
with 307 billion unknowns in 93 seconds
run time. The problem itself is
artificially designed by meshing a cube.
This is necessary to ease the construction
of problems with varying mesh size
for our scalability study. However, the
HHG data structures would allow for an
arbitrary tetrahedral input mesh.
We believe that this is the largest finite
element system that has been solved
to date. Additionally, we point out that
the absolute times to solution are still
fast enough to leave room for using
this olver s a building block in e.g. a
time stepping scheme.
The results in Table 1 additionally
show the results of a scaling experiment
from 4 to 9,170 compute cores.
The amount of memory per core is
kept constant and the problem size is
chosen to fill as much of the available
memory as possible. If the program
were perfectly scalable, the average
time per V-cycle would stay constant
throughout the table, because the ratio
of problem size (i.e. workload) versus
number of cores (i.e. compute power)
stays constant. Near perfect scaling
is seen as measure of the quality of an
algorithm and its implementation. For
the HLRB II in installation phase 1 the
computation time increases only by a
factor of 2.2 when scaling from 4 to
3,825 cores. This is still not perfect
but in our view acceptable, especially
when compared to other algorithms
and especially in terms of the absolute
compute time. Note that perfect scalability
is the more difficult to achieve
the faster a code is.
While every core of HLRB II phase 1
still had its own memory and network
interface, the new dual-core configuration
provides less bandwidth per
core since two cores must share an
interface. Additionally, a part of the
installation is now configured as socalled
“high density partitions” where
two dual-core processors and thus four
cores share one interface. Benchmark
results including these high density
partitions are marked with an asterisk
in table 1. The timings for 64, 504 and
2,040 cores show that the dual-core
processors of phase 2 account for
approximately 39 % deterioration in
runtime compared to phase 1. Scaling
on the regular partitions shows a runtime
increase from 4.93 s on 64 cores
to 6.33 s on 6,120 cores. On the high
density partitions, the runtime deteriorates
to 7.06 s on just 128 cores but
then increases only slightly further to
7.75 s for our largest runs.

Conclusions
The HHG framework and the HLRB II
have been used to solve a finite element
problem of world-record size, exceeding
previous results by more than
an order of magnitude, see [2,5]. HHG
draws its power from using a multigrid
solver that is especially designed and
carefully optimized for current, massively
parallel high performance architectures.
The SGI Altix-architecture is
found to be well-suited for large scale
iterative FE solvers.
References:
[1] Bergen, B., Gradl, T.,
Hülsemann, F., Rüde, U.
A Massively Parallel Multigrid Method for
Finite Elements, Computing in Science and
Engineering, 8:56-62, 2006
[2] Bergen, B., Hülsemann, F., Rüde, U.
Is 1.7 x 1010 Unknowns the Largest Finite
Element System that can be solved today?
Proceedings of the ACM/IEEE Proceedings
of SC, Seattle, Nov. 12-18, 2005,
ISBN 1-59593-061-2
[3] Hager, G., Bergen, B.,
Lammers, P., Wellein, G.
Taming the Bandwidth Behemoth. First
Experiences on a Large SGI Altix System.
In: inSiDE, 3(2):24, 2005
[4] Bergen, B., Wellein, G.,
Hülsemann, F., Rüde, U.
Hierarchical Hybrid Grids – Achieving
TERAFLOP Performance on Large Scale
Finite Element Simulation, International
Journal of Parallel, Emergent and Distributed
Systems, 22(4):311-329, 2007
[5] Adams, M. F., Bayraktar, H. H.,
Keaveny, T. M., Papadopoulos, P.
Ultrascalable implicit finite element analyses
in solid mechanics with over a half a billion
degrees of freedom, ACM/IEEE Proceedings
of SC 2004: High Performance Networking
and Computing, 2004
• Tobias Gradl
• Ulrich Rüde
Chair for
System
Simulation,
University
Erlangen-
Nuremberg
top |
 |