The new Petaflop Class System at LRZ
In the last edition of Inside, an outlook was given on SuperMUC, the successor of the Altix 4700 at LRZ. Now the time has come to reveal more details of what users of this 3-Petaflop supercomputer can expect
when running their applications on the new system: This article provides a more in-depth description of the architectural characteristics of SuperMUC.
SuperMUC is the first multi-petaflop system in the ambitious project called PetaGCS. Its initial operation is scheduled for June 2012, and it will allow LRZ to make a significant contribution to the Tier-0 level of the German Gauss Center for Supercomputing (GCS) and the European PRACE Research Infrastructure. The new system, which will be delivered by IBM, is explicitly designed to be as "general purpose" as possible and allows running programs developed and built on a desktop system with minimal porting effort. The second goal, achieving high energy efficiency, is accomplished by the innovative warm water cooling of the compute nodes.
The new system consists of 18 thin node islands and one fat node island which are connected with Infiniband technology from Mellanox. The system architecture of SuperMUC is illustrated in Figure 1, and its main characteristics are given in Table 1.
The fat node island has already been
delivered in 2011 and is operated as
migration system; it has been described
in a previous Inside article. After the thin node islands have been made fully operational, this migration system will be integrated into SuperMUC as its 19th island; it will be available for such
jobs that need the large shared memory
capability of 256 GBytes per node.
The bulk of SuperMUC performance is contained in the thin node islands; each of these consists of 518 nodes with two eight-core Intel Sandy Bridge-EP processors and 32 GBytes of main memory. 512 nodes will be used as compute nodes, delivering a core count of 8,192 cores per island. Four spare nodes allow retaining this core count even in cases of node failures although they are generally available for computational purposes as well. The two remaining nodes are used as installation and management servers. All nodes are water-cooled down to the mainboard and
the CPU itself.
All thin compute nodes in an individual island are connected via a fully non-blocking Mellanox InfiniBand FDR10 network. The large-scale island architecture has been designed to implement a graceful hierarchical degradation of relative bisection bandwidth: Connectivity
between islands is pruned by a factor of 4 in bandwidth, while still having fat tree topology. This enables a single application to use the whole system provided its bandwidth requirements do not cause scaling failure. However, we expect that the majority of applications will be sized to use one or at most a few islands; even this is still considered a challenge for many applications.
The compute Node
The basic building block is a shared memory system with two 8-core Sandy Bridge-EP sockets; this recently released new CPU generation from Intel, the successor to the Nehalem and Westmere platforms, has a number of interesting characteristics that make it well suited for scientific large-scale processing tasks. The chipset it is integrated with has been given the name "Romley" and is illustrated in the following figure.
First, for very floating point intensive tasks, Sandy Bridge's new 256 bit wide Advanced Vector Extensions allow to perform up to 8 floating point instructions with 64 bit precision; this can nearly double the floating point performance compared to previous Intel and AMD processors. The following table illustrates this by comparing LINPACK performance numbers using 16 MPI tasks on as many cores as well as theoretical peak performance:
|Figure 2: Water cooled SuperMUC node board
A problem size of N=62400 was chosen
for all architectures except the AMD node, where N=42240 was used due to constraints of the node's memory. On the Sandy-Bridge processor, the sequential Intel MKL library was linked into the LINPACK executable, which
automatically dispatches matrix multiplication (DGEMM) calls to an AVX-based implementation. On the AMD Magny-Cours CPU, the GOTO BLAS implementation was used for optimal performance. It is expected that only such parts of the application that are able to operate on data in the L1 or L2 caches most of the time will profit from AVX vectorization for Sandy Bridge processors; the Intel compilers support such vectorization via compiler switches and code directives.
|Table 2: Peak and LINPACK performance
Second, the memory subsystem of Sandy Bridge chipsets has been considerably enhanced compared to its pre-
decessor. This is in the following
illustrated by comparing the performance
of the linked double precision vector
triad as a function of the vector length
for the various architectures mentioned
above. The following figure shows these values as performance per core when a complete node is filled with as many tasks as there are cores in the system.
For in-cache as well as memory performance, the Sandy Bridge node shows a significant advantage over all other architectures displayed here. Converting the performance for vector length >250,000 to per-socket achievable bandwidths one obtains 19.5 GB/s
for AMD Magny Cours, 21.0 GB/s for Intel Westmere-EX and 27.4 GB/s
(of 51.2 theoretical maximum) for Intel Sandy Bridge, assuming three loads and one store. In all cases even higher bandwidths can be obtained by using streaming stores (e.g., via a compiler switch), but this feature must be used with care because in-cache performance is then much worse.
|Figure 3: Comparing vector triad performance
|Figure 4: Performance degradation due to loss of spatial locality
While the above discussion gives an impression of performance deterioration if temporal locality is lost, many scientific simulations use algorithms that at least partially require access to non-contiguous data and therefore incur loss of spatial locality. The performance impact of this is illustrated by Figure 4; it shows the performance degradation for two different vector lengths (N=805 and N=1,000,000) corresponding to L1 cache and main memory accesses, respectively
It is interesting to note that for non-contiguous access to main memory the
Sandy Bridge processor loses its performance advantage against AMD's Magny Cours for strides larger than 8, even though its cache line length is 64 bytes, corresponding to 8 double precision words, and therefore no further performance decrease should occur for strides larger than 8. But for all cache-based accesses, Sandy Bridge maintains a significant performance advantage over the other processors discussed here.
A further feature of Sandy Bridge (however one already available on its predecessor Westmere) is a high-bandwidth ring interconnect between the cores, in particular their last level caches, which improves scalability for applications running in shared memory. This feature, and also the QPI (Quickpath) technology that additionally allows efficient accesses by cores on one socket to the memory attached to another socket, reduces the well-known NUMA-related performance problems for many applications.
The Mellanox Infiniband
The FDR10-based interconnect for the thin node islands will provide a higher per-link bandwidth than the QDR technology in the fat node island; together with the smaller core count of a thin node this will assure a better overall balance of the thin node system, and should therefore also improve scalability for large parallel applications. It is expected that point-to-point message transfers of sufficient size will exceed a rate of 10 GBytes/s (double the QDR rate available on the migration system),
and that a reduction of a single word across 65,536 MPI tasks (4,096 nodes) should complete within 26 Ás.
Large-scale Data Handling
High-capacity and high-throughput disk storage are becoming ever more relevant
for nowadays large scale computing environments, since the enormous amounts of data produced must be able to be efficiently stored and processed. 10 PBytes of disk space operated with IBM's General Parallel File System (GPFS) are available on SuperMUC for storing result files with an
aggregated I/O bandwidth of up to 200 GB/s; GPFS supports additional tuning for I/O intensive programs, especially if MPI-IO or specific I/O libraries are used. The users HOME directories will be based on highly available and reliable NAS storage from NetApp, and therefore guarantee a high level of safety for valuable data (e.g., program sources). For the NAS-based areas, snapshots and mirroring ensure high safety against users unintentionally destroying their data or disk crashes. Result files residing on GPFS can be archived into the LRZ tape archive, a second copy of which will be kept at a remote site; this allows a similar level of data safety, however with the responsibility shifted to the owner of the data.
Great effort is being invested to achieve
as energy-efficient operation and usage of the system as possible: Firstly, high-temperature water cooling on the system board level will be used to
reduce the total energy consumption by
up to 30%. Secondly, Intel's processor technology allows running the CPUs at different frequencies, thereby controlling energy consumption on the system level; in conjunction with monitoring and scheduling software techniques, it is intended to get user's jobs tuned by the system to run at the lowest frequency possible without a significant impact on job performance.
• Matthias Brehm
• Reinhold Bader