Innovatives Supercomputing in Deutschland
inSiDE • Vol. 6 No. 2 • Autumn 2008
current edition
archive
centers
events
download
about inSiDE
index  index prev  prev next  next

The QPACE Project: QCD parallel Computing on the Cell Broadband Engine

The building blocks of the atomic nucleus, protons and neutrons, are known not to be elementary since the 1950s. Rather, they are composed of even smaller building blocks: quarks and gluons. The theory describing the interactions between quarks and gluons is called Quantum Chromodynamics (QCD), which is well established both experimen tally and theoretically. However, the study of some of the most important properties of the theory is only possible by numerical simulations, using a discretized formulation of the theory known as Lattice Quantum Chromodynamics (LQCD). While this formulation renders numerical simulations possible, these require a huge amount of computational resources. To carry out such calculations, highly scalable massively parallel computers providing hundreds of TFlops of computing power are required. Scientific progress in this field is limited by the availability of suitable computing resources.

The QPACE project addresses this issue, as its goal is to design and build a novel cost-efficient capability computer that is optimized for LQCD applications. This research area has a long tradition of developing such computers (see, e.g., [1,2]). Previous projects were based on system-on-a-chip designs, but due to the rising costs of custom ASICs the QPACE project is pursuing a different strategy: a powerful commercial multi-core processor is tightly coupled to a custom-designed network processor. The latter is implemented using a modern Field Programmable Gate Array (FPGA), which has several distinct advantages over a custom ASIC: shorter development time and cost, lower risk, and the possibility to modify the hardware design of the network processor even after the machine has been deployed.

Figure 1: QPACE node-card with a PowerXCell 8i processor, 4 GBytes of main memory, an FPGA and 6 high-speed network transceivers

The development of QPACE is a common effort of several academic institutions together with the IBM Research and Development Lab in Böblingen (Germany). The academic partners include the Universities of Regensburg and Wuppertal as well as the research labs DESY and Jülich and the Universities of Ferrara and Milano. The project is mainly funded by the Deutsche Forschungs gemeinschaft (DFG) in the framework of SFB/TR-55 and by IBM. First prototype hardware is already available, and the testing of the final hardware configuration is expected to be completed at the end of 2008. In early 2009 we plan to start the manufacturing of several large machines with an aggregate peak performance of 200 Tflops (double precision). The ambitious goal of the project is to make these machines available for research in lattice QCD by the middle of 2009.

The QPACE Architecture

The building block of QPACE is a nodecard based on IBM´s PowerXCell 8i processor and a Xilinx Virtex-5 FPGA (see Figure 1). The PowerXCell 8i is the second implementation of the Cell Broadband Engine Architecture [3] and is very similar to the Cell processor used in Sony´s PlayStation 3. The main reason for using this enhanced Cell processor is its support for high-performance double precision operations with IEEE-compliant rounding. The Cell processor contains one PowerPC Processor Element (PPE) and 8 Synergistic Processor Elements (SPE). Each of the SPEs runs a single thread and has its own 256 kBytes on-chip memory (local store, LS) which is accessible by direct memory access (DMA) or by local load/store operations to/from 128 general-purpose 128-bit registers. An SPE in the PowerXCell 8i processor can execute two instructions per cycle, performing up to 8 single precision (SP) or 4 double precision (DP) floating point (FP) operations. Thus, the total SP or DP peak performance of all 8 SPEs of a single processor is 204.8 GFlops or 102.4 GFlops, respectively (at a clock speed of 3.2 GHz).

The processor has an on-chip memory controller supporting a memory bandwidth of 25.6 GB/s and a configurable I/O interface (Rambus FlexIO) supporting a coherent as well as a non-coherent protocol with a total bidirectional bandwidth of up to 25.6 GB/s. Internally, all units of the processor are connected to the coherent element interconnect bus (EIB) by DMA controllers.

In QPACE the I/O interface is used to interconnect the PowerXCell 8i processor with the network processor (Xilinx V5-LX110T). This is possible because of a special feature of the RocketIO transceivers in the Xilinx Virtex-5 FPGAs. We will be using 2 FlexIO links between the multi-core compute processor and the network processor, with an aggregate bandwidth of 6 Gbytes/second per direction.

The node-cards are connected in a three-dimensional torus with nearestneighbor connections. The physical layer of the torus network links relies on commercial standards for which well-tested and cheap communication hardware is available. This allows us to move the most timing-critical logics out of the FPGA. Specifically, we are using the 10 Gbit/s transceiver PMC Sierra PM8358 (in XAUI mode). On top of this standard physical layer we have designed a lean custom protocol optimized for low latencies. Unlike in other existing Cell-based parallel machines, in QPACE it will be possible to perform communications directly from the local store (LS) of any SPE on one processor to the LS of any SPE of one of the 6 neighboring processors. The data do not have to be routed through main memory (therefore decreasing the pressure on the performance-critical memory controller) or the PowerPC processor element. Rather, the data are moved via the EIB directly to or from the I/O-interface. The tentative goal is to keep the latency for LS-to-LS copy operations on the order of 1µs.

32 node-cards are mounted on a single backplane. One dimension of the three-dimensional torus network is completely routed within the backplane. The nodes can be arranged as 1 x 4 x 8 nodes or multiple smaller partitions. For larger partitions, several backplanes can be interconnected by cables. 8 backplanes are integrated into a single rack, hosting a total of 256 node-cards with a total peak performance of 26 TFlops (DP). A system consisting of n racks can be operated as a single partition with 2n x 16 x 8 nodes. To obtain smaller partitions without re-cabling we use a special feature of the PMC Sierra PM8358, which provides a redundant link interface. An example for how this feature can be used to partition the machine is shown in Figure 2. The properties of the physical layer of the network have been investigated in detail in a test setup (see Figure 3). In Figure 4 an example for an eye diagram is shown for a lane with maximum distance between transmitter and receiver.

The root-cards are also part of a global signal tree network. Via this network signals and interrupts can be sent by any of the node-cards to the top of the tree. There the signals are reduced, and the result is propagated to all node-cards of a given partition.On each node-card the network processor is also connected to a Gbit-Ethernet transceiver. The Ethernet ports of all node-cards can be connected to standard Ethernet switches that are integrated in the QPACE rack. Depending on the I/O-requirements the Ethernet bandwidth between a QPACE rack and a front-end system can be adjusted by changing the bandwidth of the uplinks of the switches.

Figure 2: Using redundant links, 8 node-cards can be connected periodically as 1 x 8 or 2 x 4 node-cards.

On each backplane there are 2 rootcards which manage and control 16 node-cards each (e.g., when booting the machine). The root-card hosts a small Freescale MCF5271 Microprocessor operated using uClinux [4]. The microprocessor can be accessed via Ethernet, and from it one can connect to various devices on the nodecards via serial links (e.g., UART).

Each node-card consumes up to 130 Watts. To remove the generated heat a cost-efficient liquid cooling system is being developed, which enables us to reach high packaging densities. The power consumption of a single QPACE rack is about 35 kWatts. This translates into a power efficiency of about 1.5 Watts/GFlops (DP, peak).

Application Software and Performance

During an early phase of this project a performance analysis has been done based on simple models which typically only take bandwidth and throughput parameters of the hardware into account [5]. The overall performance of LQCD applications strongly depends on how efficient the execution of just one basic step can be implemented, namely the product of a large but sparse matrix (the so-called Lattice Dirac Operator) and a vector (a quark field). For one particular version of this matrix we estimated for realistic parameters a theoretical efficiency of about 30%. The main restrictions come from the performance of the memory controller. In this case a sophisticated strategy had been assumed for reading data from memory and storing results back to memory, such that external memory accesses are minimized. In a real implementation of this application kernel it has been demonstrated that on a single processor an efficiency of 25% can be achieved [6].

Figure 3: QPACE backplane together with 2 prototype node-cards. The copper plate in the middle is liquid cooled. In the fi nal system the nodecards will be mounted in a housing which is directly attached to this cold plate. Also visible are the (blue) communication cables.

Figure 4: Eye diagram measured for a signal which is routed through about 50 cm board material and 50 cm cable. In case of QPACE this is the maximum possible distance. During this measurement the link was running at 3.125 GHz.

Figure 2: Using redundant links, 8 node-cards can be connected periodically as 1 x 8 or 2 x 4 node-cards.

To relieve the programmer from the burden of porting efforts we apply two strategies. For a number of kernel routines which are particularly performance relevant we will provide highly optimized implementations which can be accessed through library calls. To facilitate the implementation of the remaining parts of the code we plan to port or implement software layers that hide the hardware details.

Summary

QPACE is a next-generation massively parallel computer optimized for LQCD applications. It leverages the power of modern multi-core processors by tightly coupling them within a custom high-bandwidth, low-latency network. The system is not only optimized with respect to procurement costs vs. performance but also in terms of power consumption, i.e., operating costs. The machines that will become available in 2009 will significantly increase the computing resources available for LQCD calculations in Germany.

Acknowledgements

We would like to thank all members of the QPACE development team at the academic sites and at the IBM labs in Böblingen, La Gaude and Rochester for their hard work making this endeavor possible. We also acknowledge the following companies which contribute to the project by various means: Eurotech (Italy), Knürr (Germany), Rambus (US), Xilinx (US), Zollner (Germany).

References

[1] Belletti, F. et al.
Computing for LQCD: apeNEXT, Computing in Science & Engineering, 8, p. 18, 2006

[2] Boyle, P.A. et al.
Overview of the QCDSP and QCDOC com puters, IBM Journal of Research and Developement, 49, 351, 2005

[3] IBM
Cell Broadband Architecture, Version 1.0, 2005

[4] UCLINUX
http://www.uclinux.org

[5] Goldrian, G. et al.
QPACE: QCD parallel Computing on the Cell Broadband Engine, Computing in Science and Engineering, 2008 (accepted for publication)

[6] Nobile, A.
PhD thesis, Milano University, 2008

• Dirk Pleiter1
• Tilo Wettig2

DESY, Zeuthen Site1
University of Regensburg, Departement of Physics2


top  top