Innovatives Supercomputing in Deutschland
inSiDE • Vol. 11 No. 1 • Spring 2013
current edition
about inSiDE
index  index prev  prev next  next

NVIDIA Application Lab at Jülich

Accelerating scientific HPC applications using GPUs has become popular. For many applications it has indeed proved to be a very successful approach. Nevertheless, there is still a lot to learn about algorithms and methodologies for porting applications. Enabling more scientific applications for GPU-based architectures is a core goal of the NVIDIA Application Lab at Jülich. The lab is jointly operated by Jülich Supercomputing Centre (JSC) and NVIDIA since July 2012. For JSC it is yet another step to further strengthen links to vendors that are actively pursuing an exascale strategy.

During its first half year of operation the main focus of the lab was to establish a broad application portfolio encompassing computational neuroscience, high-energy physics, radio astronomy, data analytics and others. In common with each other, the applications have a high level of parallelism, ideally with few dependencies between tasks or task groups. One example is an application from the JuBrain project developed at the Jülich Institute for Neuroscience and Medicine INM-1 [1]. The project will result in an accurate, highly detailed computer model of the human brain. This atlas is created by reconstructing fibre tracks from pictures of thousands of slices (see Fig. 1). The process of mapping the pictures, called registration, requires repeated computation of a metric that measures how pixels of two pictures map to each other, a computationally expensive process that maps well to the GPU.

In applications from experimental physics, a natural data decomposition may be according to how the data leaves the detector. For instance, in high-energy physics experiments data is generated at such extremely high data rates that data for different time slices has to be distributed to different processing devices. Another example is search for pulsars in radio astronomy, where data sets from different beams and different measurements have to be repeatedly processed resulting in thousands or even millions of Fast Fourier Transforms.

Large quantities of data which need further processing are created not only by experiments and observatories, but increasingly from “computational experiments” such as Monte Carlo simulations of protein folding. Here clustering, a standard method of data analytics, is applied to identify regions of similar objects in multi-dimensional data sets. At the lab we have shown that sub-space clustering algorithms can be very efficiently implemented on GPUs opening the path to analysis of high-dimensional data sets in other areas as well [2].

For many applications, using a single GPU is not sufficient, either because more computing power is required, or because the problem size is too large to fit into the memory of a single device. This forces application developers to not only consider parallelization at device level, but also to manage an additional level of parallelism. Depending on the application this may be challenging since on most architectures device-to-device communication through the network is implemented as split transactions via the host processor. Software solutions like CUDA-aware MPI implementations can help mitigate this problem, but ultimately better hardware support is needed to interconnect GPU and network devices.

Figure 1: Many cuts of a human brain are required to build an accurate, highly detailed computer model of the human brain [1]

The goal of the lab is not solely to provide service to application developers and achieve performance improvements for their applications. To improve future architectures and their usability for scientific applications it is necessary to better understand how well these applications map onto such architectures. The introduction of a new architecture, Kepler, or more specifically, the GK110 GPU, is a good opportunity to learn about the effects of architectural changes by means of comparison with the previous architecture. How can application kernels cope with a significant increase in compute performance when bandwidth to device memory becomes only moderately larger? Do other changes in the memory hierarchy allow compensating for the increased flops-per-byte ratio? First experience shows that a careful analysis of the utilization of all levels of the memory hierarchy helps to make the application use the new architecture as efficiently as its predecessor. A better understanding at this level helps not only application developers but also processor and systems architects to improve GPU-based architectures for scientific computations.


Figure 2: Clusters identified in data obtained by Monte Carlo simulations of protein folding using MAFIA [2].



[2] A. Adinetz, J. Kraus, J. Meinke, D. Pleiter,„GPUMAFIA: Efficient Subspace Clustering with MAFIA on GPUs,” submitted to Euro-Par 2013.

• Andrew Adinetz
• Jiri Kraus
• Dirk Pleiter
Jülich Supercomputing Centre

top  top