Computational Science Technical Note CSTN-115


eResearch Directions in Processing Accelerators

K. A. Hawick, A. Leist, D. P. Playne, A. Gilman and M. J. Johnson

Archived August 2010

Extended Abstract

The changes in price points of commodity processing devices remains a strong influence on what platforms can be used for high-performance eResearch applications. Our group is interested in advanced simulations and applications that make use of considerable processing power within semi-interactive and semi-real time paradigms. This paradigm is important for computational science applications where models are being developed, investigated and measured quantitatively. Processing power is required for: numerical integration; random number generation; and floating-point function evaluations. In many cases interactive graphics are also used. Memory requirements tend to be aggressive too, since they allows simulation of larger systems for various models -- which often display their interesting behaviours on logarithmic size scales thus necessitating simulation over many system lengths.

Computer clusters have proven a very cost effective alternative to traditional large scale supercomputers for batch and throughput oriented simulations. A relatively large number of separate ``runs'' are made using independent hardly-communicating parallelism and the results post processed semi-interactively on desktop scale processing resources. However, recent commodity pricing of processing accelerator devices has the potential to change the economics again. Multi-core CPUs are now widespread and this trend will likely continue, driven by heat dissipation and chip fabrication issues. In the longer term we can hope for software tools and support infrastructure to help exploit very many cores at an application level. At the time of writing however, specialist processing devices of quite diverse architecture are already available and can provide a relatively cheap hardware platform for the applications developer able to exploit them.

Multi-core conventional CPUs such as those from Intel and AMD, can be programmed using multi-threading and shared memory processing techniques. They can of course be grouped into compute-clusters to form a hybrid applications platform that uses message passing technology between CPUs and threading technologies between cores on the same CPU. Another alternative that has been explored in an interesting way by the Sony/Toshiba/IBM collaboration as the CellBE uses a heterogeneous mixture of cores within the one chip. This can also be programmed with software threading technologies. In both these paradigms, the individual cores typically have vendor-specific vector capabilities and other mechanisms to accelerate performance with highly localised parallelism. Ideally these low-level core architectural details would be known to the compiler/optimiser and it would take care of exploiting them for the applications programmer.

Another approach is to make use of data-parallelism at a large scale, in the manner of Graphical processing Units and Field Programmable Gate Arrays (FPGA). Current generation GPUs tend to have a relatively large (several hundred) cores that are each capable of running very lightweight threads in a synchronised or data-parallel manner. Languages such as NVIDIA's Compute Unified Device Architecture (CUDA) and the non-proprietary Open Compute Language (OpenCL) are available software frameworks to program these devices. Generally this is a less flexible approach to processing acceleration, and is focused on solving specific applications problems well. An even more specific approach is that offered by FPGAs where a high level application must be decomposed into known library gate array implementations that are laid out on the generic gate array. At the time of writing, although there are vendor-neutral high-level programming languages for these devices, a proprietary system is needed to realise them on actual gate array devices.

Amdahl's law manifests itself in an interesting manner for multi-core hybrid processors. The weakest links in the parallelism chain can be directly attacked by specialist accelerator processing elements even when no one single parallel paradigm suits the whole application code. Scientific simulations can often benefit from Gustafson's corollary or ``work-around'' to scalability limitations in that the problem size or amount of intrinsic parallelism available in a simulation problem can be adjusted to fit the parallel architecture available. The accelerator effect is in some ways better than the Gustafson ``scaling work-around'' to the Amdahl limitation in a monolithic parallel architectural design, since it gives the designer the flexibility to add accelerators at various points in an architecture to focus resources where needed. However, the difficulty is handling the resulting architectural diversity of approaches -- and consequent systems complexity -- within a single application program.

In this article we present a high-level summary of the main architectural features of: homogeneous core CPUs; heterogeneous core CPUs; GPUs; and FPGAs from an eResearch applications perspective. We provide some commentary on various applications of interest to us in the eResearch community, including: complex systems simulations; image processing; and numerical equation solvers, and the ways in which we believe processing accelerators and hybrids will develop to serve these.

Arguably the so-called crisis in Moore's Law has raised awareness of the importance of parallel computing at the whole computer level as well as at the processing device level. In addition, for the foreseeable future it will likely be worthwhile and indeed necessary for applications programmers to learn to explicitly program parallelism at both inter-device and intra-device levels. In summary, there are several processing accelerator approaches ongoing, and there are relative advantages and disadvantages to them all. It might be expected that further hybridisation will occur in the future with a corresponding need for more sophisticated software tools to automate the choices and linkages of specialist algorithms to the right sub-device components. In the meantime there is a continued fruitful arena for algorithm experimentation and performance optimisation at the applications and systems library levels.

Keywords: accelerator; GPU; multi-core; CellBE; FPGA; parallel computing.

Full Document Text: PDF version.

Citation Information: BiBTeX database for CSTN Notes.

BiBTeX reference:

  author = {K. A. Hawick and A. Leist and D. P. Playne and A. Gilman and M. J.
  title = {eResearch Directions in Processing Accelerators},
  booktitle = {NZ eResearch Symposium 2010 Programme},
  year = {2010},
  pages = {Online},
  address = {Dunedin, New Zealand},
  month = {26-27 October},
  timestamp = {2011.05.16},
  url = {}

[ CSTN Index | CSTN BiBTeX ]