Asynchronous Task Parallelism for exascale and beyond

Lead Research Organisation: University of Bristol
Department Name: Computer Science


Supercomputers have historically seen a 1000x performance increase every 10 years but the push
the next generation of machines will be the most difficult yet. Targeted to arrive in 2022, machines
able to achieve 1 exaFLOPS (10^18 floating point operations per second) will mark the start of the
exascale era. Traditional sources of performance improvement have all been exhausted, making
increased parallelism the most viable way to achieve further computational power.
This increased parallelism will be brought about by both increasing the number of cores per node and
increasing the number of nodes in a machine. An exascale machine is expected to contain O(10^6)
nodes, each with O(10^3) cores per node, hence this increase in parallelism is expected to occur
both at the inter- and intra- node level.
Unfortunately, increasing parallelism alone will not result in being able to reach exascale due, in part,
to two factors. Firstly, current programming approaches make heavy use of synchronisation: at the
intra-node level threads are subject to the fork/join model and at the inter-node level the bulk
synchronous parallel (BSP) paradigm is commonly employed by using barriers. Secondly, it is widely
accepted that core speeds will no longer be consistent across either a single socket or an entire
node. This performance variability in hardware means that evenly dividing up work between threads
will lead to load imbalance, causing the entire computation to run at the speed of the slowest core.
To solve these issues, the high-performance computing community has started to consider
asynchronous task-parallelism as a solution. As a programming model, asynchronicity allows hiding
inter-node communication latency, whilst dependencies between tasks enforce program correctness
and allows for finer-grained synchronisation. Therefore, it is thought that the reduction in
synchronisation could dramatically improve performance in machines that utilise high degrees of
parallelism. Furthermore, to solve the problem of hardware performance variability, tasking would be
a sensible approach as it allows for implicitly handling load balance by executing work when needed
as opposed to current commonly-used methods of statically dividing up a problem domain a priori.
In addition to solving problems that will exist in upcoming supercomputers, tasking may also allow for
more irregular and complex scientific programs to be developed, that have previously performed
poorly or cannot be used at all due to the lack of support for a suitable programming model.
Much of this research is in the very early stages and there are still questions over the efficacy of
tasking on realistic scientific programs. In the early stages of my degree I will find, identify, and
potentially create programs that currently require tasking with the goal of forming a consensus in the
community on suitable applications which can be used to form benchmarks to compare and test
different tasking approaches. This will lead on to determining the features that are required and
features that are currently implemented well by existing frameworks, with the final aim of the work to
be able to advise and contribute towards a programming model for future supercomputers.

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509619/1 01/10/2016 30/09/2021
1792447 Studentship EP/N509619/1 01/10/2016 31/03/2020 Patrick Atkinson Atkinson
Description Due to energy constraints, modern computer processor design has shifted from increasing the speed of processors to adding more processors to the same computer. Currently, the most well-supported and easiest way to program for these parallel computer architectures is employ data-parallelism---making each processor perform the same operation concurrently on different operands. However, using data-parallelism can lead to suboptimal performance for irregular algorithms and can suffer from load imbalance between processors. As an alternative to data-parallelism, task-parallelism solves the problem by having each processor perform different work concurrently and fetching new work as needed. Unfortunately, task-parallelism is not well understood, or even supported, on many of the computer architectures used in the field of high-performance computing (HPC). Hence, the work I conducted in my PhD was to first evaluate and dissect task-parallel programming models on CPU architectures, where task-parallelism is well-supported, and then evaluate the application of task-parallelism on GPU architectures, where it is not well-supported. Through the implementation of my own GPU task-parallel programming model I found that task-parallelism is an effective way to program for GPU architectures for a range of irregular computational science applications. Additionally, through my evaluation of task-parallelism on CPU architectures, I found several new mechanisms to improve the performance for certain HPC applications. The HPC applications demonstrated to achieve better performance are used in the fields of computational physics and computational chemistry.
Exploitation Route My work opens up the possibility for the efficient implementation of other irregular algorithms on many-core architectures. It also could allow for the re-formulation of existing methods from data-parallel implementations to task-parallel implementations, resulting in better performance. The software I developed could lead to performance improvements across a range of computational science applications.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Electronics