Application Based Fault Tolerance in High Performance Computing Applications

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

As we are moving towards Exascale systems, the probability of faults occurring increases with the number of
components in the system. Some of these faults, such as Soft Errors (SE), can introduce noise to the data, which
depending on the system, might be impossible to correct or detect meaning that the computation is corrupted
and can potentially return invalid results. Most common FT techniques implemented in hardware use Error
Correcting Codes (ECC) methods, which can always correct any detected errors that are correctable. A
hardware implementation allows to minimise the runtime performance overhead at the cost of additional
hardware complexity and memory bandwidth. These implementations also need more energy during additional
computations and memory transfers, and as the current supercomputers already consume over 10MW (enough
to power a small town), removing this additional hardware can improve energy efficiency of Exascale systems.
We investigate high-performance software alternatives of FT techniques which have a distinct advantage as
they do not require the additional hardware, which at Exascale will prove to be highly beneficial. Application
Based Fault Tolerance (ABFT) techniques also bring more flexibility to how faults are dealt with when they
occur and this can lead to much greater performance. By investigating common High Performance Computing
(HPC) computation and communication patterns we derive new methods for Fault Tolerance (FT). ABFT
techniques allow the application to decide whether a particular error needs to be corrected or can be ignored, for
example a bit flip in the less significant bits of the mantissa for a double precision floating point number may
converge to a correct value after few iterations of the algorithm and hence error correcting is not required.
ABFT can also be used to provide 1 FT to hardware that does not provide ECC capabilities, such as embedded
processors or consumer GPUs. Even if the hardware does provide FT, it can usually be turned off, and using
ABFT instead would free up the resources required by the hardware, such as memory and memory bandwidth,
which would improve the applications performance. In particular we investigate new ABFT techniques for the
HPC dwarfs and apply Information and Coding theory to derive innovative methods that can detect and correct
(multiple) errors. We then look into techniques that apply to a subset of the dwarfs, which are highly optimised
and can included in a software library so that they are ready to use out of the box. A big benefit of hardware FT
techniques is that it does not require the user to change their code in order to protect their application from
faults, whereas ABFT techniques are application dependent and often require changes to the source code of the
application. To mitigate this problem we investigate automatic detection of fault vulnerable sections of the
application using techniques such as machine learning and then apply the ABFT techniques during the
compilation of the application. This approach minimises the efforts required from the programmer to adapt
these FT techniques and make their application fault tolerant.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509619/1 01/10/2016 30/09/2021
1834202 Studentship EP/N509619/1 01/01/2017 24/05/2018 Grzegorz Pawelczak