Future-proof massively-parallel execution of multi-block applications
Lead Research Organisation:
UNIVERSITY OF OXFORD
Department Name: Oxford e-Research Centre
Abstract
For many years, increasing the clock frequency of microprocessors has led to steady improvements in performance of computer applications. This gave an almost free performance boost to the speed of applications without having to re-write software for each new generation of processors. However, increasing the performance of processors in this manner led to an unsustainable increase in energy consumption. Thus, to gain higher performance chip developers now rely on multiple cores operating in parallel. The latest CPUs have up to 10 cores, each with a vector unit producing up to 8 single precision floating point results per clock cycle, while the latest graphics processors (GPUs) have up to 2688 much simpler cores operating in groups of 32.
This move into manycore computing has led to considerable hardware innovation, and it is likely that the next 10 years will see further rapid evolution in computer architectures. This poses huge challenges to application developers who naturally wish to concentrate on their engineering and scientific applications and how best to model them, without having to worry about the details of modern computer architectures. To address this, there are a range of efforts within scientific computing to develop high-level software packages or frameworks so that the application developer can specify what they want to be computed at a high level, and then the package takes care of the implementation details.
Building on prior EPSRC-funded research to develop a framework called OP2 for unstructured grid applications, this proposal aims to develop a future-proof extension called OPS to handle the needs of multi-block structured grid applications. Developers' applications can be written in FORTRAN or C, using a carefully-designed application programming interface (API), and then OPS generates customised code for the implementation on different hardware target platforms.
As well as customising for the different hardware, two other optimisation approaches will be adopted. One is the use of ``tiling'' to overlap the execution of parallel loops which are usually executed sequentially. This improves both performance and energy efficiency by reusing data within the cache, cutting down on the number of times data is moved between the processor and the main memory. This is something which is becoming increasingly important on modern architectures because the energy cost and time taken for data movement is much greater than for floating point operations.
The other optimisation is the use of run-time optimisation for applications which execute for a long time. The backend implementations are parameterised, with parameters controlling aspects such as the number of threads in a thread block, or the size of a ``tile'' in the tiling optimisation. The optimal values for these parameters are not known a priori, and it could significantly affect the performance. By dynamically varying the values, and timing the consequential changes in performance, we can implement heuristics to iteratively improve the parameter values during the execution.
The new OPS framework will be assessed, both for performance and ease-of-use, by applying it to two important academic CFD codes, ROTOR developed at Bristol by Prof. Chris Allen, and SBLI developed by at Southampton by Prof. Neil Sandham. As well as being important codes in their own right, these are also representative of the needs of other codes within CCP12 (Computational Engineering), the UK Turbulence Consortium, and the UK Applied Aerodynamics Consortium.
This move into manycore computing has led to considerable hardware innovation, and it is likely that the next 10 years will see further rapid evolution in computer architectures. This poses huge challenges to application developers who naturally wish to concentrate on their engineering and scientific applications and how best to model them, without having to worry about the details of modern computer architectures. To address this, there are a range of efforts within scientific computing to develop high-level software packages or frameworks so that the application developer can specify what they want to be computed at a high level, and then the package takes care of the implementation details.
Building on prior EPSRC-funded research to develop a framework called OP2 for unstructured grid applications, this proposal aims to develop a future-proof extension called OPS to handle the needs of multi-block structured grid applications. Developers' applications can be written in FORTRAN or C, using a carefully-designed application programming interface (API), and then OPS generates customised code for the implementation on different hardware target platforms.
As well as customising for the different hardware, two other optimisation approaches will be adopted. One is the use of ``tiling'' to overlap the execution of parallel loops which are usually executed sequentially. This improves both performance and energy efficiency by reusing data within the cache, cutting down on the number of times data is moved between the processor and the main memory. This is something which is becoming increasingly important on modern architectures because the energy cost and time taken for data movement is much greater than for floating point operations.
The other optimisation is the use of run-time optimisation for applications which execute for a long time. The backend implementations are parameterised, with parameters controlling aspects such as the number of threads in a thread block, or the size of a ``tile'' in the tiling optimisation. The optimal values for these parameters are not known a priori, and it could significantly affect the performance. By dynamically varying the values, and timing the consequential changes in performance, we can implement heuristics to iteratively improve the parameter values during the execution.
The new OPS framework will be assessed, both for performance and ease-of-use, by applying it to two important academic CFD codes, ROTOR developed at Bristol by Prof. Chris Allen, and SBLI developed by at Southampton by Prof. Neil Sandham. As well as being important codes in their own right, these are also representative of the needs of other codes within CCP12 (Computational Engineering), the UK Turbulence Consortium, and the UK Applied Aerodynamics Consortium.
Planned Impact
The Dominic Tildesley report on ``A Strategic Vision for UK e-Infrastructure'' includes many examples of the importance of computational modelling in a wide range of industries as well as in government, and it emphasises the importance of software, including the quote:
``This continuing growth in price/performance and power-efficiency comes at a cost: the new systems will be based on complex multi-core and accelerator-based architectures which are much more challenging to program than are today's systems, requiring a revolution in software design and development.''
The importance and challenge of manycore computing is also reflected in its inclusion as one of EPSRC's five main priorities within the ICT area.
Historically, the UK has played a leading role in computational modelling, relative to its size, particularly in areas such as aeronautical CFD and weather prediction. It is very important that this position is maintained, and one of the key ways of achieving this is to ensure that the computational modellers can spend their time developing better models, not worrying about the details of novel computer architectures.
At the same time, we need to train a new generation of scientific computing experts who do understand thoroughly the details of novel computer architectures and how best to exploit them. If £100M is being spent annually in the UK on HPC hardware (a very conservative estimate), then the cost savings are very substantial if software improvements such as tiling and run-time optimisation can deliver a factor 2 increase in performance.
Our research will develop open-source software giving support specifically for multi-block structured-grid applications in engineering and sceince, but more generally it will contribute towards the domain of manycore parallel computing which is vital for the health for the country's capability in computational modelling which underlies so much of modern engineering and science.
``This continuing growth in price/performance and power-efficiency comes at a cost: the new systems will be based on complex multi-core and accelerator-based architectures which are much more challenging to program than are today's systems, requiring a revolution in software design and development.''
The importance and challenge of manycore computing is also reflected in its inclusion as one of EPSRC's five main priorities within the ICT area.
Historically, the UK has played a leading role in computational modelling, relative to its size, particularly in areas such as aeronautical CFD and weather prediction. It is very important that this position is maintained, and one of the key ways of achieving this is to ensure that the computational modellers can spend their time developing better models, not worrying about the details of novel computer architectures.
At the same time, we need to train a new generation of scientific computing experts who do understand thoroughly the details of novel computer architectures and how best to exploit them. If £100M is being spent annually in the UK on HPC hardware (a very conservative estimate), then the cost savings are very substantial if software improvements such as tiling and run-time optimisation can deliver a factor 2 increase in performance.
Our research will develop open-source software giving support specifically for multi-block structured-grid applications in engineering and sceince, but more generally it will contribute towards the domain of manycore parallel computing which is vital for the health for the country's capability in computational modelling which underlies so much of modern engineering and science.
Organisations
Publications

Giles M
(2014)
GPU Implementation of Finite Difference Solvers

Giles MB
(2014)
Trends in high-performance computing for engineering calculations.
in Philosophical transactions. Series A, Mathematical, physical, and engineering sciences

Jammy S
(2016)
Block-structured compressible Navier-Stokes solution using the OPS high-level abstraction
in International Journal of Computational Fluid Dynamics

Mudalige G
(2019)
Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation
in Journal of Parallel and Distributed Computing

Reguly I
(2016)
High Performance Computing

Reguly I
(2017)
Beyond 16GB

Reguly I
(2017)
Beyond 16GB: Out-of-Core Stencil Computations

Reguly I
(2018)
Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS
in IEEE Transactions on Parallel and Distributed Systems
Description | The objective of this work was to demonstrate an approach to the creation of future-proof software through separating the specification of what is to be computed from the details of the implementation which achieves this. Flexible code generation techniques were then used to create a number of different back-end implementations for different computer architectures, such as GPUs or many-core CPUs. AWE funded related work on a series of "mini-apps" which demonstrated there is no significant performance penalty in following our flexible approach rather than hand-crafting separate implementations for different platforms. |
Exploitation Route | AWE is now considering whether to adopt this approach in their own software development process. In addition, a UK company specialising in mathematical software is building on these ideas in developing their own software for a particular class of applications. The software itself is available on Github under an open source license: https://github.com/OP-DSL/OPS |
Sectors | Aerospace Defence and Marine Digital/Communication/Information Technologies (including Software) Energy Security and Diplomacy |
URL | http://www.oerc.ox.ac.uk/projects/ops |
Description | AWE is now considering whether to adopt this approach in their own software development process. In addition, a UK company specialising in mathematical software is building on these ideas in developing their own software for a particular class of applications. The software itself is available on Github under an open source license: https://github.com/OP-DSL/OPS |
First Year Of Impact | 2013 |
Sector | Security and Diplomacy |
Impact Types | Economic |
Description | AWE (2014) |
Amount | £24,984 (GBP) |
Organisation | Atomic Weapons Establishment |
Sector | Private |
Country | United Kingdom |
Start | 05/2014 |
End | 11/2014 |
Description | AWE (2015) |
Amount | £24,725 (GBP) |
Organisation | Atomic Weapons Establishment |
Sector | Private |
Country | United Kingdom |
Start | 06/2015 |
End | 12/2015 |
Description | Rolls-Royce (2014) |
Amount | £29,960 (GBP) |
Organisation | Rolls Royce Group Plc |
Sector | Private |
Country | United Kingdom |
Start | 09/2014 |
End | 12/2014 |
Description | Rolls-Royce (2015) |
Amount | £36,396 (GBP) |
Organisation | Rolls Royce Group Plc |
Sector | Private |
Country | United Kingdom |
Start | 01/2015 |
End | 12/2015 |
Description | CUDA Programming on NVIDIA GPUs |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | One week course on CUDA programming on NVIDIA GPUs, available to both academics and non-academics. Lots of the students have since gone on to use CUDA programming in their research. |
Year(s) Of Engagement Activity | 2008,2009,2010,2011,2012,2013,2014 |
URL | http://people.maths.ox.ac.uk/gilesm/cuda/ |