Exploration of In-Core Accelerators for High Performance Computing Applications
Lead Research Organisation:
University of Bristol
Department Name: Computer Science
Abstract
The field of High Performance Computing (HPC) aims to decrease the time taken to solve large problems by utilising the available hardware to its full potential and executing task in parallel. Some of the most common computationally intensive applications that utilise HPC are Biochemical, Engineering, and Physics simulations; allowing for many hypotheses to be tested without having to perform any physical experiments. However, with the amount of data needing to be processed and size of problems need solving ever increasing, new advancements within the HPC space are needed to ensure that solving these problems stays feasible.
Since its inception, the Top500 (a list comprised of the 500 fastest supercomputers) has been dominated by X86 powered systems. Recently however, there has been an increase in the number of Arm based supercomputers in the Top500, and in 2020 the Fugaku Supercomputer became the first ever Arm based machine to place first. One of the main reasons that Fugaku is so powerful is due to its use of the Fujitsu A64FX processor - the first in the world to utilise Arm's Scalable Vector Extension (SVE). Although vector techniques such as SIMD have been around since the 1960s, SVE expands upon them by allowing chip designers to implement vectors of lengths ranging from 128-bits to 2048-bits, allowing them to optimise their hardware for their own targets. Subsequently, SVE has the advantages of increased parallelism and more efficient loop unrolling, whilst keeping things simple for developers by adjusting the programming model dynamically to the available vector length. This removes the need to re-compile high-level languages or re-write hand-written assembly between different SVE platforms.
In mid-2021, Arm released a new extension to their instruction set, the Scalable Matrix Extension (SME). SME builds upon SVE and adds new instructions to more efficiently perform matrix-based computations, which for many scientific applications forms a large part of the workload. High utilisation of SME instructions could lead to large performance gains in many HPC applications without having to port code to a GPU, a timely and non-trivial process. Techniques like SME were first seen with NVIDIA Tensor cores and have been able to provide great performance gains in Machine and Deep Learning applications. Additionally, with companies such as Intel and Apple including in-core matrix accelerators in their most recent processors (Apple's M1 and Intel's Sapphire Rapids), it's clear that the die-space required is worth the performance gains achieved, with Intel reporting up to a 7.8x speedup in certain cases. However, we are yet to see from independent research what benefit in-core pipelined matrix accelerators like SME provide to HPC applications.
I aim to begin my research by continuing the development of the Simulation Engine (SimEng) that the HPC group at Bristol are currently working on as part of their ASiMoV project. Using this architectural simulator, I will be able to configure and simulate various implementations of a CPU core with SME support. This will enable me to assess what kind of performance gains could be expected in HPC, Deep Learning, and Neural Network applications from a similar real-world solution. Furthermore, I would be able to investigate the performance, time, and energy-consumption differences of porting benchmarks to GPUs versus using an in-CPU solution such as SME. Although other architectural simulators exist, such as Gem5, SimEng's core design goals are cycle-accuracy and simulation speed. Being as accurate to real-world hardware as possible is essential to produce reliable and meaningful results, and as any simulation of SME or performance results generated whilst utilising SME will be completely novel, it will help ensure the validity of the work I aim to carry out.
This project is in part funded by and in collaboration with Arm Ltd, and falls within the EPSRC Architecture and Operating System
Since its inception, the Top500 (a list comprised of the 500 fastest supercomputers) has been dominated by X86 powered systems. Recently however, there has been an increase in the number of Arm based supercomputers in the Top500, and in 2020 the Fugaku Supercomputer became the first ever Arm based machine to place first. One of the main reasons that Fugaku is so powerful is due to its use of the Fujitsu A64FX processor - the first in the world to utilise Arm's Scalable Vector Extension (SVE). Although vector techniques such as SIMD have been around since the 1960s, SVE expands upon them by allowing chip designers to implement vectors of lengths ranging from 128-bits to 2048-bits, allowing them to optimise their hardware for their own targets. Subsequently, SVE has the advantages of increased parallelism and more efficient loop unrolling, whilst keeping things simple for developers by adjusting the programming model dynamically to the available vector length. This removes the need to re-compile high-level languages or re-write hand-written assembly between different SVE platforms.
In mid-2021, Arm released a new extension to their instruction set, the Scalable Matrix Extension (SME). SME builds upon SVE and adds new instructions to more efficiently perform matrix-based computations, which for many scientific applications forms a large part of the workload. High utilisation of SME instructions could lead to large performance gains in many HPC applications without having to port code to a GPU, a timely and non-trivial process. Techniques like SME were first seen with NVIDIA Tensor cores and have been able to provide great performance gains in Machine and Deep Learning applications. Additionally, with companies such as Intel and Apple including in-core matrix accelerators in their most recent processors (Apple's M1 and Intel's Sapphire Rapids), it's clear that the die-space required is worth the performance gains achieved, with Intel reporting up to a 7.8x speedup in certain cases. However, we are yet to see from independent research what benefit in-core pipelined matrix accelerators like SME provide to HPC applications.
I aim to begin my research by continuing the development of the Simulation Engine (SimEng) that the HPC group at Bristol are currently working on as part of their ASiMoV project. Using this architectural simulator, I will be able to configure and simulate various implementations of a CPU core with SME support. This will enable me to assess what kind of performance gains could be expected in HPC, Deep Learning, and Neural Network applications from a similar real-world solution. Furthermore, I would be able to investigate the performance, time, and energy-consumption differences of porting benchmarks to GPUs versus using an in-CPU solution such as SME. Although other architectural simulators exist, such as Gem5, SimEng's core design goals are cycle-accuracy and simulation speed. Being as accurate to real-world hardware as possible is essential to produce reliable and meaningful results, and as any simulation of SME or performance results generated whilst utilising SME will be completely novel, it will help ensure the validity of the work I aim to carry out.
This project is in part funded by and in collaboration with Arm Ltd, and falls within the EPSRC Architecture and Operating System
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/T51763X/1 | 30/09/2019 | 07/11/2025 | |||
2641219 | Studentship | EP/T51763X/1 | 08/11/2021 | 07/11/2025 | Finn Wilkinson |