Application Customisation: Enhancing Design Quality and Developer Productivity

Lead Research Organisation: Imperial College London

Department Name: Computing

Abstract

There have not been many shake-ups in mainstream processor architectures, since von Neumann articulated their basic principles in 1945 and Hoff developed the microprocessor architecture in 1969. This is changing: field programmable technology has been adopted by major companies such as Microsoft and Intel for datacentre computing, and new architectures are expected which integrate processor cores and field programmable resources on the same chip. These developments are largely motivated by improvements in performance and energy efficiency of field programmable technology, which are so promising that industrial adoption takes place despite the significant challenge of developing applications for custom computing systems based on field programmable technology.

Our vision is to address this challenge by advancing the foundation and applications of customisation, which involves developing hardware and software to fit design requirements. The proposed Platform project aims to pioneer new capabilities for enhancing design quality and designer productivity of custom computing systems, with potential to revolutionise many applications including those with needs for big data processing or for improved reliability and security. It builds on success of disruptive research funded by our previous Platform (EP/I012036/1).

An example of such success is research in runtime reconfiguration of custom computing systems: we developed new analysis methods to enable reconfiguration to remove idle functions; we showed how reconfiguration can benefit many applications such as genomic data processing and finite-difference computation. Our work is disruptive since, in contrast to current focus on partial reconfiguration, it demonstrates that full reconfiguration can provide significant energy-efficient acceleration over conventional multicore and manycore processors reducing, for example, runtime of Bisulfite sequence alignment from hours to minutes for non-invasive prenatal and cancer diagnosis. Moreover, we invented the first field programmable architecture capable of single-cycle on-chip configuration generation, while current commercial devices are based on off-chip configuration generation that can take hours.

Such exciting progress is only possible because the Platform Grant enabled high-risk research by researchers who would otherwise suffer from funding gaps: 12 Research Associates in our team enjoyed Platform support before they found permanent positions. Renewed Platform support will allow continuing development of our dynamic and ambitious research team to explore next-generation computer systems and their applications.

The flexibility of the renewed Platform Grant will be used to address three new strategic areas, on which we are uniquely capable of making major impacts; we will conduct exploratory research to identify promising projects for responsive mode or other forms of funding:

1. Multi-level tradeoff-aware design automation, which includes investigating customisation strategies and the associated tradeoffs, automation of effective customisation strategies, and developing reusable demonstration facilities and testbeds.

2. Reconfigurable big data and cloud architectures, which include customisable big data processing, runtime design generation and optimisation, and domain-specific cloud optimisation.

3. Reliable system development life cycle, which includes codesign of reliable and resilient systems, high-coverage testing and verification strategies, and reliability and resilience life cycle management.

The added-value aspects for this Platform Grant proposal include: (a) ensuring a critical mass of researchers in key areas, (b) exploring significant strategic areas, (c) contributing to research infrastructure, (d) attracting fresh talents, (e) pioneering and strengthening international collaborations, and (f) accelerating technology transfer.

Planned Impact

Energy-efficient acceleration with custom computing is a critical technology which can benefit:

1. the society, facing global challenges such as climate change, healthcare and security

2. organisations with products or services based on high performance computer systems such as CHREC and Maxeler, and cloud service providers such as Microsoft

3. FPGA vendors such as Altera and Xilinx

4. related silicon device vendors, such as ARM and Imagination Technologies

5. companies with products or services which rely on systems that would benefit from the above devices, such as Moortec and ThoughtWorks

6. individuals or organisations who use such products or services, especially those who would benefit from enhanced reliability

7. the Research Associates working on this project and students who work on related projects

8. students and others studying related courses e.g. hardware design, high-performance computing, embedded systems

This project has significant potential for 3 kinds of transformational impact:

(i) novel computer systems with energy-efficient acceleration, and their tools,
(ii) improved productivity of their designers and users, and
(iii) new or improved applications and services enabled by them. Hence:

(a) The society will benefit from better understanding and modelling of climate change, and from enhanced healthcare and security

(b) Companies with products or services relying on high-performance computing systems would be able to offer more powerful systems with enhanced security in a shorter time and at a lower cost

(c) FPGA vendors will benefit from more efficient architectures and from higher designer productivity

(d) Other silicon vendors will benefit from better prototyping capabilities, and to adapt particular techniques (e.g. those related to energy reduction or security enhancement) where applicable

(e) Companies with products or services relying on embedded systems would be able to speed up implementing better and cheaper real-time systems with lower energy and enhanced security

(f) Users of such products or services would be able to enjoy improvements in a more timely manner and at a lower price

(g) Environment will benefit from reduced energy usage; society will benefit from improved reliability

In addition, FPGA technology has potential to benefit many more applications. Examples include improving:

(a) the internet by making it more efficient and secure through FPGA-based message routers and intrusion detection engines

(b) cloud computing systems by significantly reducing their power and energy consumption, and need for cooling

(c) healthcare provision by accelerating, for instance, medical robotics

(d) scientific understanding through experimental facilities such as the Large Hadron Collider at CERN

(e) simulation facilities for a wide range of applications, from chip design to climate effects to gaming, by lowering the efforts of prototyping such systems

We will work with our Project Partners and Visiting Researchers to take into account their suggestions for key challenges that next-generation systems would need to meet, so that the project can produce useful results as soon and as much as possible. We will also explore dissemination and use of the project results for a wide range of research and development efforts, together with initial exploitation measures either through the project industrial partners, or through exploitation routes recommended by Imperial Innovations.

The means of dissemination includes publishing papers in relevant journals and conferences, providing a project web portal with access to publications and open-source tools and benchmarks, developing tutorial and teaching material, and liaising with related projects.

Funded Value:

£1,263,356

Funded Period:

Mar 17 - Feb 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/P010040/1

Principal Investigator:

Wayne Luk

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Computer Sys. & Architecture (15%)

Electronic Devices & Subsys. (70%)

Parallel Computing (15%)

Organisations

People	ORCID iD
Wayne Luk (Principal Investigator)
David Thomas (Co-Investigator)	http://orcid.org/0000-0002-9671-0917
Christos Bouganis (Co-Investigator)
Peter Pietzuch (Co-Investigator)
Paul Kelly (Co-Investigator)
Cristian Cadar (Co-Investigator)
Nobuko Yoshida (Co-Investigator)
George Constantinides (Co-Investigator)	http://orcid.org/0000-0002-0201-310X
Peter Y K Cheung (Co-Investigator)

Publications

Author Name Title Publication

Date Published

|< < 1 2 3 4 5 6 7 8 > >|

10 25 50

Russell F (2017) Exploiting the chaotic behaviour of atmospheric models with reconfigurable architectures in Computer Physics Communications

Bolten M (2017) Algebraic description and automatic generation of multigrid methods in SPIRAL in Concurrency and Computation: Practice and Experience

Stow E (2022) Convolutional kernel function algebra in Frontiers in Computer Science

Gan L (2017) Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture in IEEE Micro

Vespa E (2018) Efficient Octree-Based Volumetric SLAM Supporting Signed-Distance and Occupancy Mapping in IEEE Robotics and Automation Letters

McInerney I (2023) Horizon-Independent Preconditioner Design for Linear Predictive Control in IEEE Transactions on Automatic Control

Todman T (2022) Custom Instructions for Networked Processor Templates in IEEE Transactions on Circuits and Systems II: Express Briefs

Fan H (2022) FPGA-Based Acceleration for Bayesian Convolutional Neural Networks in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Cheng J (2022) DASS: Combining Dynamic & Static Scheduling in High-Level Synthesis in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Liu J (2018) Polyhedral-Based Dynamic Loop Pipelining for High-Level Synthesis in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Key Findings
Further Funding


Description	1. A new deconvolution architecture for efficient FPGA implementation. FPGA-based accelerators are proposed for both deconvolutional and convolutional neural network (CNN) algorithms. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space in order to achieve optimal processing speed of the system and improve power efficiency. On Xilinx Zynq ZC706 board, the proposed deconvolution accelerator achieves a performance of 90.1 GOPS under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. 2. CROSSBOW is a new single-server multi-GPU system for training deep learning models that enables choice of batch size while scaling to multiple GPUs. CROSSBOW uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. CROSSBOW achieves high hardware efficiency with small batch sizes by potentially training multiple model replicas per GPU, automatically tuning the number of replicas to maximise throughput. Experiments show that CROSSBOW improves the training time of deep learning models on an 8-GPU server by 1.3-4x compared to TensorFlow. 3. Hipernetch is a novel FPGA-based design for performing high-bandwidth network switching. FPGAs have recently become popular in data centres due to their promising capabilities for a wide range of applications. With the recent surge in transceiver bandwidth, FPGAs can benefit the implementation and refinement of network switches used in data centres. Hipernetch replaces the crossbar with a "combined parallel round-robin arbiter". Unlike a crossbar, the combined parallel round-robin arbiter is easy to pipeline, and does not require centralised iterative scheduling algorithms that try to fit too many steps in a single or a few FPGA cycles. The result is a network switch implementation on FPGAs operating at a high clock frequency and with a low port-to-port latency. Our proposed Hipernetch architecture additionally provides a competitive switching performance approaching output-queued crossbar switches. Our implemented Hipernetch designs exhibit a throughput that exceeds 100 Gbps per port for switches of up to 16 ports, reaching an aggregate throughput of around 1.7 Tbps. 4. Recent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization problem due to the expensive training cost and the time-consuming hardware implementation, which makes the exploration on the vast design space of neural architecture and hardware design intractable. Our proposed approach to automating hardware-accelerated DNNs is capable of locating designs on the Pareto frontier. This capability is enabled by a novel three-phase co-design framework, with the following new features: (a) decoupling DNN training from the design space exploration of hardware architecture and neural architecture, (b) providing a hardware-friendly neural architecture space by considering hardware characteristics in constructing the search cells, (c) adopting Gaussian process to predict accuracy, latency and power consumption to avoid time-consuming synthesis and place-and-route processes. In comparison with the manually-designed ResNet101, InceptionV2 and MobileNetV2, we can achieve up to 5% higher accuracy with up to 3 times speed up on the ImageNet dataset. Compared with other state-of-the-art co-design frameworks, our network found and hardware configuration can achieve up to 6% higher accuracy, 26 times smaller latency, and 8.5 times higher energy efficiency. 5. Neural networks (NNs) have demonstrated their potential in a variety of domains ranging from computer vision to natural language processing. Among various NNs, two-dimensional (2D) and three-dimensional (3D) convolutional neural networks (CNNs) have been widely adopted for a broad spectrum of applications such as image classification and video recognition, due to their excellent capabilities in extracting 2D and 3D features. However, standard 2D and 3D CNNs are not able to capture their model uncertainty which is crucial for many safety-critical applications including healthcare and autonomous driving. In contrast, Bayesian convolutional neural networks (BayesCNNs), as a variant of CNNs, have demonstrated their ability to express uncertainty in their prediction via a mathematical grounding. Nevertheless, BayesCNNs have not been widely used in industrial practice due to their compute requirements stemming from sampling and subsequent forward passes through the whole network multiple times. As a result, these processes significantly increase the amount of computation and memory consumption in comparison to standard CNNs. We propose a novel FPGA-based hardware architecture to accelerate both 2D and 3D BayesCNNs based on Monte Carlo Dropout. Compared with other state-of-the-art accelerators for BayesCNNs, the proposed design can achieve up to 4 times higher energy efficiency and 9 times better compute efficiency. An automatic framework capable of supporting partial Bayesian inference is developed to explore the trade-off between algorithm and hardware performance. Extensive experiments are conducted to demonstrate that our framework can effectively find the optimal implementations in the design space. 6. Our research has led to novel reconfigurable architectures for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves. Gravitational interferometers such as the LIGO detectors capture cosmic events such as black hole mergers which happen at unknown times and of varying durations, producing time-series data. We have developed a new architecture capable of accelerating RNN inference for analyzing time-series data from LIGO detectors. This architecture is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer. A customizable template for this architecture has been designed, which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools. The proposed approach has been evaluated based on two LSTM models, targeting a ZYNQ 7045 FPGA and a U250 FPGA. Experimental results show that with balanced II, the number of DSPs can be reduced up to 42% while achieving the same IIs. When compared to other FPGA-based LSTM designs, our design can achieve about 4.92 to 12.4 times lower latency.
Exploitation Route	1. Organisations with products or services based on high performance computer systems, and cloud service providers such as Microsoft, especially those based on deep learning technologies. Microsoft is supporting a PhD student for further research. 2. FPGA vendors such as Intel and Xilinx, and related silicon device vendors, such as ARM and Imagination Technologies; also companies with products or services which rely on systems that would benefit from the above devices. Intel and Xilinx are sponsoring further research. 3. Researchers, including the Research Associates working on this project and students who work on related projects, working on high-performance systems related to various applications such as those involving deep learning.
Sectors	Digital/Communication/Information Technologies (including Software),Electronics,Financial Services, and Management Consultancy


Description	Advanced concepts and novel technologies for the study of the impact of ionising radiation on tissue
Amount	£78,532 (GBP)
Funding ID	ST/T002638/1
Organisation	Science and Technologies Facilities Council (STFC)
Sector	Public
Country	United Kingdom
Start	10/2019
End	03/2022


Description	AppControl: Enforcing Application Behaviour through Type-Based Constraints
Amount	£1,483,020 (GBP)
Funding ID	EP/V000462/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	09/2020
End	06/2024


Description	Centre for Spatial Computational Learning
Amount	£1,211,769 (GBP)
Funding ID	EP/S030069/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	11/2019
End	10/2022


Description	CloudCAP: Capability-based Isolation for Cloud Native Applications
Amount	£879,242 (GBP)
Funding ID	EP/V000365/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	10/2020
End	09/2023


Description	DART: Design Accelerators by Regulating Transformations
Amount	£613,910 (GBP)
Funding ID	EP/V028251/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	09/2021
End	09/2024


Description	Efficient Cross-Domain DSL Development for Exascale
Amount	£430,061 (GBP)
Funding ID	EP/W007789/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	08/2021
End	08/2024


Description	Gen X: ExCALIBUR working group on Exascale continuum mechanics through code generation.
Amount	£159,456 (GBP)
Funding ID	EP/V001493/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	04/2020
End	11/2021


Description	POST: Protocols, Observabilities and Session Types
Amount	£1,462,802 (GBP)
Funding ID	EP/T006544/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	04/2020
End	03/2025


Description	Session Types for Reliable Distributed Systems (STARDUST)
Amount	£697,651 (GBP)
Funding ID	EP/T014709/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	10/2020
End	09/2024


Description	SysGenX: Composable software generation for system-level simulation at Exascale
Amount	£458,761 (GBP)
Funding ID	EP/W026066/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	12/2021
End	11/2024

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications