ENergy Efficient Adaptive Computing with multi-grain heterogeneous architectures (ENEAC)

Lead Research Organisation: University of Bristol

Department Name: Electrical and Electronic Engineering

Abstract

Energy efficiency is one of the primary design constraints for modern processing systems. Hardware accelerators are seen as a key technology to address the high performance with limited energy issue. In addition the arrival of computing languages such as OpenCL offer a route to the programmer to target different types of multi-core accelerators using a single source code. Performance portability is a significant challenge specially if the accelerators have different microarchitectures such as is the case in CPU-GPU-FPGA systems. This research addresses the energy and performance challenge by investigating how a device formed by processing units with different granularities ranging from coarse grain CPU cores of different complexity, medium grain general purpose GPU cores and fine grain FPGA logic cells can be dynamically programmed. The challenge is to be able to program all these resources with a single programming model and create a run-time system that can automatically tune the software to the best execution resource from energy and performance points of view. The results from this research are expected to deliver new fundamental insights to the question of: How future computers can obtain orders of magnitude higher performance with limited energy budgets?

Planned Impact

The potential beneficiaries of this research are the electronics and semiconductor companies involved in the creation of the multi-core processing platforms that will be at the center of future super-computing devices. A good indication of the challenges that this industry faces over the next 20 years is available in the International Technology Roadmap for Semiconductors (ITRS). This report identified energy as a fundamental challenge that future integrated circuits will face.
The objective set by the IRTS in terms of energy requirements is to maintain the static and dynamic power at current or decreasing levels despite the exponential growth in logic complexity and throughput. Heterogeneous processing enable a better match between the processing requirements and hardware but understanding the optimal match and making this transparent to the programmer is a significant research challenge, The proposed technology is highly relevant to servers dealing with offline analytics, web applications and large data sets applications in areas, including meteorology, seismic, genomics, complex physics simulations. etc. These applications offer high level of parallelism so that multiple cores can work on the problem at a time but the power requirements of these large collections of server processors become very high. It is highly possible that the optimal core type is dependent on the require performance level or energy budget.
To make sure that the knowledge and know-how permeates adequately into the industry environment the following actions will be taken:
1. The hardware demonstrators will be presented with a series of visits to the industrial collaborators Altera and ARM and other companies working in the area of interest. These visits will be organized at key stages of the project to ensure that adequate feedback is received. Short secondments of academic staff to industry will be arranged together with the visits. Additional contacts will be made with companies developing server systems around ARM technology to demonstrate the potential improvements in energy of the approach. The technology could also extend the design flows of companies targeting high-performance computing with accelerators such as IBM and Microsoft.
2. In the HIPEAC/EACO workshops we intend to organise a series of seminars/demonstrations to bring the project in contact with industry and academia. We believe that an interactive demonstration around a drug-discovery or astronomy algorithm should be able to attract plenty of general interest.
3. The traditional avenues of journal and conference publications will be also fully utilized. We have identified conferences such as AHS (Adaptive Hardware Systems), FPL (Field Programmable Logic), DAC (Design Automation Conference) as high quality conferences adequate for this work. Journals such as IEEE TVLSI, IEEE Computers and IET CDT will be targeted.
The expected impacts of this research can be summarised as:
1. To understand how available heterogeneous devices with architecturally different processing resources can be programmed efficiently using a single parallel language.
2. To deliver one order of magnitude better energy/performance operation exploiting adaptation of the algorithm to these different computing resources at run-time.
3. To show how this approach can be successfully applied to high-performance and embedded computing systems used in industrial applications.

Funded Value:

£567,204

Funded Period:

Jan 16 - Jan 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N002539/1

Principal Investigator:

Jose Nunez-Yanez

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Electronic Devices & Subsys. (40%)

Fundamentals of Computing (60%)

Organisations

People	ORCID iD
Jose Nunez-Yanez (Principal Investigator)
Simon McIntosh-Smith (Co-Investigator)

Publications

Author Name Title Publication

Date Published

|< < 1 2 3 > >|

10 25 50

Brad Hall A (2022) Identification of a branchial cleft anomaly via handheld point-of-care ultrasound. in Journal of ultrasonography

Zhang Y (2019) Adaptive event-triggered anomaly detection in compressed vibration data in Mechanical Systems and Signal Processing

Nunez-Yanez J (2017) Adaptive voltage scaling in a heterogeneous FPGA device with memory and logic in-situ detectors in Microprocessors and Microsystems

Galindo Sanchez F (2017) Energy proportional streaming spiking neural network in a reconfigurable system in Microprocessors and Microsystems

Mohammad Hosseinabady (2018) Pipelined Streaming Computation of Histogram in FPGA OpenCL in Parallel Computing is Everywhere

Jose Nunez-Yanez (2018) Simultaneous Multiprocessing on a FPGA+CPU Heterogeneous System-On-Chip in Parallel Computing is Everywhere

Nunez-Yanez J (2018) Simultaneous multiprocessing in a software-defined heterogeneous FPGA in The Journal of Supercomputing

Rodríguez A (2019) Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform in The Journal of Supercomputing

McEwan D (2019) Machine Learning, Optimization, and Data Science - 5th International Conference, LOD 2019, Siena, Italy, September 10-13, 2019, Proceedings

Zhang Y (2017) Optimal compression of vibration data with lifting wavelet transform and context-based arithmetic coding

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	When a heterogeneous chip has computing resources of different capabilities and speeds it is possible to find cases in which to use all these units in parallel is more efficient than only using the most power unit and setting the others to sleep. This type of heterogeneous parallel compute is viable if an energy aware scheduler can adjust the amount of work to each unit depending on its capacity. Discovering this capacity at run-time is not trivial since it depends on the type of work that is required. The ENEAC platform uses scheduling algorithms to analyse the behaviour of the different compute engines at run-time in terms of performance and power and distributes the workload to optimize the energy and performance. The system can be used to maintain power below some cap or maximize throughput. The schedulers are fully automatic and do not need user intervention and have been applied to encryption, machine learning and fluid dynamic tasks.
Exploitation Route	The technology could be deployed in a large scale data center or high performance computing system to accelerate algorithms used in digital economy areas. At the moment further funding obtained in the MINET (Machine learning at the network edge) award and the Accelerating Data Analytics on Energy Efficient Heterogeneous Architectures, FEDER (European Fund of Regional Development) award to explore the applicaiton of the results in deep learning tasks.
Sectors	Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Electronics,Environment
URL	https://github.com/eejlny/ENEAC


Description	The ENEAC platform is being used in the MINET and the Accelerating Data Analytics on Energy Efficient Heterogeneous Architectures, FEDER (European Fund of Regional Development) in deep learning applications. It has also been used in the TeamPlay H2020 European project to understand the energy requirements of a heterogenous architecture formed by GPU and CPU processing resources (Run-time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling) and compare this with an architecture that uses FPGA and CPU resources in parallel (Sparse Matrix-Dense Matrix multiplication on Heterogeneous CPU+FPGA Embedded System). These results were presented in the PARMA DITAM conference https://parma-ditam-workshop.github.io/Program.pdf. The international collaboration in the FEDER project has resulted in the paper "Lightweight asynchronous scheduling in heterogeneous reconfigurable systems" that can be downloaded at https://www.sciencedirect.com/science/article/pii/S1383762122000042. The ENEAC system being used in the MINET project can be seen here https://www.mdpi.com/2079-9292/9/11/1765
First Year Of Impact	2018
Sector	Education,Electronics
Impact Types	Economic


Description	Royal Society Industrial fellowships
Amount	£66,000 (GBP)
Funding ID	INF\R2\192044
Organisation	The Royal Society
Sector	Charity/Non Profit
Country	United Kingdom
Start	01/2020
End	12/2021


Title	ENEAC
Description	Hardware framework and analyses algorithms to use the ENEAC platform in a parallel system combining FPGA and CPU resources
Type Of Material	Data analysis technique
Year Produced	2019
Provided To Others?	Yes
Impact	Further funding for the application of ENEAC to deep learning
URL	https://github.com/eejlny/ENEAC


Title	GPUtx1
Description	Power modelling technology for CPU and GPU devices
Type Of Material	Computer model/algorithm
Year Produced	2019
Provided To Others?	Yes
Impact	Being used in the H2020 Teamplay project
URL	https://github.com/kranik/ARMPM_BUILDMODEL/tree/GPU_tx1


Description	HARP collaboration with Intel and University of Paderborn
Organisation	Intel Corporation
Country	United States
Sector	Private
PI Contribution	Exploration of using the energy aware scheduling algorithms part of ENEAC in a Xeon+FPGA device developed by Intel. This is part of the Hardware Accelerator Research Program (https://software.intel.com/en-us/hardware-accelerator-research-program)
Collaborator Contribution	Intel has donated access and support for their HARP systems.
Impact	Initial results were presented in the ARM research summit in 2017. A journal paper "Hardware Accelerator Research Program" has been submitted in 2018 but it is still under review.
Start Year	2017


Title	Energy aware scheduling in heterogeneous processors
Description	General Purpose Heterogenous Computing Units Heterogeneous chips that combine CPUs and FPGAs can distribute processing so that the algorithm tasks are mapped onto the most suitable processing element. New software-defined high-level design environments for these chips use general purpose languages such as C++ and OpenCL for hardware and interface generation without the need for register transfer language expertise. These advances in hardware compilers have resulted in significant increases in FPGA design productivity. In this research, we investigate how to enhance an existing software-defined framework to reduce overheads and enable the utilization of all the available CPU cores in parallel with the FPGA hardware accelerators. Instead of selecting the best processing element for a task and simply offloading onto it, we introduce two schedulers, Dynamic and LogFit, which distribute the tasks among all the resources in an optimal manner. A new platform is created based on interrupts that removes spin-locks and allows the processing cores to sleep when not performing useful work. For a compute-intensive application, we obtained up to 45.56% more throughput and 17.89% less energy consumption when all devices of a Zynq-7000 SoC collaborate in the computation compared against FPGA-only execution.
Type Of Technology	New/Improved Technique/Technology
Year Produced	2019
Impact	New collaboration research proposal has been established with staff members in the department of electronic Engineering at Imperial College and a new collaboration has been established on energy efficiency with Universidad Rio Grande do Norte, Brazil.
URL	https://github.com/eejlny/gphcu/tree/master


Description	Evaluation of Heterogeneous execution on an HPC-oriented CPU-FPGA System-on-Chip.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Current HPC systems require highly specialized hardware to continue improving performance because the gains offered by technology scaling have significantly reduced. One promising approach for specialization is the integration of CPUs and FPGAs in the same socket, so programmers can write highly optimized kernels, which deliver excellent performance. However, following this approach requires to overcome several obstacles. First, programming FPGA with hardware description languages is very challenging and error prone, and, second, maximizing the utilization of all the computing power of CPU and FPGA devices requires high level programming frameworks that help with the burden of scheduling the work and managing the data. This work presents an analysis of a C++ template based framework enabling programmers to run OpenCL code on any heterogeneous platform, including Intel HARP, with ease. First, it goes over the hardware platform and the interconnection network between the devices, since performance largely depends on them. Second, the talk comments on how High Level Synthesis, HLS, tools can be the substrate for the heterogenous framework and briefly overlooks the usual preference of FPGAs for very deep pipelines with single-task kernels over the single-instruction multiple-thread model of GPUs.
Year(s) Of Engagement Activity	2019
URL	https://sites.google.com/view/hlpgpu2019/home


Description	Heterogeneous FPGA+GPU Embedded Systems: Challenges and Opportunities
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and low response time. This new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and energy consumption are two main factors that should be considered during the design of such systems. Focusing on performance and energy consumption,this paper studies the opportunities and challenges that a heterogeneous embedded system consisting of embedded FPGAs and GPUs (as accelerators) can provide for applications. We study three design,modeling and scheduling challenges throughout the paper. We also propose three techniques to cope with these three challenges. Applying the proposed techniques to three applications including image histogram, dense matrix-vector multiplication and sparse matrix-vector multiplications show 1.79x and 2.29x improvements in performance and energy consumption, respectively, when both FPGA and GPU execute the corresponding application in parallel.
Year(s) Of Engagement Activity	2019
URL	https://eehpcwg.llnl.gov/events.html#jan-22


Description	Invited talk Marionet workshop HiPEAC Manchester 2018
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The talk on "High productivity and low power parallel computing with logic, voltage and frequency scaling in reconfigurable devices" presented the application of voltage, frequency and logic scaling to reconfigurable devices to obtain energy proportional computing. Two case studies consisting of convolutional neural networks and video fusion were created to demonstrate the gains and feasibility of adjusting the operating point beyond nominal and near the operational limits. The fusion application merges infrared and visible light frames using wavelet transforms and splits the algorithm between an ARM processor and the reconfigurable logic automatically balancing performance to obtain the most energy efficient point. The talk included a lively discussion and the idea to host a future Marionet workshop on these topics in 2018 at Bristol.
Year(s) Of Engagement Activity	2018
URL	https://www.hipeac.net/events/activities/7543/marionet/#fndtn-program


Description	Keynote in the Edge Computing Symposium PARCO conference Bologna Sept 2017
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The keynote was given on the topic of "FPGAs for high-productivity and low-power edge computing" with results extracted from the grant activities. The keynote was followed by a informal discussion of the importance and potential gains and shortcomings of the approach. A number of new contacts were established that are currently engaged with the PI in the topic.
Year(s) Of Engagement Activity	2017
URL	http://www.hpc.cineca.it/content/parco-2017-mini-symposium-edge-computing


Description	Keynote talk WRC workshop HiPEAC conference Manchester 2018
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The keynote presented the application of an extended voltage and frequency scaling framework called Elongate to a high-performance and reconfigurable binarized neural network. The neural network is coupled to a multiprocessor system-on-chip that acts as a host controlling the operational point to obtain energy proportionality. Elongate instruments a design netlist by inserting timing detectors to enable the extension of the operating margins of a design reliably. The results show that Elongate can obtain new performance and energy points that are up to 80% better than nominal at the same level of classification accuracy. The results also indicate that the built-in neural network robustness allows operation beyond the first point of error while maintaining the classification accuracy largely unaffected. There was a participation in a panel afterwards and new contacts were established for future collaboration were established.
Year(s) Of Engagement Activity	2018
URL	https://web.fe.up.pt/~specs/events/wrc2018/index.php?page=keynotes


Description	a talk at the 1st EMERGING DEEP LEARNING ACCELERATORS (EDLA) workshop
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	An invited talk and paper were given and this workshop presenting how energy efficiency can be obtained in standard silicon beyond what is nominally possible. A number of contacts were established to push this further into new research proposals.
Year(s) Of Engagement Activity	2019
URL	http://workshops.inf.ed.ac.uk/edla/


Description	workshop involving industry and academia
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Industry/Business
Results and Impact	This workshop brought together industry participants from companies such as Intel, Huawei and start-ups such as Coregraph and UltraSoC both at the national and international levels. There was also strong presence from universities and research institutes in UK. The attendees presented their latest development and a discussion took place on the state of the industry and the nation in the area of high performance computing and multiprocessor development. A number of possible common projects and synergies were identified and this has led for example to a recent research proposal submission with Intel on the area of neural networks for machine health monitoring.
Year(s) Of Engagement Activity	2018
URL	https://seis.bristol.ac.uk/~eejlny/nghpc/eehco.htm

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications