ENergy Efficient Adaptive Computing with multi-grain heterogeneous architectures (ENEAC)

Lead Research Organisation: University of Bristol
Department Name: Electrical and Electronic Engineering

Abstract

Energy efficiency is one of the primary design constraints for modern processing systems. Hardware accelerators are seen as a key technology to address the high performance with limited energy issue. In addition the arrival of computing languages such as OpenCL offer a route to the programmer to target different types of multi-core accelerators using a single source code. Performance portability is a significant challenge specially if the accelerators have different microarchitectures such as is the case in CPU-GPU-FPGA systems. This research addresses the energy and performance challenge by investigating how a device formed by processing units with different granularities ranging from coarse grain CPU cores of different complexity, medium grain general purpose GPU cores and fine grain FPGA logic cells can be dynamically programmed. The challenge is to be able to program all these resources with a single programming model and create a run-time system that can automatically tune the software to the best execution resource from energy and performance points of view. The results from this research are expected to deliver new fundamental insights to the question of: How future computers can obtain orders of magnitude higher performance with limited energy budgets?

Planned Impact

The potential beneficiaries of this research are the electronics and semiconductor companies involved in the creation of the multi-core processing platforms that will be at the center of future super-computing devices. A good indication of the challenges that this industry faces over the next 20 years is available in the International Technology Roadmap for Semiconductors (ITRS). This report identified energy as a fundamental challenge that future integrated circuits will face.
The objective set by the IRTS in terms of energy requirements is to maintain the static and dynamic power at current or decreasing levels despite the exponential growth in logic complexity and throughput. Heterogeneous processing enable a better match between the processing requirements and hardware but understanding the optimal match and making this transparent to the programmer is a significant research challenge, The proposed technology is highly relevant to servers dealing with offline analytics, web applications and large data sets applications in areas, including meteorology, seismic, genomics, complex physics simulations. etc. These applications offer high level of parallelism so that multiple cores can work on the problem at a time but the power requirements of these large collections of server processors become very high. It is highly possible that the optimal core type is dependent on the require performance level or energy budget.
To make sure that the knowledge and know-how permeates adequately into the industry environment the following actions will be taken:
1. The hardware demonstrators will be presented with a series of visits to the industrial collaborators Altera and ARM and other companies working in the area of interest. These visits will be organized at key stages of the project to ensure that adequate feedback is received. Short secondments of academic staff to industry will be arranged together with the visits. Additional contacts will be made with companies developing server systems around ARM technology to demonstrate the potential improvements in energy of the approach. The technology could also extend the design flows of companies targeting high-performance computing with accelerators such as IBM and Microsoft.
2. In the HIPEAC/EACO workshops we intend to organise a series of seminars/demonstrations to bring the project in contact with industry and academia. We believe that an interactive demonstration around a drug-discovery or astronomy algorithm should be able to attract plenty of general interest.
3. The traditional avenues of journal and conference publications will be also fully utilized. We have identified conferences such as AHS (Adaptive Hardware Systems), FPL (Field Programmable Logic), DAC (Design Automation Conference) as high quality conferences adequate for this work. Journals such as IEEE TVLSI, IEEE Computers and IET CDT will be targeted.
The expected impacts of this research can be summarised as:
1. To understand how available heterogeneous devices with architecturally different processing resources can be programmed efficiently using a single parallel language.
2. To deliver one order of magnitude better energy/performance operation exploiting adaptation of the algorithm to these different computing resources at run-time.
3. To show how this approach can be successfully applied to high-performance and embedded computing systems used in industrial applications.

Publications

10 25 50

publication icon
Galindo Sanchez F (2017) Energy proportional streaming spiking neural network in a reconfigurable system in Microprocessors and Microsystems

publication icon
Hosseinabady M (2018) Dynamic Energy Management of FPGA Accelerators in Embedded Systems in ACM Transactions on Embedded Computing Systems

publication icon
Jose Nunez-Yanez (2018) Simultaneous Multiprocessing on a FPGA+CPU Heterogeneous System-On-Chip in Parallel Computing is Everywhere

publication icon
Mohammad Hosseinabady (2018) Pipelined Streaming Computation of Histogram in FPGA OpenCL in Parallel Computing is Everywhere

publication icon
Nunez-Yanez J (2018) Simultaneous multiprocessing in a software-defined heterogeneous FPGA in The Journal of Supercomputing

publication icon
Nunez-Yanez J (2016) Applied Reconfigurable Computing

 
Description When a heterogeneous chip has computing resources of different capabilities and speeds it is possible to find cases in which to use all these units in parallel is more efficient than only using the most power unit and setting the others to sleep. This type of heterogeneous parallel compute is viable if an energy aware scheduler can adjust the amount of work to each unit depending on its capacity. Discovering this capacity at run-time is not trivial since it depends on the type of work that is required.
Exploitation Route The technology could be deployed in a large scale data center or high performance computing system to accelerate algorithms used in digital economy areas.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Electronics,Environment

URL http://www.oerc.ox.ac.uk/Jose-Nunez-Yanez-Many-core
 
Description The results have been made available to FPGA manufacturer and project partner Xilinx that is currently evaluating how the proposed technologies could be used within their design flows. This has been discussed during meetings taken place at HiPEAC, 2018 (Manchester) The results have been made available to Intel and Huawei during the 2019 workshop at Bristol and also presented in HiPEAC (2019). Huawei is evaluating how this technology could be combined with their own multiprocessor ARM server.
First Year Of Impact 2018
Sector Education,Electronics
Impact Types Economic

 
Description HARP collaboration with Intel and University of Paderborn 
Organisation Intel Corporation
Country United States 
Sector Private 
PI Contribution Exploration of using the energy aware scheduling algorithms part of ENEAC in a Xeon+FPGA device developed by Intel. This is part of the Hardware Accelerator Research Program (https://software.intel.com/en-us/hardware-accelerator-research-program)
Collaborator Contribution Intel has donated access and support for their HARP systems.
Impact Initial results were presented in the ARM research summit in 2017. A journal paper "Hardware Accelerator Research Program" has been submitted in 2018 but it is still under review.
Start Year 2017
 
Title Energy aware scheduling in heterogeneous processors 
Description General Purpose Heterogenous Computing Units Heterogeneous chips that combine CPUs and FPGAs can distribute processing so that the algorithm tasks are mapped onto the most suitable processing element. New software-defined high-level design environments for these chips use general purpose languages such as C++ and OpenCL for hardware and interface generation without the need for register transfer language expertise. These advances in hardware compilers have resulted in significant increases in FPGA design productivity. In this research, we investigate how to enhance an existing software-defined framework to reduce overheads and enable the utilization of all the available CPU cores in parallel with the FPGA hardware accelerators. Instead of selecting the best processing element for a task and simply offloading onto it, we introduce two schedulers, Dynamic and LogFit, which distribute the tasks among all the resources in an optimal manner. A new platform is created based on interrupts that removes spin-locks and allows the processing cores to sleep when not performing useful work. For a compute-intensive application, we obtained up to 45.56% more throughput and 17.89% less energy consumption when all devices of a Zynq-7000 SoC collaborate in the computation compared against FPGA-only execution. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2019 
Impact New collaboration research proposal has been established with staff members in the department of electronic Engineering at Imperial College and a new collaboration has been established on energy efficiency with Universidad Rio Grande do Norte, Brazil. 
URL https://github.com/eejlny/gphcu/tree/master
 
Description Evaluation of Heterogeneous execution on an HPC-oriented CPU-FPGA System-on-Chip. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Current HPC systems require highly specialized hardware to continue improving performance because the gains offered by technology scaling have significantly reduced. One promising approach for specialization is the integration of CPUs and FPGAs in the same socket, so programmers can write highly optimized kernels, which deliver excellent performance. However, following this approach requires to overcome several obstacles. First, programming FPGA with hardware description languages is very challenging and error prone, and, second, maximizing the utilization of all the computing power of CPU and FPGA devices requires high level programming frameworks that help with the burden of scheduling the work and managing the data. This work presents an analysis of a C++ template based framework enabling programmers to run OpenCL code on any heterogeneous platform, including Intel HARP, with ease. First, it goes over the hardware platform and the interconnection network between the devices, since performance largely depends on them. Second, the talk comments on how High Level Synthesis, HLS, tools can be the substrate for the heterogenous framework and briefly overlooks the usual preference of FPGAs for very deep pipelines with single-task kernels over the single-instruction multiple-thread model of GPUs.
Year(s) Of Engagement Activity 2019
URL https://sites.google.com/view/hlpgpu2019/home
 
Description Heterogeneous FPGA+GPU Embedded Systems: Challenges and Opportunities 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and low response time. This new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and energy consumption are two main factors that should be considered during the design of such systems. Focusing on performance and energy consumption,this paper studies the opportunities and challenges that a heterogeneous embedded system consisting of embedded FPGAs and GPUs (as accelerators) can provide for applications. We study three design,modeling and scheduling challenges throughout the paper. We also propose three techniques to cope with these three challenges. Applying the proposed techniques to three applications including image histogram, dense matrix-vector multiplication and sparse matrix-vector multiplications show 1.79x and 2.29x improvements in performance and energy consumption, respectively, when both FPGA and GPU execute the corresponding application in parallel.
Year(s) Of Engagement Activity 2019
URL https://eehpcwg.llnl.gov/events.html#jan-22
 
Description Invited talk Marionet workshop HiPEAC Manchester 2018 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The talk on "High productivity and low power parallel computing with logic, voltage and frequency scaling in reconfigurable devices" presented the application of voltage, frequency and logic scaling to reconfigurable devices to obtain energy proportional computing. Two case studies consisting of convolutional neural networks and video fusion were created to demonstrate the gains and feasibility of adjusting the operating point beyond nominal and near the operational limits. The fusion application merges infrared and visible light frames using wavelet transforms and splits the algorithm between an ARM processor and the reconfigurable logic automatically balancing performance to obtain the most energy efficient point. The talk included a lively discussion and the idea to host a future Marionet workshop on these topics in 2018 at Bristol.
Year(s) Of Engagement Activity 2018
URL https://www.hipeac.net/events/activities/7543/marionet/#fndtn-program
 
Description Keynote in the Edge Computing Symposium PARCO conference Bologna Sept 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The keynote was given on the topic of "FPGAs for high-productivity and low-power edge computing" with results extracted from the grant activities. The keynote was followed by a informal discussion of the importance and potential gains and shortcomings of the approach. A number of new contacts were established that are currently engaged with the PI in the topic.
Year(s) Of Engagement Activity 2017
URL http://www.hpc.cineca.it/content/parco-2017-mini-symposium-edge-computing
 
Description Keynote talk WRC workshop HiPEAC conference Manchester 2018 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The keynote presented the application of an extended voltage and frequency scaling framework called Elongate to a high-performance and reconfigurable binarized neural network. The neural network is coupled to a multiprocessor system-on-chip that acts as a host controlling the operational point to obtain energy proportionality. Elongate instruments a design netlist by inserting timing detectors to enable the extension of the operating margins of a design reliably. The results show that Elongate can obtain new performance and energy points that are up to 80% better than nominal at the same level of classification accuracy. The results also indicate that the built-in neural network robustness allows operation beyond the first point of error while maintaining the classification accuracy largely unaffected. There was a participation in a panel afterwards and new contacts were established for future collaboration were established.
Year(s) Of Engagement Activity 2018
URL https://web.fe.up.pt/~specs/events/wrc2018/index.php?page=keynotes
 
Description a talk at the 1st EMERGING DEEP LEARNING ACCELERATORS (EDLA) workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact An invited talk and paper were given and this workshop presenting how energy efficiency can be obtained in standard silicon beyond what is nominally possible. A number of contacts were established to push this further into new research proposals.
Year(s) Of Engagement Activity 2019
URL http://workshops.inf.ed.ac.uk/edla/
 
Description workshop involving industry and academia 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact This workshop brought together industry participants from companies such as Intel, Huawei and start-ups such as Coregraph and UltraSoC both at the national and international levels. There was also strong presence from universities and research institutes in UK. The attendees presented their latest development and a discussion took place on the state of the industry and the nation in the area of high performance computing and multiprocessor development. A number of possible common projects and synergies were identified and this has led for example to a recent research proposal submission with Intel on the area of neural networks for machine health monitoring.
Year(s) Of Engagement Activity 2018
URL https://seis.bristol.ac.uk/~eejlny/nghpc/eehco.htm