Continuous on-line adaptation in many-core systems: From graceful degradation to graceful amelioration

Lead Research Organisation: University of York
Department Name: Electronics

Abstract

Until recently, the ever-increasing demand of computing power has been met on one hand by increasing the operating frequency of processors and on the other by designing more and more complex processors capable of executing more than one instruction at the same time. However, both these approaches seem to be reaching (or possibly have already reached) their practical limits, mainly due to issues related to design complexity and cost-effectiveness.
The current trend in computer design seems to favour a shift to systems where computational power is achieved not by a single very fast and very complex processor, but through the parallel operation of several on-chip processors, each executing a single thread. This kind of approach is implemented commercially today through multi-core processors and in research through the Network On Chip (NoC) or the Chip Multi-Processors (CMP) paradigms. The natural evolution of these approaches sees the number of cores increasing constantly and it is generally accepted that the next few decades will witness the introduction of many-core systems, that is, systems that integrate hundreds or thousands of cores.
This shift introduces problems common to all massively parallel systems, ranging from the design of applications that can exploit large numbers of processors to technological challenges related to the implementation of such cores in silicon substrates that are increasingly error-prone, due to their size and to the increasing sensitivity to faults of next-generation technologies, and to the dissipation of heat generated by the computational activity in the cores. Current architectures are not suitable for this kind of systems and there is a strong need to devise novel mechanisms and technologies that will allow the development of many-core systems and eventually their commercialization as consumer products.
Imagine then a many-core system with thousands or millions of processors that gets better and better with time at executing an application, "gracefully" providing optimal power usage while maximizing performance levels and tolerating component failures. The proposed project aims at investigating how such mechanisms can represent crucial enabling technologies for many-core systems.
Specifically, this project focuses on how to overcome three critical issues related to the implementation of many-core systems: reliability, energy efficiency, and on-line optimisation. The need for reliability is an accepted challenge for many-core systems, considering the large number of components and the increasing likelihood of faults of next-generation technologies, as is the requirement to reduce the heat dissipation related to energy consumption. On the other hand, on-line optimisation, that is, the ability of the system to improve over time without the need for external intervention (including becoming better at reliability and energy efficiency), is a mechanism that could be vital to enable the implementation of these properties in systems that cannot be managed centrally due to the vast number of cores involved.
The proposed approach is centred around two basic processes: Graceful degradation implies that the system will be able to cope with faults (permanent or temporary) or potentially damaging power consumption peaks by lowering its performance. Graceful amelioration implies that the system will constantly seek for alternative, better ways to execute an application.

Planned Impact

The principal intended beneficiaries of the research carried out in the project are industries in the domain of processor and embedded system design and reconfigurable logic.
To achieve this impact, we have established contacts with some potential industrial collaborators, including Intel, Xilinx, and STMicroelectronics. The established long-term relationships between the partners and ARM (for example, Southampton has particularly strong links with ARM Ltd. through the ARM-ECS Research Centre, co-directed by Al-Hashimi), as well as the focus on ARM technology in the project platforms, will also ensure their involvement in the project from the start.
Outside the scope of our direct industrial collaborators, we will continually conduct research into the system design industry ourselves in order to identify hotspots and niches where the hardware, tools and services developed in this project could contribute to make improvements and will invite such industry to become involved in the project.
This approach, which prioritizes companies active in the UK, will allow these to exploit the project results to position themselves competitively in a market that is likely to prove highly relevant in the next decades, with a beneficial impact on the UK economy.
The timeliness of the project is crucial in this context: the path to commercialisation for many-core systems is currently at a stage where the development of fundamental mechanisms is highly relevant. The choice of implementing our approach on two platforms at different levels of development reflects the aim to maximise impact: the SpiNNaker implementation will illustrate the effectiveness of the mechanisms on existing technology (ARM cores), showing potential benefits within a shorter timescale, whereas the custom board implementation will investigate more novel architectures to illustrate a longer-term path to the design of many-core systems.
The skills developed by the RAs and academics involved in the project will also represent a strong knowledge base in the domain that will be transmitted to future generations of engineers through teaching, both at the PhD, taught postgraduate and at undergraduate level.

To approach a more general public, we will create a website that will promote and explain the project. Project related events, such as meetings, workshops and documents will be made available for download. In addition, we will provide media-related material, such as diagrams, abstracts and animations, which aim to explain to the public the challenges of current electronic design as well as our approaches to overcome them.
In order to reach a broad audience, we will use existing platforms and events to promote our research, such as national science and engineering events (e.g. House of Lords annual event, National Science week), open science events and science at schools.
 
Description The objective of the project was to develop hardware-based monitors and actuators for run-time management of many-core systems. These were to be tested on multiple platforms and ultimately implemented on a custom hardware system (the Graceful platform) specifically designed to prototype this kind of mechanisms and algorithms.
All of the individual components of the proposed systems were developed (as summarised in the publications associated to the grant): Manchester implemented runtime management mechanisms on the SpiNNaker platform; Southampton, also through a collaboration with the PRiME project, focused on approaches for run-time power management of multi-core and multi-processor systems, with fine-grained adaptation of power consumption and budgeting of energy resources; York developed the custom hardware platforms and developed mechanisms for fine- and coarse-grain runtime hardware optimisations, together with software tools and APIs to allow prototyping.
Where the project did not meet the original planning (see section below) is really in the integration of all these results into a single demonstrator which would merge all the mechanisms and approaches into a single system. For various issues, this objective was not achieved within the time-frame of the project. However, the research is continuing to achieve the planned objectives beyond the end of the project.
Exploitation Route The hardware platform developed in the project as being and will be used for additional research. Three PhD students (one finished, two ongoing) are currently exploiting the platform for their thesis research, with two additional candidates being evaluated. A collaborative research project (outlined int he appropriate section of the report) has started with the University of Sao Carlos in Brazil, in the domain of image processing through mathematical morphology operators. Additional funding proposals (all in collaboration with other UK universities and industry) have been submitted to exploit the platform for impact applications (e.g. healthcare) and further research in the domain of many-core systems.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Electronics,Healthcare,Manufacturing, including Industrial Biotechology

 
Description Brazil 
Organisation Federal University of Sao Carlos
Country Brazil 
Sector Academic/University 
PI Contribution We are providing access to the GRACEFUL hardware platform and software tools, as well as enabling knowledge transfer related to the project outcomes.
Collaborator Contribution The partners are providing contributions in the domain of application development for research on many-core run-time management, specifically in the area of mathematical morphology for image processing. Funding for a one-year (2018-19) visit to the University of York by Prof. Emerson Carlos Pedrino has been secured through a FAPESP-funded proposal.
Impact Outputs will consist of publications and demonstrators. A demonstrator of a mathematical morphology-based image filtering application running on the GRACEFUL platform with run-time management is planned for Summer 2019. The collaborative research is coming to fruition right now with a series of journal articles (the first to be submitted at the end of March 2019).
Start Year 2018
 
Description NESUS 
Organisation Network for Sustainable Ultrascale Computing (NESUS)
Country Global 
Sector Public 
PI Contribution Dr. Tempesti acted as substitute MC for the UK and workgroup co-leader for the NESUS COST action from 2014 to 2017.
Collaborator Contribution The invitation to participate in the COST action originated from contacts related to the steering of the GRACEFUL project and the activities of the action informed the development of the research in the project.
Impact A journal paper was published jointly by the members of the fault-tolerance workgroup (WG3) of the COST action.
Start Year 2014
 
Title GRACEFUL many-core platform 
Description The core of the project was the development of a hardware platform for the prototyping of run-time control of many-core system. The platform has been built and is available for researchers interested in carrying out further research. It consists of a cabinet-size system comprised of 64 full custom PCBs implementing a fully reconfigurable many-core system. 
Type Of Technology Physical Model/Kit 
Year Produced 2018 
Impact The system has been proposed as the experimentation platform for four research funding proposals in application areas ranging from healthcare to medieval manuscript transcription (all collaborations between multiple institutions). It has also given rise to a scientific collaboration with the University of San Carlos, Brazil (see appropriate section). The impact on the many-core research community is therefore ongoing. The impact on wider areas, including cultural and societal impacts, will occur through the application domains of further research. 
 
Title PRiME framework 
Description Research done in the GRACEFUL project (particularly in Southampton) has contributed to the the development of the PRiME Framework (released on github). The PRiME Framework is a collection of software modules that support a novel API, specifically designed to enable the agnostic runtime management of software applications and hardware platforms. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Both Graceful and PRiME ended in 2018 and resources have been made available. We expect impacts to occur at a later stage, as the technologies spread to potential users. 
URL https://github.com/PRiME-project/PRiME-Framework
 
Title XL-STaGe 
Description XL-STaGe is a tool designed to automatically create large task graphs for many-core systems, with a back-end that can generate code specific to one architecture. The full/final software has not yet been released in the public domain as a significant upgrade is under way. The details of the base system were published in 2016 and are available on request. 
Type Of Technology Software 
Year Produced 2016 
Impact We expect that the final version will have an impact on researchers interested in modelling marge-scale may-core systems, but this will occur when the final version is released (expected end of 2019). 
 
Description GRACEFUL workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The Adaptive Many-Core Architectures and Systems workshop was held at the University of York on 13-15 June 2018. The workshop aimed to highlight and discuss emerging trends and future directions in the field of many-core system design (and beyond), and featured invited position papers from world-leading researchers and industrialists across the field.
The workshop featured several keynote speakers from industry and academia and targeted in priority postgraduate students across the UK involved in research in many-core systems. In the workshop, the outcomes of the GRACEFUL research projects (including a partial demonstrator) were put on display and discussed. Some of the contacts made in the workshop are ongoing and will prove valuable for future collaborations.
Year(s) Of Engagement Activity 2018
URL http://york.ac.uk/manycoreworkshop/