Continuous on-line adaptation in many-core systems: From graceful degradation to graceful amelioration

Lead Research Organisation: University of York
Department Name: Electronics

Abstract

Until recently, the ever-increasing demand of computing power has been met on one hand by increasing the operating frequency of processors and on the other by designing more and more complex processors capable of executing more than one instruction at the same time. However, both these approaches seem to be reaching (or possibly have already reached) their practical limits, mainly due to issues related to design complexity and cost-effectiveness.
The current trend in computer design seems to favour a shift to systems where computational power is achieved not by a single very fast and very complex processor, but through the parallel operation of several on-chip processors, each executing a single thread. This kind of approach is implemented commercially today through multi-core processors and in research through the Network On Chip (NoC) or the Chip Multi-Processors (CMP) paradigms. The natural evolution of these approaches sees the number of cores increasing constantly and it is generally accepted that the next few decades will witness the introduction of many-core systems, that is, systems that integrate hundreds or thousands of cores.
This shift introduces problems common to all massively parallel systems, ranging from the design of applications that can exploit large numbers of processors to technological challenges related to the implementation of such cores in silicon substrates that are increasingly error-prone, due to their size and to the increasing sensitivity to faults of next-generation technologies, and to the dissipation of heat generated by the computational activity in the cores. Current architectures are not suitable for this kind of systems and there is a strong need to devise novel mechanisms and technologies that will allow the development of many-core systems and eventually their commercialization as consumer products.
Imagine then a many-core system with thousands or millions of processors that gets better and better with time at executing an application, "gracefully" providing optimal power usage while maximizing performance levels and tolerating component failures. The proposed project aims at investigating how such mechanisms can represent crucial enabling technologies for many-core systems.
Specifically, this project focuses on how to overcome three critical issues related to the implementation of many-core systems: reliability, energy efficiency, and on-line optimisation. The need for reliability is an accepted challenge for many-core systems, considering the large number of components and the increasing likelihood of faults of next-generation technologies, as is the requirement to reduce the heat dissipation related to energy consumption. On the other hand, on-line optimisation, that is, the ability of the system to improve over time without the need for external intervention (including becoming better at reliability and energy efficiency), is a mechanism that could be vital to enable the implementation of these properties in systems that cannot be managed centrally due to the vast number of cores involved.
The proposed approach is centred around two basic processes: Graceful degradation implies that the system will be able to cope with faults (permanent or temporary) or potentially damaging power consumption peaks by lowering its performance. Graceful amelioration implies that the system will constantly seek for alternative, better ways to execute an application.

Planned Impact

The principal intended beneficiaries of the research carried out in the project are industries in the domain of processor and embedded system design and reconfigurable logic.
To achieve this impact, we have established contacts with some potential industrial collaborators, including Intel, Xilinx, and STMicroelectronics. The established long-term relationships between the partners and ARM (for example, Southampton has particularly strong links with ARM Ltd. through the ARM-ECS Research Centre, co-directed by Al-Hashimi), as well as the focus on ARM technology in the project platforms, will also ensure their involvement in the project from the start.
Outside the scope of our direct industrial collaborators, we will continually conduct research into the system design industry ourselves in order to identify hotspots and niches where the hardware, tools and services developed in this project could contribute to make improvements and will invite such industry to become involved in the project.
This approach, which prioritizes companies active in the UK, will allow these to exploit the project results to position themselves competitively in a market that is likely to prove highly relevant in the next decades, with a beneficial impact on the UK economy.
The timeliness of the project is crucial in this context: the path to commercialisation for many-core systems is currently at a stage where the development of fundamental mechanisms is highly relevant. The choice of implementing our approach on two platforms at different levels of development reflects the aim to maximise impact: the SpiNNaker implementation will illustrate the effectiveness of the mechanisms on existing technology (ARM cores), showing potential benefits within a shorter timescale, whereas the custom board implementation will investigate more novel architectures to illustrate a longer-term path to the design of many-core systems.
The skills developed by the RAs and academics involved in the project will also represent a strong knowledge base in the domain that will be transmitted to future generations of engineers through teaching, both at the PhD, taught postgraduate and at undergraduate level.

To approach a more general public, we will create a website that will promote and explain the project. Project related events, such as meetings, workshops and documents will be made available for download. In addition, we will provide media-related material, such as diagrams, abstracts and animations, which aim to explain to the public the challenges of current electronic design as well as our approaches to overcome them.
In order to reach a broad audience, we will use existing platforms and events to promote our research, such as national science and engineering events (e.g. House of Lords annual event, National Science week), open science events and science at schools.