DOME: Delaying and Overcoming Microprocessor Errors

Lead Research Organisation: University of Cambridge
Department Name: Computer Laboratory

Abstract

Modern day computer systems have benefited from being designed and manufactured using an ever-increasing budget of transistors with very reliable integrated circuits. However, moving forward such a ''free lunch'' is over and forgotten nightmares faced by computer pioneers are coming back to haunt us. Not so long ago, unreliable valves were the basic building blocks for computers and research focussed on how to successfully compute, overcoming this underlying weakness (e.g. von Neuman, 1956, ''Probabilistic logics and the synthesis of reliable organisms from unreliable components'').

State-of-the-art integrated circuit technologies have now reached the range of 40-22 nanometers, posing significant reliability challenges. Hard or permanent errors can manifest themselves at any point during a processor's lifetime. During manufacturing, errors can render a proportion of a chip incapable of computing, thus decreasing yield and profit.
As we move towards smaller and smaller components, transistors take less and less time to wearout, becoming more prone to failure in the field. Traditional reliability solutions involve applying high-cost redundancy to the hardware structures within the processor, providing backup spares for when errors occur. On the application side, solutions also involve redundancy by running multiple copies of each piece of software.

A common criticism of current reliability solutions is that they do not consider how the software and hardware can be co-designed synergistically to tackle this challenge. Redesigning and reimplementing general purpose software applications will incur an unaffordable price tag. Our hypothesis is that virtualization technologies (a layer that transparently hides the underlying platform from the application software) have an important role to play. In particular, managed runtime environments (MREs) have become pervasive for high-productivity software developers and represent a promising vehicle for providing reliability mechanisms. Within these systems, applications can be monitored and morphed without user intervention.

There are two complementary strands to our proposed research, focused around a co-designed MRE and multicore computer architecture. Firstly, we will consider wearout mitigation schemes to slow processor ageing and lengthen a chip's lifetime before a hard fault occurs. Secondly, given that an error will occur at some point during a system's life, we will develop error-tolerance approaches that maintain execution on faulty hardware.

If successful, we believe this project will be seen as a significant milestone in the development of wearout-conscious and error-tolerant multicore architectures over the next decade. This research programme will advance our understanding of the field, tackling the UK Microelectronics Grand Challenge of Moore for Less that has been signposted by EPSRC. It is also important to highlight that this proposal tackles a key aspect of the new EPSRC ICT capability priority on "Many-core architectures and concurrency in distributed and embedded systems".

Publications

10 25 50

publication icon
Kanev S (2013) Measuring Code Optimization Impact on Voltage Noise in Workshop on Silicon Errors in Logic - System Effects (SELSE)

publication icon
Mitropoulou K (2016) Lynx

publication icon
Valero A (2016) Enhancing the L1 Data Cache Design to Mitigate HCI in IEEE Computer Architecture Letters

publication icon
Valero A (2017) On Microarchitectural Mechanisms for Cache Wearout Reduction in IEEE Transactions on Very Large Scale Integration (VLSI) Systems

 
Description We have three main findings. First, many permanent errors within processors can be overcome through the addition of a small logic unit capable of re-executing instructions. Second, applications cause different amounts of transistor ageing depending on the operations they perform. Third, errors in a large core can be both detected and corrected using an array of small, power-efficient cores that run in parallel.
Exploitation Route Our work can be used by industry to develop schemes that combat processor ageing and overcome permanent processor faults.
Sectors Digital/Communication/Information Technologies (including Software),Electronics

URL https://www.cl.cam.ac.uk/~tmj32/
 
Description We have held discussions with Arm about deploying this technology in their R-class processors. These are for real-time systems and require strong reliability guarantees. In the meantime, we have further developed some of the techniques from this work which build a stronger case for including this research.
First Year Of Impact 2016
Sector Digital/Communication/Information Technologies (including Software),Electronics
Impact Types Economic

 
Title Research data supporting "High Performance Fault Tolerance Through Predictive Instruction Re-Execution" 
Description Source code for simulator modules to implement schemes in the paper. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Description HiPEAC 
Organisation European Commission
Department Seventh Framework Programme (FP7)
Country European Union (EU) 
Sector Public 
PI Contribution Attending meetings to disseminate results and interact with other researchers in the same area.
Collaborator Contribution A visit by a PostDoc from another member for 4 months.
Impact The network is on High-Performance and Embedded Architectures and Compilers
Start Year 2011
 
Description NESUS 
Organisation Network for Sustainable Ultrascale Computing (NESUS)
Country Global 
Sector Academic/University 
PI Contribution Visit to initial kick-off meeting to discuss our work.
Collaborator Contribution The whole consortium aims to meet the challenges of sustainable ultrascale computing.
Impact The collaboration crosses disciplines within Computer Science. These are programming languages, compilers, runtimes, computer architecture and networks.
Start Year 2014
 
Title The Lynx Queue 
Description Lynx is a very fast single-producer, single-consumer software queue. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact We have used this queue to develop faster soft-error detection techniques. It has been downloaded 21 times by others. 
URL http://www.cl.cam.ac.uk/~tmj32/data/