Probabilistic Power-Fault Co-Optimisation (PFCO) for Cryogenic Logic Technologies

Lead Research Organisation: Liverpool John Moores University
Department Name: School of Engineering

Abstract

High performance computing (HPC) and corresponding CMOS chips used in large data centres today face two major challenges:

(1) Energy: Achieving a low-energy, low-carbon, greener footprint is critical. It has been predicted that total energy consumed by data centres could increase 15 times by 2030, accounting for up to 8% of global electricity demand. It is crucial to develop new technologies that can drastically reduce HPC energy consumption by orders of magnitude.

(2) Performance: Due to the difficulties in continuing the conventional scaling, particularly for power reduction, HPC improvement has dramatically slowed from 32 times per decade in the 1990s to only 1.5 times per decade, as up to 80% of transistors on the chip cannot be used simultaneously due to chip overheating.

Power efficiency of logic technologies is limited by VDD, which cannot be scaled much below 0.8 V at room temperature, due to the large Vth and off-leakage imposed by the large sub-threshold swing (SS) of MOSFETs. Liquid nitrogen cooling can drastically improve the performance-to-energy ratio for greener data centres. The significantly reduced SS at cryogenic temperatures (CT) =77 K enables significant Vth and VDD scaling to ~0.1 V and 0.3 V, respectively, and delivers >7.5 times power reduction. DARPA recently called for ground-breaking innovations in cryogenic logic technologies and set the target of 25 times performance-to-power ratio (PPR) improvement.

However, nanoscale devices operating at a low VDD near Vth are susceptible to logic and timing faults from false transistor switching on/off, induced by charging and discharging of a single defect. The risk of false switching is amplified by the steeper I-V characteristic at CT. Existing device fault and reliability modelling cannot be used under low VDD at CT where the probabilistic process dominates, because they are based on conventional deterministic methodologies. These models use the accumulation or average of multiple defects to set the failure criteria. They are overly pessimistic and force the designers to use the worst-case methodology and unnecessarily large design margins, which hinders the achievement of the 25X Cryo-CMOS optimization targets. Therefore, a novel probabilistic criterion of Fault Rate in Per Million Operations (FPMO) should be used to identify the lowest VDD at CT, instead.

This project aims to break the bottleneck and fill in the knowledge gap of cryogenic power fault co-optimisation (PFCO) under ultra-low-VDD. SPICE-level compact modelling of probabilistic and quantum phenomena in defects and their correlations with random faults in devices and circuits will be developed, for achieving ultra-low power and ultra-low fault cryogenic HPC.

The project is built on the holistic integration of three key research pillars and objectives to demonstrate the probabilistic PFCO technology and EDA tools at defects, devices, circuits, and system modelling levels:

1) Measure and characterise different probabilistic defects/faults at CT.

2) Develop probabilistic physical and compact device/fault modelling.

3) Demonstrate the new probabilistic Power and Fault Co-optimisation strategy.

The integrated pursuit of the four WPs aims at delivering a major breakthrough deployable for HPC at low VDD and CT, through probabilistic fault modelling and Device-Technology-System co-optimisation for achieving unprecedented PPR, which is transformable for many other applications in low power design and green ICT technologies. The results will be implemented in cryogenic process-development-kits (PDK) and commercial design tools, and disseminated to leading industry and academia partners through collaborations with ARM, IBM, Synopsys, IMEC and Semiwise.

Publications

10 25 50