Resilient and Testable Energy-Efficient Digital Hardware

Lead Research Organisation: University of Southampton
Department Name: Electronics and Computer Science

Abstract

The UK is home to some world-leading electronic companies including semiconductor IP supplier of low-power microprocessors (ARM), multimedia and communications cores (Imagination Technologies); which are at the heart of today's and future consumer electronics, and home entertainment. Power management is an essential enabling technology in such electronics and will become more prominent in future electronic systems. The downside of power management is that it decreases the reliability and increase the testability cost of energy-efficient hardware as demonstrated by recent academic and industrial research including that reported by the investigation team. This is because energy-efficient hardware often have no provision for tolerating run-time soft errors (unless for safety critical applications); and current methods for testing such hardware for manufacturing defects don't explicitly target power management circuitry. There are currently no fault models or test methods for power distribution networks and power management circuitry and no on-line soft error monitoring and correction methods for power management hardware. This grant application is focused on developing new fault models, methods, circuits and their validation (simulation, FPGA and AISC) to quantify and improve the resilience and testability of energy-efficient digital hardware. Particular emphasis is placed upon cost-effectiveness through joint consideration of reliability, and test and re-using on-chip hardware to minimise silicon area, power consumption and impact on functional performance. This is a three-year project involving two post-doctoral researchers (one for three years and the other for two years), and ARM (Cambridge) as an industrial partner. The project will be carried out in collaboration with Prof. F. Kurdahi (Uni. of California, Irvine) and Prof. M. Tehranipoor (Uni. of Connecticut). Both acknowledged world experts in the proposed research.

This project will significantly advance the present state-of-the-art in reliable and testable energy-efficient hardware and will lead to the following research deliverables:
1. New fault models for power management circuitry and power distribution network (PDMC) to underpin their logic and timing behaviour due to soft errors and manufacturing defects;
2. New methods and circuits and their practical validation for improving testability and diagnosis (against manufacturing defects) and reliability (against soft errors) through online monitoring and correction.
3. A design automation methodology for embedding automatically into an energy-efficient design the required circuitry to enable enhanced reliability and testability using existing EDA tools.

Planned Impact

Impact on Knowledge: There are currently no fault models or test methods for power distribution networks and power management circuitry and no on-line soft error monitoring and correction methods for power management hardware. This research will provide high academic impact through significant advances in the present state-of-the-art design methods and automation of energy-efficient digital hardware. This will further raise the profile and presence of our internationally-leading research energy-efficient and dependable hardware by disseminating research results through high impact publications, organization of workshops, and giving invited talks at major international conferences.

Impact on the Economy: The microelectronics industry is highly competitive, particularly for consumer products. To remain ahead of the field it is important for companies to continually develop and improve their products, with a drive towards increased energy efficiency and reliability at reduced cost. The importance of the proposed research is derived from its contribution to ensure future hardware is low-power and dependable across a range of application areas from healthcare and transport to electronics design and safety critical systems. This research will demonstrate the potential of developing new fault models, methods and circuits, to design resilient and testable energy-efficient digital hardware, which will be highly attractive for commercial development. We believe that this proposal has two key areas of novelty that offer opportunities for exploitation. The first area is associated with developing soft-error and manufacturing-defects aware fault-models for power distribution networks and power management circuitry. The second area of is the methods and circuits together with the fault models will help to establish the scientific foundation required for the development of design flows and synthesis tools to support dependable and testable energy-efficient hardware. At present, there are no commercial tools for reliable and yield-aware low-power designs; therefore the research results will be attractive for commercial development. Exploitable results arising from the research will be dealt with through the University of Southampton Research and Innovation Services, and where appropriate (e.g. a start-up company). The project will directly benefit the project industrial partner ARM (Cambridge); market leaders in the design of low-power IP cores processors for consumer electronics. They fully appreciate the advantages of the developed methods, circuits and tools, which support the design of future energy-efficient computing platform.

Impact on People: The research programme will provide high-quality and industry-relevant training in the general area of Electronics Design and EDA tools to the two named research fellows and the opportunity to further increase their international research presence and leadership; improving their employability prospects.

Impact on Society: The School of Electronics and Computer Science (ECS) places a high value on communicating its research through media. It has its own experienced Marketing and Communications Manager, Joyce Lewis, responsible for all corporate marketing communications, public relations and media liaison. Joyce set up the ECS video podcasting news service in 2006, which was the first one in any UK academic institution. The video podcasts are a regular feature of our communications programme. Over the last five years, Joyce organised a large-scale media training programmes for postgraduate students and researchers, with over 100 people taking part. ECS issues around 80 news releases each year to international and national media and receives outstanding coverage. High quality media training of researchers is recognized as being an essential part of ECS success in public engagement, and is provided by a numnumber of experienced journalist trainers.

Publications

10 25 50

publication icon
Gutierrez M (2017) Susceptible Workload Evaluation and Protection using Selective Fault Tolerance in Journal of Electronic Testing

publication icon
Kufel J (2016) Sequence-Aware Watermark Design for Soft IP Embedded Processors in IEEE Transactions on Very Large Scale Integration (VLSI) Systems

publication icon
Rossi D (2017) Aging Benefits in Nanometer CMOS Designs in IEEE Transactions on Circuits and Systems II: Express Briefs

publication icon
Rossi D (2016) Reliable Power Gating With NBTI Aging Benefits in IEEE Transactions on Very Large Scale Integration (VLSI) Systems

publication icon
Rossi D (2018) Exploiting Aging Benefits for the Design of Reliable Drowsy Cache Memories in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

 
Description Power management is an essential enabling technology in today's and future's low-power devices. The downside of power management is that it decreases the reliability and increase the testability cost of energy-efficient hardware as demonstrated by recent academic and industrial research including that reported by the investigation team.

We showed that existing delay-based testing techniques for power gating are based on simplified models of the power-distribution-networks (PDNs) and exhibit both fault coverage and yield loss due to deviations at the charging delay introduced by the distributed nature of PDNs. To restore this loss, which could reach up to 67.7% of false passes and 25% of false fails due to stuck-open faults, we proposed a design-for-testability (DFT) logic that accounts for a distributed PDN. The proposed logic is optimized for flexibility between test-application-time and hardware overhead and reuses silicon already available on the devices. Through physical layout SPICE simulation, we showed complete fault coverage and yield recovery on single stuck-open faults. This research was the first analysis of the PDN impact on test quality and offers a unified test solution for both ring and grid power gating styles. The analysis of the investigated problem, the proposed solutions and the validation simulation results on the grid style power gating were presented at the 23rd Asian Test Symposium in November 2014 under the title "High Quality Testing of Grid Style Power Gating". The complete research on testing of ring and grid power gating styles are published at the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems entitled as "DFT architecture with power-distribution-network consideration for delay-based power gating test".

Moreover, we have investigated the impact of the power-distribution-network on the diagnosis of power gating designs. The analysis results and the proposed solutions are presented at the 20th European Test Symposium under the title "Diagnosis of power switches with power-distribution-network consideration". The complete research on the diagnosis of power gating considering the power-distribution-network has been submitted at the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) and is under review.

During this project, we have also investigated the impact of the most dominant aging mechanism of CMOS devices manufactured with CMOS technologies below 65 nm, the Bias Temperature Instability (BTI), on the long term reliability of low power designs. We demonstrated that, as logic circuits and memories age due to BTI of CMOS devices, which is the most dominant aging effect of CMOS devices, leakage power reduction techniques become more effective due to a sub-threshold current reduction with aging. The results are published at the IEEE Transactions on Circuits and Systems II under title "Aging benefits in Nanometer CMOS designs". In order to quantify and harvest the low power beneficial effects of aging, we developed a technique for exploring the power gating design of logic circuits and a technique for exploring the Dynamic-Voltage-Scaling (DVS) design of memories. The analysis of this phenomenon on logic circuit designs as well as the proposed solutions for power gated logic designs were presented at the 20th European Test Symposium under the title "NBTI and leakage aware sleep transistor design for reliable and energy efficient power gating" and the complete work has been published at the IEEE Transactions on Very Large Scale Integration (VLSI) Systems under title "Reliable power gating with NBTI aging benefits". The analysis of this phenomenon for memories as well as the proposed solutions for drowsy memories were presented at the 21st IEEE International On-Line Testing Symposium under the title "BTI and leakage aware dynamic voltage scaling for reliable low power cache memories" and the overall work on how the aging of drowsy memories affects the power gated designs is under major revision for publication at the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD).

Additionally, we have shown that the supply current quality is both affecting and gets affected by the aging status of the devices. We proposed a technique for monitoring the supply current locally on the devices that provides feedback for the state of the supply network towards the power management circuitry in order to monitor the aging of the devices at runtime. We have also investigated the re-usability of the power network sensor that was developed as a DFT for power gating, for aging monitoring. We developed and validated using measurements form actual chips provided by ARM, which is an industrial partner of the project, an approach for online monitoring of the BTI-induced-aging of power gated cores that does not require any distributed sensors. Instead, it reuses the virtual rail, which is already distributed, and therefore, has negligible area cost. The research has been published at the IEEE Transactions on Very Large Scale Integration (VLSI) Systems under title "Coarse-grained Online Monitoring of BTI Aging by Reusing Power Gating Infrastructure".

Moreover, we conducted a comprehensive evaluation on the effects of BTI on level shifters. The research outcomes have been published at the Microelectronics Reliability Journal under title "The impact of BTI aging on the reliability of level shifters in Nano-scale CMOS technology".
Exploitation Route It is likely that companies on testing or on designing of sensors for power consumption, reliability and power integrity will be benefitted by the results of this project. Manufacturing testing is conducted after the manufacturing of the integrated circuits to assure that defect-free devices are shipped to the customers. The effectiveness of the developed DFT architectures has already been demonstrated for manufacturing testing. As a result, it could be adopted by industry testing of such low power designs. Moreover, the DFT is integrated in the devices and can be reused for in-the-field applications to increase the resilience of them. These assumptions are supported by the interest that has been shown to our work from the following companies:
• The company Moortech Semiconductor IP has expressed its interest for the developed power network sensors for aging monitoring.
• The company Broadcom has shown interest for using the DFT architecture for power gating at runtime applications.
• ARM Ltd. was also interested in the developed aging monitoring sensors, mainly for online applications, and has provided the test chips for the validation of the technique.

We observe that the above interactions of our research team is related to online/runtime applications of the sensors. Therefore, we have concluded that the necessary drivers and software is required to be developed for the proposed architectures in order to move this technology forward and make it available for a broader range of online applications and end users. As a further step, the proposed reliability monitoring and enhancement solutions should be evaluated at more complex systems and workloads.

For meeting these goals, we have already investigated (and done the first steps) towards directions on combining the developed by the Resilient project technology with that of the PRiME project. The latter targets the development of runtime software for multi-core heterogeneous platforms. First, the reliability analysis from data obtained by power network sensors is already used by the PRiME project. Additionally, we have designed and implemented a prototype of the proposed power network sensors by combining power network monitoring circuitry that is already available on actual platforms with custom designs on FPGA fabric, which it provides an indication of the power integrity status of the system. Our first results on commercial development platforms have identified potential problems that negatively affect the power integrity of those platforms. Our next step is to mitigate these problems without compromising performance by using the developed by the Resilient project online sensors for monitoring the power network status and the PRiME runtime software for appropriately controlling the DVFS policies.

Additional outcomes and ongoing pieces of this project include:
• A susceptible workload driven selective fault tolerance technique using a probabilistic fault model for enhancing the reliability of combinational circuits, which was presented at the 22nd IEEE International On-Line Testing Symposium IOLTS'16 and has been submitted to the Springer's Journal of Journal of Electronic Testing.
• A novel low power online error monitoring technique that produces an alarm signal when systematic erroneous behaviour has occurred over a pre-defined time interval has been accepted as a poster at the 22nd IEEE European Test Symposium (ETS) 2017. (the title of the accepted one is: Low Power Probabilistic Online Monitoring of Systematic Erroneous Behaviour)
• A BTI aware thermal management technique for enhancing the reliability of DVFS designs was presented at the IEEE Defect and Fault Tolerance in VLSI and Nanotechnology Systems Symposium (DFT'16). (best paper award nominee). Currently, we are extending this work for submission to IEEE Transactions on Emerging Topics in Computing, Special Issue on Reliability-aware Design and Analysis Methods for Digital Systems.
Sectors Aerospace, Defence and Marine,Electronics,Energy,Manufacturing, including Industrial Biotechology

URL http://rhea.ecs.soton.ac.uk
 
Description Our research on aging monitoring was validated using chips provided by ARM Ltd, which is one of our industrial patterns. PRiME project is using key findings of this project for enhancing the reliability of complex heterogeneous systems using data provided by prototype power network sensors. First, the circuit reliability analysis from data obtained by power network sensors is being used for building system level reliability models for the PRiME project. Second, prototypes of the power network monitoring circuitry, which has already been shown to monitor successfully the reliability of circuits, is used for the reliability monitoring of actual platforms. The Resilient outcomes have already been cited by numerous prestigious publications at IEEE Conferences and Transactions as well as by the press (IEEE Spectrum Magazine).
 
Description A R M Ltd 
Organisation ARM Holdings
Country United Kingdom 
Sector Private 
Start Year 2006