7.5 Reliability Modeling and Mitigation

Printer-friendly version PDF version

Date: Wednesday 21 March 2018
Time: 14:30 - 16:00
Location / Room: Konf. 3

Chair:
Said Hamdioui, TU Delft, NL

Co-Chair:
Bram Kruseman, NXP, NL

This session covers various reliability modeling, characterization and mitigation approaches at different abstraction levels. The first paper uses deep learning for variability characterization. The second paper provides aging mitigation schemes for voltage regulators. The third paper addresses program vulnerability in GPU applications.

TimeLabelPresentation Title
Authors
14:307.5.1(Best Paper Award Candidate)
LOW-COST HIGH-ACCURACY VARIATION CHARACTERIZATION FOR NANOSCALE IC TECHNOLOGIES VIA NOVEL LEARNING-BASED TECHNIQUES
Speaker:
Zhijian Pan, Tsinghua University, CN
Authors:
Zhijian Pan1, Miao Li2, Jian Yao2, Hong Lu2, Zuochang Ye1, Yanfeng Li2 and Yan Wang1
1Tsinghua University, CN; 2Platform Design Automation, Inc., CN
Abstract
Faster and more accurate variation characteri-zations of semiconductor devices/circuits are in great demand as process technologies scale down to Fin-FET era. Traditional methods with intensive data testing are extremely costly. In this paper, we propose a novel learning-based high-accuracy data pre-diction framework inspired by learning methods from computer vision to efficiently characterize variabilities of device/circuit behaviors induced by manufacturing process variations. The key idea is to adaptively learn the underlying data pattern among data with variations from a small set of already obtained data and utilize it to accurately predict the unmeasured data with minimum physical measurement cost. To realize this idea, novel regression modeling techniques based on Gaussian process regression and partial least squares regression with feature extraction and matching are developed. We applied our approach to real-time variation characterization for transistors with multiple geometries from a foundry 28nm CMOS process. The results show that the framework achieves about 14x time speed-up with on average 0.1% error for variation data prediction and under 0.3% error for statistical extraction compared to traditional physical measure-ments, which demonstrates the efficacy of the framework for accurate and fast variation analysis and statistical modeling.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:007.5.2MITIGATION OF NBTI INDUCED PERFORMANCE DEGRADATION IN ON-CHIP DIGITAL LDOS
Speaker:
Longfei Wang, University of South Florida, US
Authors:
Longfei Wang1, S. Karen Khatamifard2, Ulya Karpuzcu2 and Selcuk Kose1
1University of South Florida, US; 2University of Minnesota, US
Abstract
On-chip digital low-dropout voltage regulators (LDOs) have recently gained impetus and drawn significant attention for integration within both mobile devices and microprocessors. Although the benefits of easy integration and fast response speed surpass analog LDOs and other voltage regulator types, NBTI induced performance degradation is typically overlooked. The conventional bi-directional shift register based controller can even exacerbate the degradation, which has been demonstrated theoretically and through practical applications. In this paper, a novel uni-directional shift register is proposed to evenly distribute the electrical stress and mitigate the NBTI effects under arbitrary load conditions with nearly no extra power and area overhead. The benefits of the proposed design as well as reliability aware design considerations are explored and highlighted through simulation of an IBM POWER8 like processor under several benchmark applications. It is demonstrated that the proposed NBTI-aware design can achieve up to 43.2% performance improvement as compared to a conventional one.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:307.5.3EVALUATING THE IMPACT OF EXECUTION PARAMETERS ON PROGRAM VULNERABILITY IN GPU APPLICATIONS
Speaker:
Fritz Previlon, Northeastern University, US
Authors:
Fritz Previlon1, Charu Kalra1, Paolo Rech2 and David Kaeli1
1Northeastern University, US; 2Universidade Federal do Rio Grande do Sul, BR
Abstract
While transient faults continue to be a major concern for the High Performance Computing (HPC) community, we still lack a clear understanding of how these faults propagate in applications. This paper addresses two particular aspects of the vulnerabilities of HPC applications as run on Graphics Processing Units (GPUs): their dependence on input data and on thread-block size. To characterize fault propagation as a function of input parameters, we leverage an ISA-level fault injection framework and carry out an extensive fault injection campaign to characterize the vulnerability of a suite of GPU applications. Our results show that the vulnerability of most of the programs studied are insensitive to changes in input values, except in less common cases when input values were highly biased, i.e., values that exhibit a special vulnerability behavior. For example, the multiplication property of any value with a zero value (zero times any number is equal to zero) makes it a biased input for multiplication operations. Our study also examines the effects of changing the GPU thread-block size and its impact on vulnerability. We found that, similar to performance, the vulnerability of an application can depend on the block size of the kernels in the application. In some applications, we found that the silent data corruption rate can vary by as much as 8% when changing the block size of a kernel.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00IP3-12, 387AN EFFICIENT NBTI-AWARE WAKE-UP STRATEGY FOR POWER-GATED DESIGNS
Speaker:
Yu-Guang Chen, Yuan Ze University, TW
Authors:
Kun-Wei Chiu1, Yu-Guang Chen2 and Ing-Chao Lin1
1National Cheng Kung University, TW; 2Yuan Ze University, TW
Abstract
The wake-up process of a power-gated design may induce an excessive surge current and threaten the signal integrity. A proper wake-up sequence should be carefully designed to avoid surge current violations. On the other hand, PMOS sleep transistors may suffer from the negative-bias temperature instability (NBTI) effect which results in decreased driving current. Conventional wake-up sequence decision approaches do not consider the NBTI effect, which may result in a longer or unacceptable wake-up time after circuit aging. Therefore, in this paper, we propose a novel NBTI-aware wake-up strategy to reduce the average wake-up time within a circuit lifetime. Our strategy first finds a set of proper wake-up sequences for different aging scenarios (i.e. after a certain period of aging), and then dynamically reconfigures the wake-up sequences at runtime. The experimental results show that compared to a traditional fixed wake-up sequence approach, our strategy can reduce average wake-up time by as much as 45.04% with only 3.7% extra area overhead for the reconfiguration structure.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:01IP3-13, 835DESIGNING RELIABLE PROCESSOR CORES IN ULTIMATE CMOS AND BEYOND: A DOUBLE SAMPLING SOLUTION
Speaker:
Nacer-Eddine Zergainoh, TIMA, FR
Authors:
Thierry Bonnoit, Fraidy Bouesse, Nacer-Eddine Zergainoh and Michael Nicolaidis, TIMA, FR
Abstract
The double sampling paradigm is an efficient method to protect the circuits against soft-errors. But the data that are going out of the area protected by double sampling are still vulnerable. To eliminate this weakness without having additional constraints on the datapaths, the most common solution adds a contaminable buffer stage between the two areas. Therefore, this stage avoids the propagation of the potentially corrupted data further in the circuit when an error is detected in the double sampling area. But the issue is that this stage must itself be protected against soft-errors, which drastically increases the cost of the solution. In this paper we characterize the additional implementation constraints due to this vulnerability. We proposed an architectural solution that uses three latches to remove those constraints and protect the area outside the double sampling domain without adding a buffer stage. We present an implementation of this solution on the LEON3 processor, and we compare the results in terms of additional cost and efficiency with other solutions.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00End of session
Coffee Break in Exhibition Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018

  • Coffee Break 10:30 - 11:30
  • Lunch Break 13:00 - 14:30
  • Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20
  • Coffee Break 16:00 - 17:00

Wednesday, March 21, 2018

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:30
  • Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20
  • Coffee Break 16:00 - 17:00

Thursday, March 22, 2018

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:00
  • Keynote Lecture in "Saal 2" 13:20 - 13:50
  • Coffee Break 15:30 - 16:00