5.6 Reliability improvement and evaluation techniques

Printer-friendly version PDF version

Date: Wednesday 21 March 2018
Time: 08:30 - 10:00
Location / Room: Konf. 4

Chair:
Stefano Di Carlo, Politecnico di Torino, IT

Co-Chair:
Vasileios Tenentes, University of Southampton, GB

This session introduces reliability improvement approaches using dynamic recovery, redundant multithreading, aging mitigation and optimization of metastability effects, spanning from the system to the circuit layer. Also, cross-layer resilience evaluation via fault injection for complex microprocessors is presented.

TimeLabelPresentation Title
Authors
08:305.6.1IMPROVING RELIABILITY FOR REAL-TIME SYSTEMS THROUGH DYNAMIC RECOVERY
Speaker:
Yue Ma, University of Notre Dame, US
Authors:
Yue Ma1, Tam Chantem2, Robert P. Dick3 and Xiaobo Sharon Hu1
1University of Notre Dame, US; 2Virginia Tech, US; 3University of Michigan, US
Abstract
Technology scaling has increased concerns about transient faults due to soft errors and permanent faults due to lifetime wear processes. Although researchers have investigated related problems, they have either considered only one of the two reliability concerns or presented simple recovery allocation algorithms that cannot effectively use available time slack to improve soft-error reliability. This paper introduces a framework for improving soft-error reliability while satisfying lifetime reliability and real-time constraints. We present a dynamic recovery allocation technique that guarantees to recover any failed task if the remaining slack is adequate. Based on this technique, we propose two scheduling algorithms for task sets with different characteristics to improve system-level soft-error reliability. Lifetime reliability requirements are satisfied by reducing core frequencies for appropriate tasks, thereby reducing wear due to temperature and thermal cycling. Simulation results show that the proposed framework reduces the probability of failure by at least 8% and 73% on average compared to existing approaches.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:005.6.2OPTIMAL METASTABILITY-CONTAINING SORTING NETWORKS
Speaker:
Johannes Bund, Saarland University, DE
Authors:
Johannes Bund1, Christoph Lenzen2 and Moti Medina3
1Saarland University, Saarland Informatics Campus, DE; 2Max Planck Institute for Informatics, Saarland Informatics Campus, DE; 3From 1/10/2017 in The Department of Electrical and Computer Engineering Ben-Gurion University, IL
Abstract
When setup/hold times of bistable elements are violated, they may become metastable, i.e., enter a transient state that is neither digital 0 nor 1 [Marino 81]. In general, metastability cannot be avoided, a problem that manifests whenever taking discrete measurements of analog values. Metastability of the output then reflects uncertainty as to whether a measurement should be rounded up or down to the next possible integral measurement outcome. Surprisingly, Lenzen & Medina (ASYNC 2016) showed that metastability can be contained, i.e., measurement values can be correctly sorted without resolving metastability first. However, both their work and the state of the art by Bund et al. (DATE 2017) leave open whether such a solution can be as small and fast as standard sorting networks. We show that this is indeed possible, by providing a circuit that sorts Gray code inputs (possibly containing a metastable bit) and has asymptotically optimal depth and size. Concretely, for 10-channel sorting networks and 16-bit wide inputs, we improve by 48.46% in delay and by 71.58% in area over Bund et al. Our simulations indicate that straightforward transistor-level optimization is likely to result in performance on par with standard (non-containing) solutions.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:305.6.3MAUI: MAKING AGING USEFUL, INTENTIONALLY
Speaker:
Shou-Chun Li, Department of Computer Science, National Chiao Tung University, TW
Authors:
Kai-Chiang Wu1, Tien-Hung Tseng2 and Shou-Chun Li1
1National Chiao Tung University, TW; 2National Chiao Tung University, Taiwan, TW
Abstract
Device aging, which causes significant loss on circuit performance and lifetime, has been a primary factor in reliability degradation of nanoscale designs. In this paper, we propose to take advantage of aging-induced clock skews (i.e., make them useful for aging tolerance) by manipulating these time-varying skews to compensate for the performance degradation of logic networks. The goal is to assign achievable/reasonable aging-induced clock skews in a circuit, such that its overall performance degradation due to aging can be minimized, that is, the lifespan can be maximized. On average, 25% aging tolerance can be achieved with insignificant design overhead.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:455.6.4EXPERT: EFFECTIVE AND FLEXIBLE ERROR PROTECTION BY REDUNDANT MULTITHREADING
Speaker:
HwiSoo So, Yonsei University, KR
Authors:
HwiSoo So1, Moslem Didehban2, Yohan Ko1, Aviral Shrivastava2 and Kyoungwoo Lee1
1Yonsei University, KR; 2Arizona State University, US
Abstract
Resiliency is a first-order design concern in modern microprocessor design. Compiler-level Redundant MultiThreading (RMT) schemes are promising because of their capability to detect the manifestation of hardware transient and permanent faults. In this work, we propose EXPERT, a compiler-level RMT scheme which can detect the manifestation of hardware faults in all hardware components. EXPERT transformation generates a checker thread for program main execution thread. These redundant threads execute simultaneously on two physically different cores of a multi-core processor. They perform mostly same computations, however, after each memory write operation committed by the main thread, the checker thread loads back the written data from the memory and checks it against its own locally computed values. If they match, execution continues. Otherwise, the error flag will be raised. Our processor-wide statistical transient and permanent fault injection experiments show that EXPERT error coverage is ~65 better than the state-of-the-art scheme.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00IP2-11, 579ETISS-ML: A MULTI-LEVEL INSTRUCTION SET SIMULATOR WITH RTL-LEVEL FAULT INJECTION SUPPORT FOR THE EVALUATION OF CROSS-LAYER RESILIENCY TECHNIQUES
Speaker:
Martin Dittrich, Technical University of Munich, DE
Authors:
Daniel Mueller-Gritschneder1, Martin Dittrich1, Josef Weinzierl1, Eric Cheng2, Subhasish Mitra2 and Ulf Schlichtmann1
1Technical University of Munich, DE; 2Stanford University, US
Abstract
ETISS is an instruction set simulator (ISS) for Virtual Prototypes (VPs) modeled with SystemC/TLM. In this paper, we propose the extension ETISS-ML, which enables a multi-level simulation that switches between ISS-level and register transfer level (RTL) to accurately evaluate the impact of soft errors in the pipeline of a RISC processor. ETISS-ML achieves close-to-RTL-accurate fault injection simulation results with close-to-ISS simulation performance with a speed up gain up to 100x compared to RTL. For this, we propose an approach to dynamically determine the length of the RTL simulation period. The high simulation performance of ETISS-ML enables an ultra-efficient and accurate evaluation of cross-layer resiliency techniques for embedded applications, which requires running a large number of fault injections for long simulation scenarios. This is demonstrated on a case study of a Microcontroller Unit (MCU) executing a control algorithm for adaptive cruise control.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:01IP2-12, 806PRECISE EVALUATION OF THE FAULT SENSITIVITY OF OOO SUPERSCALAR PROCESSORS
Speaker:
Antonio Carlos Schneider Beck, Federal University of Rio Grande do Sul, BR
Authors:
Rafael Tonetto1, Gabriel Luca Nazar2 and Antonio Carlos Schneider Beck2
1Federal University of Rio Grande do Sul, BR; 2Universidade Federal do Rio Grande do Sul, BR
Abstract
Since superscalar processors lead the market, their resiliency evaluation by means of fault injection grows in importance. Fault injection strategies usually trade-off their levels of accuracy: low-level HW-based methods are accurate, but very expensive, need special equipment and the actual hardware, and lack controllability; while high-level simulation-based strategies are flexible, fast, easily accessible and have high controllability, but are not accurate since they are based on models that do not always reflect the low-level implementation, mainly when it comes to complex designs like out-of-order multiple-issue processors. In this work, we propose a cycle-accurate fault injection platform for superscalar processors, which has a smart checkpointing mechanism to accelerate injection time, attenuating the shortcomings imposed by the aforementioned fault injection methods while providing the same level of abstraction as detailed RTL models. Leveraging from this new platform, we evaluate a complex and parameterizable Out-of-Order processor (BOOM) by experimenting with different issue widths and analyzing the sensitivity of several hardware structures of the processor.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00End of session
Coffee Break in Exhibition Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018

  • Coffee Break 10:30 - 11:30
  • Lunch Break 13:00 - 14:30
  • Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20
  • Coffee Break 16:00 - 17:00

Wednesday, March 21, 2018

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:30
  • Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20
  • Coffee Break 16:00 - 17:00

Thursday, March 22, 2018

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:00
  • Keynote Lecture in "Saal 2" 13:20 - 13:50
  • Coffee Break 15:30 - 16:00