6.3 Anti-aging and Error Protection using Checkpointing and DVFS

Printer-friendly version PDF version

Date: Wednesday 16 March 2016
Time: 11:00 - 12:30
Location / Room: Konferenz 1

Chair:
Antonio Rosario Miele, Polimi, IT

Co-Chair:
Jose L. Ayala, Complutense University of Madrid, ES

As reliability becomes a major concern for both designers and technologists, techniques such as error protection is needed to keep the best known state and preserve it for subsequent operations. In this session various methods of checkpointing at register level and at memory level are presented that relieve systems from aging. Various combinations of DVFS and checkpointing techniques are presented in this session including techniques that exploit application level tolerability to errors.

TimeLabelPresentation Title
Authors
11:006.3.1AGING-AWARE VOLTAGE SCALING
Speaker:
Victor M. van Santen, Karlsruhe Institute of Technology (KIT), DE
Authors:
Victor M. van Santen1, Hussam Amrouch1, Narendra Parihar2, Souvik Mahapatra2 and Jörg Henkel1
1Karlsruhe Institute of Technology (KIT), DE; 2Indian Institute of Technology Bombay, IN
Abstract
As feature sizes of transistors began to approach atomic levels, aging effects have become one of major concerns when it comes to reliability. Recently, aging effects have become a subject to voltage scaling as the latter entered the sub-micron regime. Hence, aging shifted from a sole long-term (as treated by state-of-the-art) to a short and long-term reliability challenge. This paper interrelates both aging and voltage scaling to explore and quantify for the first time the short-term effects of aging. We propose "aging-awareness" with respect to voltage scaling which is indispensable to sustain runtime reliability. Otherwise, transient errors, caused by the short-term effects of aging, may occur. Compared to state-of-the-art, our aging-aware voltage scaling optimizes for both short-term and long-term aging effects at marginal guardband overhead.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:306.3.2RECORD: REDUCING REGISTER TRAFFIC FOR CHECKPOINTING IN RELIABLE EMBEDDED PROCESSORS
Speaker:
Sri Parameswaran, University of New South Wales, AU
Authors:
Tuo Li1, Jude Angelo Ambrose2 and Sri Parameswaran1
1University of New South Wales, AU; 2Canon Information Systems Research Australia, AU
Abstract
Checkpoint/recovery, as a classic method, has been widely used for overcoming transient faults in computing systems. The basic function of checkpoint/recovery is to save the system states periodically and to restore the system states by using the saved states if a fault occurs. With the hardware-implemented checkpointing mechanism executing at runtime, a processor will have substantially increased register-file reads. For embedded processors, which typically have restricted design constraints on area, power, and performance, such increases might compromise the quality of the application greatly. In this paper, we present a checkpointing method, ReCoRD, aimed at reducing the resultant register traffic at runtime, by leveraging register data dependencies. The proposed checkpointing method can reduce redundant executions of register-file checkpointing. The experiments show that ReCoRD achieves improved register traffic reduction (20%) along with reduced dynamic power consumption (approximately 20%) in comparison to the state of the art with minimal area overhead. The leakage power increases marginally (about 2%), but is more than compensated by the decrease in dynamic power.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:006.3.3ERROR RESILIENCE AND ENERGY EFFICIENCY: AN LDPC DECODER DESIGN STUDY
Speaker:
Philipp Schläfer, University of Kaiserslautern, DE
Authors:
Philipp Schläfer1, Chu-Hsiang Huang2, Clayton Schoeny2, Christian Weis1, Yao Li3, Norbert Wehn1 and Lara Dolecek2
1University of Kaiserslautern, DE; 2University of California, Los Angeles, US; 3Akamai Inc., US
Abstract
Iterative decoding algorithms for low-density parity check (LDPC) codes have an inherent fault tolerance. In this paper, we exploit this robustness and optimize an LDPC decoder for high energy efficiency: we reduce energy consumption by opportunistically increasing error rates in decoder memories, while still achieving successful decoding in the final iteration. We develop a theory-guided unequal error protection (UEP) technique. UEP is implemented using dynamic voltage scaling that controls the error probability in the decoder memories on a per iteration basis. Specifically, via a density evolution analysis of an LDPC decoder, we first formulate the optimization problem of choosing an appropriate error rate for the decoder memories to achieve successful decoding under minimal energy consumption. We then propose a low complexity greedy algorithm to solve this optimization problem and map the resulting error rates to the corresponding supply voltage levels of the decoder memories in each iteration of the decoding algorithm. We demonstrate the effectiveness of our approach via ASIC synthesis results of a decoder for the LDPC code in the IEEE 802.11ad standard, implemented in 28 nm FD-SOI technology. The proposed scheme achieves an increase in energy efficiency of up to 40% compared to the state-of-the-art solution.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:156.3.4RUNTIME INTERVAL OPTIMIZATION AND DEPENDABLE PERFORMANCE FOR APPLICATION-LEVEL CHECKPOINTING
Speaker:
Dimitrios Rodopoulos, ICCS/NTUA, GR
Authors:
Apostolos Kokolis1, Alexandros Mavrogiannis1, Dimitrios Rodopoulos2, Christos Strydis3 and Dimitrios Soudris1
1NTUA, GR; 2ICCS/NTUA, GR; 3Erasmus MC, NL
Abstract
As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30IP3-1, 631A FLEXIBLE INEXACT TMR TECHNIQUE FOR SRAM-BASED FPGAS
Speaker:
Akash Kumar, Technische Universität Dresden, DE
Authors:
Shyamsundar Venkataraman1, Rui Santos1 and Akash Kumar2
1National University of Singapore, SG; 2Technische Universität Dresden, DE
Abstract
Single Event Upsets (SEUs) inadvertently change the logic memory and thereby the configuration of the Field Programmable Gate Arrays (FPGAs), leading to their incorrect functioning. Traditional methods to tolerate such faults include Triple Modular Redundancy (TMR). However, such method has a high overhead in terms of power and area. Moreover, the inexact methods used in ASICs to overcome this problem are not efficient when applied in FPGAs. Therefore, this paper proposes a novel technique based on heuristic to tolerate faults in SRAM-based FPGAs by using inexact modules in conjunction with TMR, thus reducing the area and power overhead of the design. Experiments run on various MCNC benchmark circuits show the accuracy of the proposed technique. They also show that the design solutions found through this technique only differ 0.52% on average from the optimal ones and savings up to 84.4% in terms of computation time can be reached on average.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30End of session
Lunch Break in Großer Saal + Saal 1
Keynote Lecture in "Saal 2" 14:00 - 14:30