3.3 Methods and Characterisation techniques for Reliability

Time	Label	Presentation Title Authors
14:30	3.3.1	(Best Paper Award Candidate) NEW METHOD FOR THE AUTOMATED MASSIVE CHARACTERIZATION OF BIAS TEMPERATURE INSTABILITY IN CMOS TRANSISTORS Speaker: Pablo Sarazá Canflanca, Universidad de Sevilla, ES Authors: Pablo Saraza-Canflanca¹, Javier Diaz-Fortuny², Rafael Castro-Lopez¹, Elisenda Roca¹, Javier Martin-Martinez², Rosana Rodriguez², Montserrat Nafria² and Francisco Vidal Fernandez¹ ¹Instituto de Microelectrónica de Sevilla, ES; ²Universitat Autonoma de Barcelona UAB, ES Abstract Bias Temperature Instability has become a critical issue for circuit reliability. This phenomenon has been found to have a stochastic and discrete nature in nanometer-scale CMOS technologies. To account for this random nature, massive experimental characterization is necessary so that the extracted model parameters are accurate enough. However, there is a lack of automated analysis tools for the extraction of the BTI parameters from the extensive amount of generated data in those massive characterization tests. In this paper, a novel algorithm that allows the precise and fully automated parameter extraction from experimental BTI recovery current traces is presented. This algorithm is based on the Maximum Likelihood Estimation principles, and is able to extract, in a robust and exact manner, the threshold voltage shifts and emission times associated to oxide trap emissions during BTI recovery, required to properly model the phenomenon. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	3.3.2	GUILTY AS CHARGED: COMPUTATIONAL RELIABILITY THREATS POSED BY ELECTROSTATIC DISCHARGE-INDUCED SOFT ERRORS Speaker: Keven Feng, University of Illinois at Urbana Champaign, US Authors: Keven Feng, Sandeep Vora, Rui Jiang, Elyse Rosenbaum and Shobha Vasudevan, ECE at Univ. of Illinois at Urbana-Champaign, US Abstract Electrostatic discharge (ESD) has been shown to cause severe reliability hazards at the physical level, resulting in permanent and transient failures. We present the first analysis of the effects of ESD induced errors on instruction level computation. Our data was measured on microcontroller test chip fabricated for this study, with discharges from a controlled ESD gun. Cosmic ray induced soft errors have been widely researched, and modeled as single event upsets (SEUs). Our observations across multiple trials on 3 test chips show that in contrast to radiation induced errors, ESD can cause much more widespread errors than SEUs. In our trials, we observe system hangs and clock glitches which are serious errors. We also observe errors in the following categories. Category A: multiple bit corruptions across multiple registers, Category B: multiple bit corruptions in the same register, and Category C: single bit corruptions across multiple registers. At the instruction level, these errors manifest as system hangs, serious malfunctioning of I/O operations, interrupt operations, data and program memory. We demonstrate that ESD induced errors form a significant reliability threat to higher level functionality, warranting modeling and mitigation techniques. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	3.3.3	METHODOLOGY FOR APPLICATION-DEPENDENT DEGRADATION ANALYSIS OF MEMORY TIMING Speaker: Daniel Kraak, Delft University of Technology, NL Authors: Daniel Kraak¹, Innocent Agbo¹, Mottaqiallah Taouil¹, Said Hamdioui¹, Pieter Weckx², Stefan Cosemans² and Francky Catthoor² ¹Delft University of Technology, NL; ²imec vzw., BE Abstract Memory designs typically contain design margins to compensate for aging. As aging impact becomes more severe with technology scaling, it is crucial to accurately predict such impact to prevent overestimation or underestimation of the margins. This paper proposes a methodology to accurately and efficiently analyze the impact of aging on the memory's digital logic (e.g., timing circuit and address decoder) while considering realistic workloads extracted from applications. To demonstrate the superiority of the methodology, we analyzed the degradation of the L1 data and instruction caches for an ARM v8-a processor using both our methodology as well as the state-of-the-art methods. The results show that the existing methods may significantly over- or underestimate the impact (e.g., the decoder margin up to 221% and the access time up to 20%) as compared with the proposed scheme. In addition, the results show that in general the instruction cache has the highest degradation. For example, its access time degrades up to 9% and its decoder margin up to 44%. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00	IP1-14, 303	CHIP HEALTH TRACKING USING DYNAMIC IN-SITU DELAY MONITORING Speaker: Hadi Ahmadi Balef, Eindhoven University of Technology, NL Authors: Hadi Ahmadi Balef¹, Kees Goossens² and José Pineda de Gyvez¹ ¹Eindhoven University of Technology, NL; ²Eindhoven university of technology, NL Abstract Tracking the gradual effect of silicon aging on circuit delays requires fine-grain slack monitoring. The conventional slack monitoring techniques intend to measure the worst-case static slack, i.e. the slack of longest timing path. In sharp contrast to the conventional techniques, we propose a novel technique that is based on dynamic excitation of in-situ delay monitors (i.e. the dynamic excitation of timing paths that are monitored). As delays degrade, path delays increase and the monitors are excited more frequently. With the proposed technique, a fine-grained signature of delay degradation is extracted from the excitation rate of monitors. The in-situ monitors are inserted at intermediate points along timing paths to increase the sensitivity of signature to delay degradation. A new efficient monitor insertion algorithm is also proposed that reduces the number of monitors by ~2.1X compared to other works for an ARM Cortex M0 processor. Download Paper (PDF; Only available from the DATE venue WiFi)
16:01	IP1-15, 541	PCFI: PROGRAM COUNTER GUIDED FAULT INJECTION FOR ACCELERATING GPU RELIABILITY ASSESSMENT Speaker: Paolo Rech, UFRGS, BR Authors: Fritz Previlon, Charu Kalra, Devesh Tiwari and David Kaeli, Northeastern University, US Abstract Reliability has become a first-class design objective for GPU devices due to increasing soft-error rate. To assess the reliability of GPU programs, researchers rely on software fault-injection methods. Unfortunately, software fault-injection process is prohibitively expensive, requiring multiple days to complete a statistically sound fault-injection campaign. Therefore, to address this challenge, this paper proposes a novel fault-injection method, PCFI, that reduces the number of fault injections by exploiting the predictability in fault-injection outcome based on the program counter of the soft-error affected instruction. Evaluation on a variety of GPU programs covering a wide range of application domains shows that PCFI reduces the time to complete fault-injection campaigns by 22% on average without sacrificing the accuracy. Download Paper (PDF; Only available from the DATE venue WiFi)
16:02	IP1-16, 696	CHARACTERIZING THE RELIABILITY AND THRESHOLD VOLTAGE SHIFTING OF 3D CHARGE TRAP NAND FLASH Speaker: Weihua Liu, Huazhong University of Science and Technology, CN Authors: Weihua Liu¹, Fei Wu¹, Meng Zhang¹, Yifei Wang¹, Zhonghai Lu², Xiangfeng Lu³ and Changsheng Xie¹ ¹Huazhong University of Science and Technology, CN; ²KTH Royal Institute of Technology, SE; ³Beijing Memblaze Technology Co., Ltd., CN Abstract 3D charge trap (CT) triple-level cell (TLC) NAND flash gradually becomes a mainstream storage component due to high storage capacity and performance, but introducing a concern about reliability. Fault tolerance and data management schemes are capable of improving reliability. Designing a more efficient solution, however, needs to understand the reliability characteristics of 3D CT TLC NAND flash. To facilitate such understanding, by exploiting a real-world testing platform, we investigate the reliability characteristics including the raw bit error rate (RBER) and the threshold voltage (Vth) shifting features after suffering from variable disturbances. We give an analysis of why these characteristics exist in 3D CT TLC NAND flash. We hope these observations can guide the designers to propose high efficient solutions to the reliability problem. Download Paper (PDF; Only available from the DATE venue WiFi)
16:03	IP1-17, 882	HIDDEN DELAY FAULT SENSOR FOR TEST, RELIABILITY AND SECURITY Speaker: Giorgio Di Natale, CNRS - TIMA, FR Authors: Giorgio Di Natale¹, Elena Ioana Vatajelu², Kalpana SENTHAMARAI KANNAN² and Lorena Anghel³ ¹LIRMM, FR; ²TIMA, FR; ³Grenoble-Alpes University, FR Abstract In this paper we present a novel hidden-delay-fault sensor design and a preliminary analysis of its circuit integration and applicability. In our proposed method, the delay sensing is achieved by sampling data on both rising and falling clock edges and using a variable duty cycle to control the range of the sensed delay fault. The main advantage of our proposed method is that it works at nominal frequency, can cover a wide range of delay faults and it is versatile in its applicability. It can be used (i) during testing to perform user-defined hidden-delay-fault test, (ii) for reliability degradation estimation due to process, environmental variations and ageing, and (iii) in security to detect the insertion of Trojan horses that alter the path delay. Download Paper (PDF; Only available from the DATE venue WiFi)
16:03	IP1-18, 219	EFFECT OF DEVICE VARIATION ON MAPPING BINARY NEURAL NETWORK TO MEMRISTOR CROSSBAR ARRAY Speaker: Wooseok Yi, POSTECH, KR Authors: Wooseok Yi, Yulhwa Kim and Jae-Joon Kim, Pohang University of Science and Technology, KR Abstract In memristor crossbar array (MCA)-based neural network hardware, it is generally assumed that entire wordlines (WLs) are simultaneously enabled for parallel matrix-vector multiplication (MxV) operation. However, the error probability of MxV in a memristor crossbar array (MCA) increases as the resistance ratio (R-ratio) of a memristor decreases and the resistance variation and the number of simultaneously activated WLs increase. In this paper, we analyze the effect of R-ratio and variation of memristor devices on read sense margin and inference accuracy of MCA-based Binary Neural Network (BNN) hardware. We first show that only a limited number of WLs should be enabled to ensure correct MxV output when the R-ratio is small. On the other hand, we also show that, if the resistance variation becomes higher than a certain level, simultaneous activation of large number of WLs produces the higher accuracy even when R-ratio is small. Based on the analysis, we propose the Accuracy Estimation (AE) factor to find the optimal number of word lines that are simultaneously activated. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00		End of session Coffee Break in Exhibition Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Lunch Breaks (Lunch Area) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 26, 2019 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 27, 2019 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 28, 2019 Coffee Break 10:00 - 11:00 University Booth Best Demo Award Presentation at the University Booth 10:30 Lunch Break 12:30 - 14:00 Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

14:30

3.3.1

(Best Paper Award Candidate)
NEW METHOD FOR THE AUTOMATED MASSIVE CHARACTERIZATION OF BIAS TEMPERATURE INSTABILITY IN CMOS TRANSISTORS
Speaker:
Pablo Sarazá Canflanca, Universidad de Sevilla, ES
Authors:
Pablo Saraza-Canflanca¹, Javier Diaz-Fortuny², Rafael Castro-Lopez¹, Elisenda Roca¹, Javier Martin-Martinez², Rosana Rodriguez², Montserrat Nafria² and Francisco Vidal Fernandez¹
¹Instituto de Microelectrónica de Sevilla, ES; ²Universitat Autonoma de Barcelona UAB, ES
Abstract
Bias Temperature Instability has become a critical issue for circuit reliability. This phenomenon has been found to have a stochastic and discrete nature in nanometer-scale CMOS technologies. To account for this random nature, massive experimental characterization is necessary so that the extracted model parameters are accurate enough. However, there is a lack of automated analysis tools for the extraction of the BTI parameters from the extensive amount of generated data in those massive characterization tests. In this paper, a novel algorithm that allows the precise and fully automated parameter extraction from experimental BTI recovery current traces is presented. This algorithm is based on the Maximum Likelihood Estimation principles, and is able to extract, in a robust and exact manner, the threshold voltage shifts and emission times associated to oxide trap emissions during BTI recovery, required to properly model the phenomenon.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

3.3.2

GUILTY AS CHARGED: COMPUTATIONAL RELIABILITY THREATS POSED BY ELECTROSTATIC DISCHARGE-INDUCED SOFT ERRORS
Speaker:
Keven Feng, University of Illinois at Urbana Champaign, US
Authors:
Keven Feng, Sandeep Vora, Rui Jiang, Elyse Rosenbaum and Shobha Vasudevan, ECE at Univ. of Illinois at Urbana-Champaign, US
Abstract
Electrostatic discharge (ESD) has been shown to cause severe reliability hazards at the physical level, resulting in permanent and transient failures. We present the first analysis of the effects of ESD induced errors on instruction level computation. Our data was measured on microcontroller test chip fabricated for this study, with discharges from a controlled ESD gun. Cosmic ray induced soft errors have been widely researched, and modeled as single event upsets (SEUs). Our observations across multiple trials on 3 test chips show that in contrast to radiation induced errors, ESD can cause much more widespread errors than SEUs. In our trials, we observe system hangs and clock glitches which are serious errors. We also observe errors in the following categories. Category A: multiple bit corruptions across multiple registers, Category B: multiple bit corruptions in the same register, and Category C: single bit corruptions across multiple registers. At the instruction level, these errors manifest as system hangs, serious malfunctioning of I/O operations, interrupt operations, data and program memory. We demonstrate that ESD induced errors form a significant reliability threat to higher level functionality, warranting modeling and mitigation techniques.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

3.3.3

METHODOLOGY FOR APPLICATION-DEPENDENT DEGRADATION ANALYSIS OF MEMORY TIMING
Speaker:
Daniel Kraak, Delft University of Technology, NL
Authors:
Daniel Kraak¹, Innocent Agbo¹, Mottaqiallah Taouil¹, Said Hamdioui¹, Pieter Weckx², Stefan Cosemans² and Francky Catthoor²
¹Delft University of Technology, NL; ²imec vzw., BE
Abstract
Memory designs typically contain design margins to compensate for aging. As aging impact becomes more severe with technology scaling, it is crucial to accurately predict such impact to prevent overestimation or underestimation of the margins. This paper proposes a methodology to accurately and efficiently analyze the impact of aging on the memory's digital logic (e.g., timing circuit and address decoder) while considering realistic workloads extracted from applications. To demonstrate the superiority of the methodology, we analyzed the degradation of the L1 data and instruction caches for an ARM v8-a processor using both our methodology as well as the state-of-the-art methods. The results show that the existing methods may significantly over- or underestimate the impact (e.g., the decoder margin up to 221% and the access time up to 20%) as compared with the proposed scheme. In addition, the results show that in general the instruction cache has the highest degradation. For example, its access time degrades up to 9% and its decoder margin up to 44%.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

IP1-14, 303

CHIP HEALTH TRACKING USING DYNAMIC IN-SITU DELAY MONITORING
Speaker:
Hadi Ahmadi Balef, Eindhoven University of Technology, NL
Authors:
Hadi Ahmadi Balef¹, Kees Goossens² and José Pineda de Gyvez¹
¹Eindhoven University of Technology, NL; ²Eindhoven university of technology, NL
Abstract
Tracking the gradual effect of silicon aging on circuit delays requires fine-grain slack monitoring. The conventional slack monitoring techniques intend to measure the worst-case static slack, i.e. the slack of longest timing path. In sharp contrast to the conventional techniques, we propose a novel technique that is based on dynamic excitation of in-situ delay monitors (i.e. the dynamic excitation of timing paths that are monitored). As delays degrade, path delays increase and the monitors are excited more frequently. With the proposed technique, a fine-grained signature of delay degradation is extracted from the excitation rate of monitors. The in-situ monitors are inserted at intermediate points along timing paths to increase the sensitivity of signature to delay degradation. A new efficient monitor insertion algorithm is also proposed that reduces the number of monitors by ~2.1X compared to other works for an ARM Cortex M0 processor.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:01

IP1-15, 541

PCFI: PROGRAM COUNTER GUIDED FAULT INJECTION FOR ACCELERATING GPU RELIABILITY ASSESSMENT
Speaker:
Paolo Rech, UFRGS, BR
Authors:
Fritz Previlon, Charu Kalra, Devesh Tiwari and David Kaeli, Northeastern University, US
Abstract
Reliability has become a first-class design objective for GPU devices due to increasing soft-error rate. To assess the reliability of GPU programs, researchers rely on software fault-injection methods. Unfortunately, software fault-injection process is prohibitively expensive, requiring multiple days to complete a statistically sound fault-injection campaign. Therefore, to address this challenge, this paper proposes a novel fault-injection method, PCFI, that reduces the number of fault injections by exploiting the predictability in fault-injection outcome based on the program counter of the soft-error affected instruction. Evaluation on a variety of GPU programs covering a wide range of application domains shows that PCFI reduces the time to complete fault-injection campaigns by 22% on average without sacrificing the accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:02

IP1-16, 696

CHARACTERIZING THE RELIABILITY AND THRESHOLD VOLTAGE SHIFTING OF 3D CHARGE TRAP NAND FLASH
Speaker:
Weihua Liu, Huazhong University of Science and Technology, CN
Authors:
Weihua Liu¹, Fei Wu¹, Meng Zhang¹, Yifei Wang¹, Zhonghai Lu², Xiangfeng Lu³ and Changsheng Xie¹
¹Huazhong University of Science and Technology, CN; ²KTH Royal Institute of Technology, SE; ³Beijing Memblaze Technology Co., Ltd., CN
Abstract
3D charge trap (CT) triple-level cell (TLC) NAND flash gradually becomes a mainstream storage component due to high storage capacity and performance, but introducing a concern about reliability. Fault tolerance and data management schemes are capable of improving reliability. Designing a more efficient solution, however, needs to understand the reliability characteristics of 3D CT TLC NAND flash. To facilitate such understanding, by exploiting a real-world testing platform, we investigate the reliability characteristics including the raw bit error rate (RBER) and the threshold voltage (Vth) shifting features after suffering from variable disturbances. We give an analysis of why these characteristics exist in 3D CT TLC NAND flash. We hope these observations can guide the designers to propose high efficient solutions to the reliability problem.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:03

IP1-17, 882

HIDDEN DELAY FAULT SENSOR FOR TEST, RELIABILITY AND SECURITY
Speaker:
Giorgio Di Natale, CNRS - TIMA, FR
Authors:
Giorgio Di Natale¹, Elena Ioana Vatajelu², Kalpana SENTHAMARAI KANNAN² and Lorena Anghel³
¹LIRMM, FR; ²TIMA, FR; ³Grenoble-Alpes University, FR
Abstract
In this paper we present a novel hidden-delay-fault sensor design and a preliminary analysis of its circuit integration and applicability. In our proposed method, the delay sensing is achieved by sampling data on both rising and falling clock edges and using a variable duty cycle to control the range of the sensed delay fault. The main advantage of our proposed method is that it works at nominal frequency, can cover a wide range of delay faults and it is versatile in its applicability. It can be used (i) during testing to perform user-defined hidden-delay-fault test, (ii) for reliability degradation estimation due to process, environmental variations and ageing, and (iii) in security to detect the insertion of Trojan horses that alter the path delay.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:03

IP1-18, 219

EFFECT OF DEVICE VARIATION ON MAPPING BINARY NEURAL NETWORK TO MEMRISTOR CROSSBAR ARRAY
Speaker:
Wooseok Yi, POSTECH, KR
Authors:
Wooseok Yi, Yulhwa Kim and Jae-Joon Kim, Pohang University of Science and Technology, KR
Abstract
In memristor crossbar array (MCA)-based neural network hardware, it is generally assumed that entire wordlines (WLs) are simultaneously enabled for parallel matrix-vector multiplication (MxV) operation. However, the error probability of MxV in a memristor crossbar array (MCA) increases as the resistance ratio (R-ratio) of a memristor decreases and the resistance variation and the number of simultaneously activated WLs increase. In this paper, we analyze the effect of R-ratio and variation of memristor devices on read sense margin and inference accuracy of MCA-based Binary Neural Network (BNN) hardware. We first show that only a limited number of WLs should be enabled to ensure correct MxV output when the R-ratio is small. On the other hand, we also show that, if the resistance variation becomes higher than a certain level, simultaneous activation of large number of WLs produces the higher accuracy even when R-ratio is small. Based on the analysis, we propose the Accuracy Estimation (AE) factor to find the optimal number of word lines that are simultaneously activated.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

End of session
Coffee Break in Exhibition Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Coffee Break 10:30 - 11:30
Lunch Break 13:00 - 14:30
Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Wednesday, March 27, 2019

Coffee Break 10:00 - 11:00
Lunch Break 12:30 - 14:30
Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Thursday, March 28, 2019

Coffee Break 10:00 - 11:00
University Booth Best Demo Award Presentation at the University Booth 10:30
Lunch Break 12:30 - 14:00
Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50
Coffee Break 15:30 - 16:00