9.6 Intelligent Dependable Systems

Time	Label	Presentation Title Authors
08:30	9.6.1	THERMAL-CYCLING-AWARE DYNAMIC RELIABILITY MANAGEMENT IN MANY-CORE SYSTEM-ON-CHIP Speaker: Mohammad-Hashem Haghbayan, University of Turku, FI Authors: Mohammad-Hashem Haghbayan¹, Antonio Miele², Zhuo Zou³, Hannu Tenhunen¹ and Juha Plosila¹ ¹University of Turku, FI; ²Politecnico di Milano, IT; ³Nanjing University of Science and Technology, CN Abstract Dynamic Reliability Management (DRM) is a common approach to mitigate aging and wear-out effects in multi-/many-core systems. State-of-the-art DRM approaches apply fine-grained control on resource management to increase/balance the chip reliability while considering other system constraints, e.g., performance, and power budget. Such approaches, acting on various knobs such as workload mapping and scheduling, Dynamic Voltage/Frequency Scaling (DVFS) and Per-Core Power Gating (PCPG), demonstrated to work properly with the various aging mechanisms, such as electromigration, and Negative-Bias Temperature Instability (NBTI). However, we claim that they do not suffice for thermal cycling. Thus, we here propose a novel thermal-cycling-aware DRM approach for shared-memory many-core systems running multi-threaded applications. The approach applies a fine-grained control capable at reducing both temperature levels and variations. The experimental evaluations demonstrated that the proposed approach is able to achieve 39% longer lifetime than past approaches. Download Paper (PDF; Only available from the DATE venue WiFi)
09:00	9.6.2	DETERMINISTIC CACHE-BASED EXECUTION OF ON-LINE SELF-TEST ROUTINES IN MULTI-CORE AUTOMOTIVE SYSTEM-ON-CHIPS Speaker: Andrea Floridia, Politecnico di Torino, IT Authors: Andrea Floridia¹, Tzamn Melendez Carmona¹, Davide Piumatti¹, Annachiara Ruospo¹, Ernesto Sanchez¹, Sergio De Luca², Rosario Martorana² and Mose Alessandro Pernice² ¹Politecnico di Torino, IT; ²STMicroelectronics, IT Abstract Traditionally, the usage of caches and deterministic execution of on-line self-test procedures have been considered two mutually exclusive concepts. At the same time, software executed in a multi-core context suffers of a limited timing predictability due to the higher system bus contention. When dealing with self-test procedures, this higher contention might lead to a fluctuating fault coverage or even the failure of some test programs. This paper presents a cache-based strategy for achieving both deterministic behaviour and stable fault coverage from the execution of self-test procedures in multi-core systems. The proposed strategy is applied to two representative modules negatively affected by a multi-core execution: synchronous imprecise interrupts logic and pipeline hazard detection unit. The experiments illustrate that it is possible to achieve a stable execution while also improving the state-of-the-art approaches for the on-line testing of embedded microprocessors. The effectiveness of the methodology was assessed on all the three cores of a multi-core industrial System-on-Chip intended for automotive ASIL D applications. Download Paper (PDF; Only available from the DATE venue WiFi)
09:30	9.6.3	FT-CLIPACT: RESILIENCE ANALYSIS OF DEEP NEURAL NETWORKS AND IMPROVING THEIR FAULT TOLERANCE USING CLIPPED ACTIVATION Authors: Le-Ha Hoang¹, Muhammad Abdullah Hanif² and Muhammad Shafique² ¹TU Wien (TU Wien), AT; ²TU Wien, AT Abstract Deep Neural Networks (DNNs) are widely being adopted for safety-critical applications, e.g., healthcare and autonomous driving. Inherently, they are considered to be highly error-tolerant. However, recent studies have shown that hardware faults that impact the parameters of a DNN (e.g., weights) can have drastic impacts on its classification accuracy. In this paper, we perform a comprehensive error resilience analysis of DNNs subjected to hardware faults (e.g., permanent faults) in the weight memory. The outcome of this analysis is leveraged to propose a novel error mitigation technique which squashes the high-intensity faulty activation values to alleviate their impact. We achieve this by replacing the unbounded activation functions with their clipped versions. We also present a method to systematically define the clipping values of the activation functions that result in increased resilience of the networks against faults. We evaluate our technique on the AlexNet and VGG-16 DNN trained for the CIFAR-10 dataset. The experimental results show that our mitigation technique significantly improves the network's resilience to faults. For example, the proposed technique offers on average 68.92% improvement in the classification accuracy of resilience-optimized VGG-16 model at 1 ×10−5 fault rate, when compared to the base network without any fault mitigation. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00	IP4-16, 221	AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS Speaker: Antonio Miele, Politecnico di Milano, IT Authors: Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT Abstract Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00		End of session

Time

Label

Presentation Title
Authors

08:30

9.6.1

THERMAL-CYCLING-AWARE DYNAMIC RELIABILITY MANAGEMENT IN MANY-CORE SYSTEM-ON-CHIP
Speaker:
Mohammad-Hashem Haghbayan, University of Turku, FI
Authors:
Mohammad-Hashem Haghbayan¹, Antonio Miele², Zhuo Zou³, Hannu Tenhunen¹ and Juha Plosila¹
¹University of Turku, FI; ²Politecnico di Milano, IT; ³Nanjing University of Science and Technology, CN
Abstract
Dynamic Reliability Management (DRM) is a common approach to mitigate aging and wear-out effects in multi-/many-core systems. State-of-the-art DRM approaches apply fine-grained control on resource management to increase/balance the chip reliability while considering other system constraints, e.g., performance, and power budget. Such approaches, acting on various knobs such as workload mapping and scheduling, Dynamic Voltage/Frequency Scaling (DVFS) and Per-Core Power Gating (PCPG), demonstrated to work properly with the various aging mechanisms, such as electromigration, and Negative-Bias Temperature Instability (NBTI). However, we claim that they do not suffice for thermal cycling. Thus, we here propose a novel thermal-cycling-aware DRM approach for shared-memory many-core systems running multi-threaded applications. The approach applies a fine-grained control capable at reducing both temperature levels and variations. The experimental evaluations demonstrated that the proposed approach is able to achieve 39% longer lifetime than past approaches.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:00

9.6.2

DETERMINISTIC CACHE-BASED EXECUTION OF ON-LINE SELF-TEST ROUTINES IN MULTI-CORE AUTOMOTIVE SYSTEM-ON-CHIPS
Speaker:
Andrea Floridia, Politecnico di Torino, IT
Authors:
Andrea Floridia¹, Tzamn Melendez Carmona¹, Davide Piumatti¹, Annachiara Ruospo¹, Ernesto Sanchez¹, Sergio De Luca², Rosario Martorana² and Mose Alessandro Pernice²
¹Politecnico di Torino, IT; ²STMicroelectronics, IT
Abstract
Traditionally, the usage of caches and deterministic execution of on-line self-test procedures have been considered two mutually exclusive concepts. At the same time, software executed in a multi-core context suffers of a limited timing predictability due to the higher system bus contention. When dealing with self-test procedures, this higher contention might lead to a fluctuating fault coverage or even the failure of some test programs. This paper presents a cache-based strategy for achieving both deterministic behaviour and stable fault coverage from the execution of self-test procedures in multi-core systems. The proposed strategy is applied to two representative modules negatively affected by a multi-core execution: synchronous imprecise interrupts logic and pipeline hazard detection unit. The experiments illustrate that it is possible to achieve a stable execution while also improving the state-of-the-art approaches for the on-line testing of embedded microprocessors. The effectiveness of the methodology was assessed on all the three cores of a multi-core industrial System-on-Chip intended for automotive ASIL D applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:30

9.6.3

FT-CLIPACT: RESILIENCE ANALYSIS OF DEEP NEURAL NETWORKS AND IMPROVING THEIR FAULT TOLERANCE USING CLIPPED ACTIVATION
Authors:
Le-Ha Hoang¹, Muhammad Abdullah Hanif² and Muhammad Shafique²
¹TU Wien (TU Wien), AT; ²TU Wien, AT
Abstract
Deep Neural Networks (DNNs) are widely being adopted for safety-critical applications, e.g., healthcare and autonomous driving. Inherently, they are considered to be highly error-tolerant. However, recent studies have shown that hardware faults that impact the parameters of a DNN (e.g., weights) can have drastic impacts on its classification accuracy. In this paper, we perform a comprehensive error resilience analysis of DNNs subjected to hardware faults (e.g., permanent faults) in the weight memory. The outcome of this analysis is leveraged to propose a novel error mitigation technique which squashes the high-intensity faulty activation values to alleviate their impact. We achieve this by replacing the unbounded activation functions with their clipped versions. We also present a method to systematically define the clipping values of the activation functions that result in increased resilience of the networks against faults. We evaluate our technique on the AlexNet and VGG-16 DNN trained for the CIFAR-10 dataset. The experimental results show that our mitigation technique significantly improves the network's resilience to faults. For example, the proposed technique offers on average 68.92% improvement in the classification accuracy of resilience-optimized VGG-16 model at 1 ×10−5 fault rate, when compared to the base network without any fault mitigation.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

IP4-16, 221

AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS
Speaker:
Antonio Miele, Politecnico di Milano, IT
Authors:
Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT
Abstract
Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

End of session