9.6 Intelligent Dependable Systems

Printer-friendly version PDF version

Date: Thursday 12 March 2020
Time: 08:30 - 10:00
Location / Room: Lesdiguières

Chair:
Rishad Shafik, Newcastle University, GB

This session spans from dependability approaches for multicore systems realized as SoCs for intelligent reliability management and on-line software-based self-test, to error resilient AI systems where the AI system is re-designed to tolerate critical faults or is used for error detection purposes.

TimeLabelPresentation Title
Authors
08:309.6.1THERMAL-CYCLING-AWARE DYNAMIC RELIABILITY MANAGEMENT IN MANY-CORE SYSTEM-ON-CHIP
Speaker:
Mohammad-Hashem Haghbayan, University of Turku, FI
Authors:
Mohammad-Hashem Haghbayan1, Antonio Miele2, Zhuo Zou3, Hannu Tenhunen1 and Juha Plosila1
1University of Turku, FI; 2Politecnico di Milano, IT; 3Nanjing University of Science and Technology, CN
Abstract
Dynamic Reliability Management (DRM) is a common approach to mitigate aging and wear-out effects in multi-/many-core systems. State-of-the-art DRM approaches apply fine-grained control on resource management to increase/balance the chip reliability while considering other system constraints, e.g., performance, and power budget. Such approaches, acting on various knobs such as workload mapping and scheduling, Dynamic Voltage/Frequency Scaling (DVFS) and Per-Core Power Gating (PCPG), demonstrated to work properly with the various aging mechanisms, such as electromigration, and Negative-Bias Temperature Instability (NBTI). However, we claim that they do not suffice for thermal cycling. Thus, we here propose a novel thermal-cycling-aware DRM approach for shared-memory many-core systems running multi-threaded applications. The approach applies a fine-grained control capable at reducing both temperature levels and variations. The experimental evaluations demonstrated that the proposed approach is able to achieve 39% longer lifetime than past approaches.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:009.6.2DETERMINISTIC CACHE-BASED EXECUTION OF ON-LINE SELF-TEST ROUTINES IN MULTI-CORE AUTOMOTIVE SYSTEM-ON-CHIPS
Speaker:
Andrea Floridia, Politecnico di Torino, IT
Authors:
Andrea Floridia1, Tzamn Melendez Carmona1, Davide Piumatti1, Annachiara Ruospo1, Ernesto Sanchez1, Sergio De Luca2, Rosario Martorana2 and Mose Alessandro Pernice2
1Politecnico di Torino, IT; 2STMicroelectronics, IT
Abstract
Traditionally, the usage of caches and deterministic execution of on-line self-test procedures have been considered two mutually exclusive concepts. At the same time, software executed in a multi-core context suffers of a limited timing predictability due to the higher system bus contention. When dealing with self-test procedures, this higher contention might lead to a fluctuating fault coverage or even the failure of some test programs. This paper presents a cache-based strategy for achieving both deterministic behaviour and stable fault coverage from the execution of self-test procedures in multi-core systems. The proposed strategy is applied to two representative modules negatively affected by a multi-core execution: synchronous imprecise interrupts logic and pipeline hazard detection unit. The experiments illustrate that it is possible to achieve a stable execution while also improving the state-of-the-art approaches for the on-line testing of embedded microprocessors. The effectiveness of the methodology was assessed on all the three cores of a multi-core industrial System-on-Chip intended for automotive ASIL D applications.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:309.6.3FT-CLIPACT: RESILIENCE ANALYSIS OF DEEP NEURAL NETWORKS AND IMPROVING THEIR FAULT TOLERANCE USING CLIPPED ACTIVATION
Authors:
Le-Ha Hoang1, Muhammad Abdullah Hanif2 and Muhammad Shafique2
1TU Wien (TU Wien), AT; 2TU Wien, AT
Abstract
Deep Neural Networks (DNNs) are widely being adopted for safety-critical applications, e.g., healthcare and autonomous driving. Inherently, they are considered to be highly error-tolerant. However, recent studies have shown that hardware faults that impact the parameters of a DNN (e.g., weights) can have drastic impacts on its classification accuracy. In this paper, we perform a comprehensive error resilience analysis of DNNs subjected to hardware faults (e.g., permanent faults) in the weight memory. The outcome of this analysis is leveraged to propose a novel error mitigation technique which squashes the high-intensity faulty activation values to alleviate their impact. We achieve this by replacing the unbounded activation functions with their clipped versions. We also present a method to systematically define the clipping values of the activation functions that result in increased resilience of the networks against faults. We evaluate our technique on the AlexNet and VGG-16 DNN trained for the CIFAR-10 dataset. The experimental results show that our mitigation technique significantly improves the network's resilience to faults. For example, the proposed technique offers on average 68.92% improvement in the classification accuracy of resilience-optimized VGG-16 model at 1 ×10−5 fault rate, when compared to the base network without any fault mitigation.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00IP4-16, 221AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS
Speaker:
Antonio Miele, Politecnico di Milano, IT
Authors:
Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT
Abstract
Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00End of session