7.3 CPU and GPU microarchitecture dependability

Printer-friendly version PDF version

Date: Wednesday 27 March 2019
Time: 14:30 - 16:00
Location / Room: Room 3

Chair:
Michail Maniatakos, NYU Abu Dhabi, UA

Co-Chair:
Nikolaos Foutris, University of Manchester, UK

This session first focuses on the dependability of out-of-order processors and specifically in the register renaming sub-system and the L1 cache. Then, it analyzes the main requirements to enable ISO26262 ASIL-D compliance for Commercial Off-The-Shelf (COTS) GPUs.

TimeLabelPresentation Title
Authors
14:307.3.1(Best Paper Award Candidate)
ERROR-SHIELDED REGISTER RENAMING SUBSYSTEM FOR A DYNAMICALLY SCHEDULED OUT-OF-ORDER CORE
Authors:
Ron Gabor1, Yiannakis Sazeides2, Arkady Bramnik1, Alexandros Andreou2, Chrysostomos Nicopoulos2, Karyofyllis Patsidis3, Dimitris Konstantinou3 and Giorgos Dimitrakopoulos3
1Intel, IL; 2University of Cyprus, CY; 3Democritus University of Thrace, GR
Abstract
Emerging mission-critical and functional safety applications require high performance processors that meet strict reliability requirements against random hardware failures. These requirements touch even sub-systems within the core that, so far, may have been considered as low significance contributors to the processor failure rate. This paper identifies the register renaming sub-system of an out-of-order core as a prime example of where cost-efficient and non-intrusive protection can enable future processors to meet their reliability goals. We propose two hardware schemes that guard against failures in the register renaming sub-system of a core: a technique for the detection of random hardware errors in the physical register identifiers, and a method to recover from the detected errors.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:007.3.2LAEC: LOOK-AHEAD ERROR CORRECTION CODES IN EMBEDDED PROCESSORS L1 DATA CACHE
Authors:
Pedro Benedicte1, Carles Hernandez2, Jaume Abella2 and Francisco Cazorla2
1Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES; 2Barcelona Supercomputing Center, ES
Abstract
As implementation technology shrinks, the presence of errors in cache memories is becoming an increasing issue in all computing domains. Critical systems, e.g. space and automotive, are specially exposed and susceptible to reliability issues. Furthermore, hardware designs in these systems are migrating to multi-level cache multicore systems, in which write-through first level data (DL1) caches have been shown to heavily harm average and guaranteed performance. While write-back DL1 caches solve this problem they come with their own challenges: they need Error Correction Codes (ECC) to tolerate soft errors, but implementing DL1 ECC in simple embedded micro-controllers requires either complex hardware to squash instructions consuming erroneous data, or delayed delivery of data to correct potential errors, which impacts performance even if such process is pipelined. In this paper we present a low-complexity hardware mechanism to anticipate data fetch and error correction in DL1 so that both (1) correct data is always delivered, but (2) avoiding additional delays in most of the cases. This achieves both high guaranteed performance and an effective solutions against errors.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:307.3.3HIGH-INTEGRITY GPU DESIGNS FOR CRITICAL REAL-TIME AUTOMOTIVE SYSTEMS
Speaker:
Sergi Alcaide Portet, Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES
Authors:
Sergi Alcaide1, Leonidas Kosmidis2, Carles Hernandez2 and Jaume Abella2
1Universitat Politècnica de Catalunya - Barcelona Supercomputing Center (BSC), ES; 2Barcelona Supercomputing Center, ES
Abstract
Autonomous Driving (AD) imposes the use of high-performance hardware, such as GPUs, to perform object recognition and tracking in real-time. However, differently to the consumer electronics market, critical real-time AD functionalities require a high degree of resilience against faults, in line with the automotive ISO26262 functional safety standard requirements. ISO26262 imposes the use of some source of independent redundancy for the most critical functionalities so that a single fault cannot lead to a failure, being dual core lockstep (DCLS) with diversity the preferred choice for computing devices. Unfortunately, GPUs do not support diverse DCLS by construction, thus failing to meet ISO26262 requirements efficiently. In this paper we propose lightweight modifications to GPUs to enable diverse DCLS for critical real-time applications without diminishing their performance for non-critical applications. In particular, we show how enabling specific mechanisms for software-controlled kernel scheduling in the GPU, allows guaranteeing that redundant kernels can be executed in different resources so that a single fault cannot lead to a failure, as imposed by ISO26262. Our results on a GPU simulator and an NVIDIA GPU prove the viability of the approach and its effectiveness on high-performance GPU designs needed for AD systems.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00IP3-14, 703A FINE-GRAINED SOFT ERROR RESILIENT ARCHITECTURE UNDER POWER CONSIDERATIONS
Speaker:
Sajjad Hussain, Chair for Embedded Systems, KIT, Karlsruhe, DE
Authors:
Sajjad Hussain1, Muhammad Shafique2 and Joerg Henkel1
1Karlsruhe Institute of Technology, DE; 2Vienna University of Technology (TU Wien), AT
Abstract
Besides the limited power budgets and the dark-silicon issue, soft error is one of the most critical reliability issues in computing systems fabricated using nano-scale devices. During the execution, different applications have varying performance, power/energy consumption and vulnerability properties. Different trade-offs can be devised to provide required resiliency within the allowed power constraints. To exploit this behavior, we propose a novel soft error resilient architecture and the corresponding run-time system that enables power-aware fine-grained resiliency for different processor components. It selectively determines the reliability state of various components, such that the overall application reliability is improved under a given power budget. Our architecture saves power up to 16% and reliability degradation up to 11% compared to state-of-the-art techniques.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:01IP3-15, 188FINE-GRAINED HARDWARE MITIGATION FOR MULTIPLE LONG-DURATION TRANSIENTS ON VLIW FUNCTION UNITS
Speaker:
Angeliki Kritikakou, University of Rennes 1 - IRISA/INRIA, FR
Authors:
Rafail Psiakis1, Angeliki Kritikakou1 and Olivier Sentieys2
1Univ Rennes/IRISA/INRIA, FR; 2INRIA, FR
Abstract
Technology scaling makes hardware more susceptible to radiation, which can cause multiple transient faults with long duration. In these cases, the affected function unit is usually considered as faulty and is not further used. To reduce this performance degradation, the proposed hardware mechanism detects the faults that are still active during execution and re-schedules the instructions to use the fault-free components of the affected function units. The results show multiple long-duration fault mitigation with low performance, area, and power overhead

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00End of session
Coffee Break in Exhibition Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Wednesday, March 27, 2019

Thursday, March 28, 2019