10.3 System-level Dependability for Multicore and Real-time Systems

Time	Label	Presentation Title Authors
11:00	10.3.1	IDENTIFYING THE MOST RELIABLE COLLABORATIVE WORKLOAD DISTRIBUTION IN HETEROGENEOUS DEVICES Speaker: Paolo Rech, UFRGS, BR Authors: Gabriel Piscoya Dávila, Daniel Oliveira, Philippe Navaux and Paolo Rech, UFRGS, BR Abstract The constant need for higher performances and reduced power consumption has lead vendors to design heterogeneous devices that embed traditional CPU and an accelerator, like a GPU or FPGA. When the CPU and the accelerator are used collaboratively the device computational performances reach their peak. However, the higher amount of resources employed for computation has, potentially, the side effect of increasing soft error rate. In this paper, we evaluate the reliability behaviour of AMD Kaveri Accelerated Processing Units executing a set of heterogeneous applications. We distribute the workload between the CPU and GPU and evaluate which configuration provides the lowest error rate or allows the computation of the highest amount of data before experiencing a failure. We show that, in most cases, the most reliable workload distribution is the one that delivers the highest performances. As experimentally proven, by choosing the correct workload distribution the device reliability can increase of up to 9x. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	10.3.2	CE-BASED OPTIMIZATION FOR REAL-TIME SYSTEM AVAILABILITY UNDER LEARNED SOFT ERROR RATE Speaker: Liying Li, East China Normal University, CN Authors: Liying Li¹, Tongquan Wei¹, Junlong Zhou², Mingsong Chen¹ and X, Sharon Hu³ ¹East China Normal University, CN; ²Nanjing University of Science and Technology, CN; ³University of Notre Dame, US Abstract As the density of integrated circuits continues to increase, the possibility that real-time systems suffer from transient and permanent failures rises significantly, resulting in a degraded availability of system functionality. In this paper, we investigate the dynamic modeling of transient failure rate based on Back Propagation (BP) neural network, and propose an optimization strategy for system availability based on Cross Entropy (CE). Specifically, the neural network is trained using cross-layer simulation data obtained from SPICE simulation while the CE-based optimization for system functionality availability is achieved by judiciously selecting an optimal supply voltage for processors under timing constraints. Simulation results show that the proposed method can achieve system availability improvement of up to 32% compared to benchmarking methods. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	10.3.3	A DETERMINISTIC-PATH ROUTING ALGORITHM FOR TOLERATING MANY FAULTS ON WAFER-LEVEL NOC Speaker: Ying Zhang, Tongji University, CN Authors: Zhongsheng Chen¹, ying zhang², Zebo Peng³ and Jianhui Jiang¹ ¹Tongji University, CN; ²Tongji University, Shanghai, China, CN; ³Linkoping University, SE Abstract Wafer-level NoC has emerged as a promising fabric to further improve supercomputer performance, but this new fabric may suffer from the many-fault problem. This paper presents a deterministic-path routing algorithm for tolerating many faults on wafer-level NoCs. The proposed algorithm generates routing tables using a breadth-first traversal strategy, and stores one routing table in each NoC switch. The switch will then transmit packages according to its routing table online. We use the Tarjan algorithm to dynamically reconfigure the routes to avoid the faulty nodes and develop the deprecated link/node rules to ensure deadlock-free communication of the NoCs. Experimental results demonstrate that the proposed algorithm does not only tolerate the effects of many faults, but also maximizes the available nodes in the reconfigured NoC. The performance of the proposed algorithm in terms of average latency, throughput, and energy consumption is also better than those of the existing solutions. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP5-1, 705	THERMAL-AWARENESS IN A SOFT ERROR TOLERANT ARCHITECTURE Speaker: Sajjad Hussain, Chair for Embedded Systems, KIT, DE Authors: Sajjad Hussain¹, Muhammad Shafique² and Joerg Henkel¹ ¹Karlsruhe Institute of Technology, DE; ²Vienna University of Technology (TU Wien), AT Abstract It is crucial to provide soft error reliability in a power-efficient manner such that the maximum chip temperature remains within the safe operating limits. Different execution phases of an application have diverse performance, power, temperature and vulnerability behavior that can be leveraged to fulfill the resiliency requirements within the allowed thermal constraints. We propose a soft error tolerant architecture with fine-grained redundancy for different architectural components, such that their reliable operations can be activated selectively at fine-granularity to maximize the reliability under a given thermal constraint. When compared with state-of-the-art, our temperature-aware fine-grained reliability manager provides up to 30% reliability within the thermal budget. Download Paper (PDF; Only available from the DATE venue WiFi)
12:31	IP5-2, 547	A SOFTWARE-LEVEL REDUNDANT MULTITHREADING FOR SOFT/HARD ERROR DETECTION AND RECOVERY Speaker: Hwisoo So, Yonsei University, KR Authors: Moslem Didehban¹, HwiSoo So², Aviral Shrivastava¹ and Kyoungwoo Lee² ¹Arizona State University, US; ²Yonsei University, KR Abstract Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. In such environments, error resiliency is one of the main design concerns. Software level Redundant MultiThreading is one of the most promising error resilience strategies because they can potentially serve as inexpensive and flexible solutions for hardware unreliability issues i.e. soft and hard errors. However, the error coverage of the existing software level RMT solutions is limited to soft error detection and they rely on external schemes for error recovery. In this paper, we investigate the potential of software-level RMT schemes for complete soft and hard error detection and recovery. First, we pinpoint the main reasons behind ineffectiveness of basic software level triple redundant multithreading (STRMT) in protection against soft and hard errors. Then we introduce FISHER (FlexIble Soft and Hard Error Resiliency) as a software-only RMT scheme which can achieve comprehensive error resiliency against both soft and hard errors. Rather than performing centralized voting operations for critical instructions operands, FISHER distributes and intertwines error detection and recovery operations between redundant threads. To evaluate the effectiveness of the proposed solution, we performed more than 135,000 soft and hard error injection experiments on different hardware components of an ARM cortex53-like μ-architecturally simulated microprocessor. The results demonstrate that FISHER can reduce programs failure rate by around 261× and 162× compared to original and basic STRMTprotected versions of programs, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
12:32	IP5-3, 317	COMMON-MODE FAILURE MITIGATION:INCREASING DIVERSITY THROUGH HIGH-LEVEL SYNTHESIS Speaker: Farah Naz Taher, University of Texas at Dallas, US Authors: Farah Naz Taher¹, Matthew Joslin¹, Anjana Balachandran², Zhiqi Zhu¹ and Benjamin Carrion Schaefer¹ ¹The University of Texas at Dallas, US; ²The Hong Kong Polytechnic University, HK Abstract Fault tolerance is vital in many domains. One popular way to increase fault-tolerance is through hardware redundancy. However, basic redundancy cannot cope with Common Mode Failures (CMFs). One way to address CMF is through the use of diversity in combination with traditional hardware redundancy. This work proposes an automatic design space exploration (DSE) method to generate optimized redundant hardware accelerators with maximum diversity to protect against CMFs given as a single behavioral description for High-Level Synthesis (HLS). For this purpose, this work exploits one of the main advantages of C-based VLSI design over the traditional RT-level design based on low-level Hardware Description Languages (HDLs): The ability to generate micro-architectures with unique characteristics from the same behavioral description. Experimental results show that the proposed method provides a significant diversity increment compared to using traditional RTL-based exploration to generate diverse designs. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session Lunch Break in Lunch Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Lunch Breaks (Lunch Area) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 26, 2019 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 27, 2019 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 28, 2019 Coffee Break 10:00 - 11:00 University Booth Best Demo Award Presentation at the University Booth 10:30 Lunch Break 12:30 - 14:00 Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

11:00

10.3.1

IDENTIFYING THE MOST RELIABLE COLLABORATIVE WORKLOAD DISTRIBUTION IN HETEROGENEOUS DEVICES
Speaker:
Paolo Rech, UFRGS, BR
Authors:
Gabriel Piscoya Dávila, Daniel Oliveira, Philippe Navaux and Paolo Rech, UFRGS, BR
Abstract
The constant need for higher performances and reduced power consumption has lead vendors to design heterogeneous devices that embed traditional CPU and an accelerator, like a GPU or FPGA. When the CPU and the accelerator are used collaboratively the device computational performances reach their peak. However, the higher amount of resources employed for computation has, potentially, the side effect of increasing soft error rate. In this paper, we evaluate the reliability behaviour of AMD Kaveri Accelerated Processing Units executing a set of heterogeneous applications. We distribute the workload between the CPU and GPU and evaluate which configuration provides the lowest error rate or allows the computation of the highest amount of data before experiencing a failure. We show that, in most cases, the most reliable workload distribution is the one that delivers the highest performances. As experimentally proven, by choosing the correct workload distribution the device reliability can increase of up to 9x.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30

10.3.2

CE-BASED OPTIMIZATION FOR REAL-TIME SYSTEM AVAILABILITY UNDER LEARNED SOFT ERROR RATE
Speaker:
Liying Li, East China Normal University, CN
Authors:
Liying Li¹, Tongquan Wei¹, Junlong Zhou², Mingsong Chen¹ and X, Sharon Hu³
¹East China Normal University, CN; ²Nanjing University of Science and Technology, CN; ³University of Notre Dame, US
Abstract
As the density of integrated circuits continues to increase, the possibility that real-time systems suffer from transient and permanent failures rises significantly, resulting in a degraded availability of system functionality. In this paper, we investigate the dynamic modeling of transient failure rate based on Back Propagation (BP) neural network, and propose an optimization strategy for system availability based on Cross Entropy (CE). Specifically, the neural network is trained using cross-layer simulation data obtained from SPICE simulation while the CE-based optimization for system functionality availability is achieved by judiciously selecting an optimal supply voltage for processors under timing constraints. Simulation results show that the proposed method can achieve system availability improvement of up to 32% compared to benchmarking methods.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

10.3.3

A DETERMINISTIC-PATH ROUTING ALGORITHM FOR TOLERATING MANY FAULTS ON WAFER-LEVEL NOC
Speaker:
Ying Zhang, Tongji University, CN
Authors:
Zhongsheng Chen¹, ying zhang², Zebo Peng³ and Jianhui Jiang¹
¹Tongji University, CN; ²Tongji University, Shanghai, China, CN; ³Linkoping University, SE
Abstract
Wafer-level NoC has emerged as a promising fabric to further improve supercomputer performance, but this new fabric may suffer from the many-fault problem. This paper presents a deterministic-path routing algorithm for tolerating many faults on wafer-level NoCs. The proposed algorithm generates routing tables using a breadth-first traversal strategy, and stores one routing table in each NoC switch. The switch will then transmit packages according to its routing table online. We use the Tarjan algorithm to dynamically reconfigure the routes to avoid the faulty nodes and develop the deprecated link/node rules to ensure deadlock-free communication of the NoCs. Experimental results demonstrate that the proposed algorithm does not only tolerate the effects of many faults, but also maximizes the available nodes in the reconfigured NoC. The performance of the proposed algorithm in terms of average latency, throughput, and energy consumption is also better than those of the existing solutions.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

IP5-1, 705

THERMAL-AWARENESS IN A SOFT ERROR TOLERANT ARCHITECTURE
Speaker:
Sajjad Hussain, Chair for Embedded Systems, KIT, DE
Authors:
Sajjad Hussain¹, Muhammad Shafique² and Joerg Henkel¹
¹Karlsruhe Institute of Technology, DE; ²Vienna University of Technology (TU Wien), AT
Abstract
It is crucial to provide soft error reliability in a power-efficient manner such that the maximum chip temperature remains within the safe operating limits. Different execution phases of an application have diverse performance, power, temperature and vulnerability behavior that can be leveraged to fulfill the resiliency requirements within the allowed thermal constraints. We propose a soft error tolerant architecture with fine-grained redundancy for different architectural components, such that their reliable operations can be activated selectively at fine-granularity to maximize the reliability under a given thermal constraint. When compared with state-of-the-art, our temperature-aware fine-grained reliability manager provides up to 30% reliability within the thermal budget.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:31

IP5-2, 547

A SOFTWARE-LEVEL REDUNDANT MULTITHREADING FOR SOFT/HARD ERROR DETECTION AND RECOVERY
Speaker:
Hwisoo So, Yonsei University, KR
Authors:
Moslem Didehban¹, HwiSoo So², Aviral Shrivastava¹ and Kyoungwoo Lee²
¹Arizona State University, US; ²Yonsei University, KR
Abstract
Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. In such environments, error resiliency is one of the main design concerns. Software level Redundant MultiThreading is one of the most promising error resilience strategies because they can potentially serve as inexpensive and flexible solutions for hardware unreliability issues i.e. soft and hard errors. However, the error coverage of the existing software level RMT solutions is limited to soft error detection and they rely on external schemes for error recovery. In this paper, we investigate the potential of software-level RMT schemes for complete soft and hard error detection and recovery. First, we pinpoint the main reasons behind ineffectiveness of basic software level triple redundant multithreading (STRMT) in protection against soft and hard errors. Then we introduce FISHER (FlexIble Soft and Hard Error Resiliency) as a software-only RMT scheme which can achieve comprehensive error resiliency against both soft and hard errors. Rather than performing centralized voting operations for critical instructions operands, FISHER distributes and intertwines error detection and recovery operations between redundant threads. To evaluate the effectiveness of the proposed solution, we performed more than 135,000 soft and hard error injection experiments on different hardware components of an ARM cortex53-like μ-architecturally simulated microprocessor. The results demonstrate that FISHER can reduce programs failure rate by around 261× and 162× compared to original and basic STRMTprotected versions of programs, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:32

IP5-3, 317

COMMON-MODE FAILURE MITIGATION:INCREASING DIVERSITY THROUGH HIGH-LEVEL SYNTHESIS
Speaker:
Farah Naz Taher, University of Texas at Dallas, US
Authors:
Farah Naz Taher¹, Matthew Joslin¹, Anjana Balachandran², Zhiqi Zhu¹ and Benjamin Carrion Schaefer¹
¹The University of Texas at Dallas, US; ²The Hong Kong Polytechnic University, HK
Abstract
Fault tolerance is vital in many domains. One popular way to increase fault-tolerance is through hardware redundancy. However, basic redundancy cannot cope with Common Mode Failures (CMFs). One way to address CMF is through the use of diversity in combination with traditional hardware redundancy. This work proposes an automatic design space exploration (DSE) method to generate optimized redundant hardware accelerators with maximum diversity to protect against CMFs given as a single behavioral description for High-Level Synthesis (HLS). For this purpose, this work exploits one of the main advantages of C-based VLSI design over the traditional RT-level design based on low-level Hardware Description Languages (HDLs): The ability to generate micro-architectures with unique characteristics from the same behavioral description. Experimental results show that the proposed method provides a significant diversity increment compared to using traditional RTL-based exploration to generate diverse designs.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

End of session
Lunch Break in Lunch Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Coffee Break 10:30 - 11:30
Lunch Break 13:00 - 14:30
Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Wednesday, March 27, 2019

Coffee Break 10:00 - 11:00
Lunch Break 12:30 - 14:30
Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Thursday, March 28, 2019

Coffee Break 10:00 - 11:00
University Booth Best Demo Award Presentation at the University Booth 10:30
Lunch Break 12:30 - 14:00
Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50
Coffee Break 15:30 - 16:00