doi: 10.3850/978-3-9815370-4-8_0085


Soft-Error Reliability and Power Co-Optimization for GPGPUs Register File using Resistive Memory


Jingweijia Tan1,a, Zhi Li2 and Xin Fu1,b

1ECE Department, University of Houston, Houston, TX, USA.

ajtan12@uh.edu, bxfu8@central.uh.edu

2EECS Department, University of Kansas, Lawrence, KS, USA.

zli@ku.edu

ABSTRACT

The increasing adoption of graphics processing units (GPUs) for high-performance computing raises the reliability challenge, which is generally ignored in traditional GPUs. GPUs usually support thousands of parallel threads and require a sizable register file. Such large register file is highly susceptible to soft errors and power-hungry. Although ECC has been adopted to register file in modern GPUs, it causes considerable power overhead, which further increases the power stress. Thus, an energy-efficient soft-error protection mechanism is more desirable. Besides its extremely low leakage power consumption, resistive memory (e.g. spin-transfer torque RAM) is also immune to the radiation induced soft errors due to its magnetic field based storage. In this paper, we propose to LEverage reSistive memory to enhance the Soft-error robustness and reduce the power consumption (LESS) of registers in the General-Purpose computing on GPUs (GPGPUs). Since resistive memory experiences longer write latency compared to SRAM, we explore the unique characteristics of GPGPU applications to obtain the win-win gains: achieving the near-full soft-error protection for the register file, and meanwhile substantially reducing the energy consumption with negligible performance loss. Our experimental results show that LESS is able to mitigate the registers soft-error vulnerability by 86% and achieve 60% energy savings with negligible (e.g. 4%) performance loss.

Keywords: GPGPU, Register file, Soft error, Reliability, Energy efficiency, Resistive memory.



Full Text (PDF)