LAD-ECC: Energy-Efficient ECC Mechanism for GPGPUs Register File

Xiaohui Weia, Hengshan Yueb and Jingweijia Tanc

College of Computer Science and Technology, Jilin University, Changchun, China
aweixh@jlu.edu.cn
byuehs18@mails.jlu.edu.cn
cjtan@jlu.edu.cn

ABSTRACT

Graphics Processing Units (GPUs) are widely used in general-purpose high-performance computing applications (i.e., GPGPUs), which require reliable execution in the presence of soft errors. To support massive thread level parallelism, a sizeable register file is adopted in GPUs, which is highly vulnerable to soft errors. Although modern commercial GPUs provide singleerror- correction double-error-detection (SEC-DED) ECC for the register file, it consumes a considerable amount of energy due to frequent register accesses and leakage power of ECC storage.
In this paper, we propose to Leverage Approximation and Duplication characteristics of register values to build an energyefficient ECC mechanism (LAD-ECC) in GPGPUs, which consists of APproximation-aware ECC (AP-ECC) and Duplication-Aware ECC (DA-ECC). Leveraging the inherent error tolerance features, AP-ECC merely protects significant bits of registers to combat the critical error. Observing same-named registers across threads usually keep the same data, DA-ECC avoids unnecessary ECC generation and verification for duplicate register values. Experimental results demonstrate that our LAD-ECC tremendously reduces 69.72% energy consumption of traditional SEC-DED ECC.

Keywords: GPGPUs, Reliability, Soft Error, Energy-Efficiency



Full Text (PDF)