Stealth ECC: A Data-Width Aware Adaptive ECC Scheme for DRAM Error Resilience

Young Seo Lee1,a, Gunjae Koo1,b, Young-Ho Gong2, and Sung Woo Chung1,c
1Department of Computer Science and Engineering, Korea University, Seoul 02841, South Korea
aleeyoungseo@korea.ac.kr
bgunjaekoo@korea.ac.kr
cswchung@korea.ac.kr
2School of Computer and Information Engineering, Kwangwoon University, Seoul 01897, South Korea
yhgong@kw.ac.kr

ABSTRACT


As DRAM process technology scales down and DRAM density continues to grow, DRAM errors have become a primary concern in modern data centers. Typically, data centers have adopted memory systems with a single error correction double error detection (SECDED) code. However, the SECDED code is not sufficient to satisfy DRAM reliability demands as memory systems get more vulnerable. Though the servers in data centers employ strong ECC schemes, such ECC schemes lead to substantial performance and/or storage overhead.

In this paper, we propose Stealth ECC, a cost-effective memory protection scheme providing stronger error correctability than the conventional SECDED code, with negligible performance overhead and without storage overhead. Depending on the data-width (either narrow-width or full-width), Stealth ECC adaptively selects ECC schemes. For narrow-width values, Stealth ECC provides multi-bit error correctability by storing more parity bits in MSB side, instead of zeros. Furthermore, with bitwise interleaved data placement between x4 DRAM chips, Stealth ECC is robust to a single DRAM chip error for narrow-width values. On the other hand, for full-width values, Stealth ECC adopts the SECDED code, which maintains DRAM reliability comparable to the conventional SECDED code. As a result, thanks to the reliability improvement of narrow-width values, Stealth ECC enhances overall DRAM reliability, while incurring negligible performance overhead as well as no storage overhead. Our simulation results show that Stealth ECC reduces the probability of system failure (caused by DRAM errors) by 47.9%, on average, with only 0.9% performance overhead compared to the conventional SECDED code.

Keywords: Safety, Embedded, Arm, Trustzone, TEE.



Full Text (PDF)