Thermal-Cycling-aware Dynamic Reliability Management in Many-Core System-on-Chip

Mohammad-Hashem Haghbayan1,a, Antonio Miele2, Zhuo Zou3, Hannu Tenhunen1,b and Juha Plosila1,c
1Department of Future Technologies – University of Turku – Finland
amohhag@utu.fi
bhannu.tenhunen@utu.fi
cjuplos@utu.fi
2Dip. Elettronica, Informazione e Bioingegneria – Politecnico di Milano – Italy
antonio.miele@polimi.it
3Fudan University – China
zhuo@fudan.edu.cn

ABSTRACT


Dynamic Reliability Management (DRM) is a common approach to mitigate aging and wear-out effects in multi- /many-core systems. State-of-the-art DRM approaches apply finegrained control on resource management to increase/balance the chip reliability while considering other system constraints, e.g., performance, and power budget. Such approaches, acting on various knobs such as workload mapping and scheduling, Dynamic Voltage/Frequency Scaling (DVFS) and Per-Core Power Gating (PCPG), demonstrated to work properly with the various aging mechanisms, such as electromigration, and Negative-Bias Temperature Instability (NBTI). However, we claim that they do not suffice for thermal cycling. Thus, we here propose a novel thermal-cycling-aware DRM approach for shared-memory many-core systems running multi-threaded applications. The approach applies a fine-grained control capable at reducing both temperature levels and variations. The experimental evaluations demonstrated that the proposed approach is able to achieve 39% longer lifetime than past approaches.

Keywords: Lifetime Reliability, Thermal Cycling, Resource Management



Full Text (PDF)