Thermal-Awareness in a Soft Error Tolerant Architecture

Sajjad Hussain1,a, Muhammad Shafique2 and Jörg Henkel1,b
1Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany
asajjad.hussain@kit.edu
bhenkel@kit.edu
2Department of Computer Engineering Vienna University of Technology, Austria
muhammad.shafique@tuwien.ac.at

ABSTRACT


It is crucial to provide soft error reliability in a power-efficient manner such that the maximum chip temperature remains within the safe operating limits. Different execution phases of an application have diverse performance, power, temperature and vulnerability behavior that can be leveraged to fulfill the resiliency requirements within the allowed thermal constraints. We propose a soft error tolerant architecture with fine-grained redundancy for different architectural components, such that their reliable operations can be activated selectively at fine-granularity to maximize the reliability under a given thermal constraint. When compared with state-of-the-art, our temperature-aware fine-grained reliability manager provides up to 30% reliability within the thermal budget.



Full Text (PDF)