PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment

Fritz G. Previlon, Charu Kalra, Devesh Tiwari and David R. Kaeli
Northeastern University, Boston, MA, USA

ABSTRACT


Reliability has become a first-class design objective for GPU devices due to increasing soft-error rate. To assess the reliability of GPU programs, researchers rely on software faultinjection methods. Unfortunately, software fault-injection process is prohibitively expensive, requiring multiple days to complete a statistically sound fault-injection campaign.

Therefore, to address this challenge, this paper proposes a novel fault-injection method, PCFI, that reduces the number of fault injections by exploiting the predictability in fault-injection outcome based on the program counter of the soft-error affected instruction. Evaluation on a variety of GPU programs covering a wide range of application domains shows that PCFI reduces the time to complete fault-injection campaigns by 22% on average, without sacrificing accuracy.

Keywords: Reliability, Fault injection, GPU, Soft errors, Transient faults.



Full Text (PDF)