RetroDMR: Troubleshooting Non-Deterministic Faults with Retrospective DMR

Ting Wang1, Yannan Liu1, Qiang Xu1, Zhaobo Zhang2, Zhiyuan Wang2 and Xinli Gu2
1The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
2Huawei Technologies, Inc., Santa Clara, CA

ABSTRACT


The most notorious faults for diagnosis in post-silicon validation are those that manifest themselves in a non-deterministic manner with system-level functional tests, where errors randomly appear from time to time even when applying the same workloads. In this work, we propose a novel diagnostic framework that resorts to dual-modular redundancy (DMR) for troubleshooting nondeterministic faults, namely RetroDMR. To be specific, we log the essential events (e.g., the sequence of thread migration) in the faulty run to record the mapping relationship between threads and their corresponding execution units. Then in the following diagnosis runs, we apply redundant multithreading (RMT) technique to reduce error detection latency, while at the same time we try to follow the thread migration sequence of the original run whenever possible. By doing so, RetroDMR significantly improves the reproduction rate and diagnosis resolution for non-deterministic faults, as demonstrated in our experimental results.



Full Text (PDF)