OR-ML: Enhancing Reliability for Machine Learning Accelerator with Opportunistic Redundancy

Bo Dong1,2, Zheng Wang1, Wenxuan Chen1,2, Chao Chen1, Yongkui Yang1 and Zhibin Yu1
1Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, Shenzhen, China
2School of Microelectronics, Xidian University, Xi’an, China

ABSTRACT


Reliability plays a central role in deep sub-micron and nanometre IC fabrication technology and has recently been reported to be one of the key issues affecting the inference phase of neural networks. State-of-the-art machine learning (ML) accelerators exploit massively computing parallelism observed in neural networks to achieve high energy efficiency. The topology of ML engines’ computing fabric, which constitutes large arrays of processing elements (PEs), has been increasing dramatically to incorporate the huge size and heterogeneity of the rapid evolving ML algorithm. However, it is commonly observed that activations of zero value lead to reduced PE utilization. In this work, we present a novel and low-cost approach to enhance the reliability of generic ML accelerators by Opportunistically exploring the chances of runtime Redundancy provided by neighbouring PEs, named as OR-ML. In contrast to conventional redundancy techniques, the proposed technique introduces no additional computing resources, therefore significantly reduces the implementation overhead and achieves obvious level of protection. The design prototype is evaluated using emulated fault injection on FPGA, executing mainstream neural networks for objection classification and detection.

Keywords: Fault Tolerance, Machine Learning Accelerator, Opportunistic Redundancy.



Full Text (PDF)