Learn-to-Scale: Parallelizing Deep Learning Inference on Chip Multiprocessor Architecture

Kaiwei Zou1,2,a, Ying Wang1,2,b, Huawei Li1,2,c and Xiaowei Li1,2,d
1SKLCA, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
2University of Chinese Academy of Sciences, Beijing, China
azoukaiwei@ict.ac.cn
bwangying2009@ict.ac.cn
clihuawei@ict.ac.cn
dlxw@ict.ac.cn

ABSTRACT


Accelerating deep neural networks on resource-con-strained embedded devices is becoming increasingly important for real-time applications. However, in contrast to the intensive re-search works on specialized neural network inference architectures, there is a lack of study on the acceleration and parallelization of deep learning inference on embedded chip-multiprocessor archi-tectures, which are favored by many real-time applications for su-perb energy-efficiency and scalability. In this work, we investigate the strategies of parallelizing single-pass deep neural network in-ference on embedded on-chip multi-core accelerators. These meth-ods exploit the elasticity and noise-tolerance features of deep learn-ing algorithms to circumvent the bottleneck of on-chip inter-core data moving and reduce the communication overhead aggravated as the core number scales up. The experimental results show that the communication-aware sparsified parallelization method im-proves the system performance by 1.6×-1.1× and achieves 4×-1.6× better interconnects energy efficiency for different neural networks.

Keywords: Parallelization, Multi-core, Inference, Neural network, Embedded devices.



Full Text (PDF)