SmartShuttle: Optimizing Off‐Chip Memory Accesses for Deep Learning Accelerators

Jiajun Lia, Guihai Yanb, Wenyan Luc, Shuhao Jiangd, Shijun Gonge, Jingya Wuf and Xiaowei Lig
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
alijiajun@ict.ac.cn
byan@ict.ac.cn
cluwenyan@ict.ac.cn
djiangshuhao@ict.ac.cn
egongshijun@ict.ac.cn
fwujingya@ict.ac.cn
glxw@ict.ac.cn

ABSTRACT


Convolutional Neural Network (CNN) accelerators are rapidly growing in popularity as a promising solution for deep learning based applications. Though optimizations on computation have been intensively studied, the energy efficiency of such accelerators remains limited by off‐chip memory accesses since their energy cost is magnitudes higher than other operations. Minimizing off‐chip memory access volume, therefore, is the key to further improving energy efficiency. However, we observed that sticking to minimizing the accesses of one data type as many prior work did cannot fit the varying shapes of convolutional layers in CNNs. Hence, there exists a dilemma of minimizing the accesses of which data type. To overcome the problem, this paper proposed an adaptive layer partitioning and scheduling scheme, called Smart Shuttle, to minimize off‐chip memory accesses for CNN accelerators. Smart shuttle can adaptively switch among different data reuse schemes and the corresponding tiling factor settings to dynamically match different convolutional layers. Moreover, SmartShuttle thoroughly investigates the impact of data reusability and sparsity on the memory access volume. The experimental results show that Smart Shuttle processes the convolutional layers at 434.8 multiply and accumulations (MACs)/DRAM access for VGG16 (batch size = 3), and 526.3 MACs/DRAM access for AlexNet (batch size = 4), which outperforms the state‐of‐the‐art approach (Eyeriss) by 52.2% and 52.6%, respectively.



Full Text (PDF)