Improving the DRAM Access Efficiency for Matrix Multiplication on Multicore Accelerators
Sheng Maa, Yang Guob, Shenggang Chenc, Libo Huangd and Zhiying Wange
State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha, Hunan, China
amasheng@nudt.edu.cn
bguoyang@nudt.edu.cn
cshgchen@nudt.edu.cn
dlibohuang@nudt.edu.cn
ezywang@nudt.edu.cn
ABSTRACT
The parallelization of matrix multiplication on multicore accelerators divides a matrix into several partitions. The existing design deploys an independent DMA transfer for each core to access its own partition from DRAM. This design has poor memory access efficiency, since memory access streams of multiple concurrent DMA transfers interfere with each other. We propose Distributed-DMA (D-DMA), which invokes one transfer to serve all cores. D-DMA accesses data in a row-major manner to efficiently exploit inter-partition locality to improve the DRAM access efficiency. Compared with a baseline design, D-DMA improves the bandwidth by 84.8% and reduces DRAM energy consumption by 43.1% for micro-benchmarks. It achieves higher performance for the GEMM benchmark. With much lower hardware cost, D-DMA significantly outperforms an out-of-order memory controller.