Performance Evaluation and Optimization of HBM-Enabled GPU for Data-intensive Applications

Maohua Zhu1,a, Youwei Zhuo2, Chao Wang3, Wenguang Chen4 and Yuan Xie1,b
1University of California, Santa Barbara.
amaohuazhu@ece.ucsb.edu
byuanxie@ece.ucsb.edu
2University of Southern California.
youweizh@usc.edu
3University of Science and Technology of China.
cswang@ustc.edu.cn
4Tsinghua University.
cwg@tsinghua.edu.cn

ABSTRACT


Graphics Processing Units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional GDDR memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package stacked DRAM memory. However, the capacity of integrated in-packaged stacked memory is limited (e.g. only 4GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative dataintensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and investigate techniques to fully unleash the benefits of such HBMenabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experiment results demonstrate that our pipelined CNN training achieves a 1.63x speedup on an HBM enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most 24.5x(9.8x and 2.5x for each technique, respectively) faster than conventional implementations.



Full Text (PDF)