LoSCache: Leveraging Locality Similarity to Build Energy-Efficient GPU L2 Cache

Jingweijia Tan1,a, Kaige Yan1,b, Shuaiwen Leon Song2 and Xin Fu3
1Jilin University, Changchun, China
ajtan@jlu.edu.cn
byankaige@jlu.edu.cn
2HPC Group, Pacific Northwest National Laboratory, Richland, USA
shuaiwen.song@pnnl.gov
3University of Houston, Houston, USA
xfu8@central.uh.edu

ABSTRACT


This paper presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all the streaming multiprocessors is not the primary performance bottleneck but it does consume a large amount of chip energy. We observe that L2 cache is significantly underutilized by spending 95.6% of the time storing useless data. If such "dead time" on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of CTAs to dynamically predict the L2-level data re-reference counts of the remaining CTAs. After that, specific L2 cache lines can be powered off if they are predicted to be “dead” after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss.

Keywords: GPU, Cache, Energy-efficiency, Locality similarity.



Full Text (PDF)