Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications

Arian Maghazeh1, Sudipta Chattopadhyay2, Petru Eles1 and Zebo Peng1
1Department of Computer and Information Science, Linköping University, Sweden
2Singapore University of Technology and Design, Singapore

ABSTRACT


We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings.



Full Text (PDF)