CMRC: Comprehensive Microarchitectural Register Coalescing for GPGPUs

Ahmad M. Radaideh1 and Paul V. Gratz2
1Qualcomm Technologies, Inc. Austin, TX, USA
ahmadr@qti.qualcomm.com
2Texas A&M University College Station, TX, USA
pgratz@gratz1.com

ABSTRACT


Graphics processing units (GPUs) deploy a large register file (RF) to achieve high compute throughput. This RF, however, consumes a large portion of the total dynamic power in the GPU. Additionally, the RF banks and operand collectors (OCs) are designed with limited number of ports causing access serialization and negatively impacting performance. In this work, we introduce CMRC, a coalescing-aware RF organization that takes advantage of frequent narrow-width data present in general purpose applications to increase performance and reduce energy for GPGPUs. CMRC is a low-cost comprehensive approach to register coalescing capable of combining narrow-width read and write accesses from same or different warp instructions into fewer accesses, reducing port contention and access pressure. On general purpose applications, CMRC reduces RF accesses by 31.8%, achieves a performance speedup of 16.5%, and reduces overall GPU energy by 32.2% on average, outperforming best of class prior work by ∼1.8x without the requirement of compiler support.



Full Text (PDF)