DATE 2021

Value Similarity Extensions for Approximate Computing in General-Purpose Processors

Younghoon Kim^1,a, Swagath Venkataramani^2,a, Sanchari Sen^2,b and Anand Raghunathan^1,b
¹School of Electrical and Computer Engineering, Purdue University
^akim1606@purdue.edu
^braghunathan@purdue.edu
²IBM T. J. Watson Research Center
^aswagath.venkataramani@ibm.com
^bsanchari.sen@ibm.com

ABSTRACT

Approximate Computing (AxC) is a popular design paradigm wherein selected computations are executed approximately to gain efficiency with minimal impact on applicationlevel quality. Most efforts in AxC target specialized accelerators and domain-specific processors, with relatively limited focus on General-Purpose Processors (GPPs). However, GPPs are still broadly used to execute applications that are amenable to AxC, making AxC for GPPs a critical challenge.

A key bottleneck in applying AxC to GPPs is that their execution units account for only a small fraction of total energy, requiring a holistic approach targeting compute, memory and control front-ends. This paper proposes such an approach that leverages the application property of value similarity, i.e., input operands to computations that occur close-in-time take similar values. Such similar computations are dynamically pre-detected and the fetch-decode-execute of entire instruction sequences are skipped to benefit performance. To this end, we propose a set of lightweight micro-architectural and ISA extensions called VSX that enable: (i) similarity detection amongst values in a cache-line, (ii) skipping of pre-defined instructions and/or loop iterations when similarity is detected, and (iii) substituting outputs of skipped instructions with saved results from previously executed computations. We also develop compiler techniques, guided by user annotations, to benefit from VSX in the context of common Machine Learning (ML) kernels. Our RTL implementation of VSX for a low-power RISC-V processor incurred 2.13% area overhead and yielded 1.19×-3.84× speedup with <0.5% accuracy loss on 6 ML benchmarks.

Full Text (PDF)