Activation Density based Mixed-Precision Quantization for Energy Efficient Neural Networks

Karina Vasquez1, Yeshwanth Venkatesha2,a, Abhiroop Bhattacharjee2,b, Abhishek Moitra2,c and Priyadarshini Panda2,d
1Department of Electrical Engineering, UTEC, Peru
karina.vasquez@utec.edu.pe
2Department of Electrical Engineering, Yale University, USA
ayeshwanth.venkatesha@yale.edu
babhiroop.bhattacharjee@yale.edu
cabhishek.moitra@yale.edu
dpriya.panda@yale.edu

ABSTRACT


As neural networks gain widespread adoption in embedded devices, there is a growing need for model compression techniques to facilitate seamless deployment in resourceconstrained environments. Quantization is one of the go-to methods yielding state-of-the-art model compression. Most quantization approaches take a fully trained model, then apply different heuristics to determine the optimal bit-precision for different layers of the network, and finally retrain the network to regain any drop in accuracy. Based on Activation Density—the proportion of non-zero activations in a layer—we propose a novel intraining quantization method. Our method calculates optimal bitwidth/ precision for each layer during training yielding an energyefficient mixed precision model with competitive accuracy. Since we train lower precision models progressively during training, our approach yields the final quantized model at lower training complexity and also eliminates the need for re-training. We run experiments on benchmark datasets like CIFAR-10, CIFAR-100, TinyImagenet on VGG19/ResNet18 architectures and report the accuracy and energy estimates for the same. We achieve up to 4:5× benefit in terms of estimated multiply-and-accumulate (MAC) reduction while reducing the training complexity by 50% in our experiments. To further evaluate the energy benefits of our proposed method, we develop a mixed-precision scalable Process In Memory (PIM) hardware accelerator platform. The hardware platform incorporates shift-add functionality for handling multibit precision neural network models. Evaluating the quantized models obtained with our proposed method on the PIM platform yields about 5× energy reduction compared to baseline 16-bit models. Additionally, we find that integrating activation density based quantization with activation density based pruning (both conducted during training) yields up to ∼198× and ∼44× energy reductions for VGG19 and ResNet18 architectures respectively on PIM platform compared to baseline 16-bit precision, unpruned models.

Keywords: Neural Networks, Quantization, Activation Density, Process In-Memory.



Full Text (PDF)