Hydra: A Near Hybrid Memory Accelerator For Cnn Inference

Palash Das1,a, Ajay Joshi2 and Hemangee K. Kapoor1,b
1Department of CSE IIT Guwahati, Guwahati, India
apalash.das@iitg.ac.in
bhemangee@iitg.ac.in
2Department of ECE Boston University, Boston, USA
joshi@bu.edu

ABSTRACT


Convolutional neural network (CNN) accelerators often suffer from limited off-chip memory bandwidth and onchip capacity constraints. One solution to this problem is nearmemory or in-memory processing. Non-volatile memory, such as phase-change memory (PCM), has emerged as a promising DRAM alternative. It is also used in combination with DRAM, forming a hybrid memory. Though near-memory processing (NMP) has been used to accelerate the CNN inference, the feasibility/efficacy of NMP remained unexplored for a hybrid main memory system. Additionally, PCMs are also known to have low write endurance, and therefore, the tremendous amount of writes generated by the accelerators can drastically hamper the longevity of the PCM memory. In this work, we propose Hydra, a near hybrid memory accelerator integrated close to the DRAM to execute inference. The PCM banks store the models that are only read by the memory controller during the inference. For entire forward propagation (inference), the intermediate writes from Hydra are entirely performed to the DRAM, eliminating PCM-writes to enhance PCM lifetime. Unlike the other in-DRAM processing-based works, Hydra does not eliminate any multiplication operations by using binary or ternary neural networks, making it more suitable for the requirement of high accuracy. We also exploit inter- and intra-chip (DRAM chip) parallelism to improve the system’s performance. On average, Hydra achieves around 20x performance improvements over the in-DRAM processing-based state-of-the-art works while accelerating the CNN inference.

Keywords: Convolutional Neural Network, Accelerator, Near-memory Processing, Hybrid Memory.



Full Text (PDF)