In-Memory Computing based Accelerator for Transformer Networks for Long Sequences

Ann Franchesca Lagunaa, Arman Kazemib, Michael Niemierc and X. Sharon Hud
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana, USA
aalaguna@nd.edu
bakazemi@nd.edu
cmniemier@nd.edu
dshu@nd.edu

ABSTRACT


Transformer networks have outperformed recurrent neural networks and convolutional neural networks in various sequential tasks. However, scaling transformer networks for long sequences has been challenging because of memory and compute bottlenecks. Transformer networks are impeded by memory bandwidth limitations because of their low operation per byte ratio resulting in low utilization of GPU’s computing resources. In-memory processing can mitigate memory bottlenecks by eliminating the transfer time between memory and compute units. Furthermore, transformer networks use neural attention mechanisms to characterize the relationships between sequence elements. Efficient hardware solutions have been proposed to implement efficient attention mechanisms, which include ternary content addressable memories (TCAM), crossbar arrays (XBars), and processing in-memory (PIM). However, these solutions do not implement a multi-head self-attention mechanism. We propose using a combination of XBars and CAMs to accelerate transformer networks. We improve the speed of transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, (3) exploiting the available parallelism in the attention mechanism, and (4) using locality sensitive hashing to filter the number of sequence elements by their importance. Our approach achieves a 200x speedup and 41x energy improvement for a sequence length of 4098.

Keywords: Transformers, Crossbars, TCAM, LSH, Parallelization, In-Memory Computing.



Full Text (PDF)