DATE 2022

TCX: A Programmable Tensor Processor

Tailin Liang^1,2,a, Lei Wang^1,b, Shaobo Shi^1,2,b, John Glossner^1,3,c, and Xiaotong Zhang^1,d
¹School of Computer Science and Communication Engineering, University of Science and Technology, Beijing 100083, China
²Hua Xia General Processor Technologies, Beijing 100080, China
³General Processor Technologies, Tarrytown, NY 10591, United States
^atailin.liang@xs.ustb.edu.cn
^bwanglei@ustb.edu.cn
^cjglossner@ustb.edu.cn
^dzxt@ustb.edu.cn

ABSTRACT

Neural network processors and accelerators are domainspecific architectures deployed to solve the high computational requirements of deep learning algorithms. This paper proposes a new instruction set extension for tensor computing, TCX, with RISC-style instructions and variable length tensor extensions. It features a multidimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC ISAs and provides software compatibility for scalable hardware implementations. We present an implementation of the TCX tensor computing accelerator using an out-of-order microarchitecture implementation. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described which allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements based on tensor dimensions. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depth-wise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4096 multiplication-accumulation compute unit with up to 98.83% MAC utilization. It consumes 12.8 square millimeters while dissipating 0.46 Watts per TOP in TSMC 28nm technology.

Keywords: Neural Network Accelerator, Convolutional Neural Network, ASIC Design.

Full Text (PDF)