3.4 Accelerating Neural Networks and Vision Workloads

Time	Label	Presentation Title Authors
14:30	3.4.1	PSB-RNN: A PROCESSING-IN-MEMORY SYSTOLIC ARRAYARCHITECTURE USING BLOCK CIRCULANT MATRICES FOR RECURRENT NEURAL NETWORKS Speaker: Nagadastagiri Challapalle, Pennsylvania State University, US Authors: Nagadastagiri Challapalle¹, Sahithi Rampalli¹, Makesh Tarun Chandran¹, Gurpreet Singh Kalsi², John (Jack) Sampson¹, Sreenivas Subramoney² and Vijaykrishnan Narayanan¹ ¹Pennsylvania State University, US; ²Intel Labs, IN Abstract Recurrent Neural Networks (RNNs) are widely used in Natural Language Processing (NLP) applications as they inherently capture contextual information across spatial and temporal dimensions. Compared to other classes of neural networks, RNNs have more weight parameters as they primarily consist of fully connected layers. Recently, several techniques such as weight pruning, zero-skipping, and block circulant compression have been introduced to reduce the storage and access requirements of RNN weight parameters. In this work, we present a ReRAM crossbar based processing-in-memory (PIM) architecture with systolic dataflow incorporating block circulant compression for RNNs. The block circulant compression decomposes the operations in a fully connected layer into a series of Fourier transforms and point-wise operations resulting in reduced space and computational complexity. We formulate the Fourier transform and point-wise operations into in-situ multiply-and-accumulate (MAC) operations mapped to ReRAM crossbars for high energy efficiency and throughput. We also incorporate systolic dataflow for communication within the crossbar arrays, in contrast to broadcast and multicast communications, to further improve energy efficiency. The proposed architecture achieves average improvements in compute efficiency of 44x and 17x over a custom FPGA architecture and conventional crossbar based architecture implementations, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	3.4.2	XPULPNN: ACCELERATING QUANTIZED NEURAL NETWORKS ON RISC-V PROCESSORS THROUGH ISA EXTENSIONS Speaker: Angelo Garofalo, Università di Bologna, IT Authors: Angelo Garofalo¹, Giuseppe Tagliavini¹, Francesco Conti², Davide Rossi¹ and Luca Benini² ¹Università di Bologna, IT; ²ETH Zurich, CH / Università di Bologna, CH Abstract Strongly quantized fixed-point arithmetic is considered the key direction to enable the inference of CNNs on low-power, resource-constrained edge devices. However, the deployment of highly quantized Neural Networks at the extreme edge of IoT, on fully programmable MCUs, is currently limited by the lack of support, at the Instruction Set Architecture (ISA) level, for sub-byte fixed-point data types, making it necessary to add numerous instructions for packing and unpacking data when running low-bitwidth (i.e. 2- and 4-bit) QNN kernels, creating a bottleneck for performance and energy efficiency of QNN inference. In this work we present a set of extensions to the RISC-V ISA, aimed at boosting the energy efficiency of low-bitwidth QNNs on low-power microcontroller-class cores. The microarchitecture supporting the new extensions is builton top of a RISC-V core featuring instruction set extensions targeting energy-efficient digital signal processing. To evaluate the extensions, we integrated the core into a full microcontroller system, synthesized and placed&routed in 22nm FDX technology. QNN convolution kernels, implemented on the new core, run 5.3×and 8.9× faster when considering 4- and 2-bit data operands respectively, compared to the baseline processor only supporting 8-bit SIMD instructions. With a peak of 279 GMAC/s/W, the proposed solution achieves 9×better energy efficiency compared to the baseline and two orders of magnitudes better energy efficiency compared to state-of-the-art microcontrollers. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	3.4.3	SNA: A SIAMESE NETWORK ACCELERATOR TO EXPLOIT THE MODEL-LEVEL PARALLELISM OF HYBRID NETWORK STRUCTURE Speaker: Xingbin Wang, Chinese Academy of Sciences, CN Authors: Xingbin Wang, Boyan Zhao, Rui Hou and Dan Meng, Chinese Academy of Sciences, CN Abstract Siamese network is compute-intensive learning model with growing applicability in a wide range of domains. However, state-of-art deep neural network (DNN) accelerators would not work efficiently for siamese network, as their designs do not account for the algorithm properties of siamese network. In this paper, we propose a siamese network accelerator called SNA, the first Simultaneous Multi-Threading (SMT) hardware architecture to perform siamese network inference with high performance and energy efficiency. We devise an adaptive inter-model computing resource partition and flexible on-chip buffer management mechanism based on the model parallelism and SMT design philosophy. Our architecture is implemented in Verilog and synthesized in a 65nm technology using Synopsys design tools. We also evaluate it with several typical siamese networks. Compared to the state-of-art accelerator, on average, the SNA architecture offers 2.1x speedup and 1.48x energy reduction. Download Paper (PDF; Only available from the DATE venue WiFi)
15:45	3.4.4	HCVEACC: A HIGH-PERFORMANCE AND ENERGY-EFFICIENT ACCELERATOR FOR TRACKING TASK IN VSLAM SYSTEM Speaker: Meng Liu, Chinese Academy of Sciences, CN Authors: Li Renwei, Wu Junning, Liu Meng, Chen Zuding, Zhou Shengang and Feng Shanggong, Chinese Academy of Sciences, CN Abstract Visual SLAM (vSLAM) is a critical computer vision technology that is able to build a map of an unknown environment and perform location, simultaneously leveraging the partially built map. While existing several software SLAM processing frameworks, underlying general-purpose processors still hardly achieve the real-time SLAM at a reasonably low cost. In this paper, we propose HcveAcc, the first specialized CMOS-based hardware accelerator to help optimize the tracking task in the vSLAM system with high-performance and energy-efficient. Our HcveAcc targets to solve the time overhead bottleneck in the tracking process—high-density feature extraction and high-precision descriptor generation, and provides a configurable hardware architecture that handles higher resolution image data. We have implemented the HcveAcc in a 28nm CMOS technology using commercial EDA tools and evaluated it for the EuRoC and TUM dataset to demonstrate the robustness and accuracy in the SLAM tracking procedure. Our results show that HcveAcc achieves 4.3X speedup while consuming much less energy compared with state-of-the-art FPGA solutions. Download Paper (PDF; Only available from the DATE venue WiFi)
16:01	IP1-13, 55	ASCELLA: ACCELERATING SPARSE COMPUTATION BY ENABLING STREAM ACCESSES TO MEMORY Speaker: Bahar Asgari, Georgia Tech, US Authors: Bahar Asgari, Ramyad Hadidi and Hyesoon Kim, Georgia Tech, US Abstract Sparse computations dominate a wide range of applications from scientific problems to graph analytics. The main characterization of sparse computations, indirect memory accesses, prevents them from effectively achieving high performance on general-purpose processors. Therefore, hardware accelerators have been proposed for sparse problems. For these accelerators, the storage format and the decompression mechanism is crucial but have seen less attention in prior work. To address this gap, we propose Ascella, an accelerator for sparse computations, which besides enabling a smooth stream of data and parallel computation, proposes a fast decompression mechanism. Our implementation on a ZYNQ FPGA shows that on average, Ascella executes sparse problems up to 5.1x as fast as prior work. Download Paper (PDF; Only available from the DATE venue WiFi)
16:02	IP1-14, 645	ACCELERATION OF PROBABILISTIC REASONING THROUGH CUSTOM PROCESSOR ARCHITECTURE Speaker: Nimish Shah, KU Leuven, BE Authors: Nimish Shah, Laura I. Galindez Olascoaga, Wannes Meert and Marian Verhelst, KU Leuven, BE Abstract Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with Probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00		End of session

Time

Label

Presentation Title
Authors

14:30

3.4.1

PSB-RNN: A PROCESSING-IN-MEMORY SYSTOLIC ARRAYARCHITECTURE USING BLOCK CIRCULANT MATRICES FOR RECURRENT NEURAL NETWORKS
Speaker:
Nagadastagiri Challapalle, Pennsylvania State University, US
Authors:
Nagadastagiri Challapalle¹, Sahithi Rampalli¹, Makesh Tarun Chandran¹, Gurpreet Singh Kalsi², John (Jack) Sampson¹, Sreenivas Subramoney² and Vijaykrishnan Narayanan¹
¹Pennsylvania State University, US; ²Intel Labs, IN
Abstract
Recurrent Neural Networks (RNNs) are widely used in Natural Language Processing (NLP) applications as they inherently capture contextual information across spatial and temporal dimensions. Compared to other classes of neural networks, RNNs have more weight parameters as they primarily consist of fully connected layers. Recently, several techniques such as weight pruning, zero-skipping, and block circulant compression have been introduced to reduce the storage and access requirements of RNN weight parameters. In this work, we present a ReRAM crossbar based processing-in-memory (PIM) architecture with systolic dataflow incorporating block circulant compression for RNNs. The block circulant compression decomposes the operations in a fully connected layer into a series of Fourier transforms and point-wise operations resulting in reduced space and computational complexity. We formulate the Fourier transform and point-wise operations into in-situ multiply-and-accumulate (MAC) operations mapped to ReRAM crossbars for high energy efficiency and throughput. We also incorporate systolic dataflow for communication within the crossbar arrays, in contrast to broadcast and multicast communications, to further improve energy efficiency. The proposed architecture achieves average improvements in compute efficiency of 44x and 17x over a custom FPGA architecture and conventional crossbar based architecture implementations, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

3.4.2

XPULPNN: ACCELERATING QUANTIZED NEURAL NETWORKS ON RISC-V PROCESSORS THROUGH ISA EXTENSIONS
Speaker:
Angelo Garofalo, Università di Bologna, IT
Authors:
Angelo Garofalo¹, Giuseppe Tagliavini¹, Francesco Conti², Davide Rossi¹ and Luca Benini²
¹Università di Bologna, IT; ²ETH Zurich, CH / Università di Bologna, CH
Abstract
Strongly quantized fixed-point arithmetic is considered the key direction to enable the inference of CNNs on low-power, resource-constrained edge devices. However, the deployment of highly quantized Neural Networks at the extreme edge of IoT, on fully programmable MCUs, is currently limited by the lack of support, at the Instruction Set Architecture (ISA) level, for sub-byte fixed-point data types, making it necessary to add numerous instructions for packing and unpacking data when running low-bitwidth (i.e. 2- and 4-bit) QNN kernels, creating a bottleneck for performance and energy efficiency of QNN inference. In this work we present a set of extensions to the RISC-V ISA, aimed at boosting the energy efficiency of low-bitwidth QNNs on low-power microcontroller-class cores. The microarchitecture supporting the new extensions is builton top of a RISC-V core featuring instruction set extensions targeting energy-efficient digital signal processing. To evaluate the extensions, we integrated the core into a full microcontroller system, synthesized and placed&routed in 22nm FDX technology. QNN convolution kernels, implemented on the new core, run 5.3×and 8.9× faster when considering 4- and 2-bit data operands respectively, compared to the baseline processor only supporting 8-bit SIMD instructions. With a peak of 279 GMAC/s/W, the proposed solution achieves 9×better energy efficiency compared to the baseline and two orders of magnitudes better energy efficiency compared to state-of-the-art microcontrollers.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

3.4.3

SNA: A SIAMESE NETWORK ACCELERATOR TO EXPLOIT THE MODEL-LEVEL PARALLELISM OF HYBRID NETWORK STRUCTURE
Speaker:
Xingbin Wang, Chinese Academy of Sciences, CN
Authors:
Xingbin Wang, Boyan Zhao, Rui Hou and Dan Meng, Chinese Academy of Sciences, CN
Abstract
Siamese network is compute-intensive learning model with growing applicability in a wide range of domains. However, state-of-art deep neural network (DNN) accelerators would not work efficiently for siamese network, as their designs do not account for the algorithm properties of siamese network. In this paper, we propose a siamese network accelerator called SNA, the first Simultaneous Multi-Threading (SMT) hardware architecture to perform siamese network inference with high performance and energy efficiency. We devise an adaptive inter-model computing resource partition and flexible on-chip buffer management mechanism based on the model parallelism and SMT design philosophy. Our architecture is implemented in Verilog and synthesized in a 65nm technology using Synopsys design tools. We also evaluate it with several typical siamese networks. Compared to the state-of-art accelerator, on average, the SNA architecture offers 2.1x speedup and 1.48x energy reduction.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:45

3.4.4

HCVEACC: A HIGH-PERFORMANCE AND ENERGY-EFFICIENT ACCELERATOR FOR TRACKING TASK IN VSLAM SYSTEM
Speaker:
Meng Liu, Chinese Academy of Sciences, CN
Authors:
Li Renwei, Wu Junning, Liu Meng, Chen Zuding, Zhou Shengang and Feng Shanggong, Chinese Academy of Sciences, CN
Abstract
Visual SLAM (vSLAM) is a critical computer vision technology that is able to build a map of an unknown environment and perform location, simultaneously leveraging the partially built map. While existing several software SLAM processing frameworks, underlying general-purpose processors still hardly achieve the real-time SLAM at a reasonably low cost. In this paper, we propose HcveAcc, the first specialized CMOS-based hardware accelerator to help optimize the tracking task in the vSLAM system with high-performance and energy-efficient. Our HcveAcc targets to solve the time overhead bottleneck in the tracking process—high-density feature extraction and high-precision descriptor generation, and provides a configurable hardware architecture that handles higher resolution image data. We have implemented the HcveAcc in a 28nm CMOS technology using commercial EDA tools and evaluated it for the EuRoC and TUM dataset to demonstrate the robustness and accuracy in the SLAM tracking procedure. Our results show that HcveAcc achieves 4.3X speedup while consuming much less energy compared with state-of-the-art FPGA solutions.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:01

IP1-13, 55

ASCELLA: ACCELERATING SPARSE COMPUTATION BY ENABLING STREAM ACCESSES TO MEMORY
Speaker:
Bahar Asgari, Georgia Tech, US
Authors:
Bahar Asgari, Ramyad Hadidi and Hyesoon Kim, Georgia Tech, US
Abstract
Sparse computations dominate a wide range of applications from scientific problems to graph analytics. The main characterization of sparse computations, indirect memory accesses, prevents them from effectively achieving high performance on general-purpose processors. Therefore, hardware accelerators have been proposed for sparse problems. For these accelerators, the storage format and the decompression mechanism is crucial but have seen less attention in prior work. To address this gap, we propose Ascella, an accelerator for sparse computations, which besides enabling a smooth stream of data and parallel computation, proposes a fast decompression mechanism. Our implementation on a ZYNQ FPGA shows that on average, Ascella executes sparse problems up to 5.1x as fast as prior work.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:02

IP1-14, 645

ACCELERATION OF PROBABILISTIC REASONING THROUGH CUSTOM PROCESSOR ARCHITECTURE
Speaker:
Nimish Shah, KU Leuven, BE
Authors:
Nimish Shah, Laura I. Galindez Olascoaga, Wannes Meert and Marian Verhelst, KU Leuven, BE
Abstract
Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with Probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

End of session