6.5 Efficient Data Representations in Neural Networks

Time	Label	Presentation Title Authors
11:00	6.5.1	ACOUSTIC: ACCELERATING CONVOLUTIONAL NEURAL NETWORKS THROUGH OR-UNIPOLAR SKIPPED STOCHASTIC COMPUTING Speaker: Puneet Gupta, University of California, Los Angeles, US Authors: Wojciech Romaszkan, Tianmu Li, Tristan Melton, Sudhakar Pamarti and Puneet Gupta, University of California, Los Angeles, US Abstract As privacy and latency requirements force a move towards edge Machine Learning inference, resource constrained devices are struggling to cope with large and computationally complex models. For Convolutional Neural Networks, those limitations can be overcome by taking advantage of enormous data reuse opportunities and amenability to reduced precision. To do that however, a level of compute density unattainable for conventional binary arithmetic is required. Stochastic Computing can deliver such density, but it has not lived up to its full potential because of multiple underlying precision issues. We present ACOUSTIC: Accelerating Convolutions through Or-Unipolar Skipped sTochastIc Computing, an accelerator framework that enables fully stochastic, high-density CNN inference. Leveraging split-unipolar representation, OR-based accumulation and novel computation-skipping approach, ACOUSTIC delivers server-class parallelism within a mobile area and power budget - a 12mm2 accelerator can be as much as 38.7x more energy efficient and 72.5x faster than conventional fixed-point accelerators. It can also be up to 79.6x more energy efficient than state-of-the-art stochastic accelerators. At the lower-end ACOUSTIC achieves 8x-120X inference throughput improvement with similar energy and area when compared to recent mixed-signal/neuromorphic accelerators. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	6.5.2	ACCURACY TOLERANT NEURAL NETWORKS UNDER AGGRESSIVE POWER OPTIMIZATION Speaker: Yi-Wen Hung, National Tsing Hua University, TW Authors: Xiang-Xiu Wu¹, Yi-Wen Hung¹, Yung-Chih Chen² and Shih-Chieh Chang¹ ¹National Tsing Hua University, TW; ²Yuan Ze University, Taoyuan, Taiwan, TW Abstract With the success of deep learning, many neural network models have been proposed and applied to various applications. In several applications, the devices used to implement the complicated models have limited power resources and thus aggressive optimization techniques are often applied for saving power. However, some optimization techniques, such as voltage scaling and multiple threshold voltages, may increase the probability of error occurrence due to slow signal propagation, which increases the path delay in a circuit and fails some input patterns. Although neural network models are considered to have some error tolerance, the prediction accuracy could be significantly affected, when there are a large number of errors. Thus, in this paper, we propose a scheme to mitigate the errors caused by slow signal propagation. Since the delay of multipliers dominate the critical path of the circuit, we consider the patterns significantly altered by the slow signal propagation according to the multipliers and prevent the patterns from failure by adjusting the neural network and the parameters. The proposed scheme modifies a neural network on the software side and thus it is unnecessary to re-design the hardware structure. The experimental results show that the proposed scheme is effective for several neural network models. It can improve the accuracy by up to 27%, when the device under consideration is applied with aggressive power optimization techniques. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	6.5.3	A CONVOLUTIONAL RESULT SHARING APPROACH FOR BINARIZED NEURAL NETWORK INFERENCE Speaker: Chia-Chun Lin, National Tsing Hua University, TW Authors: Ya-Chun Chang¹, Chia-Chun Lin¹, Yi-Ting Lin¹, Yung-Chih Chen² and Chun-Yao Wang¹ ¹National Tsing Hua University, TW; ²Yuan Ze University, TW Abstract The binary-weight-binary-input binarized neural network (BNN) allows a much more efficient way to implement convolutional neural networks (CNNs) on mobile platforms. During inference, the multiply-accumulate operations in BNNs can be reduced to XNOR-popcount operations. Thus, the XNOR-popcount operations dominate most of the computation in BNNs. To reduce the number of required operations in convolution layers of BNNs, we decompose 3-D filters into 2-D filters and exploit the repeated filters, inverse filters, and similar filters to share results. By sharing the results, the number of operations in convolution layers of BNNs can be reduced effectively. Experimental results show that the number of operations can be reduced by about 60% for CIFAR-10 on BNNs while keeping the accuracy loss within 1% of originally trained network. Download Paper (PDF; Only available from the DATE venue WiFi)
12:15	6.5.4	PHONEBIT: EFFICIENT GPU-ACCELERATED BINARY NEURAL NETWORK INFERENCE ENGINE FOR MOBILE PHONES Speaker: Gang Chen, Sun Yat-sen University, CN Authors: Gang Chen¹, Shengyu He², Haitao Meng² and Kai Huang¹ ¹Sun Yat-sen University, CN; ²Northeastern University, CN Abstract Over the last years, a great success of deep neural networks (DNNs) has been witnessed in computer vision and other fields. However, performance and power constraints make it still challenging to deploy DNNs on mobile devices due to their high computational complexity. Binary neural networks (BNNs) have been demonstrated as a promising solution to achieve this goal by using bit-wise operations to replace most arithmetic operations. Currently, existing GPU-accelerated implementations of BNNs are only tailored for desktop platforms. Due to architecture differences, mere porting of such implementations to mobile devices yields suboptimal performance or is impossible in some cases. In this paper, we propose PhoneBit, a GPU-accelerated BNN inference engine for Android-based mobile devices that fully exploits the computing power of BNNs on mobile GPUs. PhoneBit provides a set of operator-level optimizations including locality-friendly data layout, bit packing with vectorization and layers integration for efficient binary convolution. We also provide a detailed implementation and parallelization optimization for PhoneBit to optimally utilize the memory bandwidth and computing power of mobile GPUs. We evaluate PhoneBit with AlexNet, YOLOv2 Tiny and VGG16 with their binary version. Our experiment results show that PhoneBit can achieve significant speedup and energy efficiency compared with state-of-the-art frameworks for mobile devices. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP3-3, 140	HARDWARE ACCELERATION OF CNN WITH ONE-HOT QUANTIZATION OF WEIGHTS AND ACTIVATIONS Speaker: Gang Li, Chinese Academy of Sciences, CN Authors: Gang Li, Peisong Wang, Zejian Liu, Cong Leng and Jian Cheng, Chinese Academy of Sciences, CN Abstract In this paper, we propose a novel one-hot representation for weights and activations in CNN model and demonstrate its benefits on hardware accelerator design. Specifically, rather than merely reducing the bitwidth, we quantize both weights and activations into n-bit integers that containing only one non-zero bit per value. In this way, the massive multiply and accumulates (MACs) are equivalent to additions of powers of two that can be efficiently calculated with histogram based computations. Experiments on the ImageNet classification task show that comparable accuracy can be obtained on our proposed One-Hot Networks (OHN) compared to conventional fixed-point networks. As case studies, we evaluate the efficacy of one-hot data representation on two state-of-the-art CNN accelerators on FPGA, our preliminary results show that 50% and 68.5% resource saving can be achieved on DaDianNao and Laconic respectively. Besides, the one-hot optimized Laconic can further achieve an average speedup of 4.94x on AlexNet. Download Paper (PDF; Only available from the DATE venue WiFi)
12:31	IP3-4, 729	BNNSPLIT: BINARIZED NEURAL NETWORKS FOR EMBEDDED DISTRIBUTED FPGA-BASED COMPUTING SYSTEMS Speaker: Luca Stornaiuolo, Politecnico di Milano, IT Authors: Giorgia Fiscaletti, Marco Speziali, Luca Stornaiuolo, Marco D. Santambrogio and Donatella Sciuto, Politecnico di Milano, IT Abstract In the past few years, Convolutional Neural Networks (CNNs) have seen a massive improvement, outperforming other visual recognition algorithms. Since they are playing an increasingly important role in fields such as face recognition, augmented reality or autonomous driving, there is the growing need for a fast and efficient system to perform the redundant and heavy computations of CNNs. This trend led researchers towards heterogeneous systems provided with hardware accelerators, such as GPUs and FPGAs. The vast majority of CNNs is implemented with floating-point parameters and operations, but from research, it has emerged that high classification accuracy can be obtained also by reducing the floating-point activations and weights to binary values. This context is well suitable for FPGAs, that are known to stand out in terms of performance when dealing with binary operations, as demonstrated in Finn, the state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs. In this paper, we propose a framework that extends Finn to a distributed scenario, enabling BNNs implementation on embedded multi-FPGA systems. Download Paper (PDF; Only available from the DATE venue WiFi)
12:32	IP3-5, 147	L2L: A HIGHLY ACCURATE LOG_2_LEAD QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS Speaker: Salim Ullah, TU Dresden, DE Authors: Salim Ullah¹, Siddharth Gupta², Kapil Ahuja², Aruna Tiwari² and Akash Kumar¹ ¹TU Dresden, DE; ²IIT Indore, IN Abstract Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel quantization technique for parameters of pre-trained deep neural networks. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only ∼ 1% and ∼ 0.4%, loss in the top-1 and top-5 accuracies respectively for VGG16 network using ImageNet dataset. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session

Time

Label

Presentation Title
Authors

11:00

6.5.1

ACOUSTIC: ACCELERATING CONVOLUTIONAL NEURAL NETWORKS THROUGH OR-UNIPOLAR SKIPPED STOCHASTIC COMPUTING
Speaker:
Puneet Gupta, University of California, Los Angeles, US
Authors:
Wojciech Romaszkan, Tianmu Li, Tristan Melton, Sudhakar Pamarti and Puneet Gupta, University of California, Los Angeles, US
Abstract
As privacy and latency requirements force a move towards edge Machine Learning inference, resource constrained devices are struggling to cope with large and computationally complex models. For Convolutional Neural Networks, those limitations can be overcome by taking advantage of enormous data reuse opportunities and amenability to reduced precision. To do that however, a level of compute density unattainable for conventional binary arithmetic is required. Stochastic Computing can deliver such density, but it has not lived up to its full potential because of multiple underlying precision issues. We present ACOUSTIC: Accelerating Convolutions through Or-Unipolar Skipped sTochastIc Computing, an accelerator framework that enables fully stochastic, high-density CNN inference. Leveraging split-unipolar representation, OR-based accumulation and novel computation-skipping approach, ACOUSTIC delivers server-class parallelism within a mobile area and power budget - a 12mm2 accelerator can be as much as 38.7x more energy efficient and 72.5x faster than conventional fixed-point accelerators. It can also be up to 79.6x more energy efficient than state-of-the-art stochastic accelerators. At the lower-end ACOUSTIC achieves 8x-120X inference throughput improvement with similar energy and area when compared to recent mixed-signal/neuromorphic accelerators.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30

6.5.2

ACCURACY TOLERANT NEURAL NETWORKS UNDER AGGRESSIVE POWER OPTIMIZATION
Speaker:
Yi-Wen Hung, National Tsing Hua University, TW
Authors:
Xiang-Xiu Wu¹, Yi-Wen Hung¹, Yung-Chih Chen² and Shih-Chieh Chang¹
¹National Tsing Hua University, TW; ²Yuan Ze University, Taoyuan, Taiwan, TW
Abstract
With the success of deep learning, many neural network models have been proposed and applied to various applications. In several applications, the devices used to implement the complicated models have limited power resources and thus aggressive optimization techniques are often applied for saving power. However, some optimization techniques, such as voltage scaling and multiple threshold voltages, may increase the probability of error occurrence due to slow signal propagation, which increases the path delay in a circuit and fails some input patterns. Although neural network models are considered to have some error tolerance, the prediction accuracy could be significantly affected, when there are a large number of errors. Thus, in this paper, we propose a scheme to mitigate the errors caused by slow signal propagation. Since the delay of multipliers dominate the critical path of the circuit, we consider the patterns significantly altered by the slow signal propagation according to the multipliers and prevent the patterns from failure by adjusting the neural network and the parameters. The proposed scheme modifies a neural network on the software side and thus it is unnecessary to re-design the hardware structure. The experimental results show that the proposed scheme is effective for several neural network models. It can improve the accuracy by up to 27%, when the device under consideration is applied with aggressive power optimization techniques.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

6.5.3

A CONVOLUTIONAL RESULT SHARING APPROACH FOR BINARIZED NEURAL NETWORK INFERENCE
Speaker:
Chia-Chun Lin, National Tsing Hua University, TW
Authors:
Ya-Chun Chang¹, Chia-Chun Lin¹, Yi-Ting Lin¹, Yung-Chih Chen² and Chun-Yao Wang¹
¹National Tsing Hua University, TW; ²Yuan Ze University, TW
Abstract
The binary-weight-binary-input binarized neural network (BNN) allows a much more efficient way to implement convolutional neural networks (CNNs) on mobile platforms. During inference, the multiply-accumulate operations in BNNs can be reduced to XNOR-popcount operations. Thus, the XNOR-popcount operations dominate most of the computation in BNNs. To reduce the number of required operations in convolution layers of BNNs, we decompose 3-D filters into 2-D filters and exploit the repeated filters, inverse filters, and similar filters to share results. By sharing the results, the number of operations in convolution layers of BNNs can be reduced effectively. Experimental results show that the number of operations can be reduced by about 60% for CIFAR-10 on BNNs while keeping the accuracy loss within 1% of originally trained network.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:15

6.5.4

PHONEBIT: EFFICIENT GPU-ACCELERATED BINARY NEURAL NETWORK INFERENCE ENGINE FOR MOBILE PHONES
Speaker:
Gang Chen, Sun Yat-sen University, CN
Authors:
Gang Chen¹, Shengyu He², Haitao Meng² and Kai Huang¹
¹Sun Yat-sen University, CN; ²Northeastern University, CN
Abstract
Over the last years, a great success of deep neural networks (DNNs) has been witnessed in computer vision and other fields. However, performance and power constraints make it still challenging to deploy DNNs on mobile devices due to their high computational complexity. Binary neural networks (BNNs) have been demonstrated as a promising solution to achieve this goal by using bit-wise operations to replace most arithmetic operations. Currently, existing GPU-accelerated implementations of BNNs are only tailored for desktop platforms. Due to architecture differences, mere porting of such implementations to mobile devices yields suboptimal performance or is impossible in some cases. In this paper, we propose PhoneBit, a GPU-accelerated BNN inference engine for Android-based mobile devices that fully exploits the computing power of BNNs on mobile GPUs. PhoneBit provides a set of operator-level optimizations including locality-friendly data layout, bit packing with vectorization and layers integration for efficient binary convolution. We also provide a detailed implementation and parallelization optimization for PhoneBit to optimally utilize the memory bandwidth and computing power of mobile GPUs. We evaluate PhoneBit with AlexNet, YOLOv2 Tiny and VGG16 with their binary version. Our experiment results show that PhoneBit can achieve significant speedup and energy efficiency compared with state-of-the-art frameworks for mobile devices.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

IP3-3, 140

HARDWARE ACCELERATION OF CNN WITH ONE-HOT QUANTIZATION OF WEIGHTS AND ACTIVATIONS
Speaker:
Gang Li, Chinese Academy of Sciences, CN
Authors:
Gang Li, Peisong Wang, Zejian Liu, Cong Leng and Jian Cheng, Chinese Academy of Sciences, CN
Abstract
In this paper, we propose a novel one-hot representation for weights and activations in CNN model and demonstrate its benefits on hardware accelerator design. Specifically, rather than merely reducing the bitwidth, we quantize both weights and activations into n-bit integers that containing only one non-zero bit per value. In this way, the massive multiply and accumulates (MACs) are equivalent to additions of powers of two that can be efficiently calculated with histogram based computations. Experiments on the ImageNet classification task show that comparable accuracy can be obtained on our proposed One-Hot Networks (OHN) compared to conventional fixed-point networks. As case studies, we evaluate the efficacy of one-hot data representation on two state-of-the-art CNN accelerators on FPGA, our preliminary results show that 50% and 68.5% resource saving can be achieved on DaDianNao and Laconic respectively. Besides, the one-hot optimized Laconic can further achieve an average speedup of 4.94x on AlexNet.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:31

IP3-4, 729

BNNSPLIT: BINARIZED NEURAL NETWORKS FOR EMBEDDED DISTRIBUTED FPGA-BASED COMPUTING SYSTEMS
Speaker:
Luca Stornaiuolo, Politecnico di Milano, IT
Authors:
Giorgia Fiscaletti, Marco Speziali, Luca Stornaiuolo, Marco D. Santambrogio and Donatella Sciuto, Politecnico di Milano, IT
Abstract
In the past few years, Convolutional Neural Networks (CNNs) have seen a massive improvement, outperforming other visual recognition algorithms. Since they are playing an increasingly important role in fields such as face recognition, augmented reality or autonomous driving, there is the growing need for a fast and efficient system to perform the redundant and heavy computations of CNNs. This trend led researchers towards heterogeneous systems provided with hardware accelerators, such as GPUs and FPGAs. The vast majority of CNNs is implemented with floating-point parameters and operations, but from research, it has emerged that high classification accuracy can be obtained also by reducing the floating-point activations and weights to binary values. This context is well suitable for FPGAs, that are known to stand out in terms of performance when dealing with binary operations, as demonstrated in Finn, the state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs. In this paper, we propose a framework that extends Finn to a distributed scenario, enabling BNNs implementation on embedded multi-FPGA systems.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:32

IP3-5, 147

L2L: A HIGHLY ACCURATE LOG_2_LEAD QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS
Speaker:
Salim Ullah, TU Dresden, DE
Authors:
Salim Ullah¹, Siddharth Gupta², Kapil Ahuja², Aruna Tiwari² and Akash Kumar¹
¹TU Dresden, DE; ²IIT Indore, IN
Abstract
Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel quantization technique for parameters of pre-trained deep neural networks. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only ∼ 1% and ∼ 0.4%, loss in the top-1 and top-5 accuracies respectively for VGG16 network using ImageNet dataset.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

End of session