6.5 Efficient Data Representations in Neural Networks

Printer-friendly version PDF version

Date: Wednesday 11 March 2020
Time: 11:00 - 12:30
Location / Room: Bayard

Chair:
Brandon Reagen, Facebook and New York University, US

Co-Chair:
Sebastian Steinhorst, TU Munich, DE

The large processing requirements of ML models strains the capabilities of low-power embedded systems. Addressing this challenge, the first presentation proposes a robust co-design to leverage stochastic computing for highly accurate and efficient inference. Next, a structural optimization is proposed to counter faults at low voltage levels. Then, authors present a method for sharing results in binarized CNNs to reduce computation. The session will conclude with a talk implementing binary networks on mobile GPUs.

TimeLabelPresentation Title
Authors
11:006.5.1ACOUSTIC: ACCELERATING CONVOLUTIONAL NEURAL NETWORKS THROUGH OR-UNIPOLAR SKIPPED STOCHASTIC COMPUTING
Speaker:
Puneet Gupta, University of California, Los Angeles, US
Authors:
Wojciech Romaszkan, Tianmu Li, Tristan Melton, Sudhakar Pamarti and Puneet Gupta, University of California, Los Angeles, US
Abstract
As privacy and latency requirements force a move towards edge Machine Learning inference, resource constrained devices are struggling to cope with large and computationally complex models. For Convolutional Neural Networks, those limitations can be overcome by taking advantage of enormous data reuse opportunities and amenability to reduced precision. To do that however, a level of compute density unattainable for conventional binary arithmetic is required. Stochastic Computing can deliver such density, but it has not lived up to its full potential because of multiple underlying precision issues. We present ACOUSTIC: Accelerating Convolutions through Or-Unipolar Skipped sTochastIc Computing, an accelerator framework that enables fully stochastic, high-density CNN inference. Leveraging split-unipolar representation, OR-based accumulation and novel computation-skipping approach, ACOUSTIC delivers server-class parallelism within a mobile area and power budget - a 12mm2 accelerator can be as much as 38.7x more energy efficient and 72.5x faster than conventional fixed-point accelerators. It can also be up to 79.6x more energy efficient than state-of-the-art stochastic accelerators. At the lower-end ACOUSTIC achieves 8x-120X inference throughput improvement with similar energy and area when compared to recent mixed-signal/neuromorphic accelerators.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:306.5.2ACCURACY TOLERANT NEURAL NETWORKS UNDER AGGRESSIVE POWER OPTIMIZATION
Speaker:
Yi-Wen Hung, National Tsing Hua University, TW
Authors:
Xiang-Xiu Wu1, Yi-Wen Hung1, Yung-Chih Chen2 and Shih-Chieh Chang1
1National Tsing Hua University, TW; 2Yuan Ze University, Taoyuan, Taiwan, TW
Abstract
With the success of deep learning, many neural network models have been proposed and applied to various applications. In several applications, the devices used to implement the complicated models have limited power resources and thus aggressive optimization techniques are often applied for saving power. However, some optimization techniques, such as voltage scaling and multiple threshold voltages, may increase the probability of error occurrence due to slow signal propagation, which increases the path delay in a circuit and fails some input patterns. Although neural network models are considered to have some error tolerance, the prediction accuracy could be significantly affected, when there are a large number of errors. Thus, in this paper, we propose a scheme to mitigate the errors caused by slow signal propagation. Since the delay of multipliers dominate the critical path of the circuit, we consider the patterns significantly altered by the slow signal propagation according to the multipliers and prevent the patterns from failure by adjusting the neural network and the parameters. The proposed scheme modifies a neural network on the software side and thus it is unnecessary to re-design the hardware structure. The experimental results show that the proposed scheme is effective for several neural network models. It can improve the accuracy by up to 27%, when the device under consideration is applied with aggressive power optimization techniques.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:006.5.3A CONVOLUTIONAL RESULT SHARING APPROACH FOR BINARIZED NEURAL NETWORK INFERENCE
Speaker:
Chia-Chun Lin, National Tsing Hua University, TW
Authors:
Ya-Chun Chang1, Chia-Chun Lin1, Yi-Ting Lin1, Yung-Chih Chen2 and Chun-Yao Wang1
1National Tsing Hua University, TW; 2Yuan Ze University, TW
Abstract
The binary-weight-binary-input binarized neural network (BNN) allows a much more efficient way to implement convolutional neural networks (CNNs) on mobile platforms. During inference, the multiply-accumulate operations in BNNs can be reduced to XNOR-popcount operations. Thus, the XNOR-popcount operations dominate most of the computation in BNNs. To reduce the number of required operations in convolution layers of BNNs, we decompose 3-D filters into 2-D filters and exploit the repeated filters, inverse filters, and similar filters to share results. By sharing the results, the number of operations in convolution layers of BNNs can be reduced effectively. Experimental results show that the number of operations can be reduced by about 60% for CIFAR-10 on BNNs while keeping the accuracy loss within 1% of originally trained network.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:156.5.4PHONEBIT: EFFICIENT GPU-ACCELERATED BINARY NEURAL NETWORK INFERENCE ENGINE FOR MOBILE PHONES
Speaker:
Gang Chen, Sun Yat-sen University, CN
Authors:
Gang Chen1, Shengyu He2, Haitao Meng2 and Kai Huang1
1Sun Yat-sen University, CN; 2Northeastern University, CN
Abstract
Over the last years, a great success of deep neural networks (DNNs) has been witnessed in computer vision and other fields. However, performance and power constraints make it still challenging to deploy DNNs on mobile devices due to their high computational complexity. Binary neural networks (BNNs) have been demonstrated as a promising solution to achieve this goal by using bit-wise operations to replace most arithmetic operations. Currently, existing GPU-accelerated implementations of BNNs are only tailored for desktop platforms. Due to architecture differences, mere porting of such implementations to mobile devices yields suboptimal performance or is impossible in some cases. In this paper, we propose PhoneBit, a GPU-accelerated BNN inference engine for Android-based mobile devices that fully exploits the computing power of BNNs on mobile GPUs. PhoneBit provides a set of operator-level optimizations including locality-friendly data layout, bit packing with vectorization and layers integration for efficient binary convolution. We also provide a detailed implementation and parallelization optimization for PhoneBit to optimally utilize the memory bandwidth and computing power of mobile GPUs. We evaluate PhoneBit with AlexNet, YOLOv2 Tiny and VGG16 with their binary version. Our experiment results show that PhoneBit can achieve significant speedup and energy efficiency compared with state-of-the-art frameworks for mobile devices.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30IP3-3, 140HARDWARE ACCELERATION OF CNN WITH ONE-HOT QUANTIZATION OF WEIGHTS AND ACTIVATIONS
Speaker:
Gang Li, Chinese Academy of Sciences, CN
Authors:
Gang Li, Peisong Wang, Zejian Liu, Cong Leng and Jian Cheng, Chinese Academy of Sciences, CN
Abstract
In this paper, we propose a novel one-hot representation for weights and activations in CNN model and demonstrate its benefits on hardware accelerator design. Specifically, rather than merely reducing the bitwidth, we quantize both weights and activations into n-bit integers that containing only one non-zero bit per value. In this way, the massive multiply and accumulates (MACs) are equivalent to additions of powers of two that can be efficiently calculated with histogram based computations. Experiments on the ImageNet classification task show that comparable accuracy can be obtained on our proposed One-Hot Networks (OHN) compared to conventional fixed-point networks. As case studies, we evaluate the efficacy of one-hot data representation on two state-of-the-art CNN accelerators on FPGA, our preliminary results show that 50% and 68.5% resource saving can be achieved on DaDianNao and Laconic respectively. Besides, the one-hot optimized Laconic can further achieve an average speedup of 4.94x on AlexNet.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:31IP3-4, 729BNNSPLIT: BINARIZED NEURAL NETWORKS FOR EMBEDDED DISTRIBUTED FPGA-BASED COMPUTING SYSTEMS
Speaker:
Luca Stornaiuolo, Politecnico di Milano, IT
Authors:
Giorgia Fiscaletti, Marco Speziali, Luca Stornaiuolo, Marco D. Santambrogio and Donatella Sciuto, Politecnico di Milano, IT
Abstract
In the past few years, Convolutional Neural Networks (CNNs) have seen a massive improvement, outperforming other visual recognition algorithms. Since they are playing an increasingly important role in fields such as face recognition, augmented reality or autonomous driving, there is the growing need for a fast and efficient system to perform the redundant and heavy computations of CNNs. This trend led researchers towards heterogeneous systems provided with hardware accelerators, such as GPUs and FPGAs. The vast majority of CNNs is implemented with floating-point parameters and operations, but from research, it has emerged that high classification accuracy can be obtained also by reducing the floating-point activations and weights to binary values. This context is well suitable for FPGAs, that are known to stand out in terms of performance when dealing with binary operations, as demonstrated in Finn, the state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs. In this paper, we propose a framework that extends Finn to a distributed scenario, enabling BNNs implementation on embedded multi-FPGA systems.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:32IP3-5, 147L2L: A HIGHLY ACCURATE LOG_2_LEAD QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS
Speaker:
Salim Ullah, TU Dresden, DE
Authors:
Salim Ullah1, Siddharth Gupta2, Kapil Ahuja2, Aruna Tiwari2 and Akash Kumar1
1TU Dresden, DE; 2IIT Indore, IN
Abstract
Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel quantization technique for parameters of pre-trained deep neural networks. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only ∼ 1% and ∼ 0.4%, loss in the top-1 and top-5 accuracies respectively for VGG16 network using ImageNet dataset.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30End of session