2.5 Pruning Techniques for Embedded Neural Networks

Printer-friendly version PDF version

Date: Tuesday 10 March 2020
Time: 11:30 - 13:00
Location / Room: Bayard

Chair:
Marian Verhelst, KU Leuven, BE

Co-Chair:
Dirk Ziegenbein, Robert Bosch GmbH, DE

Network pruning has been applied successfully to reduce the computational and memory footprint of neural network processing. This session presents three innovations to better exploit pruning in embedded processing architectures. The solutions presented extend the sparsity concept to the bit level with an enhanced bit-level pruning technique based on CSD representations, introduce a novel group-level pruning technique, demonstrating an improved trade-off between hardware-execution cost and accuracy loss, and explore a sparsity-aware cache architecture to reduce cache miss rate and execution time.

TimeLabelPresentation Title
Authors
11:302.5.1DEEPER WEIGHT PRUNING WITHOUT ACCURACY LOSS IN DEEP NEURAL NETWORKS
Speaker:
Byungmin Ahn, Seoul National University, KR
Authors:
Byungmin Ahn and Taewhan Kim, Seoul National University, KR
Abstract
This work overcomes the inherent limitation of the bit-level weight pruning, that is, the maximal computation speedup is bounded by the total number of non-zero bits of the weights and the bound is invariably considered "uncontrollable" (i.e., constant) for the neural network to be pruned. Precisely, this work, based on the canonical signed digit (CSD) encoding, (1) proposes a transformation technique which converts the two's complement representation of every weight into a set of CSD representations of the minimal or near-minimal number of essential (i.e., non-zero) bits, (2) formulates the problem of selecting CSD representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture with no additional inclusion of non-trivial hardware. Through experiments, it is shown that our proposed approach reduces the number of essential bits by 69% on AlexNet and 74% on VGG-16, by which our accelerator reduces the inference computation time by 47% on AlexNet and 50% on VGG-16 over the conventional bit-level weight pruning.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:002.5.2FLEXIBLE GROUP-LEVEL PRUNING OF DEEP NEURAL NETWORKS FOR ON-DEVICE MACHINE LEARNING
Speaker:
Dongkun Shin, Sungkyunkwan University, KR
Authors:
Kwangbae Lee, Hoseung Kim, Hayun Lee and Dongkun Shin, Sungkyunkwan University, KR
Abstract
Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it can fail to reduce inference time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations to reduce the problem space. In this paper, we propose an unaligned approach to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% lower error rate at ResNet-20 network on CIFAR-10 that compared to the previous 2D aligned group-level pruning under the 95% sparsity.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:302.5.3SPARSITY-AWARE CACHES TO ACCELERATE DEEP NEURAL NETWORKS
Speaker:
Vinod Ganesan, IIT Madras, IN
Authors:
Vinod Ganesan1, Sanchari Sen2, Pratyush Kumar1, Neel Gala1, Kamakoti Veezhinatha1 and Anand Raghunathan2
1IIT Madras, IN; 2Purdue University, US
Abstract
Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and represent the state-of-the-art in many machine learning tasks. There is considerable interest in using DNNs to realize edge intelligence in highly resource-constrained devices such as wearables and IoT sensors. Unfortunately, the high computational requirements of DNNs pose a serious challenge to their deployment in these systems. Moreover, due to tight cost (and hence, area) constraints, these devices are often unable to accommodate hardware accelerators, requiring DNNs to execute on the General Purpose Processor (GPP) cores that they contain. We address this challenge through lightweight micro-architectural extensions to the memory hierarchy of GPPs that exploit a key attribute of DNNs, viz. sparsity, or the prevalence of zero values. We propose SparseCache, an enhanced cache architecture that utilizes a null cache based on a Ternary Content Addressable Memory (TCAM) to compactly store zero-valued cache lines while storing non-zero lines in a conventional data cache. By storing address rather than values for zero-valued cache lines, SparseCache increases the effective cache capacity, thereby reducing the overall miss rate and execution time. SparseCache utilizes a Zero Detector and Approximator (ZDA) and Address Merger (AM) to perform reads and writes to the null cache. We evaluate SparseCache on four state-of-the-art DNNs programmed with the Caffe framework. SparseCache achieves 5-28% reduction in miss-rate, which translates to 5-21% reduction in execution time, with only 0.1% area and 3.8% power overhead in comparison to a low-end Intel Atom Z-series processor.

Download Paper (PDF; Only available from the DATE venue WiFi)
13:00IP1-7, 429TFAPPROX: TOWARDS A FAST EMULATION OF DNN APPROXIMATE HARDWARE ACCELERATORS ON GPU
Speaker:
Zdenek Vasicek, Brno University of Technology, CZ
Authors:
Filip Vaverka, Vojtech Mrazek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ
Abstract
Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. In order to quantify the error introduced by using these circuits and avoid the expensive hardware prototyping, a software emulator of the DNN accelerator is usually executed on CPU or GPU. However, this emulation is typically two or three orders of magnitude slower than a software DNN implementation running on CPU or GPU and operating with standard floating point arithmetic instructions and common DNN libraries. The reason is that there is no hardware support for approximate arithmetic operations on common CPUs and GPUs and these operations have to be expensively emulated. In order to address this issue, we propose an efficient emulation method for approximate circuits utilized in a given DNN accelerator which is emulated on GPU. All relevant approximate circuits are implemented as look-up tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular read-only access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach extends the TensorFlow library and is available online at https://github.com/ehw-fit/tf-approximate

Download Paper (PDF; Only available from the DATE venue WiFi)
13:00End of session