2.5 Pruning Techniques for Embedded Neural Networks

Time	Label	Presentation Title Authors
11:30	2.5.1	DEEPER WEIGHT PRUNING WITHOUT ACCURACY LOSS IN DEEP NEURAL NETWORKS Speaker: Byungmin Ahn, Seoul National University, KR Authors: Byungmin Ahn and Taewhan Kim, Seoul National University, KR Abstract This work overcomes the inherent limitation of the bit-level weight pruning, that is, the maximal computation speedup is bounded by the total number of non-zero bits of the weights and the bound is invariably considered "uncontrollable" (i.e., constant) for the neural network to be pruned. Precisely, this work, based on the canonical signed digit (CSD) encoding, (1) proposes a transformation technique which converts the two's complement representation of every weight into a set of CSD representations of the minimal or near-minimal number of essential (i.e., non-zero) bits, (2) formulates the problem of selecting CSD representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture with no additional inclusion of non-trivial hardware. Through experiments, it is shown that our proposed approach reduces the number of essential bits by 69% on AlexNet and 74% on VGG-16, by which our accelerator reduces the inference computation time by 47% on AlexNet and 50% on VGG-16 over the conventional bit-level weight pruning. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	2.5.2	FLEXIBLE GROUP-LEVEL PRUNING OF DEEP NEURAL NETWORKS FOR ON-DEVICE MACHINE LEARNING Speaker: Dongkun Shin, Sungkyunkwan University, KR Authors: Kwangbae Lee, Hoseung Kim, Hayun Lee and Dongkun Shin, Sungkyunkwan University, KR Abstract Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it can fail to reduce inference time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations to reduce the problem space. In this paper, we propose an unaligned approach to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% lower error rate at ResNet-20 network on CIFAR-10 that compared to the previous 2D aligned group-level pruning under the 95% sparsity. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	2.5.3	SPARSITY-AWARE CACHES TO ACCELERATE DEEP NEURAL NETWORKS Speaker: Vinod Ganesan, IIT Madras, IN Authors: Vinod Ganesan¹, Sanchari Sen², Pratyush Kumar¹, Neel Gala¹, Kamakoti Veezhinatha¹ and Anand Raghunathan² ¹IIT Madras, IN; ²Purdue University, US Abstract Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and represent the state-of-the-art in many machine learning tasks. There is considerable interest in using DNNs to realize edge intelligence in highly resource-constrained devices such as wearables and IoT sensors. Unfortunately, the high computational requirements of DNNs pose a serious challenge to their deployment in these systems. Moreover, due to tight cost (and hence, area) constraints, these devices are often unable to accommodate hardware accelerators, requiring DNNs to execute on the General Purpose Processor (GPP) cores that they contain. We address this challenge through lightweight micro-architectural extensions to the memory hierarchy of GPPs that exploit a key attribute of DNNs, viz. sparsity, or the prevalence of zero values. We propose SparseCache, an enhanced cache architecture that utilizes a null cache based on a Ternary Content Addressable Memory (TCAM) to compactly store zero-valued cache lines while storing non-zero lines in a conventional data cache. By storing address rather than values for zero-valued cache lines, SparseCache increases the effective cache capacity, thereby reducing the overall miss rate and execution time. SparseCache utilizes a Zero Detector and Approximator (ZDA) and Address Merger (AM) to perform reads and writes to the null cache. We evaluate SparseCache on four state-of-the-art DNNs programmed with the Caffe framework. SparseCache achieves 5-28% reduction in miss-rate, which translates to 5-21% reduction in execution time, with only 0.1% area and 3.8% power overhead in comparison to a low-end Intel Atom Z-series processor. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00	IP1-7, 429	TFAPPROX: TOWARDS A FAST EMULATION OF DNN APPROXIMATE HARDWARE ACCELERATORS ON GPU Speaker: Zdenek Vasicek, Brno University of Technology, CZ Authors: Filip Vaverka, Vojtech Mrazek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ Abstract Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. In order to quantify the error introduced by using these circuits and avoid the expensive hardware prototyping, a software emulator of the DNN accelerator is usually executed on CPU or GPU. However, this emulation is typically two or three orders of magnitude slower than a software DNN implementation running on CPU or GPU and operating with standard floating point arithmetic instructions and common DNN libraries. The reason is that there is no hardware support for approximate arithmetic operations on common CPUs and GPUs and these operations have to be expensively emulated. In order to address this issue, we propose an efficient emulation method for approximate circuits utilized in a given DNN accelerator which is emulated on GPU. All relevant approximate circuits are implemented as look-up tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular read-only access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach extends the TensorFlow library and is available online at https://github.com/ehw-fit/tf-approximate Download Paper (PDF; Only available from the DATE venue WiFi)
13:00		End of session

Time

Label

Presentation Title
Authors

11:30

2.5.1

DEEPER WEIGHT PRUNING WITHOUT ACCURACY LOSS IN DEEP NEURAL NETWORKS
Speaker:
Byungmin Ahn, Seoul National University, KR
Authors:
Byungmin Ahn and Taewhan Kim, Seoul National University, KR
Abstract
This work overcomes the inherent limitation of the bit-level weight pruning, that is, the maximal computation speedup is bounded by the total number of non-zero bits of the weights and the bound is invariably considered "uncontrollable" (i.e., constant) for the neural network to be pruned. Precisely, this work, based on the canonical signed digit (CSD) encoding, (1) proposes a transformation technique which converts the two's complement representation of every weight into a set of CSD representations of the minimal or near-minimal number of essential (i.e., non-zero) bits, (2) formulates the problem of selecting CSD representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture with no additional inclusion of non-trivial hardware. Through experiments, it is shown that our proposed approach reduces the number of essential bits by 69% on AlexNet and 74% on VGG-16, by which our accelerator reduces the inference computation time by 47% on AlexNet and 50% on VGG-16 over the conventional bit-level weight pruning.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

2.5.2

FLEXIBLE GROUP-LEVEL PRUNING OF DEEP NEURAL NETWORKS FOR ON-DEVICE MACHINE LEARNING
Speaker:
Dongkun Shin, Sungkyunkwan University, KR
Authors:
Kwangbae Lee, Hoseung Kim, Hayun Lee and Dongkun Shin, Sungkyunkwan University, KR
Abstract
Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it can fail to reduce inference time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations to reduce the problem space. In this paper, we propose an unaligned approach to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% lower error rate at ResNet-20 network on CIFAR-10 that compared to the previous 2D aligned group-level pruning under the 95% sparsity.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

2.5.3

SPARSITY-AWARE CACHES TO ACCELERATE DEEP NEURAL NETWORKS
Speaker:
Vinod Ganesan, IIT Madras, IN
Authors:
Vinod Ganesan¹, Sanchari Sen², Pratyush Kumar¹, Neel Gala¹, Kamakoti Veezhinatha¹ and Anand Raghunathan²
¹IIT Madras, IN; ²Purdue University, US
Abstract
Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and represent the state-of-the-art in many machine learning tasks. There is considerable interest in using DNNs to realize edge intelligence in highly resource-constrained devices such as wearables and IoT sensors. Unfortunately, the high computational requirements of DNNs pose a serious challenge to their deployment in these systems. Moreover, due to tight cost (and hence, area) constraints, these devices are often unable to accommodate hardware accelerators, requiring DNNs to execute on the General Purpose Processor (GPP) cores that they contain. We address this challenge through lightweight micro-architectural extensions to the memory hierarchy of GPPs that exploit a key attribute of DNNs, viz. sparsity, or the prevalence of zero values. We propose SparseCache, an enhanced cache architecture that utilizes a null cache based on a Ternary Content Addressable Memory (TCAM) to compactly store zero-valued cache lines while storing non-zero lines in a conventional data cache. By storing address rather than values for zero-valued cache lines, SparseCache increases the effective cache capacity, thereby reducing the overall miss rate and execution time. SparseCache utilizes a Zero Detector and Approximator (ZDA) and Address Merger (AM) to perform reads and writes to the null cache. We evaluate SparseCache on four state-of-the-art DNNs programmed with the Caffe framework. SparseCache achieves 5-28% reduction in miss-rate, which translates to 5-21% reduction in execution time, with only 0.1% area and 3.8% power overhead in comparison to a low-end Intel Atom Z-series processor.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

IP1-7, 429

TFAPPROX: TOWARDS A FAST EMULATION OF DNN APPROXIMATE HARDWARE ACCELERATORS ON GPU
Speaker:
Zdenek Vasicek, Brno University of Technology, CZ
Authors:
Filip Vaverka, Vojtech Mrazek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ
Abstract
Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. In order to quantify the error introduced by using these circuits and avoid the expensive hardware prototyping, a software emulator of the DNN accelerator is usually executed on CPU or GPU. However, this emulation is typically two or three orders of magnitude slower than a software DNN implementation running on CPU or GPU and operating with standard floating point arithmetic instructions and common DNN libraries. The reason is that there is no hardware support for approximate arithmetic operations on common CPUs and GPUs and these operations have to be expensively emulated. In order to address this issue, we propose an efficient emulation method for approximate circuits utilized in a given DNN accelerator which is emulated on GPU. All relevant approximate circuits are implemented as look-up tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular read-only access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach extends the TensorFlow library and is available online at https://github.com/ehw-fit/tf-approximate
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

End of session