11.3 Exploiting Heterogeneity for Big Data Computing

Time	Label	Presentation Title Authors
14:00	11.3.1	A NOVEL ZERO WEIGHT/ACTIVATION-AWARE HARDWARE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORK Speaker: Dongyoung Kim, Seoul National University, KR Authors: Dongyoung Kim, Junwhan Ahn and Sungjoo Yoo, Seoul National University, KR Abstract It is imperative to accelerate convolutional neural networks (CNNs) due to their ever-widening application areas from server, mobile to IoT devices. Based on the fact that CNNs can be characterized by significant amount of zero values in both kernel weights (under quality-preserving pruning) and activations (when rectified linear units are applied), we propose a novel architecture of hardware accelerator for CNNs which exploits zero values in both weights and activations. We also report a zero-induced load imbalance problem encountered in the zero-aware parallel architecture and present a zero-aware kernel allocation. In our experiments, we designed a cycle-accurate model, RTL and layout designs of the proposed architecture. In our evaluations with two real deep CNNs, pruned AlexNet and VGG, our proposed architecture offers 4x/1.8x times (AlexNet [1]) and 5.2x/2.1x times (VGG-16 [2]) speedup compared with state-of-the-art zero-agnostic/zero activation-aware architectures. Download Paper (PDF; Only available from the DATE venue WiFi)
14:30	11.3.2	A MECHANISM FOR ENERGY-EFFICIENT REUSE OF DECODING AND SCHEDULING OF X86 INSTRUCTION STREAMS Speaker: Antonio Carlos S. Beck, Universidade Federal do Rio Grande do Sul, BR Authors: Marcelo Brandalero and Antonio Carlos Schneider Beck, Universidade Federal do Rio Grande do Sul, BR Abstract Current superscalar x86 processors decompose each CISC instruction (variable-length and with multiple addressing modes) into multiple RISC-like µops at runtime so they can be pipelined and scheduled for concurrent execution. This challenging and power-hungry process, however, is usually repeated several times on the same instruction sequence, inefficiently producing the very same decoded and scheduled µops. Therefore, we propose a transparent mechanism to save the decoding and scheduling transformation for later reuse, so that next time the same instruction sequence is found it can automatically bypass the costly pipeline stages involved. We use a coarse-grained reconfigurable array as a means to save this transformation, since its structure enables the recovery of µops already allocated in time and space, and also larger ILP exploitation than superscalar processors. The technique can reduce the energy consumption of a powerful 8-issue superscalar by 31.4% at low area costs, while also improving performance by 32.6%. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	11.3.3	UNDERSTANDING THE IMPACT OF PRECISION QUANTIZATION ON THE ACCURACY AND ENERGY OF NEURAL NETWORKS Speaker: Sherief Reda, Brown University, US Authors: Soheil Hashemi, Nicholas Anthony, Hokchhay Tann, Iris Bahar and Sherief Reda, Brown University, US Abstract Deep neural networks are gaining in popularity as they are used to generate state-of-the-art results for a variety of computer vision and machine learning applications. At the same time, these networks have grown in depth and complexity in order to solve harder problems. Given the limitations in power budgets dedicated to these networks, the importance of low-power, low-memory solutions has been stressed in recent years. While a large number of dedicated hardware using different precisions has recently been proposed, there exists no comprehensive study of different bit precisions and arithmetic in both inputs and network parameters. In this work, we address this issue and perform a study of different bit-precisions in neural networks (from floating-point to fixed-point, powers of two, and binary). In our evaluation, we consider and analyze the effect of precision scaling on both network accuracy and hardware metrics including memory footprint, power and energy consumption, and design area. We also investigate training-time methodologies to compensate for the reduction in accuracy due to limited bit precision and demonstrate that in most cases, precision scaling can deliver significant benefits in design metrics at the cost of very modest decreases in network accuracy. In addition, we propose that a small portion of the benefits achieved when using lower precisions can be forfeited to increase the network size and therefore the accuracy. We evaluate our experiments, using three well-recognized networks and datasets to show its generality. We investigate the trade-offs and highlight the benefits of using lower precisions in terms of energy and memory footprint. Download Paper (PDF; Only available from the DATE venue WiFi)
15:15	11.3.4	BIG VS LITTLE CORE FOR ENERGY-EFFICIENT HADOOP COMPUTING Speaker: Houman Homayoun, George Mason University, US Authors: Maria Malik¹, Katayoun Neshatpour¹, Tinoosh Mohsenin², Avesta Sasan¹ and Houman Homayoun¹ ¹George Mason University, US; ²University of Maryland Baltimore County, US Abstract The rapid growth in the data yields challenges to process data efficiently using current high-performance server architectures such as big Xeon cores. Furthermore, physical design constraints, such as power and density, have become the dominant limiting factor for scaling out servers. Heterogeneous architectures that combine big Xeon cores with little Atom cores have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on an architecture that matches resource needs more closely than a one-size-fits-all architecture. Therefore, the question of whether to map the application to big Xeon or little Atom in heterogeneous server architecture becomes important. In this paper, we characterize Hadoop-based applications and their corresponding MapReduce tasks on big Xeon and little Atom-based server architectures to understand how the choice of big vs little cores is affected by various parameters at application, system and architecture levels and the interplay among these parameters. Furthermore, we have evaluated the operational and the capital cost to understand how performance, power and area constraints for big data analytics affects the choice of big vs little core server as a more cost and energy efficient architecture. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	IP5-9, 763	LESS: BIG DATA SKETCHING AND ENCRYPTION ON LOW POWER PLATFORM Speaker: Amey Kulkarni, University of Maryland Baltimore County, US Authors: Amey Kulkarni¹, Colin Shea², Houman Homayoun³ and Tinoosh Mohsenin² ¹University of Maryland, Baltimore County, US; ²University of Maryland Baltimore County, US; ³George Mason University, US Abstract Ever-growing IoT demands big data processing and cognitive computing on mobile and battery operated devices. However, big data processing on low power embedded cores is challenging due to their limited communication bandwidth and on-chip storage. Additionally, IoT and cloud-based computing demand low overhead security kernel to avoid data breaches. In this paper, we propose a Light-weight Encryption using Scalable Sketching (LESS) framework for big data sketching and encryption using One-Time Random Linear Projections (OTRLP). OTRLP encoded matrix makes the Known Plaintext Attacks (KPA) ineffective, and attackers cannot gain significant information from plaintext-ciphertext pair. LESS framework can reduce data up to 67\% with 3.81~dB signal-to-reconstruction error rate (SRER). This framework has two important kernels "sketching" and "sketch-reconstruction", the latter is computationally intensive and costly. We propose to accelerate the sketch reconstruction using Orthogonal Matching Pursuit (OMP) on a domain specific many-core hardware named Power Efficient Nano Cluster (PENC) designed by authors. Detailed performance and power analysis suggests that PENC platform has 15x and 200x less energy consumption and 8x and 177x faster reconstruction time as compared to low power ARM CPU, and K1 GPU, respectively. To demonstrate efficiency of LESS framework, we integrate it with Hadoop MapReduce platform for objects and scenes identification application. The full hardware integration consists of tiny ARM cores which perform task scheduling and objects identification application, while PENC acts as an accelerator for sketch reconstruction. The full hardware integration results show that the LESS framework achieves 46% reduction in data transfers with very low execution overhead of 0.11% and negligible energy overhead of 0.001% when tested for 2.6GB streaming input data. The heterogeneous LESS framework requires 2x less transfer time and achieves 2.25x higher throughput per watt compared to MapReduce platform. Download Paper (PDF; Only available from the DATE venue WiFi)
15:31	IP5-10, 656	TRUNCAPP: A TRUNCATION-BASED APPROXIMATE DIVIDER FOR ENERGY EFFICIENT DSP APPLICATIONS Speaker: Shaghayegh Vahdat, University of Tehran, IR Authors: Shaghayegh Vahdat¹, Mehdi Kamal¹, Ali Afzali-Kusha¹, Zainalabedin Navabi¹ and Massoud Pedram² ¹University of Tehran, IR; ²University of Southern California, US Abstract In this paper, we present a high speed yet energy efficient approximate divider where the division operation is performed by multiplying the dividend by the inverse of the divisor. In this structure, truncated value of the dividend is multiplied exactly (approximately) by the approximate inverse value of divisor. To assess the efficacy of the proposed divider, its design parameters are extracted and compared to those of a number of prior art dividers in a 45nm CMOS technology. Results reveal that this structure provides 66% and 52% improvements in the area and energy consumption, respectively, compared to the most advanced prior art approximate divider. In addition, delay and energy consumption of the division operation are reduced about 94.4% and 99.93%, respectively, compared to those of an exact SRT radix-4 divider. Finally, the efficacy of the proposed divider in image processing application is studied. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30		End of session Coffee Break in Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Tuesday, March 28, 2017 Coffee Break 10:30 - 11:30 Coffee Break 16:00 - 17:00 Wednesday, March 29, 2017 Coffee Break 10:00 - 11:00 Coffee Break 16:00 - 17:00 Thursday, March 30, 2017 Coffee Break 10:00 - 11:00 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

14:00

11.3.1

A NOVEL ZERO WEIGHT/ACTIVATION-AWARE HARDWARE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORK
Speaker:
Dongyoung Kim, Seoul National University, KR
Authors:
Dongyoung Kim, Junwhan Ahn and Sungjoo Yoo, Seoul National University, KR
Abstract
It is imperative to accelerate convolutional neural networks (CNNs) due to their ever-widening application areas from server, mobile to IoT devices. Based on the fact that CNNs can be characterized by significant amount of zero values in both kernel weights (under quality-preserving pruning) and activations (when rectified linear units are applied), we propose a novel architecture of hardware accelerator for CNNs which exploits zero values in both weights and activations. We also report a zero-induced load imbalance problem encountered in the zero-aware parallel architecture and present a zero-aware kernel allocation. In our experiments, we designed a cycle-accurate model, RTL and layout designs of the proposed architecture. In our evaluations with two real deep CNNs, pruned AlexNet and VGG, our proposed architecture offers 4x/1.8x times (AlexNet [1]) and 5.2x/2.1x times (VGG-16 [2]) speedup compared with state-of-the-art zero-agnostic/zero activation-aware architectures.
Download Paper (PDF; Only available from the DATE venue WiFi)

14:30

11.3.2

A MECHANISM FOR ENERGY-EFFICIENT REUSE OF DECODING AND SCHEDULING OF X86 INSTRUCTION STREAMS
Speaker:
Antonio Carlos S. Beck, Universidade Federal do Rio Grande do Sul, BR
Authors:
Marcelo Brandalero and Antonio Carlos Schneider Beck, Universidade Federal do Rio Grande do Sul, BR
Abstract
Current superscalar x86 processors decompose each CISC instruction (variable-length and with multiple addressing modes) into multiple RISC-like µops at runtime so they can be pipelined and scheduled for concurrent execution. This challenging and power-hungry process, however, is usually repeated several times on the same instruction sequence, inefficiently producing the very same decoded and scheduled µops. Therefore, we propose a transparent mechanism to save the decoding and scheduling transformation for later reuse, so that next time the same instruction sequence is found it can automatically bypass the costly pipeline stages involved. We use a coarse-grained reconfigurable array as a means to save this transformation, since its structure enables the recovery of µops already allocated in time and space, and also larger ILP exploitation than superscalar processors. The technique can reduce the energy consumption of a powerful 8-issue superscalar by 31.4% at low area costs, while also improving performance by 32.6%.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

11.3.3

UNDERSTANDING THE IMPACT OF PRECISION QUANTIZATION ON THE ACCURACY AND ENERGY OF NEURAL NETWORKS
Speaker:
Sherief Reda, Brown University, US
Authors:
Soheil Hashemi, Nicholas Anthony, Hokchhay Tann, Iris Bahar and Sherief Reda, Brown University, US
Abstract
Deep neural networks are gaining in popularity as they are used to generate state-of-the-art results for a variety of computer vision and machine learning applications. At the same time, these networks have grown in depth and complexity in order to solve harder problems. Given the limitations in power budgets dedicated to these networks, the importance of low-power, low-memory solutions has been stressed in recent years. While a large number of dedicated hardware using different precisions has recently been proposed, there exists no comprehensive study of different bit precisions and arithmetic in both inputs and network parameters. In this work, we address this issue and perform a study of different bit-precisions in neural networks (from floating-point to fixed-point, powers of two, and binary). In our evaluation, we consider and analyze the effect of precision scaling on both network accuracy and hardware metrics including memory footprint, power and energy consumption, and design area. We also investigate training-time methodologies to compensate for the reduction in accuracy due to limited bit precision and demonstrate that in most cases, precision scaling can deliver significant benefits in design metrics at the cost of very modest decreases in network accuracy. In addition, we propose that a small portion of the benefits achieved when using lower precisions can be forfeited to increase the network size and therefore the accuracy. We evaluate our experiments, using three well-recognized networks and datasets to show its generality. We investigate the trade-offs and highlight the benefits of using lower precisions in terms of energy and memory footprint.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:15

11.3.4

BIG VS LITTLE CORE FOR ENERGY-EFFICIENT HADOOP COMPUTING
Speaker:
Houman Homayoun, George Mason University, US
Authors:
Maria Malik¹, Katayoun Neshatpour¹, Tinoosh Mohsenin², Avesta Sasan¹ and Houman Homayoun¹
¹George Mason University, US; ²University of Maryland Baltimore County, US
Abstract
The rapid growth in the data yields challenges to process data efficiently using current high-performance server architectures such as big Xeon cores. Furthermore, physical design constraints, such as power and density, have become the dominant limiting factor for scaling out servers. Heterogeneous architectures that combine big Xeon cores with little Atom cores have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on an architecture that matches resource needs more closely than a one-size-fits-all architecture. Therefore, the question of whether to map the application to big Xeon or little Atom in heterogeneous server architecture becomes important. In this paper, we characterize Hadoop-based applications and their corresponding MapReduce tasks on big Xeon and little Atom-based server architectures to understand how the choice of big vs little cores is affected by various parameters at application, system and architecture levels and the interplay among these parameters. Furthermore, we have evaluated the operational and the capital cost to understand how performance, power and area constraints for big data analytics affects the choice of big vs little core server as a more cost and energy efficient architecture.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

IP5-9, 763

LESS: BIG DATA SKETCHING AND ENCRYPTION ON LOW POWER PLATFORM
Speaker:
Amey Kulkarni, University of Maryland Baltimore County, US
Authors:
Amey Kulkarni¹, Colin Shea², Houman Homayoun³ and Tinoosh Mohsenin²
¹University of Maryland, Baltimore County, US; ²University of Maryland Baltimore County, US; ³George Mason University, US
Abstract
Ever-growing IoT demands big data processing and cognitive computing on mobile and battery operated devices. However, big data processing on low power embedded cores is challenging due to their limited communication bandwidth and on-chip storage. Additionally, IoT and cloud-based computing demand low overhead security kernel to avoid data breaches. In this paper, we propose a Light-weight Encryption using Scalable Sketching (LESS) framework for big data sketching and encryption using One-Time Random Linear Projections (OTRLP). OTRLP encoded matrix makes the Known Plaintext Attacks (KPA) ineffective, and attackers cannot gain significant information from plaintext-ciphertext pair. LESS framework can reduce data up to 67\% with 3.81~dB signal-to-reconstruction error rate (SRER). This framework has two important kernels "sketching" and "sketch-reconstruction", the latter is computationally intensive and costly. We propose to accelerate the sketch reconstruction using Orthogonal Matching Pursuit (OMP) on a domain specific many-core hardware named Power Efficient Nano Cluster (PENC) designed by authors. Detailed performance and power analysis suggests that PENC platform has 15x and 200x less energy consumption and 8x and 177x faster reconstruction time as compared to low power ARM CPU, and K1 GPU, respectively. To demonstrate efficiency of LESS framework, we integrate it with Hadoop MapReduce platform for objects and scenes identification application. The full hardware integration consists of tiny ARM cores which perform task scheduling and objects identification application, while PENC acts as an accelerator for sketch reconstruction. The full hardware integration results show that the LESS framework achieves 46% reduction in data transfers with very low execution overhead of 0.11% and negligible energy overhead of 0.001% when tested for 2.6GB streaming input data. The heterogeneous LESS framework requires 2x less transfer time and achieves 2.25x higher throughput per watt compared to MapReduce platform.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:31

IP5-10, 656

TRUNCAPP: A TRUNCATION-BASED APPROXIMATE DIVIDER FOR ENERGY EFFICIENT DSP APPLICATIONS
Speaker:
Shaghayegh Vahdat, University of Tehran, IR
Authors:
Shaghayegh Vahdat¹, Mehdi Kamal¹, Ali Afzali-Kusha¹, Zainalabedin Navabi¹ and Massoud Pedram²
¹University of Tehran, IR; ²University of Southern California, US
Abstract
In this paper, we present a high speed yet energy efficient approximate divider where the division operation is performed by multiplying the dividend by the inverse of the divisor. In this structure, truncated value of the dividend is multiplied exactly (approximately) by the approximate inverse value of divisor. To assess the efficacy of the proposed divider, its design parameters are extracted and compared to those of a number of prior art dividers in a 45nm CMOS technology. Results reveal that this structure provides 66% and 52% improvements in the area and energy consumption, respectively, compared to the most advanced prior art approximate divider. In addition, delay and energy consumption of the division operation are reduced about 94.4% and 99.93%, respectively, compared to those of an exact SRT radix-4 divider. Finally, the efficacy of the proposed divider in image processing application is studied.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017