8.5 Learning and Resilience Techniques for Green Computing

Time	Label	Presentation Title Authors
17:00	8.5.1	REVAMPING TIMING ERROR RESILIENCE TO TACKLE CHOKE POINTS AT NTC SYSTEMS Speaker: Aatreyi Bal, USU Bridge Lab, Utah State University, US Authors: Aatreyi Bal, Shamik Saha, Sanghamitra Roy and Koushik Chakraborty, Utah State University, US Abstract In this paper, we illustrate "choke points" as a vital consequence of process variation in the Near Threshold Computing (NTC) domain. Choke points are sensitized logic gates with increased delay deviation, due to process variation. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	8.5.2	EFFICIENT NEURAL NETWORK ACCELERATION ON GPGPU USING CONTENT ADDRESSABLE MEMORY Speaker: Tajana Rosing, University of California at San Diego, US Authors: Mohsen Imani¹, Daniel Peroni¹, Yeseong Kim¹, Abbas Rahimi² and Tajana Rosing³ ¹University of California San Diego, US; ²University of California Berkeley, US; ³UCSD, US Abstract Recently, neural networks have been demonstrated to be effective models for image processing, video segmentation, speech recognition, computer vision and gaming. However, high computation energy and low performance are the primary bottlenecks of running the neural networks. In this paper, we propose an energy/performance-efficient network acceleration technique on General Purpose GPU (GPGPU) architecture which utilizes specialized resistive nearest content addressable memory blocks, called NNCAM, by exploiting computation locality of the learning algorithms. NNCAM stores high frequency patterns corresponding to neural network operations and searches for the most similar patterns to reuse the computation results. To improve NNCAM computation efficiency and accuracy, we proposed layer-based associative update and selective approximation techniques. The layer-based update improves data locality of NNCAM blocks by filling NNCAM values based on the frequent computation patterns of each neural network layer. To guarantee the appropriate level of computation accuracy while providing maximum energy saving, our design adaptively allocates the neural network operations to either NNCAM or GPGPU floating point units (FPUs). The selective approximation relaxes computation on neural network layers by considering the impact on accuracy. In evaluation, we integrate NNCAM blocks with the modern AMD southern Island GPU architecture. Our experimental evaluation shows that the enhanced GPGPU can result in 68% energy savings and 40% speedup running on four popular convolutional neural networks (CNN), ensuring acceptable <2% quality loss. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	8.5.3	CHAIN-NN: AN ENERGY-EFFICIENT 1D CHAIN ARCHITECTURE FOR ACCELERATING DEEP CONVOLUTIONAL NEURAL NETWORKS Speaker: Shihao Wang, Waseda University, JP Authors: Shihao Wang, Dajiang Zhou, Xushen Han and Yoshimura Takeshi, Waseda University, JP Abstract Deep convolutional neural networks (CNN) have shown their good performances in many computer vision tasks. However, the high computational complexity of CNN involves a huge amount of data movements between the computational processor core and memory hierarchy which occupies the major of the power consumption. This paper presents Chain-NN, a novel energy-efficient 1D chain architecture for accelerating deep CNNs. Chain-NN consists of the dedicated dual-channel process engines (PE). In Chain-NN, convolutions are done by the 1D systolic primitives composed of a group of adjacent PEs. These systolic primitives, together with the proposed column-wise scan input pattern, can fully reuse input operand to reduce the memory bandwidth requirement for energy saving. Moreover, the 1D chain architecture allows the systolic primitives to be easily reconfigured according to specific CNN parameters with fewer design complexity. The synthesis and layout of Chain-NN is under TSMC 28nm process. It costs 3751k logic gates and 352KB on-chip memory. The results show a 576-PE Chain-NN can be scaled up to 700MHz. This achieves a peak throughput of 806.4GOPS with 567.5mW and is able to accelerate the five convolutional layers in AlexNet at a frame rate of 362.2fps. 1421.0GOPS/W power efficiency is at least 2.5x to 4.1x times better than the state-of-the-art works. Download Paper (PDF; Only available from the DATE venue WiFi)
18:15	8.5.4	CONTINUOUS LEARNING OF HPC INFRASTRUCTURE MODELS USING BIG DATA ANALYTICS AND IN-MEMORY PROCESSING TOOLS Speaker: Francesco Beneventi, Università di Bologna, IT Authors: Francesco Beneventi¹, Andrea Bartolini², Carlo Cavazzoni³ and Luca Benini² ¹DEI - University of Bologna, IT; ²Università di Bologna, IT; ³Cineca, IT Abstract Exascale computing represents the next leap in the HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipment-cooling, reliability and massive parallelism. Model-based optimization is an essential tool in the design process and control of energy efficient, reliable and thermally constrained systems. However, in the Exascale domain, model learning techniques tailored to the specific supercomputer require real measurements and must therefore handle and analyze a massive amount of data coming from the HPC monitoring infrastructure. This becomes rapidly a "big data" scale problem. The common approach where measurements are first stored in large databases and then processed is no more affordable due to the increasingly storage costs and lack of real-time support. Nowadays instead, cloud-based machine learning techniques aim to build on-line models using real-time approaches such as "stream processing" and "in-memory" computing, that avoid storage costs and enable fast-data processing. Moreover, the fast delivery and adaptation of the models to the quick data variations, make the decision stage of the optimization loop more effective and reliable. In this paper we leverage scalable, lightweight and flexible IoT technologies, such as the MQTT protocol, to build a highly scalable HPC monitoring infrastructure able to handle the massive sensor data produced by next-gen HPC components. We then show how state-of-the art tools for big data computing and analysis, such as Apache Spark, can be used to manage the huge amount of data delivered by the monitoring layer and to build adaptive models in real-time using on-line machine learning techniques. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP4-2, 463	LAANT: A LIBRARY TO AUTOMATICALLY OPTIMIZE EDP FOR OPENMP APPLICATIONS Speaker: Arthur Francisco Lorenzon, Federal University of Rio Grande do Sul, BR Authors: Arthur Lorenzon, Jeckson Dellagostin Souza and Antonio Carlos Schneider Beck Filho, Universidade Federal do Rio Grande do Sul, BR Abstract Efficiently exploiting thread level parallelism from new multicore systems has been challenging for software developers. While blindly increasing the number of threads may lead to performance gains, it can also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their particular characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP4-3, 68	IMPROVING THE ACCURACY OF THE LEAKAGE POWER ESTIMATION OF EMBEDDED CPUS Speaker: Shiao-Li Tsao, National Chiao Tung University, TW Authors: Ting-Wu Chin, Shiao-Li Tsao, Kuo-Wei Hung and Pei-Shu Huang, National Chiao Tung University, TW Abstract Previous studies have used on-chip thermal sensors (diodes) to estimate the leakage power of a CPU. However, an embedded CPU equips only a few thermal sensors and may suffer from considerable spatial temperature variances across the CPU core, and leakage power estimation based on insufficient temperature information introduces errors. According to our experiments, the conventional leakage power models may have up to 22.9% estimation error for a 70-nm embedded CPU. In this study, we first evaluated the accuracy of leakage power estimates based on thermal sensors on different locations of a CPU and suggested locations that can reduce the error to 0.9%. Then, we proposed temperature-referred and counter-tracked estimation (TRACE) that relies on temperature sensors and hardware activity counters to estimate leakage power. The simulation results demonstrated that employing TRACE could reduce the error to 3.4%. Experiments were also conducted on a real platform to verify our findings. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

8.5.1

REVAMPING TIMING ERROR RESILIENCE TO TACKLE CHOKE POINTS AT NTC SYSTEMS
Speaker:
Aatreyi Bal, USU Bridge Lab, Utah State University, US
Authors:
Aatreyi Bal, Shamik Saha, Sanghamitra Roy and Koushik Chakraborty, Utah State University, US
Abstract
In this paper, we illustrate "choke points" as a vital consequence of process variation in the Near Threshold Computing (NTC) domain. Choke points are sensitized logic gates with increased delay deviation, due to process variation.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

8.5.2

EFFICIENT NEURAL NETWORK ACCELERATION ON GPGPU USING CONTENT ADDRESSABLE MEMORY
Speaker:
Tajana Rosing, University of California at San Diego, US
Authors:
Mohsen Imani¹, Daniel Peroni¹, Yeseong Kim¹, Abbas Rahimi² and Tajana Rosing³
¹University of California San Diego, US; ²University of California Berkeley, US; ³UCSD, US
Abstract
Recently, neural networks have been demonstrated to be effective models for image processing, video segmentation, speech recognition, computer vision and gaming. However, high computation energy and low performance are the primary bottlenecks of running the neural networks. In this paper, we propose an energy/performance-efficient network acceleration technique on General Purpose GPU (GPGPU) architecture which utilizes specialized resistive nearest content addressable memory blocks, called NNCAM, by exploiting computation locality of the learning algorithms. NNCAM stores high frequency patterns corresponding to neural network operations and searches for the most similar patterns to reuse the computation results. To improve NNCAM computation efficiency and accuracy, we proposed layer-based associative update and selective approximation techniques. The layer-based update improves data locality of NNCAM blocks by filling NNCAM values based on the frequent computation patterns of each neural network layer. To guarantee the appropriate level of computation accuracy while providing maximum energy saving, our design adaptively allocates the neural network operations to either NNCAM or GPGPU floating point units (FPUs). The selective approximation relaxes computation on neural network layers by considering the impact on accuracy. In evaluation, we integrate NNCAM blocks with the modern AMD southern Island GPU architecture. Our experimental evaluation shows that the enhanced GPGPU can result in 68% energy savings and 40% speedup running on four popular convolutional neural networks (CNN), ensuring acceptable <2% quality loss.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

8.5.3

CHAIN-NN: AN ENERGY-EFFICIENT 1D CHAIN ARCHITECTURE FOR ACCELERATING DEEP CONVOLUTIONAL NEURAL NETWORKS
Speaker:
Shihao Wang, Waseda University, JP
Authors:
Shihao Wang, Dajiang Zhou, Xushen Han and Yoshimura Takeshi, Waseda University, JP
Abstract
Deep convolutional neural networks (CNN) have shown their good performances in many computer vision tasks. However, the high computational complexity of CNN involves a huge amount of data movements between the computational processor core and memory hierarchy which occupies the major of the power consumption. This paper presents Chain-NN, a novel energy-efficient 1D chain architecture for accelerating deep CNNs. Chain-NN consists of the dedicated dual-channel process engines (PE). In Chain-NN, convolutions are done by the 1D systolic primitives composed of a group of adjacent PEs. These systolic primitives, together with the proposed column-wise scan input pattern, can fully reuse input operand to reduce the memory bandwidth requirement for energy saving. Moreover, the 1D chain architecture allows the systolic primitives to be easily reconfigured according to specific CNN parameters with fewer design complexity. The synthesis and layout of Chain-NN is under TSMC 28nm process. It costs 3751k logic gates and 352KB on-chip memory. The results show a 576-PE Chain-NN can be scaled up to 700MHz. This achieves a peak throughput of 806.4GOPS with 567.5mW and is able to accelerate the five convolutional layers in AlexNet at a frame rate of 362.2fps. 1421.0GOPS/W power efficiency is at least 2.5x to 4.1x times better than the state-of-the-art works.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15

8.5.4

CONTINUOUS LEARNING OF HPC INFRASTRUCTURE MODELS USING BIG DATA ANALYTICS AND IN-MEMORY PROCESSING TOOLS
Speaker:
Francesco Beneventi, Università di Bologna, IT
Authors:
Francesco Beneventi¹, Andrea Bartolini², Carlo Cavazzoni³ and Luca Benini²
¹DEI - University of Bologna, IT; ²Università di Bologna, IT; ³Cineca, IT
Abstract
Exascale computing represents the next leap in the HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipment-cooling, reliability and massive parallelism. Model-based optimization is an essential tool in the design process and control of energy efficient, reliable and thermally constrained systems. However, in the Exascale domain, model learning techniques tailored to the specific supercomputer require real measurements and must therefore handle and analyze a massive amount of data coming from the HPC monitoring infrastructure. This becomes rapidly a "big data" scale problem. The common approach where measurements are first stored in large databases and then processed is no more affordable due to the increasingly storage costs and lack of real-time support. Nowadays instead, cloud-based machine learning techniques aim to build on-line models using real-time approaches such as "stream processing" and "in-memory" computing, that avoid storage costs and enable fast-data processing. Moreover, the fast delivery and adaptation of the models to the quick data variations, make the decision stage of the optimization loop more effective and reliable. In this paper we leverage scalable, lightweight and flexible IoT technologies, such as the MQTT protocol, to build a highly scalable HPC monitoring infrastructure able to handle the massive sensor data produced by next-gen HPC components. We then show how state-of-the art tools for big data computing and analysis, such as Apache Spark, can be used to manage the huge amount of data delivered by the monitoring layer and to build adaptive models in real-time using on-line machine learning techniques.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP4-2, 463

LAANT: A LIBRARY TO AUTOMATICALLY OPTIMIZE EDP FOR OPENMP APPLICATIONS
Speaker:
Arthur Francisco Lorenzon, Federal University of Rio Grande do Sul, BR
Authors:
Arthur Lorenzon, Jeckson Dellagostin Souza and Antonio Carlos Schneider Beck Filho, Universidade Federal do Rio Grande do Sul, BR
Abstract
Efficiently exploiting thread level parallelism from new multicore systems has been challenging for software developers. While blindly increasing the number of threads may lead to performance gains, it can also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their particular characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP4-3, 68

IMPROVING THE ACCURACY OF THE LEAKAGE POWER ESTIMATION OF EMBEDDED CPUS
Speaker:
Shiao-Li Tsao, National Chiao Tung University, TW
Authors:
Ting-Wu Chin, Shiao-Li Tsao, Kuo-Wei Hung and Pei-Shu Huang, National Chiao Tung University, TW
Abstract
Previous studies have used on-chip thermal sensors (diodes) to estimate the leakage power of a CPU. However, an embedded CPU equips only a few thermal sensors and may suffer from considerable spatial temperature variances across the CPU core, and leakage power estimation based on insufficient temperature information introduces errors. According to our experiments, the conventional leakage power models may have up to 22.9% estimation error for a 70-nm embedded CPU. In this study, we first evaluated the accuracy of leakage power estimates based on thermal sensors on different locations of a CPU and suggested locations that can reduce the error to 0.9%. Then, we proposed temperature-referred and counter-tracked estimation (TRACE) that relies on temperature sensors and hardware activity counters to estimate leakage power. The simulation results demonstrated that employing TRACE could reduce the error to 3.4%. Experiments were also conducted on a real platform to verify our findings.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session