6.4 Bridging the Moore's Law Gap with Application-Specific Architectures

Date: Wednesday 11 March 2015
Time: 11:00 - 12:30
Location / Room: Chartreuse

Chair:
Cristina Silvano, Politecnico di Milano, IT

Co-Chair:
Akash Kumar, National University of Singapore, SG

This session focuses on approximation, low-power, and high-performance optimization techniques for application-specific architectures, including neural networks, multicores and GPUs.

Time	Label	Presentation Title Authors
11:00	6.4.1	A ULTRA-LOW-ENERGY CONVOLUTION ENGINE FOR FAST BRAIN-INSPIRED VISION IN MULTICORE CLUSTERS Speakers: Francesco Conti¹ and Luca Benini² ¹Università di Bologna, IT; ²Università di Bologna / ETH Zürich, IT Abstract State-of-art brain-inspired computer vision algorithms such as Convolutional Neural Networks (CNNs) are reaching accuracy and performance rivaling that of humans; however, the gap in terms of energy consumption is still many degrees of magnitude wide. Many-core architectures using shared-memory clusters of power-optimized RISC processors have been proposed as a possible solution to help close this gap. In this work, we propose to augment these clusters with Hardware Convolution Engines (HWCEs): ultra-low energy coprocessors for accelerating convolutions, the main building block of many brain-inspired computer vision algorithms. Our synthesis results in ST 28nm FDSOI technology show that the HWCE is capable of performing a convolution in the lowest-energy state spending as little as 35 pJ/pixel on average, with an optimum case of 6.5 pJ/pixel. Furthermore, we show that augmenting a cluster with a HWCE can lead to an average boost of 40x or more in energy efficiency in convolutional workloads. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	6.4.2	ELIMINATING INTRA-WARP CONFLICT MISSES IN GPU Speakers: Bin Wang, Zhuo Liu, Xinning Wang and Weikuan Yu, Auburn University, US Abstract Cache indexing functions play a key role in reducing conflict misses by spreading accesses evenly among all sets of cache blocks. Although various methods have been proposed, no significant effort has been expended on the behavior of conflict misses in GPU where threads are organized into warps and execute in lock-step. When intra-warp accesses could not be coalesced into one or two cache blocks, which is often referred to as memory divergence, a warp incurs up to SIMD-width (e.g., 32) independent cache accesses. Such a burst of divergent accesses not only increases contention on cache capacity, but also incurs intra-warp associativity conflicts when they are pathologically concentrated in a few cache sets. Due to the lock- step execution, the GPU Load/Store units would be stalled when intra-warp concentration exceeds available cache associativity. Through an in-depth analysis of GPU access patterns, we find that column-majored strided accesses are likely to incur high intra-warp concentration. Based on the analysis, we propose a Full Permutation (FUP) based indexing method that adapts to both large and medium strides in this pattern. Across the 10 highly cache-sensitive GPU applications we have evaluated, FUP eliminates intra-warp associativity conflicts and outperforms two state-of-the-art indexing methods by 22% and 15%, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	6.4.3	RNA: A RECONFIGURABLE ARCHITECTURE FOR HARDWARE NEURAL ACCELERATION Speakers: Fengbin Tu¹, Shouyi YIN¹, Peng Ouyang¹, Leibo Liu² and Shaojun Wei¹ ¹Tsinghua University, CN; ²Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, CN Abstract As the energy problem has become a big concern in digital system design, one promising solution is combining the core processor with a multi-purpose accelerator targeting high performance applications. Many modern applications can be approximated by multi-layer perceptron (MLP) models, with little quality loss. However, many current MLP accelerators have several drawbacks, such as the unbalance of their performance and flexibility. In this paper, we propose a scheduling framework to guide mapping MLPs onto limited hardware resources with high performance. The framework successfully solves the main constraints of hardware neural acceleration. Furthermore, we implement a reconfigurable neural architecture (RNA) based on this framework, whose computing pattern can be reconfigured for different MLP topologies. The RNA achieves comparable performance with application-specific accelerators and greater flexibility than other hardware MLPs. Download Paper (PDF; Only available from the DATE venue WiFi)
12:15	6.4.4	APPROXANN: AN APPROXIMATE COMPUTING FRAMEWORK FOR ARTIFICIAL NEURAL NETWORK Speakers: Qian Zhang, Ting Wang, Ye Tian, Feng Yuan and Qiang Xu, The Chinese University of Hong Kong, HK Abstract Artificial Neural networks (ANNs) are one of the most well-established machine learning techniques and have a wide range of applications, such as Recognition, Mining and Synthesis (RMS). As many of these applications are inherently error-tolerant, in this work, we propose a novel approximate computing framework for ANN, namely ApproxANN. When compared to existing solutions, ApproxANN not only considers approximation for the computational units, but also approximates memory accesses. To be specific, ApproxANN characterizes the impact of neurons on the output quality in an effective and efficient manner, and judiciously determine how to approximate the computation and memory accesses of certain less critical neurons to achieve the maximum energy efficiency gain under a given quality constraint. Experimental results on various ANN applications with different datasets demonstrate the efficacy of the proposed solution. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP3-3, 377	A HARDWARE IMPLEMENTATION OF A RADIAL BASIS FUNCTION NEURAL NETWORK USING STOCHASTIC LOGIC Speakers: Yuan Ji¹, Feng Ran¹, Cong Ma² and David Lilja² ¹Shanghai University, CN; ²University of Minnesota - Twin Cities, US Abstract Hardware implementations of artificial neural networks typically require significant amounts of hardware resources. This paper proposes a novel radial basis function artificial neural network using stochastic computing elements, which greatly reduces the required hardware. The Gaussian function used for the radial basis function is implemented with a two-dimensional finite state machine. The norm between the input data and the center point is optimized using simple logic gates. Results from two pattern recognition case studies, the standard Iris flower and the MICR font benchmarks, show that the difference of the average mean squared error between the proposed stochastic network and the corresponding traditional deterministic network is only 1.3% when the stochastic stream length is 10kbits. The accuracy of the recognition rate varies depending on the stream length, which gives the designer tremendous flexibility to tradeoff speed, power, and accuracy. From the FPGA implementation results, the hardware resource requirement of the proposed stochastic hidden neuron is only a few percent of the hardware requirement of the corresponding deterministic hidden neuron. The proposed stochastic network can be expanded to larger scale networks for complex tasks with simple hardware architectures. Download Paper (PDF; Only available from the DATE venue WiFi)
12:31	IP3-4, 536	SODA: SOFTWARE DEFINED FPGA BASED ACCELERATORS FOR BIG DATA Speakers: Chao Wang, Xi Li and Xuehai Zhou, University of Science and Technology of China, CN Abstract FPGA has been an emerging field in novel big data architectures and systems, due to its high efficiency and low power consumption. It enables the researchers to deploy massive accelerators within one single chip. In this paper, we present a software defined FPGA based accelerators for big data, named SODA, which could reconstruct and reorganize the acceleration engines according to the requirement of the various data-intensive applications. SODA decomposes large and complex applications into coarse grained single-purpose RTL code libraries that perform specialized tasks in out-of-order hardware. We built a prototyping system with constrained shortest path Finding (CSPF) case studies to evaluate SODA framework. SODA is able to achieve up to 43.75X speedup at 128 node application. Furthermore, hardware cost of the SODA framework demonstrates that it can achieve high speedup with moderate hardware utilization. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session Lunch Break, Keynote lectures from 1250 - 1420 (Room Oisans) in front of the session room Salle Oisans and in the Exhibition area Coffee Break in Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Lunch Break On Tuesday and Wednesday, lunch boxes will be served in front of the session room Salle Oisans and in the exhibition area for fully registered delegates (a voucher will be given upon registration on-site). On Thursday, lunch will be served in Room Les Ecrins (for fully registered conference delegates only). Tuesday, March 10, 2015 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30; Keynote session from 13:20 - 14:20 (Room Oisans) sponsored by Mentor Graphics Coffee Break 16:00 - 17:00 Wednesday, March 11, 2015 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30, Keynote lectures from 12:50 - 14:20 (Room Oisans) Coffee Break 16:00 - 17:00 Thursday, March 12, 2015 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:00, Keynote lecture from 13:20 - 13:50 Coffee Break 15:30 - 16:00