3.3 Loop Acceleration

Date: Tuesday 10 March 2015
Time: 14:30 - 16:00
Location / Room: Stendhal

Chair:
Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE

Co-Chair:
Benjamin Schafer, Hong Kong Polytechnic University, HK

This session reveals novel loop optimization techniques in high-level synthesis for resolving area overhead and communication bottlenecks in nested loops and/or multidimensional arrays. The first talk leverages loop-array dependencies for loop partitioning to reduce the dimension of the design space in order to ease the design complexity. The second talk quantifies a relationship between loop unrolling and partitioning, based on which area reduction methods are proposed by controlling the degree of loop unrolling. The third talk then resolves communication bottlenecks in embedded accelerators through inter-tile data reuse on loop optimizations.

Time	Label	Presentation Title Authors
14:30	3.3.1	EXPLOITING LOOP-ARRAY DEPENDENCIES TO ACCELERATE THE DESIGN SPACE EXPLORATION WITH HIGH LEVEL SYNTHESIS Speakers: Nam Khanh Pham¹, Amit Kumar Singh², Akash Kumar³ and Mi Mi Aung Khin⁴ ¹ECE Department, National University of Singapore, SG; ²University of York, GB; ³National University of Singapore, SG; ⁴Data Storage Institute (DSI), ASTAR, Singapore., SG Abstract* Recently, the requirement of shortened design cycles has led to rapid development of High Level Synthesis (HLS) tools that convert system level descriptions in a high level language into efficient hardware designs. Due to the high level of abstraction, HLS tools can easily provide multiple hardware designs from the same behavioral description. Therefore, they allow designers to explore various architectural options for different design objectives. However, such exploration has exponential complexity, making it practically impossible to explore the entire design space. The conventional approaches to reduce the design space exploration (DSE) complexity do not analyze the structure of the design space to limit the number of design points. To fill such a gap, we explore the structure of the design space by analyzing the dependencies between loops and arrays. We represent these dependencies as a graph that is used to reduce the dimensions of the design space. Moreover, we also examine the access pattern of the array and utilize it to find the efficient partition of arrays for each loop optimization parameter set. The experimental results show that our approach provides almost the same quality of result as the exhaustive DSE approach while significantly reducing the exploration time with an average of speed-up of 14x. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	3.3.2	INTERPLAY OF LOOP UNROLLING AND MULTIDIMENSIONAL MEMORY PARTITIONING IN HLS Speakers: Alessandro Cilardo and Luca Gallo, University of Naples Federico II, IT Abstract This paper deals with memory partitioning in the context of high-level synthesis for FPGA technologies. In particular, the work focuses on the area overhead caused by partitioning and sheds light on the interplay with a technique commonly used in HLS, i.e., loop unrolling. As a practical outcome, the study proposes a solution to reduce the area overhead by appropriately controlling the degree of loop unrolling. The experimental results confirm the significance of the analysis as well as the effectiveness of the proposed optimization technique. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	3.3.3	INTER-TILE REUSE OPTIMIZATION APPLIED TO BANDWIDTH CONSTRAINED EMBEDDED ACCELERATORS Speakers: Maurice Peemen, Bart Mesman and Henk Corporaal, Eindhoven University of Technology, NL Abstract The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00		End of session Coffee Break in Exhibition Area Coffee Break in Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Lunch Break On Tuesday and Wednesday, lunch boxes will be served in front of the session room Salle Oisans and in the exhibition area for fully registered delegates (a voucher will be given upon registration on-site). On Thursday, lunch will be served in Room Les Ecrins (for fully registered conference delegates only). Tuesday, March 10, 2015 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30; Keynote session from 13:20 - 14:20 (Room Oisans) sponsored by Mentor Graphics Coffee Break 16:00 - 17:00 Wednesday, March 11, 2015 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30, Keynote lectures from 12:50 - 14:20 (Room Oisans) Coffee Break 16:00 - 17:00 Thursday, March 12, 2015 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:00, Keynote lecture from 13:20 - 13:50 Coffee Break 15:30 - 16:00