9.3 Advances in Reconfigurable Computing

Time	Label	Presentation Title Authors
08:30	9.3.1	LASER: A HARDWARE/SOFTWARE APPROACH TO ACCELERATE COMPLICATED LOOPS ON CGRAS Speaker: Shail Dave, Arizona State University, US Authors: Mahesh Balasubramanian¹, Shail Dave¹, Aviral Shrivastava¹ and Reiley Jeyapaul² ¹Arizona State University, US; ²ARM Research, GB Abstract Coarse-Grained Reconfigurable Arrays (CGRAs) are popular accelerators predominantly used in streaming, filtering, and decoding applications. Due to their high performance and high power-efficiency, CGRAs can be a promising solution to accelerate the loops of general purpose applications also. However, the loops in general purpose applications are often complicated, like loops with perfect and imperfect nests and loops with nested if-then-else's (conditionals). We argue that the existing hardware-software solutions to execute branches and conditions are inefficient. In order to efficiently execute complicated loops on CGRAs, we present a hardware-software hybrid solution: LASER -- a comprehensive technique to accelerate compute-intensive loops of applications. In LASER, compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. LASER achieves a geomean performance improvement of 40.91% and utilization of 43.43% with 46% lower energy consumption. Download Paper (PDF; Only available from the DATE venue WiFi)
09:00	9.3.2	A TIME-MULTIPLEXED FPGA OVERLAY WITH LINEAR INTERCONNECT Speaker: Xiangwei Li, Nanyang Technological University, SG Authors: Xiangwei Li¹, Abhishek Kumar Jain², Douglas L. Maskell¹ and Suhaib A. Fahmy³ ¹Nanyang Technological University, SG; ²Lawrence Livermore National Laboratory, US; ³University of Warwick, GB Abstract Coarse-grained overlays improve FPGA design productivity by providing fast compilation and software like programmability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays represent an interesting alternative as they are able to change their behavior on a cycle by cycle basis while the compute kernel executes. This reduces the FPGA resource needed, but at the cost of a higher initiation interval (II) and hence reduced throughput. The fully flexible routing network of current CGRA-like overlays results in high FPGA resource usage. However, many application kernels are acyclic and can be implemented using a much simpler linear feed-forward routing network. This paper examines a DSP block based TM overlay with linear interconnect where the overlay architecture takes account of the application kernels' characteristics and the underlying FPGA architecture, so as to minimize the II and the FPGA resource usage. We examine a number of architectural extensions to the DSP block based functional unit to improve the II, throughput and latency. The results show an average 70% reduction in II, with corresponding improvements in throughput and latency. Download Paper (PDF; Only available from the DATE venue WiFi)
09:30	9.3.3	URECA: A COMPILER SOLUTION TO MANAGE UNIFIED REGISTER FILE FOR CGRAS Speaker: Shail Dave, Arizona State University, US Authors: Shail Dave, Mahesh Balasubramanian and Aviral Shrivastava, Arizona State University, US Abstract Coarse-grained reconfigurable array (CGRA) is a promising solution to accelerate loops featuring loop-carried dependencies or low trip-counts. One challenge in compiling for CGRAs is to efficiently manage both recurring (repeatedly written and read) and nonrecurring (read-only) variables of loops. Although prior works manage recurring variables in rotating register file (RF), they access the nonrecurring variables through the on-chip memory. It increases memory accesses and degrades the performance. Alternatively, both the variables can be managed in separate rotating and nonrotating RFs. But, it increases code size and effective utilization of the registers becomes challenging. Instead, this paper proposes URECA, a compiler solution to manage the variables in a single nonrotating RF. During mapping loop operations on CGRA, the compiler allocates necessary registers and splits RF in rotating and nonrotating parts. It also pre-loads read-only values in the unified RF, which are then directly accessed at run-time. Evaluating compute-intensive benchmarks from MiBench show that URECA provides a geomean speedup of 11.41x over sequential loop execution. It improves the loop acceleration through CGRAs by 1.74x at 32% reduced energy consumption over state-of-the-art. Download Paper (PDF; Only available from the DATE venue WiFi)
09:45	9.3.4	OPTIMIZING THE DATA PLACEMENT AND TRANSFORMATION FOR MULTI-BANK CGRA COMPUTING SYSTEM. Speaker: Zhongyuan Zhao, Shanghai Jiao Tong University, CN Authors: Zhongyuan Zhao¹, Yantao Liu¹, Weiguang Sheng¹, Tushar Krishna², Qin Wang¹ and Zhigang Mao¹ ¹Shanghai Jiao Tong University, CN; ²Georgia Institute of Technology, US Abstract This paper provides a data placement optimization approach for Coarse-Grained Reconfigurable Architecture (CGRA) based computing platform in order to simultaneously optimize the performance of CGRA execution and data transformation between main memory and multi-bank memory. To achieve this goal, we have developed a performance model to evaluate the efficiency of data transformation and CGRA execution. This model is used for comparing the performances difference when using different data placement strategies. We search for the optimal data placement method by firstly choosing the method which generates the best CGRA execution efficiency from the candidates who can generate the optimal data transformation efficiency. Then we choose the best data placement strategy by comparing the performance of the selected strategy with the one generated through existing multi-bank optimization algorithm. Evaluation shows our approach is capable of optimizing the performance to 2.76x of state-of-the-art method when considering both data-transformation and CGRA execution efficiency. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00	IP4-7, 120	UNIVERSAL NUMBER POSIT ARITHMETIC GENERATOR ON FPGA Speaker: Hayden K.-H. So, The University of Hong Kong, HK Authors: Manish Kumar Jaiswal and Hayden So, The University of Hong Kong, HK Abstract Posit number system format includes a run-time varying exponent component, defined by a combination of regime-bit (with run-time varying length) and exponent-bit (with size of up to ES bits, the exponent size). This also leads to a run-time variation in its mantissa field size and position. This run-time variation in posit format poses a hardware design challenge. Being a recent development, posit lacks for its adequate hardware arithmetic architectures. Thus, this paper is aimed towards the posit arithmetic algorithmic development and their generic hardware generator. It is focused on basic posit arithmetic (floating-point to posit conversion, posit to floating point con- version, addition/subtraction and multiplication). These are also demonstrated on a FPGA platform. Target is to develop an open- source solution for generating basic posit arithmetic architectures with parameterized choices. This contribution would enable further exploration and evaluation of posit system. Download Paper (PDF; Only available from the DATE venue WiFi)
10:01	IP4-8, 347	BLOCK CONVOLUTION: TOWARDS MEMORY-EFFICIENT INFERENCE OF LARGE-SCALE CNNS ON FPGA Speaker: Gang Li, Institute of Automation, Chinese Academy of Sciences, CN Authors: Gang Li, Fanrong Li, Tianli Zhao and Jian Cheng, Institute of Automation, Chinese Academy of Sciences, CN Abstract FPGA-based CNN accelerators are gaining popularity due to high energy efficiency and great flexibility in recent years. However, as the networks grow in depth and width, the great volume of intermediate data is too large to store on chip, data transfers between on-chip memory and off-chip memory should be frequently executed, which leads to unexpected off-chip memory access latency and energy consumption. In this paper, we propose a block convolution approach, which is a memory-efficient, simple yet effective block-based convolution to completely avoid intermediate data from streaming out to off-chip memory during network inference. Experiments on the very large VGG-16 network show that the improved top-1/top-5 accuracy of 72.60%/91.10% can be achieved on the ImageNet classification task with the proposed approach. As a case study, we implement the VGG-16 network with block convolution on Xilinx Zynq ZC706 board, achieving a frame rate of 12.19fps under 150MHz working frequency, with all intermediate data staying on chip. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00		End of session Coffee Break in Exhibition Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD). Lunch Breaks (Großer Saal + Saal 1) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 20, 2018 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 21, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 22, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:00 Keynote Lecture in "Saal 2" 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

08:30

9.3.1

LASER: A HARDWARE/SOFTWARE APPROACH TO ACCELERATE COMPLICATED LOOPS ON CGRAS
Speaker:
Shail Dave, Arizona State University, US
Authors:
Mahesh Balasubramanian¹, Shail Dave¹, Aviral Shrivastava¹ and Reiley Jeyapaul²
¹Arizona State University, US; ²ARM Research, GB
Abstract
Coarse-Grained Reconfigurable Arrays (CGRAs) are popular accelerators predominantly used in streaming, filtering, and decoding applications. Due to their high performance and high power-efficiency, CGRAs can be a promising solution to accelerate the loops of general purpose applications also. However, the loops in general purpose applications are often complicated, like loops with perfect and imperfect nests and loops with nested if-then-else's (conditionals). We argue that the existing hardware-software solutions to execute branches and conditions are inefficient. In order to efficiently execute complicated loops on CGRAs, we present a hardware-software hybrid solution: LASER -- a comprehensive technique to accelerate compute-intensive loops of applications. In LASER, compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. LASER achieves a geomean performance improvement of 40.91% and utilization of 43.43% with 46% lower energy consumption.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:00

9.3.2

A TIME-MULTIPLEXED FPGA OVERLAY WITH LINEAR INTERCONNECT
Speaker:
Xiangwei Li, Nanyang Technological University, SG
Authors:
Xiangwei Li¹, Abhishek Kumar Jain², Douglas L. Maskell¹ and Suhaib A. Fahmy³
¹Nanyang Technological University, SG; ²Lawrence Livermore National Laboratory, US; ³University of Warwick, GB
Abstract
Coarse-grained overlays improve FPGA design productivity by providing fast compilation and software like programmability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays represent an interesting alternative as they are able to change their behavior on a cycle by cycle basis while the compute kernel executes. This reduces the FPGA resource needed, but at the cost of a higher initiation interval (II) and hence reduced throughput. The fully flexible routing network of current CGRA-like overlays results in high FPGA resource usage. However, many application kernels are acyclic and can be implemented using a much simpler linear feed-forward routing network. This paper examines a DSP block based TM overlay with linear interconnect where the overlay architecture takes account of the application kernels' characteristics and the underlying FPGA architecture, so as to minimize the II and the FPGA resource usage. We examine a number of architectural extensions to the DSP block based functional unit to improve the II, throughput and latency. The results show an average 70% reduction in II, with corresponding improvements in throughput and latency.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:30

9.3.3

URECA: A COMPILER SOLUTION TO MANAGE UNIFIED REGISTER FILE FOR CGRAS
Speaker:
Shail Dave, Arizona State University, US
Authors:
Shail Dave, Mahesh Balasubramanian and Aviral Shrivastava, Arizona State University, US
Abstract
Coarse-grained reconfigurable array (CGRA) is a promising solution to accelerate loops featuring loop-carried dependencies or low trip-counts. One challenge in compiling for CGRAs is to efficiently manage both recurring (repeatedly written and read) and nonrecurring (read-only) variables of loops. Although prior works manage recurring variables in rotating register file (RF), they access the nonrecurring variables through the on-chip memory. It increases memory accesses and degrades the performance. Alternatively, both the variables can be managed in separate rotating and nonrotating RFs. But, it increases code size and effective utilization of the registers becomes challenging. Instead, this paper proposes URECA, a compiler solution to manage the variables in a single nonrotating RF. During mapping loop operations on CGRA, the compiler allocates necessary registers and splits RF in rotating and nonrotating parts. It also pre-loads read-only values in the unified RF, which are then directly accessed at run-time. Evaluating compute-intensive benchmarks from MiBench show that URECA provides a geomean speedup of 11.41x over sequential loop execution. It improves the loop acceleration through CGRAs by 1.74x at 32% reduced energy consumption over state-of-the-art.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:45

9.3.4

OPTIMIZING THE DATA PLACEMENT AND TRANSFORMATION FOR MULTI-BANK CGRA COMPUTING SYSTEM.
Speaker:
Zhongyuan Zhao, Shanghai Jiao Tong University, CN
Authors:
Zhongyuan Zhao¹, Yantao Liu¹, Weiguang Sheng¹, Tushar Krishna², Qin Wang¹ and Zhigang Mao¹
¹Shanghai Jiao Tong University, CN; ²Georgia Institute of Technology, US
Abstract
This paper provides a data placement optimization approach for Coarse-Grained Reconfigurable Architecture (CGRA) based computing platform in order to simultaneously optimize the performance of CGRA execution and data transformation between main memory and multi-bank memory. To achieve this goal, we have developed a performance model to evaluate the efficiency of data transformation and CGRA execution. This model is used for comparing the performances difference when using different data placement strategies. We search for the optimal data placement method by firstly choosing the method which generates the best CGRA execution efficiency from the candidates who can generate the optimal data transformation efficiency. Then we choose the best data placement strategy by comparing the performance of the selected strategy with the one generated through existing multi-bank optimization algorithm. Evaluation shows our approach is capable of optimizing the performance to 2.76x of state-of-the-art method when considering both data-transformation and CGRA execution efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

IP4-7, 120

UNIVERSAL NUMBER POSIT ARITHMETIC GENERATOR ON FPGA
Speaker:
Hayden K.-H. So, The University of Hong Kong, HK
Authors:
Manish Kumar Jaiswal and Hayden So, The University of Hong Kong, HK
Abstract
Posit number system format includes a run-time varying exponent component, defined by a combination of regime-bit (with run-time varying length) and exponent-bit (with size of up to ES bits, the exponent size). This also leads to a run-time variation in its mantissa field size and position. This run-time variation in posit format poses a hardware design challenge. Being a recent development, posit lacks for its adequate hardware arithmetic architectures. Thus, this paper is aimed towards the posit arithmetic algorithmic development and their generic hardware generator. It is focused on basic posit arithmetic (floating-point to posit conversion, posit to floating point con- version, addition/subtraction and multiplication). These are also demonstrated on a FPGA platform. Target is to develop an open- source solution for generating basic posit arithmetic architectures with parameterized choices. This contribution would enable further exploration and evaluation of posit system.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:01

IP4-8, 347

BLOCK CONVOLUTION: TOWARDS MEMORY-EFFICIENT INFERENCE OF LARGE-SCALE CNNS ON FPGA
Speaker:
Gang Li, Institute of Automation, Chinese Academy of Sciences, CN
Authors:
Gang Li, Fanrong Li, Tianli Zhao and Jian Cheng, Institute of Automation, Chinese Academy of Sciences, CN
Abstract
FPGA-based CNN accelerators are gaining popularity due to high energy efficiency and great flexibility in recent years. However, as the networks grow in depth and width, the great volume of intermediate data is too large to store on chip, data transfers between on-chip memory and off-chip memory should be frequently executed, which leads to unexpected off-chip memory access latency and energy consumption. In this paper, we propose a block convolution approach, which is a memory-efficient, simple yet effective block-based convolution to completely avoid intermediate data from streaming out to off-chip memory during network inference. Experiments on the very large VGG-16 network show that the improved top-1/top-5 accuracy of 72.60%/91.10% can be achieved on the ImageNet classification task with the proposed approach. As a case study, we implement the VGG-16 network with block convolution on Xilinx Zynq ZC706 board, achieving a frame rate of 12.19fps under 150MHz working frequency, with all intermediate data staying on chip.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

End of session
Coffee Break in Exhibition Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018