6.4 High-performance Reconfigurable Computing

Time	Label	Presentation Title Authors
11:00	6.4.1	(Best Paper Award Candidate) AUTOMATING THE PIPELINE OF ARITHMETIC DATAPATHS Speaker: Florent de Dinechin, INSA-Lyon, FR Authors: Matei Istoan¹ and Florent de Dinechin² ¹INRIA, FR; ²INSA-Lyon, FR Abstract This article presents the new framework for semi-automatic circuit pipelining that will be used in future releases of the FloPoCo generator. From a single description of an operator or datapath, optimized implementations are obtained automatically for a wide range of FPGA targets and a wide range of frequency/latency trade-offs. Compared to previous versions of FloPoCo, the level of abstraction has been raised, enabling easier development, shorter generator code, and better pipeline optimization. The proposed approach is also more flexible than fully automatic pipelining approaches based on retiming: In the proposed technique, the incremental construction of the pipeline along with the circuit graph enables architectural design decisions that depend on the pipeline. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	6.4.2	OPERAND SIZE RECONFIGURATION FOR BIG DATA PROCESSING IN MEMORY Speaker: Luigi Carro, UFRGS, BR Authors: Paulo Cesar Santos¹, Geraldo Francisco de Oliveira Junior², Diego Gomes Tomé³, Marco Antonio Zanata Alves³, Eduardo Cunha de Almeida³ and Luigi Carro⁴ ¹UFRGS - Universidade Federal do Rio Grande do Sul, BR; ²Universidade Federal do Rio Grande do Sul, BR; ³UFPR, BR; ⁴UFRGS, BR Abstract Nowadays, applications that predominantly perform lookups over large databases are becoming more popular with column-stores as the database system architecture of choice. For these applications, Hybrid Memory Cubes (HMCs) can provide bandwidth of up to 320 GB/s and represents the best choice to keep the throughput for these ever increasing databases. However, even with the high available memory bandwidth and processing power, in order to achieve the peak performance, data movements through the memory hierarchy consumes an unnecessary amount of time and energy. In order to accelerate database operations, and reduce the energy consumption of the system, this paper presents the Reconfigurable Vector Unit (RVU) that enables massive and adaptive in-memory processing, extending the native HMC instructions and also increasing its effectiveness. RVU enables the programmer to reconfigure it to perform as a large vector unit or multiple small vectors units to better adjust for the application needs during different computation phases. Due to its adaptability, RVU is capable of achieving performance increase of 27x on average and reduce the DRAM energy consumption in 29% when compared to an x86 processor with 16 cores. Compared with the state-of-theart mechanism capable of performing large vector operations with fixed size, inside the HMC, RVU performed up to 12% better in terms of performance and improve in 53% the energy consumption. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	6.4.3	ARCHITECTURAL OPTIMIZATIONS FOR HIGH PERFORMANCE AND ENERGY EFFICIENT SMITH-WATERMAN IMPLEMENTATION ON FPGAS USING OPENCL Speaker: Lorenzo Di Tucci, Politecnico di Milano, IT Authors: Lorenzo Di Tucci¹, Kenneth O'Brien², Michaela Blott² and Marco D. Santambrogio¹ ¹Politecnico di Milano, IT; ²Xilinx Inc, IE Abstract Smith-Waterman is a dynamic programming algorithm that plays a key role in the modern genomics pipeline as it is guaranteed to find the optimal local alignment between two strings of data. The state of the art presents many hardware acceleration solutions that have been implemented in order to exploit the high degree of parallelism available in this algorithm. The majority of these implementations use heuristics to increase the performance of the system at the expense of the accuracy of the result. In this work, we present an implementation of the pure version of the algorithm. We include the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations. To achieve scalability, our custom design comprises of systolic arrays, data compression features and shift registers, while a custom port mapping strategy aims to maximize performance. Our designs are built leveraging an OpenCL-based design entry, namely Xilinx SDAccel, in conjunction with a Xilinx Virtex 7 and Kintex Ultrascale platform. Our final design achieves a performance of 42.47 GCUPS (giga cell updates per second) with an energy efficiency of 1.6988 GCUPS/W. This represents an improvement of 1.72x in performance and energy efficiency over previously published FPGA implementations and 8.49x better in energy efficiency over comparable GPU implementations. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP3-3, 348	DOUBLE MAC: DOUBLING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS ON MODERN FPGAS Speaker: Jongeun Lee, UNIST, KR Authors: Dong Nguyen¹, Daewoo Kim¹ and Jongeun Lee² ¹UNIST, KR; ²Ulsan National Institute of Science and Technology (UNIST), KR Abstract This paper presents a novel method to double the computation rate of convolutional neural network (CNN) accelerators by packing two multiply-and-accumulate (MAC) operations into one DSP block of off-the-shelf FPGAs (called Double MAC). While a general SIMD MAC using a single DSP block seems impossible, our solution is tailored for the kind of MAC operations required for a convolution layer. Our preliminary evaluation shows that not only can our Double MAC approach increase the computation throughput of a CNN layer by twice with essentially the same resource utilization, the network level performance can also be improved by 14~84% over a highly optimized state-of-the-art accelerator solution depending on the CNN hyper-parameters. Download Paper (PDF; Only available from the DATE venue WiFi)
12:31	IP3-4, 138	BITMAN: A TOOL AND API FOR FPGA BITSTREAM MANIPULATIONS Speaker: Dirk Koch, University of Manchester, GB Authors: Khoa Pham, Edson Horta and Dirk Koch, University of Manchester, GB Abstract To fully support the partial reconfiguration capabilities of FPGAs, this paper introduces the tool and API BitMan for generating and manipulating configuration bitstreams. BitMan supports recent Xilinx FPGAs that can be used by the ISE and Vivado tool suites of the FPGA vendor Xilinx, including latest Virtex-6, 7 Series, UltraScale and UltraScale+ series FPGAs. The functionality includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low-level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time. All this is possible without the vendor CAD tools for allowing BitMan to be used even with embedded CPUs. The paper describes the capabilities, API and performance evaluation of BitMan. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session Lunch Break in Garden Foyer Keynote Lecture session 7.0 in "Garden Foyer" 1350 - 1420 Lunch Break in the Garden Foyer On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

Time

Label

Presentation Title
Authors

11:00

6.4.1

(Best Paper Award Candidate)
AUTOMATING THE PIPELINE OF ARITHMETIC DATAPATHS
Speaker:
Florent de Dinechin, INSA-Lyon, FR
Authors:
Matei Istoan¹ and Florent de Dinechin²
¹INRIA, FR; ²INSA-Lyon, FR
Abstract
This article presents the new framework for semi-automatic circuit pipelining that will be used in future releases of the FloPoCo generator. From a single description of an operator or datapath, optimized implementations are obtained automatically for a wide range of FPGA targets and a wide range of frequency/latency trade-offs. Compared to previous versions of FloPoCo, the level of abstraction has been raised, enabling easier development, shorter generator code, and better pipeline optimization. The proposed approach is also more flexible than fully automatic pipelining approaches based on retiming: In the proposed technique, the incremental construction of the pipeline along with the circuit graph enables architectural design decisions that depend on the pipeline.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30

6.4.2

OPERAND SIZE RECONFIGURATION FOR BIG DATA PROCESSING IN MEMORY
Speaker:
Luigi Carro, UFRGS, BR
Authors:
Paulo Cesar Santos¹, Geraldo Francisco de Oliveira Junior², Diego Gomes Tomé³, Marco Antonio Zanata Alves³, Eduardo Cunha de Almeida³ and Luigi Carro⁴
¹UFRGS - Universidade Federal do Rio Grande do Sul, BR; ²Universidade Federal do Rio Grande do Sul, BR; ³UFPR, BR; ⁴UFRGS, BR
Abstract
Nowadays, applications that predominantly perform lookups over large databases are becoming more popular with column-stores as the database system architecture of choice. For these applications, Hybrid Memory Cubes (HMCs) can provide bandwidth of up to 320 GB/s and represents the best choice to keep the throughput for these ever increasing databases. However, even with the high available memory bandwidth and processing power, in order to achieve the peak performance, data movements through the memory hierarchy consumes an unnecessary amount of time and energy. In order to accelerate database operations, and reduce the energy consumption of the system, this paper presents the Reconfigurable Vector Unit (RVU) that enables massive and adaptive in-memory processing, extending the native HMC instructions and also increasing its effectiveness. RVU enables the programmer to reconfigure it to perform as a large vector unit or multiple small vectors units to better adjust for the application needs during different computation phases. Due to its adaptability, RVU is capable of achieving performance increase of 27x on average and reduce the DRAM energy consumption in 29% when compared to an x86 processor with 16 cores. Compared with the state-of-theart mechanism capable of performing large vector operations with fixed size, inside the HMC, RVU performed up to 12% better in terms of performance and improve in 53% the energy consumption.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

6.4.3

ARCHITECTURAL OPTIMIZATIONS FOR HIGH PERFORMANCE AND ENERGY EFFICIENT SMITH-WATERMAN IMPLEMENTATION ON FPGAS USING OPENCL
Speaker:
Lorenzo Di Tucci, Politecnico di Milano, IT
Authors:
Lorenzo Di Tucci¹, Kenneth O'Brien², Michaela Blott² and Marco D. Santambrogio¹
¹Politecnico di Milano, IT; ²Xilinx Inc, IE
Abstract
Smith-Waterman is a dynamic programming algorithm that plays a key role in the modern genomics pipeline as it is guaranteed to find the optimal local alignment between two strings of data. The state of the art presents many hardware acceleration solutions that have been implemented in order to exploit the high degree of parallelism available in this algorithm. The majority of these implementations use heuristics to increase the performance of the system at the expense of the accuracy of the result. In this work, we present an implementation of the pure version of the algorithm. We include the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations. To achieve scalability, our custom design comprises of systolic arrays, data compression features and shift registers, while a custom port mapping strategy aims to maximize performance. Our designs are built leveraging an OpenCL-based design entry, namely Xilinx SDAccel, in conjunction with a Xilinx Virtex 7 and Kintex Ultrascale platform. Our final design achieves a performance of 42.47 GCUPS (giga cell updates per second) with an energy efficiency of 1.6988 GCUPS/W. This represents an improvement of 1.72x in performance and energy efficiency over previously published FPGA implementations and 8.49x better in energy efficiency over comparable GPU implementations.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

IP3-3, 348

DOUBLE MAC: DOUBLING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS ON MODERN FPGAS
Speaker:
Jongeun Lee, UNIST, KR
Authors:
Dong Nguyen¹, Daewoo Kim¹ and Jongeun Lee²
¹UNIST, KR; ²Ulsan National Institute of Science and Technology (UNIST), KR
Abstract
This paper presents a novel method to double the computation rate of convolutional neural network (CNN) accelerators by packing two multiply-and-accumulate (MAC) operations into one DSP block of off-the-shelf FPGAs (called Double MAC). While a general SIMD MAC using a single DSP block seems impossible, our solution is tailored for the kind of MAC operations required for a convolution layer. Our preliminary evaluation shows that not only can our Double MAC approach increase the computation throughput of a CNN layer by twice with essentially the same resource utilization, the network level performance can also be improved by 14~84% over a highly optimized state-of-the-art accelerator solution depending on the CNN hyper-parameters.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:31

IP3-4, 138

BITMAN: A TOOL AND API FOR FPGA BITSTREAM MANIPULATIONS
Speaker:
Dirk Koch, University of Manchester, GB
Authors:
Khoa Pham, Edson Horta and Dirk Koch, University of Manchester, GB
Abstract
To fully support the partial reconfiguration capabilities of FPGAs, this paper introduces the tool and API BitMan for generating and manipulating configuration bitstreams. BitMan supports recent Xilinx FPGAs that can be used by the ISE and Vivado tool suites of the FPGA vendor Xilinx, including latest Virtex-6, 7 Series, UltraScale and UltraScale+ series FPGAs. The functionality includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low-level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time. All this is possible without the vendor CAD tools for allowing BitMan to be used even with embedded CPUs. The paper describes the capabilities, API and performance evaluation of BitMan.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

End of session
Lunch Break in Garden Foyer

Keynote Lecture session 7.0 in "Garden Foyer" 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

available at

Visit us at DATE 2017

Booth: 20+21

Booth: 30

Booth: 17

Booth: 26

Booth: 1

Booth: 23

Submissions

6.4 High-performance Reconfigurable Computing

DATE Smartphone App

Visit us at DATE 2017