6.4 High-performance Reconfigurable Computing

Printer-friendly version PDF version

Date: Wednesday 29 March 2017
Time: 11:00 - 12:30
Location / Room: 3A

Chair:
Philip Brisk, University of California, Riverside, US

Co-Chair:
Mirjana Stojilovic, EPFL, CH

Reconfigurable architectures are seeing increased usage in high performance and scientific applications. This session addresses challenges in this space, which include optimizes arithmetic data paths, developing an in-memory architecture for optimizing database processing, and a case study on developing a high throughput Smith-Waterman accelerator.

TimeLabelPresentation Title
Authors
11:006.4.1(Best Paper Award Candidate)
AUTOMATING THE PIPELINE OF ARITHMETIC DATAPATHS
Speaker:
Florent de Dinechin, INSA-Lyon, FR
Authors:
Matei Istoan1 and Florent de Dinechin2
1INRIA, FR; 2INSA-Lyon, FR
Abstract
This article presents the new framework for semi-automatic circuit pipelining that will be used in future releases of the FloPoCo generator. From a single description of an operator or datapath, optimized implementations are obtained automatically for a wide range of FPGA targets and a wide range of frequency/latency trade-offs. Compared to previous versions of FloPoCo, the level of abstraction has been raised, enabling easier development, shorter generator code, and better pipeline optimization. The proposed approach is also more flexible than fully automatic pipelining approaches based on retiming: In the proposed technique, the incremental construction of the pipeline along with the circuit graph enables architectural design decisions that depend on the pipeline.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:306.4.2OPERAND SIZE RECONFIGURATION FOR BIG DATA PROCESSING IN MEMORY
Speaker:
Luigi Carro, UFRGS, BR
Authors:
Paulo Cesar Santos1, Geraldo Francisco de Oliveira Junior2, Diego Gomes Tomé3, Marco Antonio Zanata Alves3, Eduardo Cunha de Almeida3 and Luigi Carro4
1UFRGS - Universidade Federal do Rio Grande do Sul, BR; 2Universidade Federal do Rio Grande do Sul, BR; 3UFPR, BR; 4UFRGS, BR
Abstract
Nowadays, applications that predominantly perform lookups over large databases are becoming more popular with column-stores as the database system architecture of choice. For these applications, Hybrid Memory Cubes (HMCs) can provide bandwidth of up to 320 GB/s and represents the best choice to keep the throughput for these ever increasing databases. However, even with the high available memory bandwidth and processing power, in order to achieve the peak performance, data movements through the memory hierarchy consumes an unnecessary amount of time and energy. In order to accelerate database operations, and reduce the energy consumption of the system, this paper presents the Reconfigurable Vector Unit (RVU) that enables massive and adaptive in-memory processing, extending the native HMC instructions and also increasing its effectiveness. RVU enables the programmer to reconfigure it to perform as a large vector unit or multiple small vectors units to better adjust for the application needs during different computation phases. Due to its adaptability, RVU is capable of achieving performance increase of 27x on average and reduce the DRAM energy consumption in 29% when compared to an x86 processor with 16 cores. Compared with the state-of-theart mechanism capable of performing large vector operations with fixed size, inside the HMC, RVU performed up to 12% better in terms of performance and improve in 53% the energy consumption.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:006.4.3ARCHITECTURAL OPTIMIZATIONS FOR HIGH PERFORMANCE AND ENERGY EFFICIENT SMITH-WATERMAN IMPLEMENTATION ON FPGAS USING OPENCL
Speaker:
Lorenzo Di Tucci, Politecnico di Milano, IT
Authors:
Lorenzo Di Tucci1, Kenneth O'Brien2, Michaela Blott2 and Marco D. Santambrogio1
1Politecnico di Milano, IT; 2Xilinx Inc, IE
Abstract
Smith-Waterman is a dynamic programming algorithm that plays a key role in the modern genomics pipeline as it is guaranteed to find the optimal local alignment between two strings of data. The state of the art presents many hardware acceleration solutions that have been implemented in order to exploit the high degree of parallelism available in this algorithm. The majority of these implementations use heuristics to increase the performance of the system at the expense of the accuracy of the result. In this work, we present an implementation of the pure version of the algorithm. We include the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations. To achieve scalability, our custom design comprises of systolic arrays, data compression features and shift registers, while a custom port mapping strategy aims to maximize performance. Our designs are built leveraging an OpenCL-based design entry, namely Xilinx SDAccel, in conjunction with a Xilinx Virtex 7 and Kintex Ultrascale platform. Our final design achieves a performance of 42.47 GCUPS (giga cell updates per second) with an energy efficiency of 1.6988 GCUPS/W. This represents an improvement of 1.72x in performance and energy efficiency over previously published FPGA implementations and 8.49x better in energy efficiency over comparable GPU implementations.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30IP3-3, 348DOUBLE MAC: DOUBLING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS ON MODERN FPGAS
Speaker:
Jongeun Lee, UNIST, KR
Authors:
Dong Nguyen1, Daewoo Kim1 and Jongeun Lee2
1UNIST, KR; 2Ulsan National Institute of Science and Technology (UNIST), KR
Abstract
This paper presents a novel method to double the computation rate of convolutional neural network (CNN) accelerators by packing two multiply-and-accumulate (MAC) operations into one DSP block of off-the-shelf FPGAs (called Double MAC). While a general SIMD MAC using a single DSP block seems impossible, our solution is tailored for the kind of MAC operations required for a convolution layer. Our preliminary evaluation shows that not only can our Double MAC approach increase the computation throughput of a CNN layer by twice with essentially the same resource utilization, the network level performance can also be improved by 14~84% over a highly optimized state-of-the-art accelerator solution depending on the CNN hyper-parameters.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:31IP3-4, 138BITMAN: A TOOL AND API FOR FPGA BITSTREAM MANIPULATIONS
Speaker:
Dirk Koch, University of Manchester, GB
Authors:
Khoa Pham, Edson Horta and Dirk Koch, University of Manchester, GB
Abstract
To fully support the partial reconfiguration capabilities of FPGAs, this paper introduces the tool and API BitMan for generating and manipulating configuration bitstreams. BitMan supports recent Xilinx FPGAs that can be used by the ISE and Vivado tool suites of the FPGA vendor Xilinx, including latest Virtex-6, 7 Series, UltraScale and UltraScale+ series FPGAs. The functionality includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low-level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time. All this is possible without the vendor CAD tools for allowing BitMan to be used even with embedded CPUs. The paper describes the capabilities, API and performance evaluation of BitMan.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30End of session
Lunch Break in Garden Foyer

Keynote Lecture session 7.0 in "Garden Foyer" 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.