12.3 Reconfigurable Systems for Machine Learning

Printer-friendly version PDF version

Date: Thursday 12 March 2020
Time: 16:00 - 17:30
Location / Room: Autrans

Chair:
Bogdan Pasca, Intel, FR

Co-Chair:
Smail Niar, Université Polytechnique Hauts-de-France, FR

Machine learning continues to attract significant research attention and reconfigurable systems offer ample flexibility for exploring new approaches to accelerating these workloads. In this session we explore how FPGAs can be used for a variety of machine learning workloads. We discuss memory optimisations for 3D convolutional neural networks (CNNs), design and implementation of binarised neural networks, and an approach for cascading hybrid precision datapaths to improve CNN classification latency.

TimeLabelPresentation Title
Authors
16:0012.3.1EXPLORATION OF MEMORY ACCESS OPTIMIZATION FOR FPGA-BASED 3D CNN ACCELERATOR
Speaker:
Teng Tian, University of Science and Technology of China, CN
Authors:
Teng Tian, Xi Jin, Letian Zhao, Xiaotian Wang, Jie Wang and Wei Wu, University of Science and Technology of China, CN
Abstract
Three-dimensional convolutional networks (3D CNNs) are used efficiently in various video recognition applications. Compared to traditional 2D CNNs, extra temporal dimension causes 3D CNNs more computationally intensive and to have a larger memory footprint. Therefore, the memory optimization is extremely crucial in this case. This paper presents a design space exploration of memory access optimization for FPGA-based 3D CNN accelerator. We present a non-overlapping data tiling method for contiguous off-chip memory access and explore on-chip data reuse opportunity by leveraging different loop ordering strategies. We propose a hardware architecture design which can flexibly support different loop ordering strategies for each 3D CNN layer. With the help of hardware/software co-design, we can provide the optimal configuration toward an energy-efficient and high-performance accelerator design. According to the experiments on AlexNet, VGG16, and C3D, our optimal model reduces up to 84% DRAM accesses and 55% energy consumption on C3D compared to a baseline model, and demonstrates state-of-the-art performance compared to prior FPGA implementations.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:3012.3.2A THROUGHPUT-LATENCY CO-OPTIMISED CASCADE OF CONVOLUTIONAL NEURAL NETWORK CLASSIFIERS
Speaker:
Alexandros Kouris, Imperial College London, GB
Authors:
Alexandros Kouris1, Stylianos Venieris2 and Christos Bouganis1
1Imperial College London, GB; 2Samsung AI, GB
Abstract
Convolutional Neural Networks constitute a prominent AI model for classification tasks, serving a broad span of diverse application domains. To enable their efficient deployment in real-world tasks, the inherent redundancy of CNNs is frequently exploited to eliminate unnecessary computational costs. Driven by the fact that not all inputs require the same amount of computation to drive a confident prediction, multi-precision cascade classifiers have been recently introduced. FPGAs comprise a promising platform for the deployment of such input-dependent computation models, due to their enhanced customisation capabilities. Current literature, however, is limited to throughput-optimised cascade implementations, employing large batching at the expense of a substantial latency aggravation prohibiting their deployment on real-time scenarios. In this work, we introduce a novel methodology for throughput-latency co-optimised cascaded CNN classification, deployed on a custom FPGA architecture tailored to the target application and deployment platform, with respect to a set of user-specified requirements on accuracy and performance. Our experiments indicate that the proposed approach achieves comparable throughput gains with related state-of-the-art works, under substantially reduced overhead in latency, enabling its deployment on latency-sensitive applications.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:0012.3.3ORTHRUSPE: RUNTIME RECONFIGURABLE PROCESSING ELEMENTS FOR BINARY NEURAL NETWORKS
Speaker:
Nael Fasfous, TU Munich, DE
Authors:
Nael Fasfous1, Manoj-Rohit Vemparala2, Alexander Frickenstein2 and Walter Stechele1
1TU Munich, DE; 2BMW Group, DE
Abstract
Recent advancements in Binary Neural Networks (BNNs) have yielded promising results, bringing them a step closer to their full-precision counterparts in terms of prediction accuracy. These advancements were brought about by additional arithmetic and binary operations, in the form of scale and shift operations (fixed-point) and convolutions with multiple weight and activation bases (binary). In this paper, we propose OrthrusPE, a runtime reconfigurable processing element (PE) which is capable of executing all the operations required by modern BNNs while improving resource utilization and power efficiency. More precisely, we exploit DSP48 blocks on off-the-shelf FPGAs to compute binary Hadamard products (for binary convolutions) and fixed-point arithmetic (for scaling, shifting, batch norm, and non-binary layers), thereby utilizing the same hardware resource for two distinct, critical modes of operation. Our experiments show that common PE implementations increase dynamic power consumption by 67%, while requiring 39% more lookup tables, when compared to an OrthrusPE implementation.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:30End of session