12.3 Reconfigurable Systems for Machine Learning

Time	Label	Presentation Title Authors
16:00	12.3.1	EXPLORATION OF MEMORY ACCESS OPTIMIZATION FOR FPGA-BASED 3D CNN ACCELERATOR Speaker: Teng Tian, University of Science and Technology of China, CN Authors: Teng Tian, Xi Jin, Letian Zhao, Xiaotian Wang, Jie Wang and Wei Wu, University of Science and Technology of China, CN Abstract Three-dimensional convolutional networks (3D CNNs) are used efficiently in various video recognition applications. Compared to traditional 2D CNNs, extra temporal dimension causes 3D CNNs more computationally intensive and to have a larger memory footprint. Therefore, the memory optimization is extremely crucial in this case. This paper presents a design space exploration of memory access optimization for FPGA-based 3D CNN accelerator. We present a non-overlapping data tiling method for contiguous off-chip memory access and explore on-chip data reuse opportunity by leveraging different loop ordering strategies. We propose a hardware architecture design which can flexibly support different loop ordering strategies for each 3D CNN layer. With the help of hardware/software co-design, we can provide the optimal configuration toward an energy-efficient and high-performance accelerator design. According to the experiments on AlexNet, VGG16, and C3D, our optimal model reduces up to 84% DRAM accesses and 55% energy consumption on C3D compared to a baseline model, and demonstrates state-of-the-art performance compared to prior FPGA implementations. Download Paper (PDF; Only available from the DATE venue WiFi)
16:30	12.3.2	A THROUGHPUT-LATENCY CO-OPTIMISED CASCADE OF CONVOLUTIONAL NEURAL NETWORK CLASSIFIERS Speaker: Alexandros Kouris, Imperial College London, GB Authors: Alexandros Kouris¹, Stylianos Venieris² and Christos Bouganis¹ ¹Imperial College London, GB; ²Samsung AI, GB Abstract Convolutional Neural Networks constitute a prominent AI model for classification tasks, serving a broad span of diverse application domains. To enable their efficient deployment in real-world tasks, the inherent redundancy of CNNs is frequently exploited to eliminate unnecessary computational costs. Driven by the fact that not all inputs require the same amount of computation to drive a confident prediction, multi-precision cascade classifiers have been recently introduced. FPGAs comprise a promising platform for the deployment of such input-dependent computation models, due to their enhanced customisation capabilities. Current literature, however, is limited to throughput-optimised cascade implementations, employing large batching at the expense of a substantial latency aggravation prohibiting their deployment on real-time scenarios. In this work, we introduce a novel methodology for throughput-latency co-optimised cascaded CNN classification, deployed on a custom FPGA architecture tailored to the target application and deployment platform, with respect to a set of user-specified requirements on accuracy and performance. Our experiments indicate that the proposed approach achieves comparable throughput gains with related state-of-the-art works, under substantially reduced overhead in latency, enabling its deployment on latency-sensitive applications. Download Paper (PDF; Only available from the DATE venue WiFi)
17:00	12.3.3	ORTHRUSPE: RUNTIME RECONFIGURABLE PROCESSING ELEMENTS FOR BINARY NEURAL NETWORKS Speaker: Nael Fasfous, TU Munich, DE Authors: Nael Fasfous¹, Manoj-Rohit Vemparala², Alexander Frickenstein² and Walter Stechele¹ ¹TU Munich, DE; ²BMW Group, DE Abstract Recent advancements in Binary Neural Networks (BNNs) have yielded promising results, bringing them a step closer to their full-precision counterparts in terms of prediction accuracy. These advancements were brought about by additional arithmetic and binary operations, in the form of scale and shift operations (fixed-point) and convolutions with multiple weight and activation bases (binary). In this paper, we propose OrthrusPE, a runtime reconfigurable processing element (PE) which is capable of executing all the operations required by modern BNNs while improving resource utilization and power efficiency. More precisely, we exploit DSP48 blocks on off-the-shelf FPGAs to compute binary Hadamard products (for binary convolutions) and fixed-point arithmetic (for scaling, shifting, batch norm, and non-binary layers), thereby utilizing the same hardware resource for two distinct, critical modes of operation. Our experiments show that common PE implementations increase dynamic power consumption by 67%, while requiring 39% more lookup tables, when compared to an OrthrusPE implementation. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30		End of session

Time

Label

Presentation Title
Authors

16:00

12.3.1

EXPLORATION OF MEMORY ACCESS OPTIMIZATION FOR FPGA-BASED 3D CNN ACCELERATOR
Speaker:
Teng Tian, University of Science and Technology of China, CN
Authors:
Teng Tian, Xi Jin, Letian Zhao, Xiaotian Wang, Jie Wang and Wei Wu, University of Science and Technology of China, CN
Abstract
Three-dimensional convolutional networks (3D CNNs) are used efficiently in various video recognition applications. Compared to traditional 2D CNNs, extra temporal dimension causes 3D CNNs more computationally intensive and to have a larger memory footprint. Therefore, the memory optimization is extremely crucial in this case. This paper presents a design space exploration of memory access optimization for FPGA-based 3D CNN accelerator. We present a non-overlapping data tiling method for contiguous off-chip memory access and explore on-chip data reuse opportunity by leveraging different loop ordering strategies. We propose a hardware architecture design which can flexibly support different loop ordering strategies for each 3D CNN layer. With the help of hardware/software co-design, we can provide the optimal configuration toward an energy-efficient and high-performance accelerator design. According to the experiments on AlexNet, VGG16, and C3D, our optimal model reduces up to 84% DRAM accesses and 55% energy consumption on C3D compared to a baseline model, and demonstrates state-of-the-art performance compared to prior FPGA implementations.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:30

12.3.2

A THROUGHPUT-LATENCY CO-OPTIMISED CASCADE OF CONVOLUTIONAL NEURAL NETWORK CLASSIFIERS
Speaker:
Alexandros Kouris, Imperial College London, GB
Authors:
Alexandros Kouris¹, Stylianos Venieris² and Christos Bouganis¹
¹Imperial College London, GB; ²Samsung AI, GB
Abstract
Convolutional Neural Networks constitute a prominent AI model for classification tasks, serving a broad span of diverse application domains. To enable their efficient deployment in real-world tasks, the inherent redundancy of CNNs is frequently exploited to eliminate unnecessary computational costs. Driven by the fact that not all inputs require the same amount of computation to drive a confident prediction, multi-precision cascade classifiers have been recently introduced. FPGAs comprise a promising platform for the deployment of such input-dependent computation models, due to their enhanced customisation capabilities. Current literature, however, is limited to throughput-optimised cascade implementations, employing large batching at the expense of a substantial latency aggravation prohibiting their deployment on real-time scenarios. In this work, we introduce a novel methodology for throughput-latency co-optimised cascaded CNN classification, deployed on a custom FPGA architecture tailored to the target application and deployment platform, with respect to a set of user-specified requirements on accuracy and performance. Our experiments indicate that the proposed approach achieves comparable throughput gains with related state-of-the-art works, under substantially reduced overhead in latency, enabling its deployment on latency-sensitive applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:00

12.3.3

ORTHRUSPE: RUNTIME RECONFIGURABLE PROCESSING ELEMENTS FOR BINARY NEURAL NETWORKS
Speaker:
Nael Fasfous, TU Munich, DE
Authors:
Nael Fasfous¹, Manoj-Rohit Vemparala², Alexander Frickenstein² and Walter Stechele¹
¹TU Munich, DE; ²BMW Group, DE
Abstract
Recent advancements in Binary Neural Networks (BNNs) have yielded promising results, bringing them a step closer to their full-precision counterparts in terms of prediction accuracy. These advancements were brought about by additional arithmetic and binary operations, in the form of scale and shift operations (fixed-point) and convolutions with multiple weight and activation bases (binary). In this paper, we propose OrthrusPE, a runtime reconfigurable processing element (PE) which is capable of executing all the operations required by modern BNNs while improving resource utilization and power efficiency. More precisely, we exploit DSP48 blocks on off-the-shelf FPGAs to compute binary Hadamard products (for binary convolutions) and fixed-point arithmetic (for scaling, shifting, batch norm, and non-binary layers), thereby utilizing the same hardware resource for two distinct, critical modes of operation. Our experiments show that common PE implementations increase dynamic power consumption by 67%, while requiring 39% more lookup tables, when compared to an OrthrusPE implementation.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

End of session