5.8 Special Session: HLS for AI HW

Date: Wednesday 11 March 2020
Time: 08:30 - 10:00
Location / Room: Exhibition Theatre

Chair:
Massimo Cecchetti, Mentor, A Siemens Business, US

Co-Chair:
Astrid Ernst, Mentor, A Siemens Business, US

One of the fastest growing areas of hardware and software design is artificial intelligence (AI)/machine learning (ML), fueled by the demand for more autonomous systems like self-driving vehicles and voice recognition for personal assistants. Many of these algorithms rely on convolutional neural networks (CNNs) to implement deep learning systems. While the concept of convolution is relatively straightforward, the application of CNNs to the ML domain has yielded dozens of different neural network approaches. These networks can be executed in software on CPUs/GPUs, the power requirements for these solutions make them impractical for most inferencing applications, the majority of which involve portable, low-power devices. To improve the power/performance, hardware teams are forming to create ML hardware acceleration blocks. However, the process of taking any one of these compute-intensive networks into hardware, especially energy-efficient hardware, is a time consuming process if the team employs a traditional RTL design flow. Consider all of these interdependent activities using a traditional flow: •Expressing the algorithm correctly in RTL. •Choosing the optimal bit-widths for kernel weights and local storage to meet the memory budget. •Designing the microarchitecture to have a low enough latency to be practical for the target application, while determining how the accelerator communicates across the system bus without killing the latency the team just fought for. •Verifying the algorithm early on and throughout the implementation process, especially in the context of the entire system. •Optimizing for power for mobile devices. •Getting the product to market on time. This domain is in desperate need of a productivity-boosting methodology shift away from an RTL flow.

Time	Label	Presentation Title Authors
08:30	5.8.1	INTRODUCTION TO HLS CONCEPTS OPEN-SOURCE IP AND REFERENCES DESIGNS ENABLING BUILDING AI ACCELERATION HARDWARE Author: Mike Fingeroff, Mentor, A Siemens Business, US Abstract HLS provides a hardware design solution for algorithm designers that generates high-quality RTL from C++ and/or SystemC descriptions that target ASIC, FPGA, or eFPGA implementations. By employing these elements of the HLS solution, teams can quickly develop quality high-performance, low-power hardware implementations: • Enables late-stage changes. Easily change C++ algorithms at any time and regenerate RTL code or target a new technology. • Rapidly explore options for power, performance, and area without changing source code. • Reduce design and verification time from one year to a few months and add new features in days not weeks, all using C/C++ code that contains 5x fewer lines of code than RTL.
09:00	5.8.2	EARLY SOC PERFORMANCE VERIFICATION USING SYSTEMC WITH NVIDIA MATCHLIB AND HLS Author: Stuart Swan, Mentor, A Siemens Business, US Abstract NVidia MatchLib is a new open-source library that enables much faster design and verification of SOCs using High-Level Synthesis. One of the primary objectives of MatchLib is to enable performance accurate modeling of SOCs in SystemC/C++. With these models, designers can identify and resolve issues such as bus and memory contention, arbitration strategies, and optimal interconnect structure at a much higher level of abstraction than RTL. In addition, much of the system level verification of the SOC can occur in SystemC/C++, before RTL is even created. This presentation will introduce NVidia Matchlib and flow (Figure 3) and its usage with Catapult HLS using some demonstration examples. Key Components of MatchLib: • Connections o Synthesizable Message Passing Framework o SystemC/C++ used to accurately model concurrent IO that synthesized HW will have o Automatic stall injection enables interconnect to be stress tested in SystemC • Parameterized AXI4 Fabric Components o Router/Splitter o Arbiter o AXI4 <-> AXI4Lite o Automatic burst segmentation and last bit generation • Parameterized Banked Memories, Crossbar, Reorder Buffer, Cache • Parameterized NOC components
09:30	5.8.3	CUSTOMER CASE STUDIES OF USING HLS FOR ULTRA-LOW POWER AI HARDWARE ACCELERATION Author: Ellie Burns, Mentor, A Siemens Business, US Abstract This presentation will review 3 customer case studies where HLS has been used for designs and applications that use AI/ML accelerated HW. All case studies are available as full customer authored white papers that detail both the design and the HLS use, design experience and lessons learned. The 3 customers studies will be NVIDIA - High-productivity IC Design for Machine Learning Accelerators FotoNation/Xperi - A Designer Life with HLS Faster Computer Vision Neural Networks Chips&Media - Deep Learning Accelerator Using HLS
10:00		End of session