10.3 Special Session: Next Generation Arithmetic for Edge Computing

Time	Label	Presentation Title Authors
11:00	10.3.1	PARADIGM ON APPROXIMATE COMPUTE FOR COMPLEX PERCEPTION-BASED NEURAL NETWORKS Authors: Andre Guntoro and Cecilia De la Parra, Robert Bosch GmbH, DE Abstract The rise of machine learning pushes the massive compute power requirements, especially on the edge devices for their real-time inferences. One established approach for reducing the power usage is by going down to integer inferences (such as 8-bit) instead of utilizing higher computation accuracy given by their floating-point counterparts. Squeezing into lower bit representations such as in binary weight networks or binary neural networks requires complex training methods and also more efforts to recover the precision loss, and it typically functions only on simple classification tasks. One promising alternative to further reduce power consumption and computation latency is by utilizing approximate compute units. This method is a promising paradigm for mitigating the computation demand of neural networks, by taking advantage of their inherent resilience. Thanks to the development in approximate computing in the last decade, we have abundant options to utilize the best available approximate units, without re-developing or re-designing them. Nonetheless, adaptation during training phase is required. At first, we need to adapt the training methods for neural networks to take into account the inaccuracy given by approximate compute, without sacrificing the training speed (considering the trainings are performed on GPU with floating-point). Second, we need to define new metric for assessing and selecting the best-fit of approximation units per use-case basis. Lastly, we need to take advantages of approximation into the neural networks, such as over-fitting mitigation per design and resiliency, so that the networks trained for and designed with approximation will and shall perform better than their exact computing counterparts. For these steps, we evaluate on small tasks first and further validate on complex tasks which are more relevant in automotive domains.
11:22	10.3.2	NEXT GENERATION FPGA ARITHMETIC FOR AI Author: Martin Langhammer, Intel, GB Abstract The most recent FPGA architectures have introduced new levels of embedded floating point performance, with tens of TFLOPs now available across a wide range of device sizes. The last two generations of FPGAs have introduced IEEE754 single precision (FP32) arithmetic, containing up to 10 TFLOPs. The emergence of AI/Machine Learning as the highest profile FPGA application has changed the focus from signal processing and embedded calculations supported by FP32 to smaller floating point precisions, such as BFLOAT16 for training and FP16 for inference. In this talk, we will describe the architecture and development of the Intel Agilex DSP Block, which contains a FP32 multiplier-adder pair that can be decomposed into two smaller precision pairs; fp16, bfloat16, and a third proprietary format which can be used for both training and inference. In the Edge, where even lower precision arithmetic is required for inference, new FPGA EDA flows can implement 100 TFLOPs+ of soft logic-based compute power. In the second half of our talk, we will describe new synthesis, clustering, and packing methodologies - collectively known as Fractal Synthesis - that allow an unprecedented near 100% logic use of the FPGA for arithmetic, while maintaining the clock rates of a small example design. The soft logic and embedded arithmetic capabilities can be used simultaneously, making the FPGA the most flexible, and amongst the highest performing AI platform
11:44	10.3.3	APPLICATION-SPECIFIC ARITHMETIC DESIGN Author: Florent de Dinechin, INSA Lyon, FR Abstract General-purpose processor manufacturers face the difficult task of deciding the best arithmetic systems to commit to silicon. An alternative, particularly relevant to FPGA computing and ASIC design, is to keep this choice as open as possible, designing tools that enable different arithmetic system to be mixed and matched in an application-specific way. To achieve this, a productive paradigm has emerged from the FloPoCo project: open-ended generation of over-parameterized operators that compute just right thanks to last-bit accuracy at all levels. This work reviews this paradigm, and also reviews some of the arithmetic tools recently developed for this purpose: the generic bit-heap framework of FloPoCo, and the integration of arithmetic optimization inside HLS tools in the Marto project.
12:06	10.3.4	A COMPARISON OF POSIT AND IEEE 754 FLOATING-POINT ARITHMETIC THAT ACCOUNTS FOR EXCEPTION HANDLING Author: John Gustafson, National University of Singapore, SG Abstract The posit number format has advantages over the decades-old IEEE 754 Standard floating-point standard in many dimensions: Accuracy, dynamic range, simplicity, bitwise reproducibility, resiliency, and resistance to side-channel security attacks. In making comparisons, it is essential to distinguish between an IEEE 754 Standard implementation that handles all the exceptions in hardware, and one that either ignores the exceptions of the Standard or handles them with software or microcode that take hundreds of clock cycles to execute. Ignoring the exceptions quickly leads to egregious problems such as different values comparing as equal; handling exceptions with microcode creates massive data-dependency on timing that permits side-channel attacks like the well-known Spectre and Meltdown security weaknesses. Many microprocessors, such as current x86 architectures, use the exception-trapping approach for exceptions such as denormalized floats, which makes them unsuitable for secure use. Posit arithmetic provides data-independent and fast execution times with less complexity than a data-independent IEEE 754 float environment for the same data size.
12:30		End of session

Time

Label

Presentation Title
Authors

11:00

10.3.1

PARADIGM ON APPROXIMATE COMPUTE FOR COMPLEX PERCEPTION-BASED NEURAL NETWORKS
Authors:
Andre Guntoro and Cecilia De la Parra, Robert Bosch GmbH, DE
Abstract
The rise of machine learning pushes the massive compute power requirements, especially on the edge devices for their real-time inferences. One established approach for reducing the power usage is by going down to integer inferences (such as 8-bit) instead of utilizing higher computation accuracy given by their floating-point counterparts. Squeezing into lower bit representations such as in binary weight networks or binary neural networks requires complex training methods and also more efforts to recover the precision loss, and it typically functions only on simple classification tasks. One promising alternative to further reduce power consumption and computation latency is by utilizing approximate compute units. This method is a promising paradigm for mitigating the computation demand of neural networks, by taking advantage of their inherent resilience. Thanks to the development in approximate computing in the last decade, we have abundant options to utilize the best available approximate units, without re-developing or re-designing them. Nonetheless, adaptation during training phase is required. At first, we need to adapt the training methods for neural networks to take into account the inaccuracy given by approximate compute, without sacrificing the training speed (considering the trainings are performed on GPU with floating-point). Second, we need to define new metric for assessing and selecting the best-fit of approximation units per use-case basis. Lastly, we need to take advantages of approximation into the neural networks, such as over-fitting mitigation per design and resiliency, so that the networks trained for and designed with approximation will and shall perform better than their exact computing counterparts. For these steps, we evaluate on small tasks first and further validate on complex tasks which are more relevant in automotive domains.

11:22

10.3.2

NEXT GENERATION FPGA ARITHMETIC FOR AI
Author:
Martin Langhammer, Intel, GB
Abstract
The most recent FPGA architectures have introduced new levels of embedded floating point performance, with tens of TFLOPs now available across a wide range of device sizes. The last two generations of FPGAs have introduced IEEE754 single precision (FP32) arithmetic, containing up to 10 TFLOPs. The emergence of AI/Machine Learning as the highest profile FPGA application has changed the focus from signal processing and embedded calculations supported by FP32 to smaller floating point precisions, such as BFLOAT16 for training and FP16 for inference. In this talk, we will describe the architecture and development of the Intel Agilex DSP Block, which contains a FP32 multiplier-adder pair that can be decomposed into two smaller precision pairs; fp16, bfloat16, and a third proprietary format which can be used for both training and inference. In the Edge, where even lower precision arithmetic is required for inference, new FPGA EDA flows can implement 100 TFLOPs+ of soft logic-based compute power. In the second half of our talk, we will describe new synthesis, clustering, and packing methodologies - collectively known as Fractal Synthesis - that allow an unprecedented near 100% logic use of the FPGA for arithmetic, while maintaining the clock rates of a small example design. The soft logic and embedded arithmetic capabilities can be used simultaneously, making the FPGA the most flexible, and amongst the highest performing AI platform

11:44

10.3.3

APPLICATION-SPECIFIC ARITHMETIC DESIGN
Author:
Florent de Dinechin, INSA Lyon, FR
Abstract
General-purpose processor manufacturers face the difficult task of deciding the best arithmetic systems to commit to silicon. An alternative, particularly relevant to FPGA computing and ASIC design, is to keep this choice as open as possible, designing tools that enable different arithmetic system to be mixed and matched in an application-specific way. To achieve this, a productive paradigm has emerged from the FloPoCo project: open-ended generation of over-parameterized operators that compute just right thanks to last-bit accuracy at all levels. This work reviews this paradigm, and also reviews some of the arithmetic tools recently developed for this purpose: the generic bit-heap framework of FloPoCo, and the integration of arithmetic optimization inside HLS tools in the Marto project.

12:06

10.3.4

A COMPARISON OF POSIT AND IEEE 754 FLOATING-POINT ARITHMETIC THAT ACCOUNTS FOR EXCEPTION HANDLING
Author:
John Gustafson, National University of Singapore, SG
Abstract
The posit number format has advantages over the decades-old IEEE 754 Standard floating-point standard in many dimensions: Accuracy, dynamic range, simplicity, bitwise reproducibility, resiliency, and resistance to side-channel security attacks. In making comparisons, it is essential to distinguish between an IEEE 754 Standard implementation that handles all the exceptions in hardware, and one that either ignores the exceptions of the Standard or handles them with software or microcode that take hundreds of clock cycles to execute. Ignoring the exceptions quickly leads to egregious problems such as different values comparing as equal; handling exceptions with microcode creates massive data-dependency on timing that permits side-channel attacks like the well-known Spectre and Meltdown security weaknesses. Many microprocessors, such as current x86 architectures, use the exception-trapping approach for exceptions such as denormalized floats, which makes them unsuitable for secure use. Posit arithmetic provides data-independent and fast execution times with less complexity than a data-independent IEEE 754 float environment for the same data size.

12:30

End of session