4.7 Adaptive Reliable Computing Using Memristive and Reconfigurable Hardware

Time	Label	Presentation Title Authors
17:00	4.7.1	RESCUING MEMRISTOR-BASED COMPUTING WITH NON-LINEAR RESISTANCE LEVELS Speaker: Jilan Lin, Tsinghua University, CN Authors: Jilan Lin¹, Lixue Xia¹, Zhenhua Zhu¹, Hanbo Sun¹, Yi Cai¹, Hui Gao¹, Ming Cheng¹, Xiaoming Chen², Yu Wang¹ and Huazhong Yang¹ ¹Tsinghua University, Beijing, CN; ²University of Notre Dame, US Abstract Emerging metal oxide resistive switching random access memory (RRAM) device and RRAM crossbar have shown great potential in computing matrix-vector multiplication. However, due to the nonlinear distribution of resistance levels in RRAM devices, state-of-the-art multi-bit RRAM cannot accomplish the multi-bit computing task accurately. In this paper, we propose fault-tolerant schemes to rescue RRAM-based computation with nonlinear resistance levels. We classify the resistance level distributions in RRAM devices into three types, and the corresponding models are proposed to analyze the computation characteristics. We propose two theoretical conditions for the resistance levels to determine if an RRAM device can support multi-bit matrix computation. For the linear model, the least squares method is used to reduce the computing error. When the resistance distribution obeys the proposed power model, a logarithmic operation is used to decode the multiplication results and accomplish accuracy computing. For exponential model, since the device cannot complete typical matrix-vector multiplication from hardware level, we propose online and offline quantization methods to make the neural computing algorithms friendly to RRAM device. Simulation results show that the root-mean-square error improves around 4% with the linear model and more than 99% with the power model. After quantization, the accuracy of ResNet-18 using RRAM with exponential resistance levels can be improved to the same accuracy with ideal linear RRAM devices. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	4.7.2	PX-CGRA: POLYMORPHIC APPROXIMATE COARSE-GRAINED RECONFIGURABLE ARCHITECTURE Speaker: Omid Akbari, University of Tehran, IR Authors: Omid Akbari¹, Mehdi Kamal¹, Ali Afzali-Kusha¹, Massoud Pedram² and Muhammad Shafique³ ¹University of Tehran, IR; ²University of Southern California, US; ³TU Wien, AT Abstract Coarse-Grained Reconfigurable Architectures (CGRAs) provide tradeoff between the energy-efficiency of Application Specific Integrated Circuits (ASICs) and the flexibility of General Purpose Processors (GPPs). State-of-the-art CGRAs only support exact architectures and precise application executions. However, a majority of the streaming applications such as multimedia and digital signal processing, which are amenable to CGRAs, are inherently error resilient. Therefore, these applications can greatly benefit from the emerging trend of Approximate Computing that leverages this error-resiliency to provide higher energy efficiency proportional to the tolerable accuracy loss (can even be constrained). This paper, for the first time, introduces the novel concept of Polymorphic Approximate CGRA (PX-CGRA) that employs heterogeneous tiles of Polymorphic-Approximated ALU Clusters (PACs) connected in a 2-D mesh style connection. These PACs can implement different approximate modes as well as accurate modes depending upon their selected configuration as per the run-time requirements of executing applications. For designing an efficient PX-CGRA, we propose a bottom-up design flow. In addition, the flow of application mapping on PX-CGRA is discussed including accuracy-level mapping, scheduling, and binding steps. To comprehensively evaluate the efficacy of the proposed CGRA, the complete PX-CGRA architecture in different sizes as well as with different PACs configurations are synthesized using a 15-nm FinFET technology. Our results show up to 15%-45% energy efficiency improvement for 5%-35% output quality degradation, respectively, when compared to the state-of-the-art exact-mode CGRA. Our proposed architecture and design methodology enable a new era of accuracy-configurable CGRAs to provide significant energy gains. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	4.7.3	MULTI-PRECISION CONVOLUTIONAL NEURAL NETWORKS ON HETEROGENEOUS HARDWARE Speaker: Mohammad Hosseinabady, University of Bristol, GB Authors: Moslem Amiri, Mohammad Hosseinabady, Simon McIntosh-Smith and Jose Nunez-Yanez, University of Bristol, GB Abstract Fully binarised convolutional neural networks (CNNs) deliver very high inference performance using single-bit weights and activations, together with XNOR type operators for the kernel convolutions. Current research shows that full binarisation results in a degradation of accuracy and different approaches to tackle this issue are being investigated such as using more complex models as accuracy reduces. This paper proposes an alternative based on a multi-precision CNN framework that combines a binarised and a floating point CNN in a pipeline configuration deployed on heterogeneous hardware. The binarised CNN is mapped onto an FPGA device and used to perform inference over the whole input set while the floating point network is mapped onto a CPU device and performs re-inference only when the classification confidence level is low. A light-weight confidence mechanism enables a flexible trade-off between accuracy and throughput. To demonstrate the concept, we choose a Zynq 7020 device as the hardware target and show that the multi-precision network is able to increase the BNN accuracy from 78.5% to 82.5% and the CPU inference speed from 29.68 to 90.82 images/sec. Download Paper (PDF; Only available from the DATE venue WiFi)
18:15	4.7.4	LOGIC SYNTHESIS AND DEFECT TOLERANCE FOR MEMRISTIVE CROSSBAR ARRAYS Speaker: Onur Tunali, Istanbul Technical University, TR Authors: Onur Tunali and Mustafa Altun, Istanbul Technical University, TR Abstract Contrary to abundant memory related studies of memristive crossbar structures, logic oriented applications are only gaining popularity in recent years. In this paper, we study logic synthesis, regarding both two-level and multi level designs, and defect aspects of memristor based crossbar architectures. First, we introduce our two-level and multi-level logic synthesis techniques. We elaborate on advantages and disadvantages of both approaches with experimental results regarding area cost. After that, we devise a defect model in alignment with the conventional stuck-at open and closed paradigm. In addition, we determine the effects of defects to the operational capacity of the crossbar. Furthermore, we propose a preliminary defect tolerant Boolean logic mapping approach. In order to evaluate our approach, we conduct extensive Monte Carlo simulations with industrial benchmarks. Finally, we discuss future directions concerning both existing two-level and prospective multi-level logic designs as well as defect tolerance with area redundancy. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP2-2, 199	(Best Paper Award Candidate) A CO-DESIGN METHODOLOGY FOR SCALABLE QUANTUM PROCESSORS AND THEIR CLASSICAL ELECTRONIC INTERFACE Speaker: Jeroen van Dijk, Delft University of Technology, NL Authors: Jeroen van Dijk¹, Andrei Vladimirescu², Masoud Babaie¹, Edoardo Charbon¹ and Fabio Sebastiano¹ ¹Delft University of Technology, NL; ²University of California, Berkeley, US Abstract A quantum computer fundamentally comprises a quantum processor and a classical controller. The classical electronic controller is used to correct and manipulate the qubits, the core components of a quantum processor. To enable quantum computers scalable to millions of qubits, as required in practical applications, the simultaneous optimization of both the classical electronic and quantum systems is needed. In this paper, a co-design methodology is proposed for obtaining an optimized qubit performance while considering practical trade-offs in the control circuits, such as power consumption, complexity, and cost. The SPINE (SPIN Emulator) toolset is introduced for the co-design and co-optimization of electronic/quantum systems. It comprises a circuit simulator enhanced with a Verilog-A model emulating the quantum behavior of single-electron spin qubits. Design examples show the effectiveness of the proposed methodology in the optimization, design and verification of a whole electronic/quantum system. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP2-3, 757	APPROXIMATE QUATERNARY ADDITION WITH THE FAST CARRY CHAINS OF FPGAS Speaker: Philip Brisk, University of California, Riverside, US Authors: Sina Boroumand¹, Hadi P. Afshar² and Philip Brisk³ ¹University of Tehran, IR; ²Qualcomm Research, US; ³University of California, Riverside, US Abstract A heuristic is presented to efficiently synthesize approximate adder trees on Altera and Xilinx FPGAs using their carry chains. The mapper constructs approximate adder trees using an approximate quaternary adder as the fundamental building block. The approximate adder trees are smaller than exact adder trees, allowing more operators to fit into a fixed-area device, trading off arithmetic accuracy for higher throughput. Download Paper (PDF; Only available from the DATE venue WiFi)
18:32	IP2-4, 424	NN COMPACTOR: MINIMIZING MEMORY AND LOGIC RESOURCES FOR SMALL NEURAL NETWORKS Speaker: Seongmin Hong, Hongik University, KR Authors: Seongmin Hong¹, Inho Lee¹ and Yongjun Park² ¹Hongik University, KR; ²Hanyang University, KR Abstract Special neural accelerators are an appealing hardware platform for machine learning systems because they provide both high performance and energy efficiency. Although various neural accelerators have recently been introduced, they are difficult to adapt to embedded platforms because current neural accelerators require high memory capacity and bandwidth for the fast preparation of synaptic weights. Embedded platforms are often unable to meet these memory requirements because of their limited resources. In FPGA-based IoT (internet of things) systems, the problem becomes even worse since computation units generated from logic blocks cannot be fully utilized due to the small size of block memory. In order to overcome this problem, we propose a novel dual-track quantization technique to reduce synaptic weight width based on the magnitude of the value while minimizing accuracy loss. In this value-adaptive technique, large and small value weights are quantized differently. In this paper, we present a fully automatic framework called NN Compactor that generates a compact neural accelerator by minimizing the memory requirements of synaptic weights through dual-track quantization and minimizing the logic requirements of PUs with minimum recognition accuracy loss. For the three widely used datasets of MNIST, CNAE-9, and Forest, experimental results demonstrate that our compact neural accelerator achieves an average performance improvement of 6.4x over a baseline embedded system using minimal resources with minimal accuracy loss. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session Exhibition Reception in Exhibition Area The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.

Time

Label

Presentation Title
Authors

17:00

4.7.1

RESCUING MEMRISTOR-BASED COMPUTING WITH NON-LINEAR RESISTANCE LEVELS
Speaker:
Jilan Lin, Tsinghua University, CN
Authors:
Jilan Lin¹, Lixue Xia¹, Zhenhua Zhu¹, Hanbo Sun¹, Yi Cai¹, Hui Gao¹, Ming Cheng¹, Xiaoming Chen², Yu Wang¹ and Huazhong Yang¹
¹Tsinghua University, Beijing, CN; ²University of Notre Dame, US
Abstract
Emerging metal oxide resistive switching random access memory (RRAM) device and RRAM crossbar have shown great potential in computing matrix-vector multiplication. However, due to the nonlinear distribution of resistance levels in RRAM devices, state-of-the-art multi-bit RRAM cannot accomplish the multi-bit computing task accurately. In this paper, we propose fault-tolerant schemes to rescue RRAM-based computation with nonlinear resistance levels. We classify the resistance level distributions in RRAM devices into three types, and the corresponding models are proposed to analyze the computation characteristics. We propose two theoretical conditions for the resistance levels to determine if an RRAM device can support multi-bit matrix computation. For the linear model, the least squares method is used to reduce the computing error. When the resistance distribution obeys the proposed power model, a logarithmic operation is used to decode the multiplication results and accomplish accuracy computing. For exponential model, since the device cannot complete typical matrix-vector multiplication from hardware level, we propose online and offline quantization methods to make the neural computing algorithms friendly to RRAM device. Simulation results show that the root-mean-square error improves around 4% with the linear model and more than 99% with the power model. After quantization, the accuracy of ResNet-18 using RRAM with exponential resistance levels can be improved to the same accuracy with ideal linear RRAM devices.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

4.7.2

PX-CGRA: POLYMORPHIC APPROXIMATE COARSE-GRAINED RECONFIGURABLE ARCHITECTURE
Speaker:
Omid Akbari, University of Tehran, IR
Authors:
Omid Akbari¹, Mehdi Kamal¹, Ali Afzali-Kusha¹, Massoud Pedram² and Muhammad Shafique³
¹University of Tehran, IR; ²University of Southern California, US; ³TU Wien, AT
Abstract
Coarse-Grained Reconfigurable Architectures (CGRAs) provide tradeoff between the energy-efficiency of Application Specific Integrated Circuits (ASICs) and the flexibility of General Purpose Processors (GPPs). State-of-the-art CGRAs only support exact architectures and precise application executions. However, a majority of the streaming applications such as multimedia and digital signal processing, which are amenable to CGRAs, are inherently error resilient. Therefore, these applications can greatly benefit from the emerging trend of Approximate Computing that leverages this error-resiliency to provide higher energy efficiency proportional to the tolerable accuracy loss (can even be constrained). This paper, for the first time, introduces the novel concept of Polymorphic Approximate CGRA (PX-CGRA) that employs heterogeneous tiles of Polymorphic-Approximated ALU Clusters (PACs) connected in a 2-D mesh style connection. These PACs can implement different approximate modes as well as accurate modes depending upon their selected configuration as per the run-time requirements of executing applications. For designing an efficient PX-CGRA, we propose a bottom-up design flow. In addition, the flow of application mapping on PX-CGRA is discussed including accuracy-level mapping, scheduling, and binding steps. To comprehensively evaluate the efficacy of the proposed CGRA, the complete PX-CGRA architecture in different sizes as well as with different PACs configurations are synthesized using a 15-nm FinFET technology. Our results show up to 15%-45% energy efficiency improvement for 5%-35% output quality degradation, respectively, when compared to the state-of-the-art exact-mode CGRA. Our proposed architecture and design methodology enable a new era of accuracy-configurable CGRAs to provide significant energy gains.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

4.7.3

MULTI-PRECISION CONVOLUTIONAL NEURAL NETWORKS ON HETEROGENEOUS HARDWARE
Speaker:
Mohammad Hosseinabady, University of Bristol, GB
Authors:
Moslem Amiri, Mohammad Hosseinabady, Simon McIntosh-Smith and Jose Nunez-Yanez, University of Bristol, GB
Abstract
Fully binarised convolutional neural networks (CNNs) deliver very high inference performance using single-bit weights and activations, together with XNOR type operators for the kernel convolutions. Current research shows that full binarisation results in a degradation of accuracy and different approaches to tackle this issue are being investigated such as using more complex models as accuracy reduces. This paper proposes an alternative based on a multi-precision CNN framework that combines a binarised and a floating point CNN in a pipeline configuration deployed on heterogeneous hardware. The binarised CNN is mapped onto an FPGA device and used to perform inference over the whole input set while the floating point network is mapped onto a CPU device and performs re-inference only when the classification confidence level is low. A light-weight confidence mechanism enables a flexible trade-off between accuracy and throughput. To demonstrate the concept, we choose a Zynq 7020 device as the hardware target and show that the multi-precision network is able to increase the BNN accuracy from 78.5% to 82.5% and the CPU inference speed from 29.68 to 90.82 images/sec.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15

4.7.4

LOGIC SYNTHESIS AND DEFECT TOLERANCE FOR MEMRISTIVE CROSSBAR ARRAYS
Speaker:
Onur Tunali, Istanbul Technical University, TR
Authors:
Onur Tunali and Mustafa Altun, Istanbul Technical University, TR
Abstract
Contrary to abundant memory related studies of memristive crossbar structures, logic oriented applications are only gaining popularity in recent years. In this paper, we study logic synthesis, regarding both two-level and multi level designs, and defect aspects of memristor based crossbar architectures. First, we introduce our two-level and multi-level logic synthesis techniques. We elaborate on advantages and disadvantages of both approaches with experimental results regarding area cost. After that, we devise a defect model in alignment with the conventional stuck-at open and closed paradigm. In addition, we determine the effects of defects to the operational capacity of the crossbar. Furthermore, we propose a preliminary defect tolerant Boolean logic mapping approach. In order to evaluate our approach, we conduct extensive Monte Carlo simulations with industrial benchmarks. Finally, we discuss future directions concerning both existing two-level and prospective multi-level logic designs as well as defect tolerance with area redundancy.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP2-2, 199

(Best Paper Award Candidate)
A CO-DESIGN METHODOLOGY FOR SCALABLE QUANTUM PROCESSORS AND THEIR CLASSICAL ELECTRONIC INTERFACE
Speaker:
Jeroen van Dijk, Delft University of Technology, NL
Authors:
Jeroen van Dijk¹, Andrei Vladimirescu², Masoud Babaie¹, Edoardo Charbon¹ and Fabio Sebastiano¹
¹Delft University of Technology, NL; ²University of California, Berkeley, US
Abstract
A quantum computer fundamentally comprises a quantum processor and a classical controller. The classical electronic controller is used to correct and manipulate the qubits, the core components of a quantum processor. To enable quantum computers scalable to millions of qubits, as required in practical applications, the simultaneous optimization of both the classical electronic and quantum systems is needed. In this paper, a co-design methodology is proposed for obtaining an optimized qubit performance while considering practical trade-offs in the control circuits, such as power consumption, complexity, and cost. The SPINE (SPIN Emulator) toolset is introduced for the co-design and co-optimization of electronic/quantum systems. It comprises a circuit simulator enhanced with a Verilog-A model emulating the quantum behavior of single-electron spin qubits. Design examples show the effectiveness of the proposed methodology in the optimization, design and verification of a whole electronic/quantum system.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP2-3, 757

APPROXIMATE QUATERNARY ADDITION WITH THE FAST CARRY CHAINS OF FPGAS
Speaker:
Philip Brisk, University of California, Riverside, US
Authors:
Sina Boroumand¹, Hadi P. Afshar² and Philip Brisk³
¹University of Tehran, IR; ²Qualcomm Research, US; ³University of California, Riverside, US
Abstract
A heuristic is presented to efficiently synthesize approximate adder trees on Altera and Xilinx FPGAs using their carry chains. The mapper constructs approximate adder trees using an approximate quaternary adder as the fundamental building block. The approximate adder trees are smaller than exact adder trees, allowing more operators to fit into a fixed-area device, trading off arithmetic accuracy for higher throughput.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:32

IP2-4, 424

NN COMPACTOR: MINIMIZING MEMORY AND LOGIC RESOURCES FOR SMALL NEURAL NETWORKS
Speaker:
Seongmin Hong, Hongik University, KR
Authors:
Seongmin Hong¹, Inho Lee¹ and Yongjun Park²
¹Hongik University, KR; ²Hanyang University, KR
Abstract
Special neural accelerators are an appealing hardware platform for machine learning systems because they provide both high performance and energy efficiency. Although various neural accelerators have recently been introduced, they are difficult to adapt to embedded platforms because current neural accelerators require high memory capacity and bandwidth for the fast preparation of synaptic weights. Embedded platforms are often unable to meet these memory requirements because of their limited resources. In FPGA-based IoT (internet of things) systems, the problem becomes even worse since computation units generated from logic blocks cannot be fully utilized due to the small size of block memory. In order to overcome this problem, we propose a novel dual-track quantization technique to reduce synaptic weight width based on the magnitude of the value while minimizing accuracy loss. In this value-adaptive technique, large and small value weights are quantized differently. In this paper, we present a fully automatic framework called NN Compactor that generates a compact neural accelerator by minimizing the memory requirements of synaptic weights through dual-track quantization and minimizing the logic requirements of PUs with minimum recognition accuracy loss. For the three widely used datasets of MNIST, CNAE-9, and Forest, experimental results demonstrate that our compact neural accelerator achieves an average performance improvement of 6.4x over a baseline embedded system using minimal resources with minimal accuracy loss.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session
Exhibition Reception in Exhibition Area
The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.