8.4 Architectural and Circuit Techniques toward Energy-efficient Computing

Time	Label	Presentation Title Authors
17:00	8.4.1	TRANSPIRE: AN ENERGY-EFFICIENT TRANSPRECISION FLOATING-POINT PROGRAMMABLE ARCHITECTURE Speaker: Rohit Prasad, Lab-SICC, UBS, France & DEI, UniBo, Italy, FR Authors: Rohit Prasad¹, Satyajit Das², Kevin Martin³, Giuseppe Tagliavini⁴, Philippe Coussy⁵, Luca Benini⁶ and Davide Rossi⁴ ¹Université Bretagne Sud, FR; ²IIT Palakkad, IN; ³University Bretagne Sud, FR; ⁴Università di Bologna, IT; ⁵Université Bretagne Sud / Lab-STICC, FR; ⁶Università di Bologna and ETH Zurich, IT Abstract In recent years, Coarse Grain Reconfigurable Architecture (CGRA) accelerators have been increasingly deployed in Internet-of-Things (IoT) end nodes. A modern CGRA has to support and efficiently accelerate both integer and floating-point (FP) operations. In this paper, we propose an ultra-low-power tunable-precision CGRA architectural template, called TRANSprecision floating-point Programmable archItectuRE (TRANSPIRE), and its associated compilation flow supporting both integer and FP operations. TRANSPIRE employs transprecision computing and multiple Single Instruction Multiple Data (SIMD) to accelerate FP operations while boosting energy efficiency as well. Experimental results show that TRANSPIRE achieves a maximum of 10.06x performance gain and consumes 12.91x less energy w.r.t. a RISC-V based CPU with an enhanced ISA supporting SIMD-style vectorization and FP data-types, while executing applications for near-sensor computing and embedded machine learning, with an area overhead of 1.25x only. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	8.4.2	MODELING AND DESIGNING OF A PVT AUTO-TRACKING TIMING-SPECULATIVE SRAM Speaker: Shan Shen, Southeast University, CN Authors: Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang and Longxing Shi, Southeast University, CN Abstract In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing-speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing-speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative. This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	8.4.3	SOLVING CONSTRAINT SATISFACTION PROBLEMS USING THE LOIHI SPIKING NEUROMORPHIC PROCESSOR Speaker: Chris Yakopcic, University of Dayton, US Authors: Chris Yakopcic¹, Nayim Rahman¹, Tanvir Atahary¹, Tarek M. Taha¹ and Scott Douglass² ¹University of Dayton, US; ²Air Force Research Laboratory, US Abstract In many cases, low power autonomous systems need to make decisions extremely efficiently. However, as a potential solution space becomes more complex, finding a solution quickly becomes nearly impossible using traditional computing methods. Thus, in this work we present a constraint satisfaction algorithm based on the principles of spiking neural networks. To demonstrate the validity of this algorithm, we have shown successful execution of the Boolean satisfiability problem (SAT) on the Intel Loihi spiking neuromorphic research processor. Power consumption in this spiking processor is due primarily to the propagation of spikes, which are the key drivers of data movement and processing. Thus, this system is inherently efficient for many types of problems. However, algorithms must be redesigned in a spiking neural network format to achieve the greatest efficiency gains. To the best of our knowledge, the work in this paper exhibits the first implementation of constraint satisfaction on a low power embedded neuromorphic processor. With this result, we aim to show that embedded spiking neuromorphic hardware is capable of executing general problem solving algorithms with great areal and computational efficiency. Download Paper (PDF; Only available from the DATE venue WiFi)
18:15	8.4.4	ACCURATE POWER DENSITY MAP ESTIMATION FOR COMMERCIAL MULTI-CORE MICROPROCESSORS Speaker: Sheldon Tan, University of California, Riverside, US Authors: Jinwei Zhang, Sheriff Sadiqbatcha, Wentian Jin and Sheldon Tan, University of California, Riverside, US Abstract In this work, we propose an accurate full chip steady-state power density map estimation method for the commercial multi-core microprocessors. The new approach is based on the measured steady-state thermal maps (images) from an advanced infrared (IR) thermal imaging system to ensure its accuracy. The new method consists of a few steps. First, based on the first principle of heat transfer, 2D spatial Laplace operation is performed on the given thermal map to obtain so-called raw power density map, which consists of both positive and negative values due to the steady-state nature and boundary conditions of the microprocessors. Then based on the total power of the microprocessor from the online CPU tool, we develop a novel scheme to generate the actual real positive-only power density map from the raw power density map. At the same time, we develop a novel approach to estimate the effective thermal conductivity of the microprocessors. To further validate the power density map and the estimated actual thermal conductivity of the microprocessors, we construct a thermal model with COMSOL, which mimics the real experimental set up of measurement used in the IR imaging system. Then we compute the thermal maps from the estimated power density maps to ensure the computed thermal maps match the measured thermal maps using FEM method. Experimental results on intel i7-8650U 4-core processor show 1.8$^circ$C root-mean-square-error (RMSE) and 96% similarity (2D correlation) between the computed thermal maps and the measured thermal maps. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP4-5, 168	WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING Speaker: Yawen Zhang, Peking University, CN Authors: Yawen Zhang¹, Sheng Lin², Runsheng Wang¹, Yanzhi Wang², Yuan Wang¹, Weikang Qian³ and Ru Huang¹ ¹Peking University, CN; ²Northeastern University, US; ³Shanghai Jiao Tong University, CN Abstract Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart. Download Paper (PDF; Only available from the DATE venue WiFi)
18:32	IP4-6, 452	WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION Speaker: Yehuda Kra, Bar-Ilan University, IL Authors: Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL Abstract Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

8.4.1

TRANSPIRE: AN ENERGY-EFFICIENT TRANSPRECISION FLOATING-POINT PROGRAMMABLE ARCHITECTURE
Speaker:
Rohit Prasad, Lab-SICC, UBS, France & DEI, UniBo, Italy, FR
Authors:
Rohit Prasad¹, Satyajit Das², Kevin Martin³, Giuseppe Tagliavini⁴, Philippe Coussy⁵, Luca Benini⁶ and Davide Rossi⁴
¹Université Bretagne Sud, FR; ²IIT Palakkad, IN; ³University Bretagne Sud, FR; ⁴Università di Bologna, IT; ⁵Université Bretagne Sud / Lab-STICC, FR; ⁶Università di Bologna and ETH Zurich, IT
Abstract
In recent years, Coarse Grain Reconfigurable Architecture (CGRA) accelerators have been increasingly deployed in Internet-of-Things (IoT) end nodes. A modern CGRA has to support and efficiently accelerate both integer and floating-point (FP) operations. In this paper, we propose an ultra-low-power tunable-precision CGRA architectural template, called TRANSprecision floating-point Programmable archItectuRE (TRANSPIRE), and its associated compilation flow supporting both integer and FP operations. TRANSPIRE employs transprecision computing and multiple Single Instruction Multiple Data (SIMD) to accelerate FP operations while boosting energy efficiency as well. Experimental results show that TRANSPIRE achieves a maximum of 10.06x performance gain and consumes 12.91x less energy w.r.t. a RISC-V based CPU with an enhanced ISA supporting SIMD-style vectorization and FP data-types, while executing applications for near-sensor computing and embedded machine learning, with an area overhead of 1.25x only.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

8.4.2

MODELING AND DESIGNING OF A PVT AUTO-TRACKING TIMING-SPECULATIVE SRAM
Speaker:
Shan Shen, Southeast University, CN
Authors:
Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang and Longxing Shi, Southeast University, CN
Abstract
In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing-speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing-speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative. This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

8.4.3

SOLVING CONSTRAINT SATISFACTION PROBLEMS USING THE LOIHI SPIKING NEUROMORPHIC PROCESSOR
Speaker:
Chris Yakopcic, University of Dayton, US
Authors:
Chris Yakopcic¹, Nayim Rahman¹, Tanvir Atahary¹, Tarek M. Taha¹ and Scott Douglass²
¹University of Dayton, US; ²Air Force Research Laboratory, US
Abstract
In many cases, low power autonomous systems need to make decisions extremely efficiently. However, as a potential solution space becomes more complex, finding a solution quickly becomes nearly impossible using traditional computing methods. Thus, in this work we present a constraint satisfaction algorithm based on the principles of spiking neural networks. To demonstrate the validity of this algorithm, we have shown successful execution of the Boolean satisfiability problem (SAT) on the Intel Loihi spiking neuromorphic research processor. Power consumption in this spiking processor is due primarily to the propagation of spikes, which are the key drivers of data movement and processing. Thus, this system is inherently efficient for many types of problems. However, algorithms must be redesigned in a spiking neural network format to achieve the greatest efficiency gains. To the best of our knowledge, the work in this paper exhibits the first implementation of constraint satisfaction on a low power embedded neuromorphic processor. With this result, we aim to show that embedded spiking neuromorphic hardware is capable of executing general problem solving algorithms with great areal and computational efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15

8.4.4

ACCURATE POWER DENSITY MAP ESTIMATION FOR COMMERCIAL MULTI-CORE MICROPROCESSORS
Speaker:
Sheldon Tan, University of California, Riverside, US
Authors:
Jinwei Zhang, Sheriff Sadiqbatcha, Wentian Jin and Sheldon Tan, University of California, Riverside, US
Abstract
In this work, we propose an accurate full chip steady-state power density map estimation method for the commercial multi-core microprocessors. The new approach is based on the measured steady-state thermal maps (images) from an advanced infrared (IR) thermal imaging system to ensure its accuracy. The new method consists of a few steps. First, based on the first principle of heat transfer, 2D spatial Laplace operation is performed on the given thermal map to obtain so-called raw power density map, which consists of both positive and negative values due to the steady-state nature and boundary conditions of the microprocessors. Then based on the total power of the microprocessor from the online CPU tool, we develop a novel scheme to generate the actual real positive-only power density map from the raw power density map. At the same time, we develop a novel approach to estimate the effective thermal conductivity of the microprocessors. To further validate the power density map and the estimated actual thermal conductivity of the microprocessors, we construct a thermal model with COMSOL, which mimics the real experimental set up of measurement used in the IR imaging system. Then we compute the thermal maps from the estimated power density maps to ensure the computed thermal maps match the measured thermal maps using FEM method. Experimental results on intel i7-8650U 4-core processor show 1.8$^circ$C root-mean-square-error (RMSE) and 96% similarity (2D correlation) between the computed thermal maps and the measured thermal maps.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP4-5, 168

WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING
Speaker:
Yawen Zhang, Peking University, CN
Authors:
Yawen Zhang¹, Sheng Lin², Runsheng Wang¹, Yanzhi Wang², Yuan Wang¹, Weikang Qian³ and Ru Huang¹
¹Peking University, CN; ²Northeastern University, US; ³Shanghai Jiao Tong University, CN
Abstract
Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:32

IP4-6, 452

WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION
Speaker:
Yehuda Kra, Bar-Ilan University, IL
Authors:
Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL
Abstract
Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session