12.4 Design and Optimization for Low-Power Applications

Time	Label	Presentation Title Authors
16:00	12.4.1	DYNAMIC SCHEDULING ON HETEROGENEOUS MULTICORES Speaker: Ann Franchesca Laguna, University of Notre Dame, US Authors: Ayobami Edun, Ruben Vazquez, Ann Gordon-Ross and Greg Stitt, University of Florida, US Abstract Heterogeneous multicore systems help meet design goals by using different architectural components that are suitable for different application needs. The individual cores may also have different tunable architectural parameters for additional specialization. However, this creates a challenge in mapping applications to cores that contain the best configuration based on an application's needs. This decision can be made by performing a sample run of the application on each core type and configuration, or using heuristics to explore the design space, however, given complex systems, these methods may be infeasible. In this paper, we present a methodology for dynamic scheduling of applications on heterogeneous multicore systems using predictive methods for reduced energy consumption. We use an artificial neural network (ANN) to train our predictive model using hardware counters in the system. The trained network can then predict the best configuration. Our scheduler uses this prediction to schedule the application to the best core (the core that offers the best configuration) and configures that core to the best configuration. If the best core is busy, alternative idle cores are considered for scheduling or the application is stalled. This decision is made based on which option meets the energy advantage considerations. Our experiments show that the total energy of a system can be reduced by 28% on average as compared to the system that uses the same fixed cache configuration for all cores. Download Paper (PDF; Only available from the DATE venue WiFi)
16:30	12.4.2	SELECTING THE OPTIMAL ENERGY POINT IN NEAR-THRESHOLD COMPUTING Speaker: Sami Salamin, Karlsruhe Institute of Technology (KIT), DE Authors: sami salamin, Hussam Amrouch and Joerg Henkel, Karlsruhe Institute of Technology, DE Abstract Near-Threshold Computing (NTC) has recently emerged as an attractive paradigm as it allows devices to operate close to their optimal energy point (OEP). This work demonstrates, for the first time, that determining where the OEP of a processor exists is challenging because standard cells, forming the processor's netlist, unevenly profit w.r.t power and also unevenly degrade w.r.t delay when the voltage approaches the near-threshold region. To precisely explore, at design time, where OEP is, we create voltage-aware cell libraries that enable designers to seamlessly employ the standard tool flows, even they were not designed for that purpose, to perform voltage-aware timing and power analysis. Besides determining where the OEP is, we also demonstrate how providing logic synthesis tool flows with voltage-aware cell libraries results in a 35% higher performance at NTC. In addition, we investigate how the performance loss at NTC can be compensated through parallelized computing demonstrating, for the first time, that the OEP moves far from NTC as the number of cores increases. Our proposed methodology enables designers to select the maximum number of cores along with the optimal operating voltage jointly in which a specific power budget is fulfilled. Finally, we show how voltage-aware design for parallelized NTC provides [40%-50%] performance increase compared to traditional (i.e., voltage-unaware design) parallelized NTC. Download Paper (PDF; Only available from the DATE venue WiFi)
17:00	12.4.3	EXPLORATION AND DESIGN OF LOW-ENERGY LOGIC CELLS FOR 1 KHZ ALWAYS-ON SYSTEMS Speaker: Maxime Feyerick, ESAT-MICAS, KU Leuven, BE Authors: Maxime Feyerick, Jaro De Roose and Marian Verhelst, KU Leuven, BE Abstract A standard cell library targeting always-on operation at 1 kHz is designed at circuit-level. This paper proposes a design methodology to achieve robust operation with minimum energy. Such minimum energy per operation for always-on systems is achieved by one specific supply and threshold voltage Vth combination. As Vth is discrete in a practical bulk technology, this minimum can however not be achieved through simple voltage tuning. In the considered 90 nm CMOS technology, Vth is too low resulting in leakage dominated systems and preventing from attaining the minimum energy point in subthreshold. Three circuit techniques are optimally combined to fight leakage: stacking, reverse body biasing and optimal transistor dimensioning relying on second order effects of the dimensions on Vth. They jointly allow logic gates to achieve the best balance between dynamic and leakage power. Moreover, the paper presents modified flip-flop topologies that also reliably operate at 0.27 V along with the gates. Benefits of improved logic gates and flip-flops are demonstrated on a small always-on feature-extraction system calculating running average and variance on a 1 Ksample/s data stream. The resulting system consumes 162 pW in simulation, or two orders of magnitude less when compared to a commercial library at its 1 V nominal voltage, or 1 order of magnitude less when compared to the commercial library at the same 0.27 V operating voltage. Download Paper (PDF; Only available from the DATE venue WiFi)
17:15	12.4.4	ENABLING ENERGY-EFFICIENT UNSUPERVISED MONOCULAR DEPTH ESTIMATION ON ARMV7-BASED PLATFORMS Speaker: Antonio Cipolletta, Politecnico di Torino, IT Authors: Valentino Peluso¹, Antonio Cipolletta¹, Andrea Calimera¹, Matteo Poggi², Fabio Tosi² and Stefano Mattoccia² ¹Politecnico di Torino, IT; ²Università di Bologna, IT Abstract This work deals with the implementation of energy-efficient monocular depth estimation using a low-cost CPU for low-power embedded systems. The paper first describes the PyD-Net depth estimation network, which consists of a lightweight CNN able to approach state-of-the-art accuracy with ultra-low resource usage. Then it proposes an accuracy-driven complexity reduction strategy based on a hardware-friendly fixed-point quantization. Finally, it introduces the low-level optimization enabling effective use of integer neural kernels. The objective is threefold: (i) prove the efficiency of the new quantization flow on a depth estimation network, that is, the capability to retaining the accuracy reached by floating-point arithmetic using 16- and 8-bit integers, (ii) demonstrate the portability of the quantized model into a general-purpose 32-bit RISC architecture of the ARM Cortex family, (iii) quantify the accuracy-energy tradeoff of unsupervised monocular estimation to establish its use in the embedded domain. The experiments have been run on a Raspberry PI board powered by a Broadcom BCM2837 chipset. A parametric analysis conducted over the KITTI dateset shows marginal accuracy loss with 16-bit (8-bit) integers and energy savings up to 6.55x (9.23x) w.r.t. floating-point. Compared to high-end CPU and GPU the proposed solution improves scalability. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30		End of session

Time

Label

Presentation Title
Authors

16:00

12.4.1

DYNAMIC SCHEDULING ON HETEROGENEOUS MULTICORES
Speaker:
Ann Franchesca Laguna, University of Notre Dame, US
Authors:
Ayobami Edun, Ruben Vazquez, Ann Gordon-Ross and Greg Stitt, University of Florida, US
Abstract
Heterogeneous multicore systems help meet design goals by using different architectural components that are suitable for different application needs. The individual cores may also have different tunable architectural parameters for additional specialization. However, this creates a challenge in mapping applications to cores that contain the best configuration based on an application's needs. This decision can be made by performing a sample run of the application on each core type and configuration, or using heuristics to explore the design space, however, given complex systems, these methods may be infeasible. In this paper, we present a methodology for dynamic scheduling of applications on heterogeneous multicore systems using predictive methods for reduced energy consumption. We use an artificial neural network (ANN) to train our predictive model using hardware counters in the system. The trained network can then predict the best configuration. Our scheduler uses this prediction to schedule the application to the best core (the core that offers the best configuration) and configures that core to the best configuration. If the best core is busy, alternative idle cores are considered for scheduling or the application is stalled. This decision is made based on which option meets the energy advantage considerations. Our experiments show that the total energy of a system can be reduced by 28% on average as compared to the system that uses the same fixed cache configuration for all cores.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:30

12.4.2

SELECTING THE OPTIMAL ENERGY POINT IN NEAR-THRESHOLD COMPUTING
Speaker:
Sami Salamin, Karlsruhe Institute of Technology (KIT), DE
Authors:
sami salamin, Hussam Amrouch and Joerg Henkel, Karlsruhe Institute of Technology, DE
Abstract
Near-Threshold Computing (NTC) has recently emerged as an attractive paradigm as it allows devices to operate close to their optimal energy point (OEP). This work demonstrates, for the first time, that determining where the OEP of a processor exists is challenging because standard cells, forming the processor's netlist, unevenly profit w.r.t power and also unevenly degrade w.r.t delay when the voltage approaches the near-threshold region. To precisely explore, at design time, where OEP is, we create voltage-aware cell libraries that enable designers to seamlessly employ the standard tool flows, even they were not designed for that purpose, to perform voltage-aware timing and power analysis. Besides determining where the OEP is, we also demonstrate how providing logic synthesis tool flows with voltage-aware cell libraries results in a 35% higher performance at NTC. In addition, we investigate how the performance loss at NTC can be compensated through parallelized computing demonstrating, for the first time, that the OEP moves far from NTC as the number of cores increases. Our proposed methodology enables designers to select the maximum number of cores along with the optimal operating voltage jointly in which a specific power budget is fulfilled. Finally, we show how voltage-aware design for parallelized NTC provides [40%-50%] performance increase compared to traditional (i.e., voltage-unaware design) parallelized NTC.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:00

12.4.3

EXPLORATION AND DESIGN OF LOW-ENERGY LOGIC CELLS FOR 1 KHZ ALWAYS-ON SYSTEMS
Speaker:
Maxime Feyerick, ESAT-MICAS, KU Leuven, BE
Authors:
Maxime Feyerick, Jaro De Roose and Marian Verhelst, KU Leuven, BE
Abstract
A standard cell library targeting always-on operation at 1 kHz is designed at circuit-level. This paper proposes a design methodology to achieve robust operation with minimum energy. Such minimum energy per operation for always-on systems is achieved by one specific supply and threshold voltage Vth combination. As Vth is discrete in a practical bulk technology, this minimum can however not be achieved through simple voltage tuning. In the considered 90 nm CMOS technology, Vth is too low resulting in leakage dominated systems and preventing from attaining the minimum energy point in subthreshold. Three circuit techniques are optimally combined to fight leakage: stacking, reverse body biasing and optimal transistor dimensioning relying on second order effects of the dimensions on Vth. They jointly allow logic gates to achieve the best balance between dynamic and leakage power. Moreover, the paper presents modified flip-flop topologies that also reliably operate at 0.27 V along with the gates. Benefits of improved logic gates and flip-flops are demonstrated on a small always-on feature-extraction system calculating running average and variance on a 1 Ksample/s data stream. The resulting system consumes 162 pW in simulation, or two orders of magnitude less when compared to a commercial library at its 1 V nominal voltage, or 1 order of magnitude less when compared to the commercial library at the same 0.27 V operating voltage.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:15

12.4.4

ENABLING ENERGY-EFFICIENT UNSUPERVISED MONOCULAR DEPTH ESTIMATION ON ARMV7-BASED PLATFORMS
Speaker:
Antonio Cipolletta, Politecnico di Torino, IT
Authors:
Valentino Peluso¹, Antonio Cipolletta¹, Andrea Calimera¹, Matteo Poggi², Fabio Tosi² and Stefano Mattoccia²
¹Politecnico di Torino, IT; ²Università di Bologna, IT
Abstract
This work deals with the implementation of energy-efficient monocular depth estimation using a low-cost CPU for low-power embedded systems. The paper first describes the PyD-Net depth estimation network, which consists of a lightweight CNN able to approach state-of-the-art accuracy with ultra-low resource usage. Then it proposes an accuracy-driven complexity reduction strategy based on a hardware-friendly fixed-point quantization. Finally, it introduces the low-level optimization enabling effective use of integer neural kernels. The objective is threefold: (i) prove the efficiency of the new quantization flow on a depth estimation network, that is, the capability to retaining the accuracy reached by floating-point arithmetic using 16- and 8-bit integers, (ii) demonstrate the portability of the quantized model into a general-purpose 32-bit RISC architecture of the ARM Cortex family, (iii) quantify the accuracy-energy tradeoff of unsupervised monocular estimation to establish its use in the embedded domain. The experiments have been run on a Raspberry PI board powered by a Broadcom BCM2837 chipset. A parametric analysis conducted over the KITTI dateset shows marginal accuracy loss with 16-bit (8-bit) integers and energy savings up to 6.55x (9.23x) w.r.t. floating-point. Compared to high-end CPU and GPU the proposed solution improves scalability.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

End of session