9.2 Emerging architectures and technologies for ultra low power and efficient embedded systems

Time	Label	Presentation Title Authors
08:30	9.2.1	(Best Paper Award Candidate) FFT-BASED DEEP LEARNING DEPLOYMENT IN EMBEDDED SYSTEMS Speaker: Sheng Lin, Syracuse University, US Authors: Sheng Lin¹, Ning Liu¹, Mahdi Nazemi², Hongjia Li¹, Caiwen Ding¹, Yanzhi Wang³ and Massoud Pedram³ ¹Syracuse University, US; ²USC, US; ³University of Southern California, US Abstract Deep learning has delivered its powerfulness in many application domains, especially in image and speech recognition. As the backbone of deep learning, deep neural networks (DNNs) consist of multiple layers of various types with hundreds to thousands of neurons. Embedded platforms are now becoming essential for deep learning deployment due to their portability, versatility, and energy efficiency. The large model size of DNNs, while providing excellent accuracy, also burdens the embedded platforms with intensive computation and storage. Researchers have investigated on reducing DNN model size with negligible accuracy loss. This work proposes a Fast Fourier Transform (FFT)-based DNN training and inference model suitable for embedded platforms with reduced asymptotic complexity of both computation and storage, making our approach distinguished from existing approaches. We develop the training and inference algorithms based on FFT as the computing kernel and deploy the FFT-based inference model on embedded platforms with extraordinary processing speed. Download Paper (PDF; Only available from the DATE venue WiFi)
09:00	9.2.2	A TRANSPRECISION FLOATING-POINT PLATFORM FOR ULTRA-LOW POWER COMPUTING Speaker: Giuseppe Tagliavini, Università di Bologna, IT Authors: Giuseppe Tagliavini¹, Stefan Mach², Davide Rossi¹, Andrea Marongiu² and Luca Benini¹ ¹Università di Bologna, IT; ²IIS, ETH Zurich, CH Abstract In modern low-power embedded platforms, the execution of floating-point (FP) operations emerges as a major contributor to the energy consumption of compute-intensive applications with large dynamic range. Experimental evidence shows that 50% of the energy consumed by a core and its data memory is related to FP computations. The adoption of FP formats requiring a lower number of bits is an interesting opportunity to reduce energy consumption, since it allows to simplify the arithmetic circuitry and to reduce the memory bandwidth required to transfer data between memory and registers by enabling vectorization. From a theoretical point of view, the adoption of multiple FP types perfectly fits with the principle of transprecision computing, allowing fine-grained control of approximation while meeting specified constraints on the precision of final results. In this paper we propose an extended FP type system with complete hardware support to enable transprecision computing on low-power embedded processors, including two standard formats (binary32 and binary16) and two new formats (binary8 and binary16alt). First, we introduce a software library that enables exploration of FP types by tuning both precision and dynamic range of program variables. Then, we present a methodology to integrate our library with an external tool for precision tuning, and experimental results that highlight the clear benefits of introducing the new formats. Finally, we present the design of a transprecision FP unit capable of handling 8-bit and 16-bit operations in addition to standard 32-bit operations. Experimental results on FP-intensive benchmarks show that up to 90% of FP operations can be safely scaled down to 8-bit or 16-bit formats. Thanks to precision tuning and vectorization, execution time is decreased by 12% and memory accesses are reduced by 27% on average, leading to a reduction of energy consumption up to 30%. Download Paper (PDF; Only available from the DATE venue WiFi)
09:30	9.2.3	A PERIPHERAL CIRCUIT REUSE STRUCTURE INTEGRATED WITH A RETIMED DATA FLOW FOR LOW POWER RRAM CROSSBAR-BASED CNN Speaker: Keni Qiu, Capital Normal University, CN Authors: Keni Qiu¹, Weiwen Chen¹, Yuanchao Xu¹, Lixue Xia², Yu Wang² and Zili Shao³ ¹Capital Normal University, CN; ²Tsinghua University, CN; ³The Hong Kong Polytechnic University, HK Abstract Convolutional computations implemented in RRAM crossbar-based Computing System (RCS) demonstrate the outstanding advantages of high performance and low power. However, current designs are energy-unbalanced among the three parts of RRAM crossbar computation, peripheral circuits and memory accesses, and the latter two factors can significantly limit the potential gains of RCS. Addressing the problem of high power overhead of peripheral circuits in RCS, this paper proposes a Peripheral Circuit Unit (PeriCU)-Reuse scheme to meet power budgets in energy constrained embedded systems. The underlying idea is to put the expensive ADCs/DACs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. In the solution, the first step is to determine the number of PeriCUs which are organized by cycle frames. Inside a cycle frame, the layers are computed in parallel inter-PeriCUs while sequentially intra-PeriCU. Furthermore, a layer retiming technique is exploited to further improve the energy of RCS by assigning two adjacent layers within the same PeriCU so as to bypass the energy consuming memory accesses. The experiments of five convolutional applications validate that the PeriCU-Reuse scheme integrated with the retiming technique can efficiently meet variable power budgets, and further reduce energy consumption efficiently. Download Paper (PDF; Only available from the DATE venue WiFi)
09:45	9.2.4	OPTIMAL DC/AC DATA BUS INVERSION CODING Speaker: Jan Lucas, TU Berlin, DE Authors: Jan Lucas, Sohan Lal and Ben Juurlink, TU Berlin, DE Abstract GDDR5 and DDR4 memories use data bus inversion (DBI) coding to reduce termination power and decrease the number of output transitions. Two main strategies exist for encoding data using DBI: DBI DC minimizes the number of outputs transmitting a zero, while DBI AC minimizes the number of signal transitions. We show that neither of these strategies is optimal and reduction of interface power of up to 6% can be achieved by taking both the number of zeros and the number of signal transitions into account when encoding the data. We then demonstrate that a hardware implementation of optimal DBI coding is feasible, results in a reduction of system power and requires only an insignificant additional die area. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00	IP4-5, 778	ENERGY-PERFORMANCE DESIGN EXPLORATION OF A LOW-POWER MICROPROGRAMMED DEEP-LEARNING ACCELERATOR Speaker: Andrea Calimera, Politecnico di Torino, IT Authors: Andrea Calimera¹, Mario R. Casu², Giulia Santoro¹, Valentino Peluso¹ and Massimo Alioto³ ¹Politecnico di Torino, IT; ²Politecnico di Torino, Department of Electronics and Telecommunications, IT; ³National University of Singapore, SG Abstract This paper presents the design space exploration of a novel microprogrammable accelerator in which PEs are connected with a Network-on-Chip and benefit from low-power features enabled through a practical implementation of a Dual- Vdd assignment scheme. An analytical model, fitted with postlayout data obtained with a 28nm FDSOI design kit, returns implementations with optimal energy-performance tradeoff by taking into consideration all the key design-space variables. The obtained Pareto analysis helps us infer optimization rules aimed at improving quality of design. Download Paper (PDF; Only available from the DATE venue WiFi)
10:01	IP4-6, 178	GENPIM: GENERALIZED PROCESSING IN-MEMORY TO ACCELERATE DATA INTENSIVE APPLICATIONS Speaker: Tajana Rosing, UC San Diego, US Authors: Mohsen Imani, Saransh Gupta and Tajana Rosing, University of California, San Diego, US Abstract Big data has become a serious problem as data volumes have been skyrocketing for the past few years. Storage and CPU technologies are overwhelmed by the amount of data they have to handle. Traditional computer architectures show poor performance which processing such huge data. Processing in-memory is a promising technique to address data movement issue by locally processing data inside memory. However, there are two main issues with stand-alone PIM designs: (i) PIM is not always computationally faster than CMOS logic, (ii) PIM cannot process all operations in many applications. Thus, not many applications can beneﬁt from PIM. To generalize the use of PIM, we designed GenPIM, a general processing in-memory architecture consisting of the conventional processor as well as the PIM accelerators. GenPIM supports basic PIM functionalities in specialized non-volatile memory including: bitwise operations, search operation, addition and multiplication. For each application, GenPIM identiﬁes the part which uses PIM operations, and processes the rest of non-PIM operations or not data intensive part of applications in general purpose cores. GenPIM also enables conﬁgurable PIM approximation by relaxing in-memory computation. We test the efﬁciency of proposed design over different emerging machine learning, compression and security applications. Our experimental evaluation shows that our design can achieve 10.9x improvement in energy efﬁciency and 6.4x speedup as compared to processing data in conventional cores. The results can be improved by 21.0% in energy consumption and 30.6% in performance by enabling PIM approximation while ensuring less than 2% quality loss. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00		End of session Coffee Break in Exhibition Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD). Lunch Breaks (Großer Saal + Saal 1) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 20, 2018 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 21, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 22, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:00 Keynote Lecture in "Saal 2" 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

08:30

9.2.1

(Best Paper Award Candidate)
FFT-BASED DEEP LEARNING DEPLOYMENT IN EMBEDDED SYSTEMS
Speaker:
Sheng Lin, Syracuse University, US
Authors:
Sheng Lin¹, Ning Liu¹, Mahdi Nazemi², Hongjia Li¹, Caiwen Ding¹, Yanzhi Wang³ and Massoud Pedram³
¹Syracuse University, US; ²USC, US; ³University of Southern California, US
Abstract
Deep learning has delivered its powerfulness in many application domains, especially in image and speech recognition. As the backbone of deep learning, deep neural networks (DNNs) consist of multiple layers of various types with hundreds to thousands of neurons. Embedded platforms are now becoming essential for deep learning deployment due to their portability, versatility, and energy efficiency. The large model size of DNNs, while providing excellent accuracy, also burdens the embedded platforms with intensive computation and storage. Researchers have investigated on reducing DNN model size with negligible accuracy loss. This work proposes a Fast Fourier Transform (FFT)-based DNN training and inference model suitable for embedded platforms with reduced asymptotic complexity of both computation and storage, making our approach distinguished from existing approaches. We develop the training and inference algorithms based on FFT as the computing kernel and deploy the FFT-based inference model on embedded platforms with extraordinary processing speed.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:00

9.2.2

A TRANSPRECISION FLOATING-POINT PLATFORM FOR ULTRA-LOW POWER COMPUTING
Speaker:
Giuseppe Tagliavini, Università di Bologna, IT
Authors:
Giuseppe Tagliavini¹, Stefan Mach², Davide Rossi¹, Andrea Marongiu² and Luca Benini¹
¹Università di Bologna, IT; ²IIS, ETH Zurich, CH
Abstract
In modern low-power embedded platforms, the execution of floating-point (FP) operations emerges as a major contributor to the energy consumption of compute-intensive applications with large dynamic range. Experimental evidence shows that 50% of the energy consumed by a core and its data memory is related to FP computations. The adoption of FP formats requiring a lower number of bits is an interesting opportunity to reduce energy consumption, since it allows to simplify the arithmetic circuitry and to reduce the memory bandwidth required to transfer data between memory and registers by enabling vectorization. From a theoretical point of view, the adoption of multiple FP types perfectly fits with the principle of transprecision computing, allowing fine-grained control of approximation while meeting specified constraints on the precision of final results. In this paper we propose an extended FP type system with complete hardware support to enable transprecision computing on low-power embedded processors, including two standard formats (binary32 and binary16) and two new formats (binary8 and binary16alt). First, we introduce a software library that enables exploration of FP types by tuning both precision and dynamic range of program variables. Then, we present a methodology to integrate our library with an external tool for precision tuning, and experimental results that highlight the clear benefits of introducing the new formats. Finally, we present the design of a transprecision FP unit capable of handling 8-bit and 16-bit operations in addition to standard 32-bit operations. Experimental results on FP-intensive benchmarks show that up to 90% of FP operations can be safely scaled down to 8-bit or 16-bit formats. Thanks to precision tuning and vectorization, execution time is decreased by 12% and memory accesses are reduced by 27% on average, leading to a reduction of energy consumption up to 30%.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:30

9.2.3

A PERIPHERAL CIRCUIT REUSE STRUCTURE INTEGRATED WITH A RETIMED DATA FLOW FOR LOW POWER RRAM CROSSBAR-BASED CNN
Speaker:
Keni Qiu, Capital Normal University, CN
Authors:
Keni Qiu¹, Weiwen Chen¹, Yuanchao Xu¹, Lixue Xia², Yu Wang² and Zili Shao³
¹Capital Normal University, CN; ²Tsinghua University, CN; ³The Hong Kong Polytechnic University, HK
Abstract
Convolutional computations implemented in RRAM crossbar-based Computing System (RCS) demonstrate the outstanding advantages of high performance and low power. However, current designs are energy-unbalanced among the three parts of RRAM crossbar computation, peripheral circuits and memory accesses, and the latter two factors can significantly limit the potential gains of RCS. Addressing the problem of high power overhead of peripheral circuits in RCS, this paper proposes a Peripheral Circuit Unit (PeriCU)-Reuse scheme to meet power budgets in energy constrained embedded systems. The underlying idea is to put the expensive ADCs/DACs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. In the solution, the first step is to determine the number of PeriCUs which are organized by cycle frames. Inside a cycle frame, the layers are computed in parallel inter-PeriCUs while sequentially intra-PeriCU. Furthermore, a layer retiming technique is exploited to further improve the energy of RCS by assigning two adjacent layers within the same PeriCU so as to bypass the energy consuming memory accesses. The experiments of five convolutional applications validate that the PeriCU-Reuse scheme integrated with the retiming technique can efficiently meet variable power budgets, and further reduce energy consumption efficiently.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:45

9.2.4

OPTIMAL DC/AC DATA BUS INVERSION CODING
Speaker:
Jan Lucas, TU Berlin, DE
Authors:
Jan Lucas, Sohan Lal and Ben Juurlink, TU Berlin, DE
Abstract
GDDR5 and DDR4 memories use data bus inversion (DBI) coding to reduce termination power and decrease the number of output transitions. Two main strategies exist for encoding data using DBI: DBI DC minimizes the number of outputs transmitting a zero, while DBI AC minimizes the number of signal transitions. We show that neither of these strategies is optimal and reduction of interface power of up to 6% can be achieved by taking both the number of zeros and the number of signal transitions into account when encoding the data. We then demonstrate that a hardware implementation of optimal DBI coding is feasible, results in a reduction of system power and requires only an insignificant additional die area.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

IP4-5, 778

ENERGY-PERFORMANCE DESIGN EXPLORATION OF A LOW-POWER MICROPROGRAMMED DEEP-LEARNING ACCELERATOR
Speaker:
Andrea Calimera, Politecnico di Torino, IT
Authors:
Andrea Calimera¹, Mario R. Casu², Giulia Santoro¹, Valentino Peluso¹ and Massimo Alioto³
¹Politecnico di Torino, IT; ²Politecnico di Torino, Department of Electronics and Telecommunications, IT; ³National University of Singapore, SG
Abstract
This paper presents the design space exploration of a novel microprogrammable accelerator in which PEs are connected with a Network-on-Chip and benefit from low-power features enabled through a practical implementation of a Dual- Vdd assignment scheme. An analytical model, fitted with postlayout data obtained with a 28nm FDSOI design kit, returns implementations with optimal energy-performance tradeoff by taking into consideration all the key design-space variables. The obtained Pareto analysis helps us infer optimization rules aimed at improving quality of design.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:01

IP4-6, 178

GENPIM: GENERALIZED PROCESSING IN-MEMORY TO ACCELERATE DATA INTENSIVE APPLICATIONS
Speaker:
Tajana Rosing, UC San Diego, US
Authors:
Mohsen Imani, Saransh Gupta and Tajana Rosing, University of California, San Diego, US
Abstract
Big data has become a serious problem as data volumes have been skyrocketing for the past few years. Storage and CPU technologies are overwhelmed by the amount of data they have to handle. Traditional computer architectures show poor performance which processing such huge data. Processing in-memory is a promising technique to address data movement issue by locally processing data inside memory. However, there are two main issues with stand-alone PIM designs: (i) PIM is not always computationally faster than CMOS logic, (ii) PIM cannot process all operations in many applications. Thus, not many applications can beneﬁt from PIM. To generalize the use of PIM, we designed GenPIM, a general processing in-memory architecture consisting of the conventional processor as well as the PIM accelerators. GenPIM supports basic PIM functionalities in specialized non-volatile memory including: bitwise operations, search operation, addition and multiplication. For each application, GenPIM identiﬁes the part which uses PIM operations, and processes the rest of non-PIM operations or not data intensive part of applications in general purpose cores. GenPIM also enables conﬁgurable PIM approximation by relaxing in-memory computation. We test the efﬁciency of proposed design over different emerging machine learning, compression and security applications. Our experimental evaluation shows that our design can achieve 10.9x improvement in energy efﬁciency and 6.4x speedup as compared to processing data in conventional cores. The results can be improved by 21.0% in energy consumption and 30.6% in performance by enabling PIM approximation while ensuring less than 2% quality loss.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

End of session
Coffee Break in Exhibition Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018