8.5 CNN Dataflow Optimizations

Time	Label	Presentation Title Authors
17:00	8.5.1	ANALYSIS AND SOLUTION OF CNN ACCURACY REDUCTION OVER CHANNEL LOOP TILING Speaker: Yesung Kang, Pohang University of Science and Technology, KR Authors: Yesung Kang¹, Yoonho Park¹, Sunghoon Kim¹, Eunji Kwon¹, Taeho Lim², Mingyu Woo³, Sangyun Oh⁴ and Seokhyeong Kang¹ ¹Pohang University of Science and Technology, KR; ²SK Hynix, KR; ³University of California, San Diego, US; ⁴UNIST, KR Abstract Owing to the growth of the size of convolutional neural networks (CNNs), quantization and loop tiling (also called loop breaking) are mandatory to implement CNN on an embedded system. However, channel loop tiling of quantized CNNs induces unexpected errors. We explain why channel loop tiling of quantized CNNs induces the unexpected errors, and how the errors affect the accuracy of state-of-the-art CNNs. We also propose a method to recover accuracy under channel tiling by compressing and decompressing the most-significant bits of partial sums. Using the proposed method, we can recover accuracy by 12.3% with only 1% circuit area overhead and an additional 2% of power consumption. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	8.5.2	DCCNN: COMPUTATIONAL FLOW REDEFINITION FOR EFFICIENT CNN INFERENCE THROUGH MODEL STRUCTURAL DECOUPLING Speaker: Xiang Chen, George Mason University, US Authors: Fuxun Yu¹, Zhuwei Qin¹, Di Wang², Ping Xu¹, Chenchen Liu³, Zhi Tian¹ and Xiang Chen¹ ¹George Mason University, US; ²Microsoft, US; ³University of Maryland, Baltimore County, US Abstract With the excellent accuracy and feasibility, Convolutional Neural Networks (CNNs) have been widely applied into novel intelligent applications and systems. However, the CNN computation performance is significantly hindered by its computation flow, which computes the model structure sequentially by layers with massive convolution operations. Such a layer-wise sequential computation flow is defined by the inter-layer data dependency and causes certain performance issues, such as resource under-utilization, significant computation overhead, etc. To solve these problems, in this work, we propose a novel CNN structural decoupling method, which could decouple CNN models by "critical paths" and eliminate the inter-layer data dependency. Based on this method, we redefine the CNN computation flow into parallel and cascade computing paradigms, which can significantly enhance the CNN computation performance with both multi-core and single-core CPU processors. Experiments show that, our DC-CNN framework could reduce at most 33% latency on multi-core CPUs for both CIFAR and ImageNet. On small-capacity mobile platforms, cascade computing could reduce the latency by average 24% on ImageNet and 42% on CIFAR10. Meanwhile, the memory reduction could reach average 21% and 64%, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	8.5.3	ABC: ABSTRACT PREDICTION BEFORE CONCRETENESS Speaker: Jung-Eun Kim, Yale University, US Authors: Jung-Eun Kim¹, Richard Bradford², Man-Ki Yoon¹ and Zhong Shao¹ ¹Yale University, US; ²Collins Aerospace, US Abstract Learning techniques are advancing the utility and capability of modern embedded systems. However, the challenge of incorporating learning modules into embedded systems is that computing resources are scarce. For such a resource-constrained environment, we have developed a framework for learning abstract information early and learning more concretely as time allows. The intermediate results can be utilized to prepare for early decisions/actions as needed. To apply this framework to a classification task, the datasets are categorized in an abstraction hierarchy. Then the framework classifies intermediate labels from the most abstract level to the most concrete. Our proposed method outperforms the existing approaches and reference base-lines in terms of accuracy. We show our framework with different architectures and on various benchmark datasets CIFAR-10,CIFAR-100, and GTSRB. We measure prediction times on GPU-equipped embedded computing platforms as well. Download Paper (PDF; Only available from the DATE venue WiFi)
18:15	8.5.4	A COMPOSITIONAL APPROACH USING KERAS FOR NEURAL NETWORKS IN REAL-TIME SYSTEMS Speaker: Xin Yang, University of Auckland, NZ Authors: Xin Yang, Partha Roop, Hammond Pearce and Jin Woo Ro, University of Auckland, NZ Abstract Real-time systems are designed using model-driven approaches, where a complex system is represented as a set of interacting components. Such a compositional approach facilitates design of simpler components, which are easier to validate and integrate with the overall system. In contrast to such systems, data-driven systems like neural networks are designed as monolithic black-boxes to capture the non-linear relationship from inputs to outputs. Increasingly, such systems are being used in safety-critical real-time systems. Here, a compositional approach would be ideal. However, to the best of our knowledge, such a compositional approach is lacking while designing data-driven components based on neural networks. This paper formalises this problem by developing the concept of Composed Neural Networks (CpNNs) by extending the well known Keras python framework. CpNNs formalise the synchronous composition of several interacting neural networks in Keras. Further, using the developed semantics, we enable modular compilation from a given CpNN to C code. The generated code is suitable for the Worst-Case Execution Time (WCET) analysis. Using several benchmarks we demonstrate the superiority of the developed approach over a recently proposed approach using Esterel, as well as the popular Python package Tensorflow Lite. For the given benchmarks, our approach is superior to Esterel with an average WCET reduction of 64.06%, and superior to Tensorflow Lite with an average measured WCET reduction of 62.08%. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	IP4-7, 935	DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS Speaker: Ahmet Inci, Carnegie Mellon University, US Authors: Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US Abstract Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications. Download Paper (PDF; Only available from the DATE venue WiFi)
18:01	IP4-8, 419	EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS Speaker: Rolando Brondolin, Politecnico di Milano, IT Authors: Luca Cerina¹, Giuseppe Franco², Claudio Gallicchio³, Alessio Micheli³ and Marco D. Santambrogio⁴ ¹politecnico di milano, IT; ²Scuola Superiore Sant'Anna / Università di Pisa, IT; ³Università di Pisa, IT; ⁴Politecnico di Milano, IT Abstract The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

8.5.1

ANALYSIS AND SOLUTION OF CNN ACCURACY REDUCTION OVER CHANNEL LOOP TILING
Speaker:
Yesung Kang, Pohang University of Science and Technology, KR
Authors:
Yesung Kang¹, Yoonho Park¹, Sunghoon Kim¹, Eunji Kwon¹, Taeho Lim², Mingyu Woo³, Sangyun Oh⁴ and Seokhyeong Kang¹
¹Pohang University of Science and Technology, KR; ²SK Hynix, KR; ³University of California, San Diego, US; ⁴UNIST, KR
Abstract
Owing to the growth of the size of convolutional neural networks (CNNs), quantization and loop tiling (also called loop breaking) are mandatory to implement CNN on an embedded system. However, channel loop tiling of quantized CNNs induces unexpected errors. We explain why channel loop tiling of quantized CNNs induces the unexpected errors, and how the errors affect the accuracy of state-of-the-art CNNs. We also propose a method to recover accuracy under channel tiling by compressing and decompressing the most-significant bits of partial sums. Using the proposed method, we can recover accuracy by 12.3% with only 1% circuit area overhead and an additional 2% of power consumption.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

8.5.2

DCCNN: COMPUTATIONAL FLOW REDEFINITION FOR EFFICIENT CNN INFERENCE THROUGH MODEL STRUCTURAL DECOUPLING
Speaker:
Xiang Chen, George Mason University, US
Authors:
Fuxun Yu¹, Zhuwei Qin¹, Di Wang², Ping Xu¹, Chenchen Liu³, Zhi Tian¹ and Xiang Chen¹
¹George Mason University, US; ²Microsoft, US; ³University of Maryland, Baltimore County, US
Abstract
With the excellent accuracy and feasibility, Convolutional Neural Networks (CNNs) have been widely applied into novel intelligent applications and systems. However, the CNN computation performance is significantly hindered by its computation flow, which computes the model structure sequentially by layers with massive convolution operations. Such a layer-wise sequential computation flow is defined by the inter-layer data dependency and causes certain performance issues, such as resource under-utilization, significant computation overhead, etc. To solve these problems, in this work, we propose a novel CNN structural decoupling method, which could decouple CNN models by "critical paths" and eliminate the inter-layer data dependency. Based on this method, we redefine the CNN computation flow into parallel and cascade computing paradigms, which can significantly enhance the CNN computation performance with both multi-core and single-core CPU processors. Experiments show that, our DC-CNN framework could reduce at most 33% latency on multi-core CPUs for both CIFAR and ImageNet. On small-capacity mobile platforms, cascade computing could reduce the latency by average 24% on ImageNet and 42% on CIFAR10. Meanwhile, the memory reduction could reach average 21% and 64%, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

8.5.3

ABC: ABSTRACT PREDICTION BEFORE CONCRETENESS
Speaker:
Jung-Eun Kim, Yale University, US
Authors:
Jung-Eun Kim¹, Richard Bradford², Man-Ki Yoon¹ and Zhong Shao¹
¹Yale University, US; ²Collins Aerospace, US
Abstract
Learning techniques are advancing the utility and capability of modern embedded systems. However, the challenge of incorporating learning modules into embedded systems is that computing resources are scarce. For such a resource-constrained environment, we have developed a framework for learning abstract information early and learning more concretely as time allows. The intermediate results can be utilized to prepare for early decisions/actions as needed. To apply this framework to a classification task, the datasets are categorized in an abstraction hierarchy. Then the framework classifies intermediate labels from the most abstract level to the most concrete. Our proposed method outperforms the existing approaches and reference base-lines in terms of accuracy. We show our framework with different architectures and on various benchmark datasets CIFAR-10,CIFAR-100, and GTSRB. We measure prediction times on GPU-equipped embedded computing platforms as well.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15

8.5.4

A COMPOSITIONAL APPROACH USING KERAS FOR NEURAL NETWORKS IN REAL-TIME SYSTEMS
Speaker:
Xin Yang, University of Auckland, NZ
Authors:
Xin Yang, Partha Roop, Hammond Pearce and Jin Woo Ro, University of Auckland, NZ
Abstract
Real-time systems are designed using model-driven approaches, where a complex system is represented as a set of interacting components. Such a compositional approach facilitates design of simpler components, which are easier to validate and integrate with the overall system. In contrast to such systems, data-driven systems like neural networks are designed as monolithic black-boxes to capture the non-linear relationship from inputs to outputs. Increasingly, such systems are being used in safety-critical real-time systems. Here, a compositional approach would be ideal. However, to the best of our knowledge, such a compositional approach is lacking while designing data-driven components based on neural networks. This paper formalises this problem by developing the concept of Composed Neural Networks (CpNNs) by extending the well known Keras python framework. CpNNs formalise the synchronous composition of several interacting neural networks in Keras. Further, using the developed semantics, we enable modular compilation from a given CpNN to C code. The generated code is suitable for the Worst-Case Execution Time (WCET) analysis. Using several benchmarks we demonstrate the superiority of the developed approach over a recently proposed approach using Esterel, as well as the popular Python package Tensorflow Lite. For the given benchmarks, our approach is superior to Esterel with an average WCET reduction of 64.06%, and superior to Tensorflow Lite with an average measured WCET reduction of 62.08%.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

IP4-7, 935

DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS
Speaker:
Ahmet Inci, Carnegie Mellon University, US
Authors:
Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US
Abstract
Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:01

IP4-8, 419

EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS
Speaker:
Rolando Brondolin, Politecnico di Milano, IT
Authors:
Luca Cerina¹, Giuseppe Franco², Claudio Gallicchio³, Alessio Micheli³ and Marco D. Santambrogio⁴
¹politecnico di milano, IT; ²Scuola Superiore Sant'Anna / Università di Pisa, IT; ³Università di Pisa, IT; ⁴Politecnico di Milano, IT
Abstract
The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session