8.3 Optimizing System-Level Design for Machine Learning

Time	Label	Presentation Title Authors
17:00	8.3.1	ESP4ML: PLATFORM-BASED DESIGN OF SYSTEMS-ON-CHIP FOR EMBEDDED MACHINE LEARNING Speaker: Davide Giri, Columbia University, US Authors: Davide Giri, Kuan-Lin Chiu, Giuseppe Di Guglielmo, Paolo Mantovani and Luca Carloni, Columbia University, US Abstract We present ESP4ML an open-source system-level design flow to build and program SoC architectures for embedded applications that require the hardware acceleration of machine learning and signal processing algorithms. We realized ESP4ML by combining two established open-source projects (ESP and HLS4ML) into a new, fully-automated design flow. For the SoC integration of accelerators generated by HLS4ML, we designed a set of new parameterized interface circuits synthesizable with high-level synthesis. For accelerator configuration and management, we developed an embedded software runtime system on top of Linux. With this HW/SW layer, we addressed the challenge of dynamically shaping the data traffic on a network-on-chip to activate and support the reconfigurable pipelines of accelerators that are needed by the application workloads currently running on the SoC. We demonstrate our vertically-integrated contributions with the FPGA-based implementations of complete SoC instances booting Linux and executing computer-vision applications that process images taken from the Google Street View database. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	8.3.2	PROBABILISTIC SEQUENTIAL MULTI-OBJECTIVE OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS Speaker: Zixuan Yin, McGill University, CA Authors: Zixuan Yin, Warren Gross and Brett Meyer, McGill University, CA Abstract With the advent of deeper, larger and more complex convolutional neural networks (CNN), manual design has become a daunting task, especially when hardware performance must be optimized. Sequential model-based optimization (SMBO) is an efficient method for hyperparameter optimization on highly parameterized machine learning (ML) algorithms, able to find good configurations with a limited number of evaluations by predicting the performance of candidates before evaluation. A case study on MNIST shows that SMBO regression model prediction error significantly impedes search performance in multi-objective optimization. To address this issue, we propose probabilistic SMBO, which selects candidates based on probabilistic estimation of their Pareto efficiency. With a formulation that incorporates error in accuracy prediction and uncertainty in latency measurement, probabilistic Pareto efficiency quantifies a candidate's quality in two ways: its likelihood of being Pareto optimal, and the expected number of current Pareto optimal solutions that it will dominate. We evaluate our proposed method on four image classification problems. Compared to a deterministic approach, probabilistic SMBO consistently generates Pareto optimal solutions that perform better, and that are competitive with state-of-the-art efficient CNN models, offering tremendous speedup in inference latency while maintaining comparable accuracy. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	8.3.3	ARS: REDUCING F2FS FRAGMENTATION FOR SMARTPHONES USING DECISION TREES Speaker: Lihua Yang, Huazhong University of Science & Technology, CN Authors: Lihua Yang, Fang Wang, Zhipeng Tan, Dan Feng, Jiaxing Qian and Shiyun Tu, Huazhong University of Science & Technology, CN Abstract As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36X longer than that of continuous files and that F2FS's in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience. Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26X transactions per second under SQLite than traditional F2FS and reduces up to 41.72% running time of Facebook and Twitter than F2FS with in-place updates. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP4-3, 22	A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE Speaker: Yu Zhang, Huazhong University of Science & Technology, CN Authors: Yu Zhang¹, Ke Zhou¹, Ping Huang², Hua Wang¹, Jianying Hu³, Yangtao Wang¹, Yongguang Ji³ and Bin Cheng³ ¹Huazhong University of Science & Technology, CN; ²Temple University, US; ³Tencent Technology (Shenzhen) Co., Ltd., CN Abstract Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP4-4, 47	YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN Speaker: Weiwei Chen, Chinese Academy of Sciences, CN Authors: Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN Abstract DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

8.3.1

ESP4ML: PLATFORM-BASED DESIGN OF SYSTEMS-ON-CHIP FOR EMBEDDED MACHINE LEARNING
Speaker:
Davide Giri, Columbia University, US
Authors:
Davide Giri, Kuan-Lin Chiu, Giuseppe Di Guglielmo, Paolo Mantovani and Luca Carloni, Columbia University, US
Abstract
We present ESP4ML an open-source system-level design flow to build and program SoC architectures for embedded applications that require the hardware acceleration of machine learning and signal processing algorithms. We realized ESP4ML by combining two established open-source projects (ESP and HLS4ML) into a new, fully-automated design flow. For the SoC integration of accelerators generated by HLS4ML, we designed a set of new parameterized interface circuits synthesizable with high-level synthesis. For accelerator configuration and management, we developed an embedded software runtime system on top of Linux. With this HW/SW layer, we addressed the challenge of dynamically shaping the data traffic on a network-on-chip to activate and support the reconfigurable pipelines of accelerators that are needed by the application workloads currently running on the SoC. We demonstrate our vertically-integrated contributions with the FPGA-based implementations of complete SoC instances booting Linux and executing computer-vision applications that process images taken from the Google Street View database.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

8.3.2

PROBABILISTIC SEQUENTIAL MULTI-OBJECTIVE OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS
Speaker:
Zixuan Yin, McGill University, CA
Authors:
Zixuan Yin, Warren Gross and Brett Meyer, McGill University, CA
Abstract
With the advent of deeper, larger and more complex convolutional neural networks (CNN), manual design has become a daunting task, especially when hardware performance must be optimized. Sequential model-based optimization (SMBO) is an efficient method for hyperparameter optimization on highly parameterized machine learning (ML) algorithms, able to find good configurations with a limited number of evaluations by predicting the performance of candidates before evaluation. A case study on MNIST shows that SMBO regression model prediction error significantly impedes search performance in multi-objective optimization. To address this issue, we propose probabilistic SMBO, which selects candidates based on probabilistic estimation of their Pareto efficiency. With a formulation that incorporates error in accuracy prediction and uncertainty in latency measurement, probabilistic Pareto efficiency quantifies a candidate's quality in two ways: its likelihood of being Pareto optimal, and the expected number of current Pareto optimal solutions that it will dominate. We evaluate our proposed method on four image classification problems. Compared to a deterministic approach, probabilistic SMBO consistently generates Pareto optimal solutions that perform better, and that are competitive with state-of-the-art efficient CNN models, offering tremendous speedup in inference latency while maintaining comparable accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

8.3.3

ARS: REDUCING F2FS FRAGMENTATION FOR SMARTPHONES USING DECISION TREES
Speaker:
Lihua Yang, Huazhong University of Science & Technology, CN
Authors:
Lihua Yang, Fang Wang, Zhipeng Tan, Dan Feng, Jiaxing Qian and Shiyun Tu, Huazhong University of Science & Technology, CN
Abstract
As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36X longer than that of continuous files and that F2FS's in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience. Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26X transactions per second under SQLite than traditional F2FS and reduces up to 41.72% running time of Facebook and Twitter than F2FS with in-place updates.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP4-3, 22

A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE
Speaker:
Yu Zhang, Huazhong University of Science & Technology, CN
Authors:
Yu Zhang¹, Ke Zhou¹, Ping Huang², Hua Wang¹, Jianying Hu³, Yangtao Wang¹, Yongguang Ji³ and Bin Cheng³
¹Huazhong University of Science & Technology, CN; ²Temple University, US; ³Tencent Technology (Shenzhen) Co., Ltd., CN
Abstract
Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP4-4, 47

YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN
Speaker:
Weiwei Chen, Chinese Academy of Sciences, CN
Authors:
Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN
Abstract
DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session