8.5 Don't Forget the Memory

Time	Label	Presentation Title Authors
17:00	8.5.1	DS-CACHE: A REFINED DIRECTORY ENTRY LOOKUP CACHE WITH PREFIX-AWARENESS FOR MOBILE DEVICES Speaker: Zhaoyan Shen, Shandong University, CN Authors: Lei Han¹, Bin Xiao¹, Xuwei Dong², Zhaoyan Shen³ and Zili Shao⁴ ¹The Hong Kong Polytechnic University, HK; ²Northwestern Polytechnical University, CN; ³Shandong University, CN; ⁴The Chinese University of Hong Kong, HK Abstract Our modern devices are filled with files, directories upon directories. Applications generate huge I/O activities in mobile devices. Directory cache is adopted to accelerate file lookup operations in the virtual file system. However, the original directory cache recursively walks all the components of a path for each lookup, leading to inefficient lookup performance and lower cache hit ratio. In this paper, we for the first time fully investigate the characteristics of the directory entry lookup in mobile devices. Based on our findings, we further propose a new directory cache scheme, called Dynamic Skipping Cache, which adopts an ASCII-based hash table to simplify the path lookup complexity by skipping the common prefixes of paths. We also design a novel lookup scheme to optimize the directory cache hit ratio. We have implemented and deployed DS-Cache on a Google Nexus 6P smartphone. Experimental results show that we can significantly reduce the latency of invoking system calls by up to 57.4%, and further reduce the completion time of real-world mobile applications by up to 64%. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	8.5.2	(Best Paper Award Candidate) IMPROVING THE DRAM ACCESS EFFICIENCY FOR MATRIX MULTIPLICATION ON MULTICORE ACCELERATORS Speaker: Sheng Ma, National University of Defense Technology, CN Authors: Sheng ma, Yang Guo, Shenggang Chen, Libo Huang and Zhiying Wang, National University of Defense Technology, CN Abstract The parallelization of matrix multiplication on multicore accelerators divides a matrix into several partitions. The existing design deploys an independent DMA transfer for each core to access its own partition from DRAM. This design has poor memory access efficiency, since memory access streams of multiple concurrent DMA transfers interfere with each other. We propose Distributed-DMA (D-DMA), which invokes one transfer to serve all cores. D-DMA accesses data in a row-major manner to efficiently exploit inter-partition locality to improve the DRAM access efficiency. Compared with a baseline design, D-DMA improves the bandwidth by 84.8% and reduces DRAM energy consumption by 43.1% for micro-benchmarks. It achieves higher performance for the GEMM benchmark. With much lower hardware cost, D-DMA significantly outperforms an out-of-order memory controller. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	8.5.3	QBLK: TOWARDS FULLY EXPLOITING THE PARALLELISM OF OPEN-CHANNEL SSDS Speaker: Hongwei Qin, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System, Engineering Research Center of data storage systems and Technology, Ministry of Education of China, School of Computer Science and Technology, Huazhong University, CN Authors: Hongwei Qin¹, Dan Feng¹, Wei Tong¹, Jingning Liu² and Yutong Zhao² ¹Wuhan National lab for Optoelectronics, CN; ²Wuhan National Lab for Optoelectronics, CN Abstract By exposing physical channels to host software, Open-Channel SSD shows great potential in future high performance storage systems. However, the existing scheme fails to achieve acceptable performance under heavy workloads. The main reasons reside not only in its single-buffer architecture, more importantly, but also in its line-based physical address management. Besides, the lock of address mapping table is also a performance burden under heavy workloads. We propose QBLK, an open source driver which tries to better exploit the parallelism of Open-Channel SSDs. Particularly, QBLK adopts four key techniques, namely (1) Multi-queue based buffering, (2) Per-channel based address management, (3) Lock-free address mapping, and (4) Fine-grained draining. Experimental results show that QBLK achieves up to 97.4% bandwidth improvement compared with the state-of-the-art PBLK scheme. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP4-2, 1013	A WRITE-EFFICIENT CACHE ALGORITHM BASED ON MACROSCOPIC TREND FOR NVM-BASED READ CACHE Speaker: Ning Bao, Renmin University of China, CN Authors: Ning Bao¹, Yunpeng Chai¹ and Xiao Qin² ¹Renmin University of China, CN; ²Auburn University, US Abstract Compared with traditional storage technologies, non-volatile memory (NVM) techniques have excellent I/O performances, but high costs and limited write endurance (e.g., NAND and PCM) or high energy consumption of writing (e.g., STT-MRAM). As a result, the storage systems prefer to utilize NVM devices as read caches for performance boost. Unlike write caches, read caches have greater potential of write reduction because their writes are only triggered by cache updates. However, traditional cache algorithms like LRU and LFU have to update cached blocks frequently because it is difficult for them to predict data popularity in the long future. Although some new algorithms like SieveStore reduce cache write pressure, they still rely on those traditional cache schemes for data popularity prediction. Due to the bad long-term data popularity prediction effect, these new cache algorithms lead to a significant and unnecessary decrease of cache hit ratios. In this paper, we propose a new Macroscopic Trend (MT) cache replacement algorithm to reduce cache updates effectively and maintain high cache hit ratios. This algorithm discovers long-term hot data effectively by observing the macroscopic trend of data blocks. We have conducted extensive experiments driven by a series of real-world traces, and the results indicate that compared with LRU, the MT cache algorithm can achieve 15.28 times longer lifetime or less energy consumption of NVM caches with a similar hit ratio. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP4-3, 626	SRAM DESIGN EXPLORATION WITH INTEGRATED APPLICATION-AWARE AGING ANALYSIS Speaker: Alexandra Listl, Technical University of Munich, DE Authors: Alexandra Listl¹, Daniel Mueller-Gritschneder², Sani Nassif³ and Ulf Schlichtmann⁴ ¹Chair of Electronic Design Automation, DE; ²Technical University of Munich, DE; ³Radyalis, US; ⁴TU München, DE Abstract On-Chip SRAMs are an integral part of safetycritical System-on-Chips. At the same time however, they are also most susceptible to reliability threats such as Bias Temperature Instability (BTI), originating from the continuous trend of technology shrinking. BTI leads to a significant performance degradation, especially in the Sense Amplifiers (SAs) of SRAMs, where failures are fatal, since the data of a whole column is destroyed. As BTI strongly depends on the workload of an application, the aging rates of SAs in a memory array differ significantly and the incorporation of workload information into aging simulations is vital. Especially in safety-critical systems precise estimation of application specific reliability requirements to predict the memory lifetime is a key concern. In this paper we present a workload-aware aging analysis for On-Chip SRAMs that incorporates the workload of real applications executed on a processor. According to this workload, we predict the performance degradation of the SAs in the memory. We integrate this aging analysis into an aging-aware SRAM design exploration framework that generates and characterizes memories of different array granularity to select the most reliable memory architecture for the intended application. We show that this technique can mitigate SA degradation significantly depending on the environmental conditions and the application workload. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

8.5.1

DS-CACHE: A REFINED DIRECTORY ENTRY LOOKUP CACHE WITH PREFIX-AWARENESS FOR MOBILE DEVICES
Speaker:
Zhaoyan Shen, Shandong University, CN
Authors:
Lei Han¹, Bin Xiao¹, Xuwei Dong², Zhaoyan Shen³ and Zili Shao⁴
¹The Hong Kong Polytechnic University, HK; ²Northwestern Polytechnical University, CN; ³Shandong University, CN; ⁴The Chinese University of Hong Kong, HK
Abstract
Our modern devices are filled with files, directories upon directories. Applications generate huge I/O activities in mobile devices. Directory cache is adopted to accelerate file lookup operations in the virtual file system. However, the original directory cache recursively walks all the components of a path for each lookup, leading to inefficient lookup performance and lower cache hit ratio. In this paper, we for the first time fully investigate the characteristics of the directory entry lookup in mobile devices. Based on our findings, we further propose a new directory cache scheme, called Dynamic Skipping Cache, which adopts an ASCII-based hash table to simplify the path lookup complexity by skipping the common prefixes of paths. We also design a novel lookup scheme to optimize the directory cache hit ratio. We have implemented and deployed DS-Cache on a Google Nexus 6P smartphone. Experimental results show that we can significantly reduce the latency of invoking system calls by up to 57.4%, and further reduce the completion time of real-world mobile applications by up to 64%.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

8.5.2

(Best Paper Award Candidate)
IMPROVING THE DRAM ACCESS EFFICIENCY FOR MATRIX MULTIPLICATION ON MULTICORE ACCELERATORS
Speaker:
Sheng Ma, National University of Defense Technology, CN
Authors:
Sheng ma, Yang Guo, Shenggang Chen, Libo Huang and Zhiying Wang, National University of Defense Technology, CN
Abstract
The parallelization of matrix multiplication on multicore accelerators divides a matrix into several partitions. The existing design deploys an independent DMA transfer for each core to access its own partition from DRAM. This design has poor memory access efficiency, since memory access streams of multiple concurrent DMA transfers interfere with each other. We propose Distributed-DMA (D-DMA), which invokes one transfer to serve all cores. D-DMA accesses data in a row-major manner to efficiently exploit inter-partition locality to improve the DRAM access efficiency. Compared with a baseline design, D-DMA improves the bandwidth by 84.8% and reduces DRAM energy consumption by 43.1% for micro-benchmarks. It achieves higher performance for the GEMM benchmark. With much lower hardware cost, D-DMA significantly outperforms an out-of-order memory controller.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

8.5.3

QBLK: TOWARDS FULLY EXPLOITING THE PARALLELISM OF OPEN-CHANNEL SSDS
Speaker:
Hongwei Qin, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System, Engineering Research Center of data storage systems and Technology, Ministry of Education of China, School of Computer Science and Technology, Huazhong University, CN
Authors:
Hongwei Qin¹, Dan Feng¹, Wei Tong¹, Jingning Liu² and Yutong Zhao²
¹Wuhan National lab for Optoelectronics, CN; ²Wuhan National Lab for Optoelectronics, CN
Abstract
By exposing physical channels to host software, Open-Channel SSD shows great potential in future high performance storage systems. However, the existing scheme fails to achieve acceptable performance under heavy workloads. The main reasons reside not only in its single-buffer architecture, more importantly, but also in its line-based physical address management. Besides, the lock of address mapping table is also a performance burden under heavy workloads. We propose QBLK, an open source driver which tries to better exploit the parallelism of Open-Channel SSDs. Particularly, QBLK adopts four key techniques, namely (1) Multi-queue based buffering, (2) Per-channel based address management, (3) Lock-free address mapping, and (4) Fine-grained draining. Experimental results show that QBLK achieves up to 97.4% bandwidth improvement compared with the state-of-the-art PBLK scheme.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP4-2, 1013

A WRITE-EFFICIENT CACHE ALGORITHM BASED ON MACROSCOPIC TREND FOR NVM-BASED READ CACHE
Speaker:
Ning Bao, Renmin University of China, CN
Authors:
Ning Bao¹, Yunpeng Chai¹ and Xiao Qin²
¹Renmin University of China, CN; ²Auburn University, US
Abstract
Compared with traditional storage technologies, non-volatile memory (NVM) techniques have excellent I/O performances, but high costs and limited write endurance (e.g., NAND and PCM) or high energy consumption of writing (e.g., STT-MRAM). As a result, the storage systems prefer to utilize NVM devices as read caches for performance boost. Unlike write caches, read caches have greater potential of write reduction because their writes are only triggered by cache updates. However, traditional cache algorithms like LRU and LFU have to update cached blocks frequently because it is difficult for them to predict data popularity in the long future. Although some new algorithms like SieveStore reduce cache write pressure, they still rely on those traditional cache schemes for data popularity prediction. Due to the bad long-term data popularity prediction effect, these new cache algorithms lead to a significant and unnecessary decrease of cache hit ratios. In this paper, we propose a new Macroscopic Trend (MT) cache replacement algorithm to reduce cache updates effectively and maintain high cache hit ratios. This algorithm discovers long-term hot data effectively by observing the macroscopic trend of data blocks. We have conducted extensive experiments driven by a series of real-world traces, and the results indicate that compared with LRU, the MT cache algorithm can achieve 15.28 times longer lifetime or less energy consumption of NVM caches with a similar hit ratio.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP4-3, 626

SRAM DESIGN EXPLORATION WITH INTEGRATED APPLICATION-AWARE AGING ANALYSIS
Speaker:
Alexandra Listl, Technical University of Munich, DE
Authors:
Alexandra Listl¹, Daniel Mueller-Gritschneder², Sani Nassif³ and Ulf Schlichtmann⁴
¹Chair of Electronic Design Automation, DE; ²Technical University of Munich, DE; ³Radyalis, US; ⁴TU München, DE
Abstract
On-Chip SRAMs are an integral part of safetycritical System-on-Chips. At the same time however, they are also most susceptible to reliability threats such as Bias Temperature Instability (BTI), originating from the continuous trend of technology shrinking. BTI leads to a significant performance degradation, especially in the Sense Amplifiers (SAs) of SRAMs, where failures are fatal, since the data of a whole column is destroyed. As BTI strongly depends on the workload of an application, the aging rates of SAs in a memory array differ significantly and the incorporation of workload information into aging simulations is vital. Especially in safety-critical systems precise estimation of application specific reliability requirements to predict the memory lifetime is a key concern. In this paper we present a workload-aware aging analysis for On-Chip SRAMs that incorporates the workload of real applications executed on a processor. According to this workload, we predict the performance degradation of the SAs in the memory. We integrate this aging analysis into an aging-aware SRAM design exploration framework that generates and characterizes memories of different array granularity to select the most reliable memory architecture for the intended application. We show that this technique can mitigate SA degradation significantly depending on the environmental conditions and the application workload.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session