12.7 Software optimization for emerging memory architectures and technologies

Time	Label	Presentation Title Authors
16:00	12.7.1	EFFICIENT STORAGE MANAGEMENT FOR AGED FILE SYSTEMS ON PERSISTENT MEMORY Speaker: Kaisheng Zeng, Tsinghua University, CN Authors: Kaisheng Zeng¹, Youyou Lu¹, Hu Wan² and Jiwu Shu¹ ¹Tsinghua University, CN; ²Capital Normal University, CN Abstract Emerging persistent memories (PMs) provide both byte addressability as DRAM and persistency as conventional storage technologies. Recent work on persistent memory file systems, such as BPFS, PMFS, have gained better performance by leveraging the dual characteristics. However, we observe that persistent memory file systems experience dramatic performance degradation over a long run. This phenomenon is referred to as file system aging. We find that the performance degradation is attributed to the inefficiency of storage management for both file space and dentry space. We also find that persistent memories wear out more quickly as file system ages. To address such issues, we propose SanGuo, a novel scattergather storage management mechanism for aged file systems on persistent memory. SanGuo consists of two key techniques. First, Scatter-alloc maximizes the efficiency and performance of file allocation while providing wear-leveling. Second, Gather-free accelerates the dentry operations, including dentry allocation, lookup and reclaim, especially for a directory file containing a large number of dentries. Experimental results show that SanGuo performs better wear-leveling while providing significant performance speed up (e.g., up to 10.33×, 8.9× respectively for Webproxy and Varmail workloads). Download Paper (PDF; Only available from the DATE venue WiFi)
16:30	12.7.2	LOOKNN: NEURAL NETWORK WITH NO MULTIPLICATION Speaker: Tajana Rosing, UCSD, US Authors: Mohammad Samragh Razlighi¹, Mohsen Imani¹, Farinaz Koushanfar² and Tajana Rosing² ¹University of California San Diego, US; ²UCSD, US Abstract Neural networks are machine learning models that have been successfully used in many applications. Due to the high computational complexity of neural networks, deploying such models on embedded devices with severe power/resource constraints is troublesome. Neural networks are inherently approximate and can be simplified. We propose LookNN, a methodology to replace floating-point multiplications with look-up table search. First, we devise an algorithmic solution to adapt conventional neural networks to LookNN such that the model's accuracy is minimally affected. We provide experimental results and theoretical analysis demonstrating the applicability of the method. Next, we design enhanced general purpose processors for searching look-up tables: each processing element of our GPU has access to a small associative memory, enabling it to bypass redundant computations. Our evaluations on AMD Southern Island GPU architecture shows that LookNN results in 2.2× energy saving and 2.5× speedup running four different neural network applications with zero additive error. For the same four applications, if we tolerate an additive error of less than 0.2%, LookNN can achieve an average of 3× energy improvement and 2.6× speedup compared to the traditional GPU architecture. Download Paper (PDF; Only available from the DATE venue WiFi)
17:00	12.7.3	PEGASUS: EFFICIENT DATA TRANSFERS FOR PGAS LANGUAGES ON NON-CACHE-COHERENT MANY-CORES Speaker: Manuel Mohr, Karlsruhe Institute of Technology, DE Authors: Manuel Mohr and Carsten Tradowsky, Karlsruhe Institute of Technology, DE Abstract To improve scalability, some many-core architectures abandon global cache coherence, but still provide a shared address space. Partitioning the shared memory and communicating via messages is a safe way of programming such machines. However, accessing pointered data structures from a foreign memory partition is expensive due to the required serialization. In this paper, we propose a novel data transfer technique that avoids serialization overhead for pointered data structures by managing cache coherence in software at object granularity. We show that for PGAS programming languages, the compiler and runtime system can completely handle the necessary cache management, thus requiring no changes to application code. Moreover, we explain how cache operations working on address ranges complement our data transfer technique. We propose a novel non-blocking implementation of range-based cache operations by offloading them to an enhanced cache controller. We evaluate our approach on a non-cache-coherent many-core architecture using a distributed-kernel benchmark suite and demonstrate a reduction of communication time of up to 39.8%. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30		End of session

Time

Label

Presentation Title
Authors

16:00

12.7.1

EFFICIENT STORAGE MANAGEMENT FOR AGED FILE SYSTEMS ON PERSISTENT MEMORY
Speaker:
Kaisheng Zeng, Tsinghua University, CN
Authors:
Kaisheng Zeng¹, Youyou Lu¹, Hu Wan² and Jiwu Shu¹
¹Tsinghua University, CN; ²Capital Normal University, CN
Abstract
Emerging persistent memories (PMs) provide both byte addressability as DRAM and persistency as conventional storage technologies. Recent work on persistent memory file systems, such as BPFS, PMFS, have gained better performance by leveraging the dual characteristics. However, we observe that persistent memory file systems experience dramatic performance degradation over a long run. This phenomenon is referred to as file system aging. We find that the performance degradation is attributed to the inefficiency of storage management for both file space and dentry space. We also find that persistent memories wear out more quickly as file system ages. To address such issues, we propose SanGuo, a novel scattergather storage management mechanism for aged file systems on persistent memory. SanGuo consists of two key techniques. First, Scatter-alloc maximizes the efficiency and performance of file allocation while providing wear-leveling. Second, Gather-free accelerates the dentry operations, including dentry allocation, lookup and reclaim, especially for a directory file containing a large number of dentries. Experimental results show that SanGuo performs better wear-leveling while providing significant performance speed up (e.g., up to 10.33×, 8.9× respectively for Webproxy and Varmail workloads).
Download Paper (PDF; Only available from the DATE venue WiFi)

16:30

12.7.2

LOOKNN: NEURAL NETWORK WITH NO MULTIPLICATION
Speaker:
Tajana Rosing, UCSD, US
Authors:
Mohammad Samragh Razlighi¹, Mohsen Imani¹, Farinaz Koushanfar² and Tajana Rosing²
¹University of California San Diego, US; ²UCSD, US
Abstract
Neural networks are machine learning models that have been successfully used in many applications. Due to the high computational complexity of neural networks, deploying such models on embedded devices with severe power/resource constraints is troublesome. Neural networks are inherently approximate and can be simplified. We propose LookNN, a methodology to replace floating-point multiplications with look-up table search. First, we devise an algorithmic solution to adapt conventional neural networks to LookNN such that the model's accuracy is minimally affected. We provide experimental results and theoretical analysis demonstrating the applicability of the method. Next, we design enhanced general purpose processors for searching look-up tables: each processing element of our GPU has access to a small associative memory, enabling it to bypass redundant computations. Our evaluations on AMD Southern Island GPU architecture shows that LookNN results in 2.2× energy saving and 2.5× speedup running four different neural network applications with zero additive error. For the same four applications, if we tolerate an additive error of less than 0.2%, LookNN can achieve an average of 3× energy improvement and 2.6× speedup compared to the traditional GPU architecture.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:00

12.7.3

PEGASUS: EFFICIENT DATA TRANSFERS FOR PGAS LANGUAGES ON NON-CACHE-COHERENT MANY-CORES
Speaker:
Manuel Mohr, Karlsruhe Institute of Technology, DE
Authors:
Manuel Mohr and Carsten Tradowsky, Karlsruhe Institute of Technology, DE
Abstract
To improve scalability, some many-core architectures abandon global cache coherence, but still provide a shared address space. Partitioning the shared memory and communicating via messages is a safe way of programming such machines. However, accessing pointered data structures from a foreign memory partition is expensive due to the required serialization. In this paper, we propose a novel data transfer technique that avoids serialization overhead for pointered data structures by managing cache coherence in software at object granularity. We show that for PGAS programming languages, the compiler and runtime system can completely handle the necessary cache management, thus requiring no changes to application code. Moreover, we explain how cache operations working on address ranges complement our data transfer technique. We propose a novel non-blocking implementation of range-based cache operations by offloading them to an enhanced cache controller. We evaluate our approach on a non-cache-coherent many-core architecture using a distributed-kernel benchmark suite and demonstrate a reduction of communication time of up to 39.8%.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

End of session