12.7 Software optimization for emerging memory architectures and technologies

Printer-friendly version PDF version

Date: Thursday 30 March 2017
Time: 16:00 - 17:30
Location / Room: 3B

Chair:
Amit Singh, University of Southampton, GB

Co-Chair:
Semeen Rehman, Technische Universitaet Dresden, DE

The papers in this session propose optimization techniques to improve the lifetime and performance of emerging technologies like persistent memory and scalable many-cores. Architectural optimizations are also presented to improve energy and performance of applications executing on GPU-based platforms.

TimeLabelPresentation Title
Authors
16:0012.7.1EFFICIENT STORAGE MANAGEMENT FOR AGED FILE SYSTEMS ON PERSISTENT MEMORY
Speaker:
Kaisheng Zeng, Tsinghua University, CN
Authors:
Kaisheng Zeng1, Youyou Lu1, Hu Wan2 and Jiwu Shu1
1Tsinghua University, CN; 2Capital Normal University, CN
Abstract
Emerging persistent memories (PMs) provide both byte addressability as DRAM and persistency as conventional storage technologies. Recent work on persistent memory file systems, such as BPFS, PMFS, have gained better performance by leveraging the dual characteristics. However, we observe that persistent memory file systems experience dramatic performance degradation over a long run. This phenomenon is referred to as file system aging. We find that the performance degradation is attributed to the inefficiency of storage management for both file space and dentry space. We also find that persistent memories wear out more quickly as file system ages. To address such issues, we propose SanGuo, a novel scattergather storage management mechanism for aged file systems on persistent memory. SanGuo consists of two key techniques. First, Scatter-alloc maximizes the efficiency and performance of file allocation while providing wear-leveling. Second, Gather-free accelerates the dentry operations, including dentry allocation, lookup and reclaim, especially for a directory file containing a large number of dentries. Experimental results show that SanGuo performs better wear-leveling while providing significant performance speed up (e.g., up to 10.33×, 8.9× respectively for Webproxy and Varmail workloads).

Download Paper (PDF; Only available from the DATE venue WiFi)
16:3012.7.2LOOKNN: NEURAL NETWORK WITH NO MULTIPLICATION
Speaker:
Tajana Rosing, UCSD, US
Authors:
Mohammad Samragh Razlighi1, Mohsen Imani1, Farinaz Koushanfar2 and Tajana Rosing2
1University of California San Diego, US; 2UCSD, US
Abstract
Neural networks are machine learning models that have been successfully used in many applications. Due to the high computational complexity of neural networks, deploying such models on embedded devices with severe power/resource constraints is troublesome. Neural networks are inherently approximate and can be simplified. We propose LookNN, a methodology to replace floating-point multiplications with look-up table search. First, we devise an algorithmic solution to adapt conventional neural networks to LookNN such that the model's accuracy is minimally affected. We provide experimental results and theoretical analysis demonstrating the applicability of the method. Next, we design enhanced general purpose processors for searching look-up tables: each processing element of our GPU has access to a small associative memory, enabling it to bypass redundant computations. Our evaluations on AMD Southern Island GPU architecture shows that LookNN results in 2.2× energy saving and 2.5× speedup running four different neural network applications with zero additive error. For the same four applications, if we tolerate an additive error of less than 0.2%, LookNN can achieve an average of 3× energy improvement and 2.6× speedup compared to the traditional GPU architecture.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:0012.7.3PEGASUS: EFFICIENT DATA TRANSFERS FOR PGAS LANGUAGES ON NON-CACHE-COHERENT MANY-CORES
Speaker:
Manuel Mohr, Karlsruhe Institute of Technology, DE
Authors:
Manuel Mohr and Carsten Tradowsky, Karlsruhe Institute of Technology, DE
Abstract
To improve scalability, some many-core architectures abandon global cache coherence, but still provide a shared address space. Partitioning the shared memory and communicating via messages is a safe way of programming such machines. However, accessing pointered data structures from a foreign memory partition is expensive due to the required serialization. In this paper, we propose a novel data transfer technique that avoids serialization overhead for pointered data structures by managing cache coherence in software at object granularity. We show that for PGAS programming languages, the compiler and runtime system can completely handle the necessary cache management, thus requiring no changes to application code. Moreover, we explain how cache operations working on address ranges complement our data transfer technique. We propose a novel non-blocking implementation of range-based cache operations by offloading them to an enhanced cache controller. We evaluate our approach on a non-cache-coherent many-core architecture using a distributed-kernel benchmark suite and demonstrate a reduction of communication time of up to 39.8%.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:30End of session