6.4 Microarchitecture to the rescue of memory

Printer-friendly version PDF version

Date: Wednesday 11 March 2020
Time: 11:00 - 12:30
Location / Room: Stendhal

Chair:
Olivier Sentieys, INRIA, FR

Co-Chair:
Jeronimo Castrillon, TU Dresden, DE

This session discusses micro-architectural innovations across three different memory technologies, namely, caches, 3D-stacked DRAM and non-volatile. This includes exploiting several aspects of redundancy to maximize cache utilization through compression, as well as multicast in 3D-stacked high-speed memories for graph analytics, and a microarchitecture solution to unify persistency and encryption in non-volatile memories.

TimeLabelPresentation Title
Authors
11:006.4.1EFFICIENT HARDWARE-ASSISTED CRASH CONSISTENCY IN ENCRYPTED PERSISTENT MEMORY
Speaker:
Zhan Zhang, Huazhong University of Science & Technology, CN
Authors:
Zhan Zhang1, Jianhui Yue2, Xiaofei Liao1 and Hai Jin1
1Huazhong University of Science & Technology, CN; 2Michigan Technological University, US
Abstract
The persistent memory (PM) requires maintaining the crash consistency and encrypting data, to ensure data recoverability and data confidentiality. The enforcement of these two goals does not only put more burden on programmers but also degrades performance. To address this issue, we propose a hardware-assisted encrypted persistent memory system. Specifically, logging and data encryption are assisted by hardware. Furthermore, we apply the counter-based encryption and the cipher feedback (CFB) mode encryption to data and log respectively, reducing the encryption overhead. Our primary experimental results show that the transaction throughput of the proposed design outperforms the baseline design by up to 34.4%.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:306.4.22DCC: CACHE COMPRESSION IN TWO DIMENSIONS
Speaker:
Amin Ghasemazar, University of British Columbia, CA
Authors:
Amin Ghasemazar1, Mohammad Ewais2, Prashant Nair1 and Mieszko Lis1
1University of British Columbia, CA; 2UofT, CA
Abstract
The importance of caches for performance, together with their high silicon area cost, has led to an interest in hardware solutions that transparently compress the cached data to increase effective capacity without sacrificing silicon area. Work to date has taken one of two tacks: either (a) deduplicating identical cache blocks across the cache to take advantage of inter-block redundancy or (b) identifying and compressing common patterns within each cache block to take advantage of intra-block redundancy. In this paper, we demonstrate that leveraging only one of these redundancy types leads to significant loss of compression opportunities in many applications: some workloads exhibit either inter-block or intra-block redundancy, while others exhibit both. We propose 2DCC, a simple technique that takes advantage of both types of redundancy. Across the SPEC and Parsec benchmark suites, 2DCC results in a 2.12× compression factor (geomean) compared to 1.44-1.49× for best prior techniques on an iso-silicon basis. For the cache-sensitive subset of these benchmarks run in isolation, 2DCC also achieves a 11.7% speedup (geomean).

Download Paper (PDF; Only available from the DATE venue WiFi)
12:006.4.3GRAPHVINE: EXPLOITING MULTICAST FOR SCALABLE GRAPH ANALYTICS
Speaker:
Leul Belayneh, University of Michigan, US
Authors:
Leul Belayneh and Valeria Bertacco, University of Michigan, US
Abstract
The proliferation of graphs as a key data structure for big-data analytics has heightened the demand for efficient graph processing. To meet this demand, prior works have proposed processing in memory (PIM) solutions in 3D-stacked DRAMs, such as Hybrid Memory Cubes (HMCs). However, PIM-based architectures, despite considerable improvement over conventional architectures, continue to be hampered by the presence of high inter-cube communication traffic. In turn, this trait has limited the underlying processing elements from fully capitalizing on the memory bandwidth an HMC has to offer. In this paper, we show that it is possible to combine multiple messages emitted from a source node into a single multicast message, thus reducing the inter-cube communication without affecting the correctness of the execution. Hence, we propose to add multicast support at source and in-network routers to reduce vertex-update traffic. Our experimental evaluation shows that, by combining multiple messages emitted at the source, it is possible to achieve an average speedup of 2.4x over a highly optimized PIM-based solution and reduce energy consumption by 3.4x, while incurring a modest power overhead of 6.8%.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30IP3-2, 855ENHANCING MULTITHREADED PERFORMANCE OF ASYMMETRIC MULTICORES WITH SIMD OFFLOADING
Speaker:
Antonio Scheneider Beck, Universidade Federal do Rio Grande do Sul, BR
Authors:
Jeckson Dellagostin Souza1, Madhavan Manivannan2, Miquel Pericas2 and Antonio Carlos Schneider Beck1
1Universidade Federal do Rio Grande do Sul, BR; 2Chalmers, SE
Abstract
Asymmetric multicore architectures with single-ISA can accelerate multithreaded applications by running code that does not execute concurrently (i.e., the serial region) on a big core and the parallel region on a larger number of smaller cores. Nevertheless, in such architectures the big core still implements resource-expensive application-specific instruction extensions that are rarely used while running the serial region, such as Single Instruction Multiple Data (SIMD) and Floating-Point (FP) operations. In this work, we propose a design in which these extensions are not implemented in the big core, thereby freeing up area and resources to increase the number of small cores in the system, and potentially enhance thread-level parallelism (TLP). To address the case when missing instruction extensions are required while running on the big core we devise an approach to automatically offload these operations to the execution units of the small cores, where the extensions are implemented and can be executed. Our evaluation shows that, on average, the proposed architecture provides 1.76x speedup when compared to a traditional single-ISA asymmetric multicore processor with the same area, for a variety of parallel applications.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30End of session