6.4 Microarchitecture to the rescue of memory

Time	Label	Presentation Title Authors
11:00	6.4.1	EFFICIENT HARDWARE-ASSISTED CRASH CONSISTENCY IN ENCRYPTED PERSISTENT MEMORY Speaker: Zhan Zhang, Huazhong University of Science & Technology, CN Authors: Zhan Zhang¹, Jianhui Yue², Xiaofei Liao¹ and Hai Jin¹ ¹Huazhong University of Science & Technology, CN; ²Michigan Technological University, US Abstract The persistent memory (PM) requires maintaining the crash consistency and encrypting data, to ensure data recoverability and data confidentiality. The enforcement of these two goals does not only put more burden on programmers but also degrades performance. To address this issue, we propose a hardware-assisted encrypted persistent memory system. Specifically, logging and data encryption are assisted by hardware. Furthermore, we apply the counter-based encryption and the cipher feedback (CFB) mode encryption to data and log respectively, reducing the encryption overhead. Our primary experimental results show that the transaction throughput of the proposed design outperforms the baseline design by up to 34.4%. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	6.4.2	2DCC: CACHE COMPRESSION IN TWO DIMENSIONS Speaker: Amin Ghasemazar, University of British Columbia, CA Authors: Amin Ghasemazar¹, Mohammad Ewais², Prashant Nair¹ and Mieszko Lis¹ ¹University of British Columbia, CA; ²UofT, CA Abstract The importance of caches for performance, together with their high silicon area cost, has led to an interest in hardware solutions that transparently compress the cached data to increase effective capacity without sacrificing silicon area. Work to date has taken one of two tacks: either (a) deduplicating identical cache blocks across the cache to take advantage of inter-block redundancy or (b) identifying and compressing common patterns within each cache block to take advantage of intra-block redundancy. In this paper, we demonstrate that leveraging only one of these redundancy types leads to significant loss of compression opportunities in many applications: some workloads exhibit either inter-block or intra-block redundancy, while others exhibit both. We propose 2DCC, a simple technique that takes advantage of both types of redundancy. Across the SPEC and Parsec benchmark suites, 2DCC results in a 2.12× compression factor (geomean) compared to 1.44-1.49× for best prior techniques on an iso-silicon basis. For the cache-sensitive subset of these benchmarks run in isolation, 2DCC also achieves a 11.7% speedup (geomean). Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	6.4.3	GRAPHVINE: EXPLOITING MULTICAST FOR SCALABLE GRAPH ANALYTICS Speaker: Leul Belayneh, University of Michigan, US Authors: Leul Belayneh and Valeria Bertacco, University of Michigan, US Abstract The proliferation of graphs as a key data structure for big-data analytics has heightened the demand for efficient graph processing. To meet this demand, prior works have proposed processing in memory (PIM) solutions in 3D-stacked DRAMs, such as Hybrid Memory Cubes (HMCs). However, PIM-based architectures, despite considerable improvement over conventional architectures, continue to be hampered by the presence of high inter-cube communication traffic. In turn, this trait has limited the underlying processing elements from fully capitalizing on the memory bandwidth an HMC has to offer. In this paper, we show that it is possible to combine multiple messages emitted from a source node into a single multicast message, thus reducing the inter-cube communication without affecting the correctness of the execution. Hence, we propose to add multicast support at source and in-network routers to reduce vertex-update traffic. Our experimental evaluation shows that, by combining multiple messages emitted at the source, it is possible to achieve an average speedup of 2.4x over a highly optimized PIM-based solution and reduce energy consumption by 3.4x, while incurring a modest power overhead of 6.8%. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP3-2, 855	ENHANCING MULTITHREADED PERFORMANCE OF ASYMMETRIC MULTICORES WITH SIMD OFFLOADING Speaker: Antonio Scheneider Beck, Universidade Federal do Rio Grande do Sul, BR Authors: Jeckson Dellagostin Souza¹, Madhavan Manivannan², Miquel Pericas² and Antonio Carlos Schneider Beck¹ ¹Universidade Federal do Rio Grande do Sul, BR; ²Chalmers, SE Abstract Asymmetric multicore architectures with single-ISA can accelerate multithreaded applications by running code that does not execute concurrently (i.e., the serial region) on a big core and the parallel region on a larger number of smaller cores. Nevertheless, in such architectures the big core still implements resource-expensive application-specific instruction extensions that are rarely used while running the serial region, such as Single Instruction Multiple Data (SIMD) and Floating-Point (FP) operations. In this work, we propose a design in which these extensions are not implemented in the big core, thereby freeing up area and resources to increase the number of small cores in the system, and potentially enhance thread-level parallelism (TLP). To address the case when missing instruction extensions are required while running on the big core we devise an approach to automatically offload these operations to the execution units of the small cores, where the extensions are implemented and can be executed. Our evaluation shows that, on average, the proposed architecture provides 1.76x speedup when compared to a traditional single-ISA asymmetric multicore processor with the same area, for a variety of parallel applications. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session

Time

Label

Presentation Title
Authors

11:00

6.4.1

EFFICIENT HARDWARE-ASSISTED CRASH CONSISTENCY IN ENCRYPTED PERSISTENT MEMORY
Speaker:
Zhan Zhang, Huazhong University of Science & Technology, CN
Authors:
Zhan Zhang¹, Jianhui Yue², Xiaofei Liao¹ and Hai Jin¹
¹Huazhong University of Science & Technology, CN; ²Michigan Technological University, US
Abstract
The persistent memory (PM) requires maintaining the crash consistency and encrypting data, to ensure data recoverability and data confidentiality. The enforcement of these two goals does not only put more burden on programmers but also degrades performance. To address this issue, we propose a hardware-assisted encrypted persistent memory system. Specifically, logging and data encryption are assisted by hardware. Furthermore, we apply the counter-based encryption and the cipher feedback (CFB) mode encryption to data and log respectively, reducing the encryption overhead. Our primary experimental results show that the transaction throughput of the proposed design outperforms the baseline design by up to 34.4%.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30

6.4.2

2DCC: CACHE COMPRESSION IN TWO DIMENSIONS
Speaker:
Amin Ghasemazar, University of British Columbia, CA
Authors:
Amin Ghasemazar¹, Mohammad Ewais², Prashant Nair¹ and Mieszko Lis¹
¹University of British Columbia, CA; ²UofT, CA
Abstract
The importance of caches for performance, together with their high silicon area cost, has led to an interest in hardware solutions that transparently compress the cached data to increase effective capacity without sacrificing silicon area. Work to date has taken one of two tacks: either (a) deduplicating identical cache blocks across the cache to take advantage of inter-block redundancy or (b) identifying and compressing common patterns within each cache block to take advantage of intra-block redundancy. In this paper, we demonstrate that leveraging only one of these redundancy types leads to significant loss of compression opportunities in many applications: some workloads exhibit either inter-block or intra-block redundancy, while others exhibit both. We propose 2DCC, a simple technique that takes advantage of both types of redundancy. Across the SPEC and Parsec benchmark suites, 2DCC results in a 2.12× compression factor (geomean) compared to 1.44-1.49× for best prior techniques on an iso-silicon basis. For the cache-sensitive subset of these benchmarks run in isolation, 2DCC also achieves a 11.7% speedup (geomean).
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

6.4.3

GRAPHVINE: EXPLOITING MULTICAST FOR SCALABLE GRAPH ANALYTICS
Speaker:
Leul Belayneh, University of Michigan, US
Authors:
Leul Belayneh and Valeria Bertacco, University of Michigan, US
Abstract
The proliferation of graphs as a key data structure for big-data analytics has heightened the demand for efficient graph processing. To meet this demand, prior works have proposed processing in memory (PIM) solutions in 3D-stacked DRAMs, such as Hybrid Memory Cubes (HMCs). However, PIM-based architectures, despite considerable improvement over conventional architectures, continue to be hampered by the presence of high inter-cube communication traffic. In turn, this trait has limited the underlying processing elements from fully capitalizing on the memory bandwidth an HMC has to offer. In this paper, we show that it is possible to combine multiple messages emitted from a source node into a single multicast message, thus reducing the inter-cube communication without affecting the correctness of the execution. Hence, we propose to add multicast support at source and in-network routers to reduce vertex-update traffic. Our experimental evaluation shows that, by combining multiple messages emitted at the source, it is possible to achieve an average speedup of 2.4x over a highly optimized PIM-based solution and reduce energy consumption by 3.4x, while incurring a modest power overhead of 6.8%.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

IP3-2, 855

ENHANCING MULTITHREADED PERFORMANCE OF ASYMMETRIC MULTICORES WITH SIMD OFFLOADING
Speaker:
Antonio Scheneider Beck, Universidade Federal do Rio Grande do Sul, BR
Authors:
Jeckson Dellagostin Souza¹, Madhavan Manivannan², Miquel Pericas² and Antonio Carlos Schneider Beck¹
¹Universidade Federal do Rio Grande do Sul, BR; ²Chalmers, SE
Abstract
Asymmetric multicore architectures with single-ISA can accelerate multithreaded applications by running code that does not execute concurrently (i.e., the serial region) on a big core and the parallel region on a larger number of smaller cores. Nevertheless, in such architectures the big core still implements resource-expensive application-specific instruction extensions that are rarely used while running the serial region, such as Single Instruction Multiple Data (SIMD) and Floating-Point (FP) operations. In this work, we propose a design in which these extensions are not implemented in the big core, thereby freeing up area and resources to increase the number of small cores in the system, and potentially enhance thread-level parallelism (TLP). To address the case when missing instruction extensions are required while running on the big core we devise an approach to automatically offload these operations to the execution units of the small cores, where the extensions are implemented and can be executed. Our evaluation shows that, on average, the proposed architecture provides 1.76x speedup when compared to a traditional single-ISA asymmetric multicore processor with the same area, for a variety of parallel applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

End of session