7.4 DRAM and NVMs

Time	Label	Presentation Title Authors
14:30	7.4.1	ROW-BUFFER HIT HARVESTING IN ORCHESTRATED LAST-LEVEL CACHE AND DRAM SCHEDULING FOR HETEROGENEOUS MULTICORE SYSTEMS Speaker: Xun Jiao, University of California, San Diego, US Authors: Yang Song¹, Olivier Alavoine² and Bill Lin¹ ¹University of California, San Diego, US; ²Qualcomm Inc., US Abstract In heterogeneous multicore systems, the memory subsystem, including the last-level cache and DRAM, is widely shared among the CPU, the GPU, and the real-time cores. Due to their distinct memory traffic patterns, heterogeneous cores result in more frequent cache misses at the last-level cache. As cache misses travel through the memory subsystem, two schedulers are involved for the last-level cache and DRAM respectively. Prior studies treated the scheduling of the last-level cache and DRAM as independent stages. However, with no orchestration and limited visibility of memory traffic, neither scheduling stage is able to ensure optimal scheduling decisions for memory efficiency. Unnecessary precharges and row activations happen in DRAM when the memory scheduler is ignorant of incoming cache misses and DRAM row-buffer states are invisible to the last-level cache. In this paper, we propose a unified memory controller for the the last-level cache and DRAM with orchestrated schedulers. The memory scheduler harvests row-buffer hit opportunities in cache request buffers during spare time without inducing significant implementation cost. Extensive evaluations show that the proposed controller improves the total memory bandwidth of DRAM by 16.8% on average and saves DRAM energy by up to 29.7% while achieving comparable CPU IPC. In addition, we explore the impact of last-level cache bypassing techniques on the proposed memory controller. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	7.4.2	ADAM: ADAPTIVE APPROXIMATION MANAGEMENT FOR THE NON-VOLATILE MEMORY HIERARCHIES Speaker: Muhammad Abdullah Hanif, TU Wien, AT Authors: Mohammad Taghi Teimoori¹, Muhammad Abdullah Hanif², Alireza Ejlali¹ and Muhammad Shafique² ¹Sharif University of Technology, IR; ²TU Wien, AT Abstract Existing memory approximation techniques focus on employing approximations at an individual level of the memory hierarchy (e.g., cache, scratchpad, or main memory). However, to exploit the full potential of approximations, there is a need to manage different approximation knobs across the complete memory hierarchy. Towards this, we model a system including STT-RAM scratchpad and PCM main memory with different approximation knobs (e.g., read/write pulse magnitude/duration) in order to synergistically trade data accuracy for both STT-RAM access delay and PCM lifetime by means of an integer linear programming (ILP) problem at design-time. Furthermore, a run-time algorithm is proposed to adaptively tune the approximation knobs of both STT-RAM and PCM to obtain high energy savings while keeping the quality within acceptable ranges across the complete memory hierarchy. We evaluated our proposed technique in a baseline system consisting 1mB STT-RAM scratchpad and 1GB PCM main memory. The experimental results demonstrate that our proposed technique improves the execution time and the lifetime by up to 23% and 2.3X, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	7.4.3	A CROSS-LAYER ADAPTIVE APPROACH FOR PERFORMANCE AND POWER OPTIMIZATION IN STT-MRAM Speaker: Nour Sayed, KIT - Karlsruhe Institute of Technology, DE Authors: Nour Sayed, Rajendra Bishnoi, Fabian Oboril and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate as a universal on-chip memory technology due to non-volatility, high density and scalability. However, high write energy and latency are major challenges in this memory technology due to the asymmetry and stochastic nature of the write operation. Typically, the write current is set for the minimum energy point, which can further impact the write latency. To mitigate these issues, we propose an adaptive write current scaling technique that adjusts the write current, and hence the write latency and energy based on the performance needs at run-time. Using this technique, optimal energy and performance points for write current are obtained using detailed device and system level analysis. Furthermore, we use run- time adaptation of write current by predicting the write access rate for the next execution phase. We evaluate the efficiency of the proposed approach on SPEC2000 applications for STT-MRAM-based L1 and L2-cache levels. The results show that the effective write latency of L1 and L2 is reduced by 52.4% and 55.7% with 7.6% and 1.4% area overheads, respectively, corresponding to the overall system performance optimization of 15.5% while the total memory energy consumption is increasing by only 3.2%. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00	IP3-11, 331	PROCESSING IN 3D MEMORIES TO SPEEDUP OPERATIONS ON COMPLEX DATA STRUCTURES Speaker: Luigi Carro, UFRGS, BR Authors: Paulo Cesar Santos¹, Geraldo Francisco de Oliveira Junior¹, Joao Paulo Lima¹, Marco Antonio Zanata Alves², Luigi Carro¹ and Antonio Carlos Schneider Beck¹ ¹UFRGS, BR; ²UFPR, BR Abstract Pointer chasing has been, for years, the kernel operation employed by diverse data structures, from graphs to hash tables and dictionaries. However, due to the bewildering growth in the volume of data that current applications have to deal with, performing pointer chasing operations have become a major source of performance and energy bottleneck, due to its sparse memory access behavior. In this work, we aim to tackle this problem by taking advantage of the already available parallelism present in today's 3D-stacked memories. We present a simple mechanism that can accelerate pointer chasing operations by making use of a state-of-the-art PIM design that executes in-memory vector operations. The key idea behind our design is to run speculative loads, in parallel, based on a given memory address in a reconfigurable window of addresses. Our design can perform pointer-chasing operations on b+tree 4.9x faster when compared to modern baseline systems. Besides that, since our device avoids data movement and alleviates the memory hierarchy's inefficiency due to poor spatial data locality, we can also reduce energy consumption by 85% when compared to the baseline. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00		End of session Coffee Break in Exhibition Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD). Lunch Breaks (Großer Saal + Saal 1) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 20, 2018 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 21, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 22, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:00 Keynote Lecture in "Saal 2" 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

14:30

7.4.1

ROW-BUFFER HIT HARVESTING IN ORCHESTRATED LAST-LEVEL CACHE AND DRAM SCHEDULING FOR HETEROGENEOUS MULTICORE SYSTEMS
Speaker:
Xun Jiao, University of California, San Diego, US
Authors:
Yang Song¹, Olivier Alavoine² and Bill Lin¹
¹University of California, San Diego, US; ²Qualcomm Inc., US
Abstract
In heterogeneous multicore systems, the memory subsystem, including the last-level cache and DRAM, is widely shared among the CPU, the GPU, and the real-time cores. Due to their distinct memory traffic patterns, heterogeneous cores result in more frequent cache misses at the last-level cache. As cache misses travel through the memory subsystem, two schedulers are involved for the last-level cache and DRAM respectively. Prior studies treated the scheduling of the last-level cache and DRAM as independent stages. However, with no orchestration and limited visibility of memory traffic, neither scheduling stage is able to ensure optimal scheduling decisions for memory efficiency. Unnecessary precharges and row activations happen in DRAM when the memory scheduler is ignorant of incoming cache misses and DRAM row-buffer states are invisible to the last-level cache. In this paper, we propose a unified memory controller for the the last-level cache and DRAM with orchestrated schedulers. The memory scheduler harvests row-buffer hit opportunities in cache request buffers during spare time without inducing significant implementation cost. Extensive evaluations show that the proposed controller improves the total memory bandwidth of DRAM by 16.8% on average and saves DRAM energy by up to 29.7% while achieving comparable CPU IPC. In addition, we explore the impact of last-level cache bypassing techniques on the proposed memory controller.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

7.4.2

ADAM: ADAPTIVE APPROXIMATION MANAGEMENT FOR THE NON-VOLATILE MEMORY HIERARCHIES
Speaker:
Muhammad Abdullah Hanif, TU Wien, AT
Authors:
Mohammad Taghi Teimoori¹, Muhammad Abdullah Hanif², Alireza Ejlali¹ and Muhammad Shafique²
¹Sharif University of Technology, IR; ²TU Wien, AT
Abstract
Existing memory approximation techniques focus on employing approximations at an individual level of the memory hierarchy (e.g., cache, scratchpad, or main memory). However, to exploit the full potential of approximations, there is a need to manage different approximation knobs across the complete memory hierarchy. Towards this, we model a system including STT-RAM scratchpad and PCM main memory with different approximation knobs (e.g., read/write pulse magnitude/duration) in order to synergistically trade data accuracy for both STT-RAM access delay and PCM lifetime by means of an integer linear programming (ILP) problem at design-time. Furthermore, a run-time algorithm is proposed to adaptively tune the approximation knobs of both STT-RAM and PCM to obtain high energy savings while keeping the quality within acceptable ranges across the complete memory hierarchy. We evaluated our proposed technique in a baseline system consisting 1mB STT-RAM scratchpad and 1GB PCM main memory. The experimental results demonstrate that our proposed technique improves the execution time and the lifetime by up to 23% and 2.3X, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

7.4.3

A CROSS-LAYER ADAPTIVE APPROACH FOR PERFORMANCE AND POWER OPTIMIZATION IN STT-MRAM
Speaker:
Nour Sayed, KIT - Karlsruhe Institute of Technology, DE
Authors:
Nour Sayed, Rajendra Bishnoi, Fabian Oboril and Mehdi Tahoori, Karlsruhe Institute of Technology, DE
Abstract
Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate as a universal on-chip memory technology due to non-volatility, high density and scalability. However, high write energy and latency are major challenges in this memory technology due to the asymmetry and stochastic nature of the write operation. Typically, the write current is set for the minimum energy point, which can further impact the write latency. To mitigate these issues, we propose an adaptive write current scaling technique that adjusts the write current, and hence the write latency and energy based on the performance needs at run-time. Using this technique, optimal energy and performance points for write current are obtained using detailed device and system level analysis. Furthermore, we use run- time adaptation of write current by predicting the write access rate for the next execution phase. We evaluate the efficiency of the proposed approach on SPEC2000 applications for STT-MRAM-based L1 and L2-cache levels. The results show that the effective write latency of L1 and L2 is reduced by 52.4% and 55.7% with 7.6% and 1.4% area overheads, respectively, corresponding to the overall system performance optimization of 15.5% while the total memory energy consumption is increasing by only 3.2%.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

IP3-11, 331

PROCESSING IN 3D MEMORIES TO SPEEDUP OPERATIONS ON COMPLEX DATA STRUCTURES
Speaker:
Luigi Carro, UFRGS, BR
Authors:
Paulo Cesar Santos¹, Geraldo Francisco de Oliveira Junior¹, Joao Paulo Lima¹, Marco Antonio Zanata Alves², Luigi Carro¹ and Antonio Carlos Schneider Beck¹
¹UFRGS, BR; ²UFPR, BR
Abstract
Pointer chasing has been, for years, the kernel operation employed by diverse data structures, from graphs to hash tables and dictionaries. However, due to the bewildering growth in the volume of data that current applications have to deal with, performing pointer chasing operations have become a major source of performance and energy bottleneck, due to its sparse memory access behavior. In this work, we aim to tackle this problem by taking advantage of the already available parallelism present in today's 3D-stacked memories. We present a simple mechanism that can accelerate pointer chasing operations by making use of a state-of-the-art PIM design that executes in-memory vector operations. The key idea behind our design is to run speculative loads, in parallel, based on a given memory address in a reconfigurable window of addresses. Our design can perform pointer-chasing operations on b+tree 4.9x faster when compared to modern baseline systems. Besides that, since our device avoids data movement and alleviates the memory hierarchy's inefficiency due to poor spatial data locality, we can also reduce energy consumption by 85% when compared to the baseline.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

End of session
Coffee Break in Exhibition Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018