4.3 Efficient memory design

Time	Label	Presentation Title Authors
17:00	4.3.1	(Best Paper Award Candidate) STAXCACHE: AN APPROXIMATE, ENERGY EFFICIENT STT-MRAM CACHE Speaker: Ashish Ranjan, Purdue University, US Authors: Ashish Ranjan¹, Swagath Venkataramani¹, Zoha Pajouhi¹, Rangharajan Venkatesan², Kaushik Roy¹ and Anand Raghunathan¹ ¹Purdue University, US; ²NVIDIA, US Abstract STT-MRAM has attracted great interest for use as on-chip memory due to its high density, near-zero leakage and high endurance. However, its overall energy efficiency is limited by the energy requirements of spin-transfer torque switching during writes and reliable single-ended sensing during reads. Leveraging the ability of many applications to produce acceptable outputs under approximations to computations and data, we propose the use of approximate storage to improve the energy efficiency of STT-MRAM based caches. Towards this end, we explore a combination of different approximation techniques at the circuit and architecture levels that yield significant energy benefits for small probabilities of errors in reads, writes, and retention. A key challenge arises when introducing approximate storage into a cache - data that can tolerate different levels of approximation (or not at all) may be dynamically loaded into a cache line at different times. In addition, it is necessary to manage the approximations so as to obtain a desirable energy-quality tradeoff at the application level. We propose STAxCache (Spintronic Approximate Cache), an STT-MRAM based approximate L2 cache architecture that retains the full flexibility of a conventional cache, while allowing for different levels of approximation to different parts of a program's memory address space. We introduce a simple interface that allows the programmer to specify the quality requirements for different data structures, and instructions in the ISA to expose this information to STAxCache. We utilize a device-to-architecture simulation framework to evaluate STAxCache and achieve 1.44x improvement in L2 cache energy for negligible ( < 0.5%) loss in application-level quality across a suite of 8 benchmarks. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	4.3.2	RETHINKING ON-CHIP DRAM CACHE FOR SIMULTANEOUS PERFORMANCE AND ENERGY OPTIMIZATION Speaker: Fazal Hameed, Center for Advancing Electronics Dresden (cfaed), Technische Universitat Dresden, Germany, DE Authors: Fazal Hameed¹ and Jeronimo Castrillon² ¹Chair of Compiler Construction, TU-Dresden, DE; ²Technische Universität Dresden, DE Abstract State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag-Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2% and 11.4% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62%. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	4.3.3	AN ENERGY-EFFICIENT MEMORY HIERARCHY FOR MULTI-ISSUE PROCESSORS Speaker: Luigi Carro, Universidade Federal do Rio Grande do Sul, BR Authors: Tiago Jost, Gabriel Nazar and Luigi Carro, UFRGS, BR Abstract Embedded processors must rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. However, a limiting factor to better use available resources inside the processor concerns memory bandwidth. Adding extra ports to allow for more data accesses drastically increases costs and energy. In this paper, we present a novel memory architecture system for embedded multi-issue processors that can overcome the limited memory bandwidth without adding extra ports to the system. We combine the use of software-managed memories (SMM) with the data cache to provide a system with a higher throughput without increasing the number of ports. Compiler-automated code transformations minimize the effort of programmers to benefit from the proposed architecture. Our experimental results show an average speedup of 1.17x, while consuming 69% less dynamic energy and on average 74.7% lower energy-delay product regarding data memory in comparison to a baseline processor. Download Paper (PDF; Only available from the DATE venue WiFi)
18:15	4.3.4	MAPPING GRANULARITY ADAPTIVE FTL BASED ON FLASH PAGE RE-PROGRAMMING Speaker: Yazhi Feng, Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, CN Authors: Yazhi Feng, Dan Feng, Chenye Yu, Wei Tong and Jingning Liu, Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China, CN Abstract The page size of NAND flash continuously grows as the manufacturing process advances. While larger page can reduce the cost per bit and improve the throughput of NAND flash, it may waste the storage space and data transfer time. Meanwhile, it causes more frequent garbage collections when serving small write requests. To address the issues, we proposed a Mapping Granularity Adaptive FTL (MGA-FTL) based on flash page re-programming feature. MGA-FTL enables a finer granularity NAND flash space management and exploits multiple subpage writes on a single flash page without erase. 2-Level Mapping is introduced to serve requests of different sizes in order to control the overhead of DRAM requirement. Meanwhile, the allocation strategy determines whether different logical pages can be mapped to a single physical page to balance the space utilization and performance. Subpage merging limits the number of associated physical pages to a logical page, which could reduce data fragmentation and improves the performance of read operations. We compared MGA-FTL with some typical FTLs, including page-level mapping FTL and sector-log mapping FTL. Experimental results show that MGA-FTL reduces the I/O response time, write amplification and the number of erasures by 53\%, 30\% and 40\% respectively. Despite the overhead of fine-grained management, MGA-FTL increases no more than 16.5\% DRAM requirement compared with a page-level mapping FTL. Unlike the subpage-level mapping, MGA-FTL only needs one third of DRAM space for storing mapping tables. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP2-5, 328	I-BEP: A NON-REDUNDANT AND HIGH-CONCURRENCY MEMORY PERSISTENCY MODEL Speaker: Yuanchao Xu, Capital Normal University, CN Authors: Yuanchao Xu, Zeyi Hou, Junfeng Yan, Lu Yang and Hu Wan, Capital Normal University, CN Abstract Byte-addressable, non-volatile memory (NVM) technologies enable fast persistent updates but incur potential data inconsistency upon a failure. Recent proposals present several persistency models to guarantee data consistency. However, they fail to express the minimal persist ordering as a result of inducing unnecessary ordering constraints. In this paper, we propose i-BEP, a non-redundant high concurrency memory persistency model, which expresses epoch dependency via persist directed acyclic graph instead of program order. Additionally, we propose two techniques, background persist and deferred eviction, to enhance the performance of i-BEP. We demonstrate that i-BEP can improve the performance by 15% for typical data structures on average over buffered epoch persistency (BEP) model. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP2-6, 880	SPMS: STRAND BASED PERSISTENT MEMORY SYSTEM Speaker: Shuo Li, National University of Defense Technology, CN Authors: Shuo Li¹, Peng Wang², Nong Xiao¹, Guangyu Sun² and Fang Liu¹ ¹National University of Defense Technology, CN; ²Peking University, CN Abstract Emerging non-volatile memories enable persistent memory, which offers the opportunity to directly access persistent data structures residing in main memory. In order to keep persistent data consistent in case of system failures, most prior work relies on persist ordering constraints which incurs significant overheads. Strand persistency minimizes persist ordering constraints. However, there is still no proposed persistent memory design based on strand persistency due to its implementation complexity. In this work, we propose a novel persistent memory system based on strand persistency, called SPMS. SPMS consists of cacheline-based strand group tracking components, a volatile strand buffer and ultra-capacitors incorporated in persistent memory modules. SPMS can track each strand and guarantee its atomicity. In case of system failures, committed strands buffered in the strand buffer can be flushed back to persistent memory within the residual energy window provided by the ultra-capacitors. Our evaluations show that SPMS outperforms the state-of-the-art persistent memory system by 6.6\% and has slightly better performance than the baseline without any consistency guarantee. What's more, SPMS reduces the persistent memory write traffic by 30\%, with the help of the strand buffer. Download Paper (PDF; Only available from the DATE venue WiFi)
18:32	IP2-7, 72	ARCHITECTING HIGH-SPEED COMMAND SCHEDULERS FOR OPEN-ROW REAL-TIME SDRAM CONTROLLERS Speaker: Leonardo Ecco, TU Braunschweig, DE Authors: Leonardo Ecco¹ and Rolf Ernst² ¹Institute of Computer and Network Engineering, TU Braunschweig, DE; ²TU Braunschweig, DE Abstract As SDRAM modules get faster and their data buses wider, researchers proposed the use of the open-row policy in command schedulers for real-time SDRAM controllers. While the real-time properties of such schedulers have been thoroughly investigated, their hardware implementation was not. Hence, in this paper, we propose a highly-parallel and multi-stage architecture that implements a state-of-the open-row real-time command scheduler. Moreover, we evaluate such architecture from the hardware overhead and performance perspectives. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session Exhibition Reception in Exhibition Area The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.

Time

Label

Presentation Title
Authors

17:00

4.3.1

(Best Paper Award Candidate)
STAXCACHE: AN APPROXIMATE, ENERGY EFFICIENT STT-MRAM CACHE
Speaker:
Ashish Ranjan, Purdue University, US
Authors:
Ashish Ranjan¹, Swagath Venkataramani¹, Zoha Pajouhi¹, Rangharajan Venkatesan², Kaushik Roy¹ and Anand Raghunathan¹
¹Purdue University, US; ²NVIDIA, US
Abstract
STT-MRAM has attracted great interest for use as on-chip memory due to its high density, near-zero leakage and high endurance. However, its overall energy efficiency is limited by the energy requirements of spin-transfer torque switching during writes and reliable single-ended sensing during reads. Leveraging the ability of many applications to produce acceptable outputs under approximations to computations and data, we propose the use of approximate storage to improve the energy efficiency of STT-MRAM based caches. Towards this end, we explore a combination of different approximation techniques at the circuit and architecture levels that yield significant energy benefits for small probabilities of errors in reads, writes, and retention. A key challenge arises when introducing approximate storage into a cache - data that can tolerate different levels of approximation (or not at all) may be dynamically loaded into a cache line at different times. In addition, it is necessary to manage the approximations so as to obtain a desirable energy-quality tradeoff at the application level. We propose STAxCache (Spintronic Approximate Cache), an STT-MRAM based approximate L2 cache architecture that retains the full flexibility of a conventional cache, while allowing for different levels of approximation to different parts of a program's memory address space. We introduce a simple interface that allows the programmer to specify the quality requirements for different data structures, and instructions in the ISA to expose this information to STAxCache. We utilize a device-to-architecture simulation framework to evaluate STAxCache and achieve 1.44x improvement in L2 cache energy for negligible ( < 0.5%) loss in application-level quality across a suite of 8 benchmarks.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

4.3.2

RETHINKING ON-CHIP DRAM CACHE FOR SIMULTANEOUS PERFORMANCE AND ENERGY OPTIMIZATION
Speaker:
Fazal Hameed, Center for Advancing Electronics Dresden (cfaed), Technische Universitat Dresden, Germany, DE
Authors:
Fazal Hameed¹ and Jeronimo Castrillon²
¹Chair of Compiler Construction, TU-Dresden, DE; ²Technische Universität Dresden, DE
Abstract
State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag-Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2% and 11.4% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62%.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

4.3.3

AN ENERGY-EFFICIENT MEMORY HIERARCHY FOR MULTI-ISSUE PROCESSORS
Speaker:
Luigi Carro, Universidade Federal do Rio Grande do Sul, BR
Authors:
Tiago Jost, Gabriel Nazar and Luigi Carro, UFRGS, BR
Abstract
Embedded processors must rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. However, a limiting factor to better use available resources inside the processor concerns memory bandwidth. Adding extra ports to allow for more data accesses drastically increases costs and energy. In this paper, we present a novel memory architecture system for embedded multi-issue processors that can overcome the limited memory bandwidth without adding extra ports to the system. We combine the use of software-managed memories (SMM) with the data cache to provide a system with a higher throughput without increasing the number of ports. Compiler-automated code transformations minimize the effort of programmers to benefit from the proposed architecture. Our experimental results show an average speedup of 1.17x, while consuming 69% less dynamic energy and on average 74.7% lower energy-delay product regarding data memory in comparison to a baseline processor.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15

4.3.4

MAPPING GRANULARITY ADAPTIVE FTL BASED ON FLASH PAGE RE-PROGRAMMING
Speaker:
Yazhi Feng, Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, CN
Authors:
Yazhi Feng, Dan Feng, Chenye Yu, Wei Tong and Jingning Liu, Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China, CN
Abstract
The page size of NAND flash continuously grows as the manufacturing process advances. While larger page can reduce the cost per bit and improve the throughput of NAND flash, it may waste the storage space and data transfer time. Meanwhile, it causes more frequent garbage collections when serving small write requests. To address the issues, we proposed a Mapping Granularity Adaptive FTL (MGA-FTL) based on flash page re-programming feature. MGA-FTL enables a finer granularity NAND flash space management and exploits multiple subpage writes on a single flash page without erase. 2-Level Mapping is introduced to serve requests of different sizes in order to control the overhead of DRAM requirement. Meanwhile, the allocation strategy determines whether different logical pages can be mapped to a single physical page to balance the space utilization and performance. Subpage merging limits the number of associated physical pages to a logical page, which could reduce data fragmentation and improves the performance of read operations. We compared MGA-FTL with some typical FTLs, including page-level mapping FTL and sector-log mapping FTL. Experimental results show that MGA-FTL reduces the I/O response time, write amplification and the number of erasures by 53\%, 30\% and 40\% respectively. Despite the overhead of fine-grained management, MGA-FTL increases no more than 16.5\% DRAM requirement compared with a page-level mapping FTL. Unlike the subpage-level mapping, MGA-FTL only needs one third of DRAM space for storing mapping tables.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP2-5, 328

I-BEP: A NON-REDUNDANT AND HIGH-CONCURRENCY MEMORY PERSISTENCY MODEL
Speaker:
Yuanchao Xu, Capital Normal University, CN
Authors:
Yuanchao Xu, Zeyi Hou, Junfeng Yan, Lu Yang and Hu Wan, Capital Normal University, CN
Abstract
Byte-addressable, non-volatile memory (NVM) technologies enable fast persistent updates but incur potential data inconsistency upon a failure. Recent proposals present several persistency models to guarantee data consistency. However, they fail to express the minimal persist ordering as a result of inducing unnecessary ordering constraints. In this paper, we propose i-BEP, a non-redundant high concurrency memory persistency model, which expresses epoch dependency via persist directed acyclic graph instead of program order. Additionally, we propose two techniques, background persist and deferred eviction, to enhance the performance of i-BEP. We demonstrate that i-BEP can improve the performance by 15% for typical data structures on average over buffered epoch persistency (BEP) model.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP2-6, 880

SPMS: STRAND BASED PERSISTENT MEMORY SYSTEM
Speaker:
Shuo Li, National University of Defense Technology, CN
Authors:
Shuo Li¹, Peng Wang², Nong Xiao¹, Guangyu Sun² and Fang Liu¹
¹National University of Defense Technology, CN; ²Peking University, CN
Abstract
Emerging non-volatile memories enable persistent memory, which offers the opportunity to directly access persistent data structures residing in main memory. In order to keep persistent data consistent in case of system failures, most prior work relies on persist ordering constraints which incurs significant overheads. Strand persistency minimizes persist ordering constraints. However, there is still no proposed persistent memory design based on strand persistency due to its implementation complexity. In this work, we propose a novel persistent memory system based on strand persistency, called SPMS. SPMS consists of cacheline-based strand group tracking components, a volatile strand buffer and ultra-capacitors incorporated in persistent memory modules. SPMS can track each strand and guarantee its atomicity. In case of system failures, committed strands buffered in the strand buffer can be flushed back to persistent memory within the residual energy window provided by the ultra-capacitors. Our evaluations show that SPMS outperforms the state-of-the-art persistent memory system by 6.6\% and has slightly better performance than the baseline without any consistency guarantee. What's more, SPMS reduces the persistent memory write traffic by 30\%, with the help of the strand buffer.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:32

IP2-7, 72

ARCHITECTING HIGH-SPEED COMMAND SCHEDULERS FOR OPEN-ROW REAL-TIME SDRAM CONTROLLERS
Speaker:
Leonardo Ecco, TU Braunschweig, DE
Authors:
Leonardo Ecco¹ and Rolf Ernst²
¹Institute of Computer and Network Engineering, TU Braunschweig, DE; ²TU Braunschweig, DE
Abstract
As SDRAM modules get faster and their data buses wider, researchers proposed the use of the open-row policy in command schedulers for real-time SDRAM controllers. While the real-time properties of such schedulers have been thoroughly investigated, their hardware implementation was not. Hence, in this paper, we propose a highly-parallel and multi-stage architecture that implements a state-of-the open-row real-time command scheduler. Moreover, we evaluate such architecture from the hardware overhead and performance perspectives.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session
Exhibition Reception in Exhibition Area
The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.

available at

Visit us at DATE 2017

Booth: 20+21

Booth: 30

Booth: 17

Booth: 26

Booth: 1

Booth: 23

Submissions

4.3 Efficient memory design

DATE Smartphone App

Visit us at DATE 2017