9.2 Emerging Schemes for Memory Management

Time	Label	Presentation Title Authors
08:30	9.2.1	A LOG-AWARE SYNERGIZED SCHEME FOR PAGE-LEVEL FTL DESIGN Speaker: Chu Li, Huazhong University of Science & Technology, CN Authors: Chu Li¹, Dan Feng¹, Yu Hua¹, Fang Wang¹, Chuntao Jiang² and Wei Zhou¹ ¹Huazhong University of Science and Technology, CN; ²Illinois Institute of Technology, US Abstract NAND flash-based Solid State Drives (SSDs) employ the Flash Translation Layer (FTL) to perform logical-to-physical address translation. Modern page-level FTLs selectively cache the address mappings in the limited SRAM while storing the mapping table in flash pages (called translation pages). However, many extra accesses to the translation pages are required for address translation, which decreases the performance and lifetime of an SSD. In this paper, we propose a Log-aware Synergized scheme for page-level FTL to reduce the extra overheads, called LSFTL. The contribution of LSFTL consists of two key elements: (i) By exploiting the partial programmability of SLC flash, "in-place logging" decreases garbage collection overhead via reserving a small portion of each translation page as a logging area to hold multiple updates to the entries of that translation page. (ii) "Log-aware flush back" reduces the number of translation page updates by evicting multiple dirty cache lines that share the same translation page in a single transaction. Extensive experimental results of trace-driven simulations show that LSFTL decreases the system response time by 39.40% on average, and up to 58.35%, and reduces the block erase count by 37.55% on average, and up to 39.99%, compared to the well-known DFTL. Download Paper (PDF; Only available from the DATE venue WiFi)
09:00	9.2.2	MALRU: MISS-PENALTY AWARE LRU-BASED CACHE REPLACEMENT FOR HYBRID MEMORY SYSTEMS Speaker: Chen Di, Huazhong University of Science and Technology, CN Authors: Di Chen, Hai Jin, Xiaofei Liao, Haikun Liu, Rentong Guo and Dong Liu, Huazhong University of Science and Technology, CN Abstract Current DRAM based memory systems face the scalability challenges in terms of storage density, power, and cost. Hybrid memory architecture composed of emerging Non-Volatile Memory (NVM) and DRAM is a promising approach to large-capacity and energy-efficient main memory. However, hybrid memory systems pose a new challenge to on-chip cache management due to the asymmetrical penalty of memory access to DRAM and NVM in case of cache misses. Cache hit rate is no longer an effective metric for evaluating memory access performance in hybrid memory systems. Current cache replacement policies that aim to improve cache hit rate are not efficient either. In this paper, we take into account the asymmetry of cache miss penalty on DRAM and NVM, and advocate a more general metric, Average Memory Access Time (AMAT), to evaluate the performance of hybrid memories. We propose a miss penalty-aware LRU-based (MALRU) cache replacement policy for hybrid memory systems. MALRU is aware of the source (DRAM or NVM) of missing blocks and prevents high-latency NVM blocks as well as low-latency DRAM blocks with good temporal locality from being evicted. Experimental results show that MALRU improves system performance against LRU and the state-of-the-art HAP policy by up to 20.4% and 11.7% (11.1% and 5.7% on average), respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
09:30	9.2.3	ENDURANCE MANAGEMENT FOR RESISTIVE LOGIC-IN-MEMORY COMPUTING ARCHITECTURES Speaker: Saeideh Shirinzadeh, University of Bremen, DE Authors: Saeideh Shirinzadeh¹, Mathias Soeken², Pierre-Emmanuel Gaillardon³, Giovanni De Micheli⁴ and Rolf Drechsler⁵ ¹Group of Computer Architecture, University of Bremen, Germany, DE; ²EPFL, CH; ³University of Utah, US; ⁴Integrated Systems Laboratory, EPFL, Lausanne, Switzerland, CH; ⁵Group of Computer Architecture, University of Bremen, Germany Cyber-Physical Systems, DFKI GmbH, Bremen, Germany, DE Abstract Resistive Random Access Memory (RRAM) is a promising non-volatile memory technology which enables modern in-memory computing architectures. Although RRAMs are known to be superior to conventional memories in many aspects, they suffer from a low write endurance. In this paper, we focus on balancing memory write traffic as a solution to extend the lifetime of resistive crossbar architectures. As a case study, we monitor the write traffic in a Programmable Logic-in-Memory (PLiM) architecture, and propose an endurance management scheme for it. The proposed endurance-aware compilation is capable of handling different trade-offs between write balance, latency, and area of the resulting PLiM implementations. Experimental evaluations on a set of benchmarks including large arithmetic and control functions show that the standard deviation of writes can be reduced by 86.65\% on average compared to a naive compiler, while the average number of instructions and RRAM devices also decreases by 36.45\% and 13.67\%, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
09:45	9.2.4	LIVE TOGETHER OR DIE ALONE: BLOCK COOPERATION TO EXTEND LIFETIME OF RESISTIVE MEMORIES Speaker: David Kaeli, Northeastern University, US Authors: Mohammad Khavari Tavana, Amir Kavyan Ziabari and David Kaeli, Northeastern University, US Abstract Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated with that block is disabled and becomes unavailable to the physical address space. To reduce the page waste caused by early block failures, other blocks can support the failed block, working cooperatively to keep it alive and extend the page's lifetime. We combine the proposed technique with different error recovery schemes, such as Error Coreection Pointers (ECP) and Aegis, to increase memory lifetimes. Block cooperation is realized through metadata sharing in ECP, where one data block shares its unused metadata with another data block. When combined with Aegis, block cooperation is realized through reorganizing data layout, where blocks possessing few faults help failed blocks bring them back from the dead. Employing block cooperation at a single level (or multiple levels) on top of ECP and Aegis, we can increase memory lifetimes by 28% (37%), and 8% (14%) on average, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00	IP4-7, 272	DAC: DEDUP-ASSISTED COMPRESSION SCHEME FOR IMPROVING LIFETIME OF NAND STORAGE SYSTEMS Speaker: Jisung Park, Seoul National University, KR Authors: Jisung Park¹, Sungjin Lee² and Jihong Kim¹ ¹Seoul National University, KR; ²Inha University, KR Abstract Thanks to an aggressive scaling of semiconductor devices, the capacity of NAND flash-based solid-state-drives (SSDs) has increased greatly. However, this benefit comes at the expense of a serious degradation of NAND device's lifetime. In order to improve the lifetime of flash-based SSDs, various data reduction techniques, such as deduplication, lossless compression, and delta compression, are rapidly adopted to SSDs. Although each technique has been extensively studied, how to efficiently combine these techniques for maximizing their synergy effects is not investigated well. In this paper, we propose a novel dedup-assisted compression (DAC) scheme that integrates existing data reduction techniques so that potential benefits of individual ones can be maximized while overcoming their inherent limitations. By doing so, DAC greatly reduces the amount of write traffic sent to SSDs. DAC also requires negligible hardware resources by utilizing existing hardware modules. Our evaluation results show that the proposed DAC decreases the amount of written data by up to 30% over a simple integration reduplication and lossless compression. Download Paper (PDF; Only available from the DATE venue WiFi)
10:01	IP4-8, 390	LIFETIME ADAPTIVE ECC IN NAND FLASH PAGE MANAGEMENT Speaker: Shunzhuo Wang, Huazhong University of Science and Technology, CN Authors: Shunzhuo Wang¹, Fei Wu¹, Zhonghai Lu², You Zhou¹, Qin Xiong¹, Meng Zhang¹ and Changsheng Xie¹ ¹Huazhong University of Science and Technology, CN; ²KTH Royal Institute of Technology, SE Abstract With increasing density, NAND flash memory has decreasing reliability. Furthermore, raw bit error rate (RBER) of flash memory grows at an exponential rate as program/erase (P/E) cycle increases. Thus, error correction codes (ECCs), usually stored in the out-of-band area (OOB) of flash pages, are widely employed to ensure the reliability. However, the worstcase oriented ECC is largely under-utilized in the early stage, i.e. when P/E cycles are small, and the required ECC redundancy may be too large to be stored in the OOB. In this paper, we propose LAE-FTL, which employs a lifetimeadaptive ECC scheme, to improve the performance and lifetime of NAND flash memory. In the early stage, weak ECCs can guarantee the reliability and the OOB is large enough to store the ECCs. Thus, LAE-FTL employs weak ECCs and adaptively uses small and incremental codewords as P/E cycle increases to improve data transfer and decoding parallelism. In the late stage with large P/E cycles, strong ECCs are needed and the ECC redundancies become too large to fit in the OOB. Thus, LAE-FTL stores the exceeding ECC redundancies in the data space of flash pages and stores user data in a cross-page fashion. Finally, our evaluation results of trace-driven simulations show that LAE-FTL improves the read performance by up to 63.42%, compared to the worst-case oriented ECC scheme in the early stage, and significantly improve reliability of flash memory at low data accessing overhead in the late stage. Download Paper (PDF; Only available from the DATE venue WiFi)
10:02	IP4-9, 386	3D-DPE: A 3D HIGH-BANDWIDTH DOT-PRODUCT ENGINE FOR HIGH-PERFORMANCE NEUROMORPHIC COMPUTING Speaker: Miguel Lastras-Montaño, University of California, Santa Barbara, US Authors: Miguel Angel Lastras-Montaño¹, Bhaswar Chakrabarti¹, Dmitri B. Strukov¹ and Kwang-Ting Cheng² ¹UC Santa Barbara, US; ²HKUST, HK Abstract We present and experimentally validate 3D-DPE, a general-purpose dot-product engine, which is ideal for accelerating artificial neural networks (ANNs). 3D-DPE is based on a monolithically integrated 3D CMOS-memristor hybrid circuit and performs a high-dimensional dot-product operation (a recurrent and computationally expensive operation in ANNs) within a single step, using analog current-based computing. 3D-DPE is made up of two subsystems, namely a CMOS subsystem serving as the memory controller and an analog memory subsystem consisting of multiple layers of high-density memristive crossbar arrays fabricated on top of the CMOS subsystem. Their integration is based on a high-density area-distributed interface, resulting in much higher connectivity between the two subsystems, compared to the traditional interface of a 2D system or a 3D system integrated using through silicon vias. As a result, 3D-DPE's single-step dot-product operation is not limited by the memory bandwidth, and the input dimension of the operations scales well with the capacity of the 3D memristive arrays. To demonstrate the feasibility of 3D-DPE, we designed and fabricated a CMOS memory controller and monolitically integrated 2 layers of titanium-oxide memristive crossbars. Then we performed the analog dot-product operation under different input conditions in two scenarios: (1) with devices within the same crossbar layer and (2) with devices from different layers. In both cases, the devices exhibited low voltage operation and analog switching behavior with high tuning accuracy. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00		End of session Coffee Break in Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Tuesday, March 28, 2017 Coffee Break 10:30 - 11:30 Coffee Break 16:00 - 17:00 Wednesday, March 29, 2017 Coffee Break 10:00 - 11:00 Coffee Break 16:00 - 17:00 Thursday, March 30, 2017 Coffee Break 10:00 - 11:00 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

08:30

9.2.1

A LOG-AWARE SYNERGIZED SCHEME FOR PAGE-LEVEL FTL DESIGN
Speaker:
Chu Li, Huazhong University of Science & Technology, CN
Authors:
Chu Li¹, Dan Feng¹, Yu Hua¹, Fang Wang¹, Chuntao Jiang² and Wei Zhou¹
¹Huazhong University of Science and Technology, CN; ²Illinois Institute of Technology, US
Abstract
NAND flash-based Solid State Drives (SSDs) employ the Flash Translation Layer (FTL) to perform logical-to-physical address translation. Modern page-level FTLs selectively cache the address mappings in the limited SRAM while storing the mapping table in flash pages (called translation pages). However, many extra accesses to the translation pages are required for address translation, which decreases the performance and lifetime of an SSD. In this paper, we propose a Log-aware Synergized scheme for page-level FTL to reduce the extra overheads, called LSFTL. The contribution of LSFTL consists of two key elements: (i) By exploiting the partial programmability of SLC flash, "in-place logging" decreases garbage collection overhead via reserving a small portion of each translation page as a logging area to hold multiple updates to the entries of that translation page. (ii) "Log-aware flush back" reduces the number of translation page updates by evicting multiple dirty cache lines that share the same translation page in a single transaction. Extensive experimental results of trace-driven simulations show that LSFTL decreases the system response time by 39.40% on average, and up to 58.35%, and reduces the block erase count by 37.55% on average, and up to 39.99%, compared to the well-known DFTL.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:00

9.2.2

MALRU: MISS-PENALTY AWARE LRU-BASED CACHE REPLACEMENT FOR HYBRID MEMORY SYSTEMS
Speaker:
Chen Di, Huazhong University of Science and Technology, CN
Authors:
Di Chen, Hai Jin, Xiaofei Liao, Haikun Liu, Rentong Guo and Dong Liu, Huazhong University of Science and Technology, CN
Abstract
Current DRAM based memory systems face the scalability challenges in terms of storage density, power, and cost. Hybrid memory architecture composed of emerging Non-Volatile Memory (NVM) and DRAM is a promising approach to large-capacity and energy-efficient main memory. However, hybrid memory systems pose a new challenge to on-chip cache management due to the asymmetrical penalty of memory access to DRAM and NVM in case of cache misses. Cache hit rate is no longer an effective metric for evaluating memory access performance in hybrid memory systems. Current cache replacement policies that aim to improve cache hit rate are not efficient either. In this paper, we take into account the asymmetry of cache miss penalty on DRAM and NVM, and advocate a more general metric, Average Memory Access Time (AMAT), to evaluate the performance of hybrid memories. We propose a miss penalty-aware LRU-based (MALRU) cache replacement policy for hybrid memory systems. MALRU is aware of the source (DRAM or NVM) of missing blocks and prevents high-latency NVM blocks as well as low-latency DRAM blocks with good temporal locality from being evicted. Experimental results show that MALRU improves system performance against LRU and the state-of-the-art HAP policy by up to 20.4% and 11.7% (11.1% and 5.7% on average), respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:30

9.2.3

ENDURANCE MANAGEMENT FOR RESISTIVE LOGIC-IN-MEMORY COMPUTING ARCHITECTURES
Speaker:
Saeideh Shirinzadeh, University of Bremen, DE
Authors:
Saeideh Shirinzadeh¹, Mathias Soeken², Pierre-Emmanuel Gaillardon³, Giovanni De Micheli⁴ and Rolf Drechsler⁵
¹Group of Computer Architecture, University of Bremen, Germany, DE; ²EPFL, CH; ³University of Utah, US; ⁴Integrated Systems Laboratory, EPFL, Lausanne, Switzerland, CH; ⁵Group of Computer Architecture, University of Bremen, Germany Cyber-Physical Systems, DFKI GmbH, Bremen, Germany, DE
Abstract
Resistive Random Access Memory (RRAM) is a promising non-volatile memory technology which enables modern in-memory computing architectures. Although RRAMs are known to be superior to conventional memories in many aspects, they suffer from a low write endurance. In this paper, we focus on balancing memory write traffic as a solution to extend the lifetime of resistive crossbar architectures. As a case study, we monitor the write traffic in a Programmable Logic-in-Memory (PLiM) architecture, and propose an endurance management scheme for it. The proposed endurance-aware compilation is capable of handling different trade-offs between write balance, latency, and area of the resulting PLiM implementations. Experimental evaluations on a set of benchmarks including large arithmetic and control functions show that the standard deviation of writes can be reduced by 86.65\% on average compared to a naive compiler, while the average number of instructions and RRAM devices also decreases by 36.45\% and 13.67\%, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:45

9.2.4

LIVE TOGETHER OR DIE ALONE: BLOCK COOPERATION TO EXTEND LIFETIME OF RESISTIVE MEMORIES
Speaker:
David Kaeli, Northeastern University, US
Authors:
Mohammad Khavari Tavana, Amir Kavyan Ziabari and David Kaeli, Northeastern University, US
Abstract
Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated with that block is disabled and becomes unavailable to the physical address space. To reduce the page waste caused by early block failures, other blocks can support the failed block, working cooperatively to keep it alive and extend the page's lifetime. We combine the proposed technique with different error recovery schemes, such as Error Coreection Pointers (ECP) and Aegis, to increase memory lifetimes. Block cooperation is realized through metadata sharing in ECP, where one data block shares its unused metadata with another data block. When combined with Aegis, block cooperation is realized through reorganizing data layout, where blocks possessing few faults help failed blocks bring them back from the dead. Employing block cooperation at a single level (or multiple levels) on top of ECP and Aegis, we can increase memory lifetimes by 28% (37%), and 8% (14%) on average, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

IP4-7, 272

DAC: DEDUP-ASSISTED COMPRESSION SCHEME FOR IMPROVING LIFETIME OF NAND STORAGE SYSTEMS
Speaker:
Jisung Park, Seoul National University, KR
Authors:
Jisung Park¹, Sungjin Lee² and Jihong Kim¹
¹Seoul National University, KR; ²Inha University, KR
Abstract
Thanks to an aggressive scaling of semiconductor devices, the capacity of NAND flash-based solid-state-drives (SSDs) has increased greatly. However, this benefit comes at the expense of a serious degradation of NAND device's lifetime. In order to improve the lifetime of flash-based SSDs, various data reduction techniques, such as deduplication, lossless compression, and delta compression, are rapidly adopted to SSDs. Although each technique has been extensively studied, how to efficiently combine these techniques for maximizing their synergy effects is not investigated well. In this paper, we propose a novel dedup-assisted compression (DAC) scheme that integrates existing data reduction techniques so that potential benefits of individual ones can be maximized while overcoming their inherent limitations. By doing so, DAC greatly reduces the amount of write traffic sent to SSDs. DAC also requires negligible hardware resources by utilizing existing hardware modules. Our evaluation results show that the proposed DAC decreases the amount of written data by up to 30% over a simple integration reduplication and lossless compression.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:01

IP4-8, 390

LIFETIME ADAPTIVE ECC IN NAND FLASH PAGE MANAGEMENT
Speaker:
Shunzhuo Wang, Huazhong University of Science and Technology, CN
Authors:
Shunzhuo Wang¹, Fei Wu¹, Zhonghai Lu², You Zhou¹, Qin Xiong¹, Meng Zhang¹ and Changsheng Xie¹
¹Huazhong University of Science and Technology, CN; ²KTH Royal Institute of Technology, SE
Abstract
With increasing density, NAND flash memory has decreasing reliability. Furthermore, raw bit error rate (RBER) of flash memory grows at an exponential rate as program/erase (P/E) cycle increases. Thus, error correction codes (ECCs), usually stored in the out-of-band area (OOB) of flash pages, are widely employed to ensure the reliability. However, the worstcase oriented ECC is largely under-utilized in the early stage, i.e. when P/E cycles are small, and the required ECC redundancy may be too large to be stored in the OOB. In this paper, we propose LAE-FTL, which employs a lifetimeadaptive ECC scheme, to improve the performance and lifetime of NAND flash memory. In the early stage, weak ECCs can guarantee the reliability and the OOB is large enough to store the ECCs. Thus, LAE-FTL employs weak ECCs and adaptively uses small and incremental codewords as P/E cycle increases to improve data transfer and decoding parallelism. In the late stage with large P/E cycles, strong ECCs are needed and the ECC redundancies become too large to fit in the OOB. Thus, LAE-FTL stores the exceeding ECC redundancies in the data space of flash pages and stores user data in a cross-page fashion. Finally, our evaluation results of trace-driven simulations show that LAE-FTL improves the read performance by up to 63.42%, compared to the worst-case oriented ECC scheme in the early stage, and significantly improve reliability of flash memory at low data accessing overhead in the late stage.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:02

IP4-9, 386

3D-DPE: A 3D HIGH-BANDWIDTH DOT-PRODUCT ENGINE FOR HIGH-PERFORMANCE NEUROMORPHIC COMPUTING
Speaker:
Miguel Lastras-Montaño, University of California, Santa Barbara, US
Authors:
Miguel Angel Lastras-Montaño¹, Bhaswar Chakrabarti¹, Dmitri B. Strukov¹ and Kwang-Ting Cheng²
¹UC Santa Barbara, US; ²HKUST, HK
Abstract
We present and experimentally validate 3D-DPE, a general-purpose dot-product engine, which is ideal for accelerating artificial neural networks (ANNs). 3D-DPE is based on a monolithically integrated 3D CMOS-memristor hybrid circuit and performs a high-dimensional dot-product operation (a recurrent and computationally expensive operation in ANNs) within a single step, using analog current-based computing. 3D-DPE is made up of two subsystems, namely a CMOS subsystem serving as the memory controller and an analog memory subsystem consisting of multiple layers of high-density memristive crossbar arrays fabricated on top of the CMOS subsystem. Their integration is based on a high-density area-distributed interface, resulting in much higher connectivity between the two subsystems, compared to the traditional interface of a 2D system or a 3D system integrated using through silicon vias. As a result, 3D-DPE's single-step dot-product operation is not limited by the memory bandwidth, and the input dimension of the operations scales well with the capacity of the 3D memristive arrays. To demonstrate the feasibility of 3D-DPE, we designed and fabricated a CMOS memory controller and monolitically integrated 2 layers of titanium-oxide memristive crossbars. Then we performed the analog dot-product operation under different input conditions in two scenarios: (1) with devices within the same crossbar layer and (2) with devices from different layers. In both cases, the devices exhibited low voltage operation and analog switching behavior with high tuning accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017