11.4 Reliable in-memory computing

Time	Label	Presentation Title Authors
14:00	11.4.1	REBOC: ACCELERATING BLOCK-CIRCULANT NEURAL NETWORKS IN RERAM Speaker: Yitu Wang, Fudan University, CN Authors: Yitu Wang¹, Fan Chen², Linghao Song², C.-J. Richard Shi³, Hai (Helen) Li⁴ and Yiran Chen² ¹Fudan University, CN; ²Duke University, US; ³University of Washington, US; ⁴Duke University, US / TU Munich, US Abstract Deep neural networks (DNNs) emerge as a key component in various applications. However, the ever-growing DNN size hinders efficient processing on hardware. To tackle this problem, on the algorithmic side, compressed DNN models are explored, of which block-circulant DNN models are memory efficient, and hardware-friendly; on the hardware side, resistive random-access memory (ReRAM) based accelerators are promising for in-situ processing for DNNs. In this work, we design an accelerator named ReBoc for accelerating block-circulant neural networks in ReRAM to reap the benefits of light-weight DNN models and efficient in-situ processing simultaneously. We propose a novel mapping scheme which utilizes Horizontal Weight Slicing and Intra-Crossbar Weight Duplication to map the block-circulant DNN model onto ReRAM crossbars with significant improved crossbar utilization. Moreover, two techniques, namely Input Slice Reusing and Input Tile Sharing are introduced to take advantage of the circulant calculation feature in block-circulant DNN models to reduce data access and buffer size. In ReBoc, a DNN model is executed within an intra-layer processing pipeline and achieves respectively 96× and 8.86× power efficiency improvement compared to the state-of-the-art FPGA and ASIC accelerators for block-circulant neural networks. Compared to ReRAM-based DNN acclerators, ReBoc achieves averagely 4.1× speedup and 2.6× energy reduction. Download Paper (PDF; Only available from the DATE venue WiFi)
14:30	11.4.2	GRAPHRSIM: A JOINT DEVICE-ALGORITHM RELIABILITY ANALYSIS FOR RERAM-BASED GRAPH PROCESSING Speaker: Chin-Fu Nien, Academia Sinica, TW Authors: Chin-Fu Nien¹, Yi-Jou Hsiao², Hsiang-Yun Cheng¹, Cheng-Yu Wen³, Ya-Cheng Ko³ and Che-Ching Lin³ ¹Academia Sinica, TW; ²National Chiao Tung University, TW; ³National Taiwan University, TW Abstract Graph processing has attracted a lot of interests in recent years as it plays a key role to analyze huge datasets. ReRAM-based accelerators provide a promising solution to accelerate graph processing. However, the intrinsic stochastic behavior of ReRAM devices makes its computation results unreliable. In this paper, we build a simulation platform to analyze the impact of non-ideal ReRAM devices on the error rates of various graph algorithms. We show that the characteristic of the targeted graph algorithm and the type of ReRAM computations employed greatly affect the error rates. Using representative graph algorithms as case studies, we demonstrate that our simulation platform can guide chip designers to select better design options and develop new techniques to improve reliability. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	11.4.3	STAIR: HIGH RELIABLE STT-MRAM AWARE MULTI-LEVEL I/O CACHE ARCHITECTURE BY ADAPTIVE ECC ALLOCATION Speaker: Hossein Asadi, Sharif University of Technology, IR Authors: Mostafa Hadizadeh, Elham Cheshmikhani and Hossein Asadi, Sharif University of Technology, IR Abstract Hybrid Multi−Level Cache Architectures (HCAs) are promising solutions for the growing need of high-performance and cost-efficient data storage systems. HCAs employ a high endurable memory as the first-level cache and a Solid−State Drive (SSD) as the second-level cache. Spin−Transfer Torque Magnetic RAM (STT-MRAM) is one of the most promising candidates for the first-level cache of HCAs because of its high endurance and DRAM-comparable performance along with non-volatility. However, STT-MRAM faces with three major reliability challenges named Read Disturbance, Write Failure, and Retention Failure. To provide a reliable HCA, the reliability challenges of STT-MRAM should be carefully addressed. To this end, this paper first makes a careful distinction between clean and dirty pages to classify and prioritize their different vulnerabilities. Then, we investigate the distribution of more vulnerable pages in the first-level cache of HCAs over 17 storage workloads. Our observations show that the protection overhead can be significantly reduced by adjusting the protection level of data pages based on their vulnerability. To this aim, we propose a STT−MRAM Aware Multi−Level I/O Cache Architecture (STAIR) to improve HCA reliability by dynamically generating extra strong Error−Correction Codes (ECCs) for the dirty data pages. STAIR adaptively allocates under-utilized parts of the first-level cache to store these extra ECCs. Our evaluations show that STAIR decreases the data loss probability by five orders of magnitude, on average, with negligible performance overhead (0.12% hit ratio reduction in the worst case) and 1.56% memory overhead for the cache controller. Download Paper (PDF; Only available from the DATE venue WiFi)
15:15	11.4.4	EFFECTIVE WRITE DISTURBANCE MITIGATION ENCODING SCHEME FOR HIGH-DENSITY PCM Speaker: Muhammad Imran, Sungkyunkwan University, KR Authors: Muhammad Imran, Taehyun Kwon and Joon-Sung Yang, Sungkyunkwan University, KR Abstract Write Disturbance (WD) is a crucial reliability concern in a high-density PCM with below 20nm scaling. WD occurs because of the inter-cell heat transfer during a RESET operation. Being dependent on the type of programming pulse and the state of the vulnerable cell, WD is significantly impacted by the data patterns. Existing encoding techniques to mitigate WD reduce the percentage of a single WD-vulnerable pattern in the data. However, it is observed that reducing the frequency of a single bit pattern may not be effective to mitigate WD for certain data patterns. This work proposes a significantly more effective encoding method which minimizes the number of vulnerable cells instead of a single bit pattern. The proposed method mitigates WD both within a word-line and across the bit-lines. In addition to WD-mitigation, the proposed method encodes the data to minimize the bit flips, thus improving the memory lifetime compared to the conventional WD-mitigation techniques. Our evaluation using SPEC CPU2006 benchmarks shows that the proposed method can reduce the aggregate (word-line+bit-line) WD errors by 42% compared to the existing state-of-the-art (SD-PCM). Compared to the state-of-the-art SD-PCM method, the proposed method improves the average write time, instructions-per-cycle (IPC) and write energy by 12%, 12% and 9%, respectively, by reducing the frequency of verify and correct operations to address WD errors. With reduction in bit flips, memory lifetime is also improved by 18% to 37% compared to SD-PCM, given an asymmetric cost of the bit flips. By integrating with the orthogonal techniques of SD-PCM, the proposed method can further enhance the performance and energy efficiency. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	IP5-6, 478	COMPUTATIONAL RESTRUCTURING: RETHINKING IMAGE PROCESSING USING MEMRISTOR CROSSBAR ARRAYS Speaker: Rickard Ewetz, University of Central Florida, US Authors: Baogang Zhang, Necati Uysal and Rickard Ewetz, University of Central Florida, US Abstract Image processing is a core operation performed on billions of sensor-devices in the Internet of Things (IoT). Emerging memristor crossbar arrays (MCAs) promise to perform matrix-vector multiplication (MVM) with extremely small energy-delay product, which is the dominating computation within the two-dimensional Discrete Cosine Transform (2D DCT). Earlier studies have directly mapped the digital implementation to MCA based hardware. The drawback is that the series computation is vulnerable to errors. Moreover, the implementation requires the use of large image block sizes, which is known to degrade the image quality. In this paper, we propose to restructure the 2D DCT into an equivalent single linear transformation (or MVM operation). The reconstruction eliminates the series computation and reduces the processed block sizes from NxN to √Nx√N. Consequently, both the robustness to errors and the image quality is improved. Moreover, the latency, power, and area is reduced with 2X while eliminating the storage of intermediate data, and the power and area can be further reduced with up to 62% and 74% using frequency spectrum optimization. Download Paper (PDF; Only available from the DATE venue WiFi)
15:33	IP5-7, 312	SCRIMP: A GENERAL STOCHASTIC COMPUTING ACCELERATION ARCHITECTURE USING RERAM IN-MEMORY PROCESSING Speaker: Saransh Gupta, University of California, San Diego, US Authors: Saransh Gupta¹, Mohsen Imani¹, Joonseop Sim¹, Andrew Huang¹, Fan Wu¹, M. Hassan Najafi² and Tajana Rosing¹ ¹University of California, San Diego, US; ²University of Louisiana, US Abstract Stochastic computing (SC) reduces the complexity of computation by representing numbers with long independent bit-streams. However, increasing performance in SC comes with an increase in area and loss in accuracy. Processing in memory (PIM) with non-volatile memories (NVMs) computes data in-place, while having high memory density and supporting bit-parallel operations with low energy. In this paper, we propose SCRIMP for stochastic computing acceleration with resistive RAM (ReRAM) in-memory processing, which enables SC in memory. SCRIMP can be used for a wide range of applications. It supports all SC encodings and operations in memory. It maximizes the performance and energy efficiency of implementing SC by introducing novel in-memory parallel stochastic number generation and efficient implication-based logic in memory. To show the efficiency of our stochastic architecture, we implement image processing on the proposed hardware. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30		End of session

Time

Label

Presentation Title
Authors

14:00

11.4.1

REBOC: ACCELERATING BLOCK-CIRCULANT NEURAL NETWORKS IN RERAM
Speaker:
Yitu Wang, Fudan University, CN
Authors:
Yitu Wang¹, Fan Chen², Linghao Song², C.-J. Richard Shi³, Hai (Helen) Li⁴ and Yiran Chen²
¹Fudan University, CN; ²Duke University, US; ³University of Washington, US; ⁴Duke University, US / TU Munich, US
Abstract
Deep neural networks (DNNs) emerge as a key component in various applications. However, the ever-growing DNN size hinders efficient processing on hardware. To tackle this problem, on the algorithmic side, compressed DNN models are explored, of which block-circulant DNN models are memory efficient, and hardware-friendly; on the hardware side, resistive random-access memory (ReRAM) based accelerators are promising for in-situ processing for DNNs. In this work, we design an accelerator named ReBoc for accelerating block-circulant neural networks in ReRAM to reap the benefits of light-weight DNN models and efficient in-situ processing simultaneously. We propose a novel mapping scheme which utilizes Horizontal Weight Slicing and Intra-Crossbar Weight Duplication to map the block-circulant DNN model onto ReRAM crossbars with significant improved crossbar utilization. Moreover, two techniques, namely Input Slice Reusing and Input Tile Sharing are introduced to take advantage of the circulant calculation feature in block-circulant DNN models to reduce data access and buffer size. In ReBoc, a DNN model is executed within an intra-layer processing pipeline and achieves respectively 96× and 8.86× power efficiency improvement compared to the state-of-the-art FPGA and ASIC accelerators for block-circulant neural networks. Compared to ReRAM-based DNN acclerators, ReBoc achieves averagely 4.1× speedup and 2.6× energy reduction.
Download Paper (PDF; Only available from the DATE venue WiFi)

14:30

11.4.2

GRAPHRSIM: A JOINT DEVICE-ALGORITHM RELIABILITY ANALYSIS FOR RERAM-BASED GRAPH PROCESSING
Speaker:
Chin-Fu Nien, Academia Sinica, TW
Authors:
Chin-Fu Nien¹, Yi-Jou Hsiao², Hsiang-Yun Cheng¹, Cheng-Yu Wen³, Ya-Cheng Ko³ and Che-Ching Lin³
¹Academia Sinica, TW; ²National Chiao Tung University, TW; ³National Taiwan University, TW
Abstract
Graph processing has attracted a lot of interests in recent years as it plays a key role to analyze huge datasets. ReRAM-based accelerators provide a promising solution to accelerate graph processing. However, the intrinsic stochastic behavior of ReRAM devices makes its computation results unreliable. In this paper, we build a simulation platform to analyze the impact of non-ideal ReRAM devices on the error rates of various graph algorithms. We show that the characteristic of the targeted graph algorithm and the type of ReRAM computations employed greatly affect the error rates. Using representative graph algorithms as case studies, we demonstrate that our simulation platform can guide chip designers to select better design options and develop new techniques to improve reliability.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

11.4.3

STAIR: HIGH RELIABLE STT-MRAM AWARE MULTI-LEVEL I/O CACHE ARCHITECTURE BY ADAPTIVE ECC ALLOCATION
Speaker:
Hossein Asadi, Sharif University of Technology, IR
Authors:
Mostafa Hadizadeh, Elham Cheshmikhani and Hossein Asadi, Sharif University of Technology, IR
Abstract
Hybrid Multi−Level Cache Architectures (HCAs) are promising solutions for the growing need of high-performance and cost-efficient data storage systems. HCAs employ a high endurable memory as the first-level cache and a Solid−State Drive (SSD) as the second-level cache. Spin−Transfer Torque Magnetic RAM (STT-MRAM) is one of the most promising candidates for the first-level cache of HCAs because of its high endurance and DRAM-comparable performance along with non-volatility. However, STT-MRAM faces with three major reliability challenges named Read Disturbance, Write Failure, and Retention Failure. To provide a reliable HCA, the reliability challenges of STT-MRAM should be carefully addressed. To this end, this paper first makes a careful distinction between clean and dirty pages to classify and prioritize their different vulnerabilities. Then, we investigate the distribution of more vulnerable pages in the first-level cache of HCAs over 17 storage workloads. Our observations show that the protection overhead can be significantly reduced by adjusting the protection level of data pages based on their vulnerability. To this aim, we propose a STT−MRAM Aware Multi−Level I/O Cache Architecture (STAIR) to improve HCA reliability by dynamically generating extra strong Error−Correction Codes (ECCs) for the dirty data pages. STAIR adaptively allocates under-utilized parts of the first-level cache to store these extra ECCs. Our evaluations show that STAIR decreases the data loss probability by five orders of magnitude, on average, with negligible performance overhead (0.12% hit ratio reduction in the worst case) and 1.56% memory overhead for the cache controller.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:15

11.4.4

EFFECTIVE WRITE DISTURBANCE MITIGATION ENCODING SCHEME FOR HIGH-DENSITY PCM
Speaker:
Muhammad Imran, Sungkyunkwan University, KR
Authors:
Muhammad Imran, Taehyun Kwon and Joon-Sung Yang, Sungkyunkwan University, KR
Abstract
Write Disturbance (WD) is a crucial reliability concern in a high-density PCM with below 20nm scaling. WD occurs because of the inter-cell heat transfer during a RESET operation. Being dependent on the type of programming pulse and the state of the vulnerable cell, WD is significantly impacted by the data patterns. Existing encoding techniques to mitigate WD reduce the percentage of a single WD-vulnerable pattern in the data. However, it is observed that reducing the frequency of a single bit pattern may not be effective to mitigate WD for certain data patterns. This work proposes a significantly more effective encoding method which minimizes the number of vulnerable cells instead of a single bit pattern. The proposed method mitigates WD both within a word-line and across the bit-lines. In addition to WD-mitigation, the proposed method encodes the data to minimize the bit flips, thus improving the memory lifetime compared to the conventional WD-mitigation techniques. Our evaluation using SPEC CPU2006 benchmarks shows that the proposed method can reduce the aggregate (word-line+bit-line) WD errors by 42% compared to the existing state-of-the-art (SD-PCM). Compared to the state-of-the-art SD-PCM method, the proposed method improves the average write time, instructions-per-cycle (IPC) and write energy by 12%, 12% and 9%, respectively, by reducing the frequency of verify and correct operations to address WD errors. With reduction in bit flips, memory lifetime is also improved by 18% to 37% compared to SD-PCM, given an asymmetric cost of the bit flips. By integrating with the orthogonal techniques of SD-PCM, the proposed method can further enhance the performance and energy efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

IP5-6, 478

COMPUTATIONAL RESTRUCTURING: RETHINKING IMAGE PROCESSING USING MEMRISTOR CROSSBAR ARRAYS
Speaker:
Rickard Ewetz, University of Central Florida, US
Authors:
Baogang Zhang, Necati Uysal and Rickard Ewetz, University of Central Florida, US
Abstract
Image processing is a core operation performed on billions of sensor-devices in the Internet of Things (IoT). Emerging memristor crossbar arrays (MCAs) promise to perform matrix-vector multiplication (MVM) with extremely small energy-delay product, which is the dominating computation within the two-dimensional Discrete Cosine Transform (2D DCT). Earlier studies have directly mapped the digital implementation to MCA based hardware. The drawback is that the series computation is vulnerable to errors. Moreover, the implementation requires the use of large image block sizes, which is known to degrade the image quality. In this paper, we propose to restructure the 2D DCT into an equivalent single linear transformation (or MVM operation). The reconstruction eliminates the series computation and reduces the processed block sizes from NxN to √Nx√N. Consequently, both the robustness to errors and the image quality is improved. Moreover, the latency, power, and area is reduced with 2X while eliminating the storage of intermediate data, and the power and area can be further reduced with up to 62% and 74% using frequency spectrum optimization.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:33

IP5-7, 312

SCRIMP: A GENERAL STOCHASTIC COMPUTING ACCELERATION ARCHITECTURE USING RERAM IN-MEMORY PROCESSING
Speaker:
Saransh Gupta, University of California, San Diego, US
Authors:
Saransh Gupta¹, Mohsen Imani¹, Joonseop Sim¹, Andrew Huang¹, Fan Wu¹, M. Hassan Najafi² and Tajana Rosing¹
¹University of California, San Diego, US; ²University of Louisiana, US
Abstract
Stochastic computing (SC) reduces the complexity of computation by representing numbers with long independent bit-streams. However, increasing performance in SC comes with an increase in area and loss in accuracy. Processing in memory (PIM) with non-volatile memories (NVMs) computes data in-place, while having high memory density and supporting bit-parallel operations with low energy. In this paper, we propose SCRIMP for stochastic computing acceleration with resistive RAM (ReRAM) in-memory processing, which enables SC in memory. SCRIMP can be used for a wide range of applications. It supports all SC encodings and operations in memory. It maximizes the performance and energy efficiency of implementing SC by introducing novel in-memory parallel stochastic number generation and efficient implication-based logic in memory. To show the efficiency of our stochastic architecture, we implement image processing on the proposed hardware.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

End of session