11.4 Reliable in-memory computing

Printer-friendly version PDF version

Date: Thursday 12 March 2020
Time: 14:00 - 15:30
Location / Room: Stendhal

Chair:
Jean-Philippe Noel, CEA-Leti, FR

Co-Chair:
Kvatinsky Shahar, Technion, IL

This session deals with work on the reliability of computing in memories. This includes new design techniques to improve CNN computing in ReRAM going through the co-optimization between device and algorithm to improve the reliability of ReRAM-based Graph Processing. Moreover, this session also deals with work on the improvment of reliability of well-established STT-MRAM and PCM. Finally, early works presenting stochastic computing and disruptive image processing techniques based on memristor are also discussed.

TimeLabelPresentation Title
Authors
14:0011.4.1REBOC: ACCELERATING BLOCK-CIRCULANT NEURAL NETWORKS IN RERAM
Speaker:
Yitu Wang, Fudan University, CN
Authors:
Yitu Wang1, Fan Chen2, Linghao Song2, C.-J. Richard Shi3, Hai (Helen) Li4 and Yiran Chen2
1Fudan University, CN; 2Duke University, US; 3University of Washington, US; 4Duke University, US / TU Munich, US
Abstract
Deep neural networks (DNNs) emerge as a key component in various applications. However, the ever-growing DNN size hinders efficient processing on hardware. To tackle this problem, on the algorithmic side, compressed DNN models are explored, of which block-circulant DNN models are memory efficient, and hardware-friendly; on the hardware side, resistive random-access memory (ReRAM) based accelerators are promising for in-situ processing for DNNs. In this work, we design an accelerator named ReBoc for accelerating block-circulant neural networks in ReRAM to reap the benefits of light-weight DNN models and efficient in-situ processing simultaneously. We propose a novel mapping scheme which utilizes Horizontal Weight Slicing and Intra-Crossbar Weight Duplication to map the block-circulant DNN model onto ReRAM crossbars with significant improved crossbar utilization. Moreover, two techniques, namely Input Slice Reusing and Input Tile Sharing are introduced to take advantage of the circulant calculation feature in block-circulant DNN models to reduce data access and buffer size. In ReBoc, a DNN model is executed within an intra-layer processing pipeline and achieves respectively 96× and 8.86× power efficiency improvement compared to the state-of-the-art FPGA and ASIC accelerators for block-circulant neural networks. Compared to ReRAM-based DNN acclerators, ReBoc achieves averagely 4.1× speedup and 2.6× energy reduction.

Download Paper (PDF; Only available from the DATE venue WiFi)
14:3011.4.2GRAPHRSIM: A JOINT DEVICE-ALGORITHM RELIABILITY ANALYSIS FOR RERAM-BASED GRAPH PROCESSING
Speaker:
Chin-Fu Nien, Academia Sinica, TW
Authors:
Chin-Fu Nien1, Yi-Jou Hsiao2, Hsiang-Yun Cheng1, Cheng-Yu Wen3, Ya-Cheng Ko3 and Che-Ching Lin3
1Academia Sinica, TW; 2National Chiao Tung University, TW; 3National Taiwan University, TW
Abstract
Graph processing has attracted a lot of interests in recent years as it plays a key role to analyze huge datasets. ReRAM-based accelerators provide a promising solution to accelerate graph processing. However, the intrinsic stochastic behavior of ReRAM devices makes its computation results unreliable. In this paper, we build a simulation platform to analyze the impact of non-ideal ReRAM devices on the error rates of various graph algorithms. We show that the characteristic of the targeted graph algorithm and the type of ReRAM computations employed greatly affect the error rates. Using representative graph algorithms as case studies, we demonstrate that our simulation platform can guide chip designers to select better design options and develop new techniques to improve reliability.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:0011.4.3STAIR: HIGH RELIABLE STT-MRAM AWARE MULTI-LEVEL I/O CACHE ARCHITECTURE BY ADAPTIVE ECC ALLOCATION
Speaker:
Hossein Asadi, Sharif University of Technology, IR
Authors:
Mostafa Hadizadeh, Elham Cheshmikhani and Hossein Asadi, Sharif University of Technology, IR
Abstract
Hybrid Multi−Level Cache Architectures (HCAs) are promising solutions for the growing need of high-performance and cost-efficient data storage systems. HCAs employ a high endurable memory as the first-level cache and a Solid−State Drive (SSD) as the second-level cache. Spin−Transfer Torque Magnetic RAM (STT-MRAM) is one of the most promising candidates for the first-level cache of HCAs because of its high endurance and DRAM-comparable performance along with non-volatility. However, STT-MRAM faces with three major reliability challenges named Read Disturbance, Write Failure, and Retention Failure. To provide a reliable HCA, the reliability challenges of STT-MRAM should be carefully addressed. To this end, this paper first makes a careful distinction between clean and dirty pages to classify and prioritize their different vulnerabilities. Then, we investigate the distribution of more vulnerable pages in the first-level cache of HCAs over 17 storage workloads. Our observations show that the protection overhead can be significantly reduced by adjusting the protection level of data pages based on their vulnerability. To this aim, we propose a STT−MRAM Aware Multi−Level I/O Cache Architecture (STAIR) to improve HCA reliability by dynamically generating extra strong Error−Correction Codes (ECCs) for the dirty data pages. STAIR adaptively allocates under-utilized parts of the first-level cache to store these extra ECCs. Our evaluations show that STAIR decreases the data loss probability by five orders of magnitude, on average, with negligible performance overhead (0.12% hit ratio reduction in the worst case) and 1.56% memory overhead for the cache controller.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:1511.4.4EFFECTIVE WRITE DISTURBANCE MITIGATION ENCODING SCHEME FOR HIGH-DENSITY PCM
Speaker:
Muhammad Imran, Sungkyunkwan University, KR
Authors:
Muhammad Imran, Taehyun Kwon and Joon-Sung Yang, Sungkyunkwan University, KR
Abstract
Write Disturbance (WD) is a crucial reliability concern in a high-density PCM with below 20nm scaling. WD occurs because of the inter-cell heat transfer during a RESET operation. Being dependent on the type of programming pulse and the state of the vulnerable cell, WD is significantly impacted by the data patterns. Existing encoding techniques to mitigate WD reduce the percentage of a single WD-vulnerable pattern in the data. However, it is observed that reducing the frequency of a single bit pattern may not be effective to mitigate WD for certain data patterns. This work proposes a significantly more effective encoding method which minimizes the number of vulnerable cells instead of a single bit pattern. The proposed method mitigates WD both within a word-line and across the bit-lines. In addition to WD-mitigation, the proposed method encodes the data to minimize the bit flips, thus improving the memory lifetime compared to the conventional WD-mitigation techniques. Our evaluation using SPEC CPU2006 benchmarks shows that the proposed method can reduce the aggregate (word-line+bit-line) WD errors by 42% compared to the existing state-of-the-art (SD-PCM). Compared to the state-of-the-art SD-PCM method, the proposed method improves the average write time, instructions-per-cycle (IPC) and write energy by 12%, 12% and 9%, respectively, by reducing the frequency of verify and correct operations to address WD errors. With reduction in bit flips, memory lifetime is also improved by 18% to 37% compared to SD-PCM, given an asymmetric cost of the bit flips. By integrating with the orthogonal techniques of SD-PCM, the proposed method can further enhance the performance and energy efficiency.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:30IP5-6, 478COMPUTATIONAL RESTRUCTURING: RETHINKING IMAGE PROCESSING USING MEMRISTOR CROSSBAR ARRAYS
Speaker:
Rickard Ewetz, University of Central Florida, US
Authors:
Baogang Zhang, Necati Uysal and Rickard Ewetz, University of Central Florida, US
Abstract
Image processing is a core operation performed on billions of sensor-devices in the Internet of Things (IoT). Emerging memristor crossbar arrays (MCAs) promise to perform matrix-vector multiplication (MVM) with extremely small energy-delay product, which is the dominating computation within the two-dimensional Discrete Cosine Transform (2D DCT). Earlier studies have directly mapped the digital implementation to MCA based hardware. The drawback is that the series computation is vulnerable to errors. Moreover, the implementation requires the use of large image block sizes, which is known to degrade the image quality. In this paper, we propose to restructure the 2D DCT into an equivalent single linear transformation (or MVM operation). The reconstruction eliminates the series computation and reduces the processed block sizes from NxN to √Nx√N. Consequently, both the robustness to errors and the image quality is improved. Moreover, the latency, power, and area is reduced with 2X while eliminating the storage of intermediate data, and the power and area can be further reduced with up to 62% and 74% using frequency spectrum optimization.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:33IP5-7, 312SCRIMP: A GENERAL STOCHASTIC COMPUTING ACCELERATION ARCHITECTURE USING RERAM IN-MEMORY PROCESSING
Speaker:
Saransh Gupta, University of California, San Diego, US
Authors:
Saransh Gupta1, Mohsen Imani1, Joonseop Sim1, Andrew Huang1, Fan Wu1, M. Hassan Najafi2 and Tajana Rosing1
1University of California, San Diego, US; 2University of Louisiana, US
Abstract
Stochastic computing (SC) reduces the complexity of computation by representing numbers with long independent bit-streams. However, increasing performance in SC comes with an increase in area and loss in accuracy. Processing in memory (PIM) with non-volatile memories (NVMs) computes data in-place, while having high memory density and supporting bit-parallel operations with low energy. In this paper, we propose SCRIMP for stochastic computing acceleration with resistive RAM (ReRAM) in-memory processing, which enables SC in memory. SCRIMP can be used for a wide range of applications. It supports all SC encodings and operations in memory. It maximizes the performance and energy efficiency of implementing SC by introducing novel in-memory parallel stochastic number generation and efficient implication-based logic in memory. To show the efficiency of our stochastic architecture, we implement image processing on the proposed hardware.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:30End of session