# Valid Window: A New Metric to Measure the Reliability of NAND Flash Memory

Min Ye<sup>\*</sup>, Qiao Li<sup>\*</sup>, Jianqiang Nie<sup>‡</sup>, Tei-Wei Kuo<sup>\*</sup>, Chun Jason Xue<sup>\*</sup> \*City University of Hong Kong, <sup>‡</sup>YEESTOR Microelectronics Co., Ltd

Abstract—NAND flash memory has been widely adopted in storage systems today. The most important issue in flash memory is its reliability, especially for 3D NAND, which suffers from several types of errors. The raw bit error rate (RBER) when applying default read reference voltages is usually adopted as the reliability metric for NAND flash memory. However, RBER is closely related to the way how data is read, and varies greatly if read retry operations are conducted with tuned read reference voltages. In this work, a new metric, valid window is proposed to measure the reliability, which is stable and accurate. A valid window expresses the size of error regions between two neighboring levels and determines if the data can be correctly read with further read retry. Taking advantage of these features, we design a method to reduce the number of read retry operations. This is achieved by adjusting program operations of 3D NAND flash memories. Experiments on a real 3D NAND flash chip verify the effectiveness of the proposed method.

#### I. INTRODUCTION

NAND flash memories are now widely adopted in storage systems, such as mobile devices, personal computers, and servers. To continuously increase the capacity and decrease the cost per bit of NAND flash memory, flash vendors have been aggressively increasing the bit density and scaling flash cells to smaller process nodes. This trend results in less charges in smaller cells, which increases the pressure on flash reliability. Due to the limitation to further decrease scaling size, 3D NAND flash has been introduced to increase capacity [1] [2]. Along with these developments of NAND flash memory, reliability has become the prominent issue for NAND flash memory as several types of errors are now amplified [3] [4] [2]. This paper proposes a new metric to better characterize reliability for NAND flash memory.

There are lots of works proposed to improve the reliability on NAND flash memory. Most of them adopt raw bit error rate (RBER) as flash reliability metric [5] [6]. RBER is calculated by dividing the number of error bits by the number of data bits. However, due to the fact that flash errors are caused by the fluctuation and shifting of voltage levels, RBER heavily depends on the read reference voltages ( $V_{read}$ ) [7] [8]. With the support of read retry operations, the read reference voltages can be tuned. RBER varies greatly as data are read with tuned  $V_{read}$ . If  $V_{read}$  is tuned to the same direction as voltage levels shift, RBER will be reduced. Many works have also proposed approaches to tune the  $V_{read}$  of read retry operations to improve the reliability. As a result, RBER is not a stable metric to accurately measure the actual reliability of a specific page or block of flash memory.

In this work, we propose a new metric called valid window (VW) to measure the reliability of NAND flash memory. Valid window is defined as a window between two adjacent voltage levels. Any  $V_{read}$  inside the window will produce an RBER lower than a threshold. The threshold is defined by the error correction capability of the adopted error correction code (ECC). When a  $V_{read}$  is inside the VW, the data read by the  $V_{read}$  can be correctly decoded by ECC. Therefore, the reliability is higher when the valid window is larger, which means more  $V_{read}$  options available for a successful read and a larger region for the voltage levels to shift and fluctuate.

There are many applications of this new metric. As an example case, we propose a design to reduce the number of read retry operations by optimizing the program operation. The main idea is to make the default read reference voltage stay within the valid window for a longer period. Based on the observation that NAND flash memory mainly suffers from retention errors [9]. High-voltage levels mainly shift to the left along with increasing retention time. The corresponding valid windows will also shift to the left over retention time. Therefore, during program operations, we will set the valid windows of higher levels to the right. This is achieved by changing the verify voltage of program operations. To verify the effectiveness of the proposed approach, we implement the proposed method on a real 3D NAND flash chip. The results show that we can reduce the number of read retry operations from 50579 to 2 when the number of P/E cycle is 1000.

The major contributions of this work are as follows.

- Presents that current reliability metric, RBER, cannot accurately and stably measure the reliability of flash memory;
- Proposes a new accurate and stable metric named valid window for flash reliability;
- Proposes to optimize program operations for better read performance with reduced number of read retry by taking advantage of the new metric;
- Conducted experiments on real flash chips to verify the effectiveness of the proposed techniques.

In the remaining paper, Section II presents the basics of flash memory. Section III presents the motivation of this work. The new metric for flash reliability is discussed in Section IV, and Section V presents a use case of the new metric. Experiments are presented in Section VI. Section VII concludes this work.

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 11219319).



### II. BACKGROUND

NAND flash memory stores data by trapping charges in the floating gate or charge trap. The data stored in each cell is represented by the threshold voltage  $(V_t)$ , which is divided into multiple voltage levels based on the number of bits stored in each cell. This work uses triple-level cell (TLC) as an example, which is now commonly used on 3D NAND flash memory [1] [2]. Figure 1 shows the threshold voltage  $(V_t)$ distribution of TLC. There are eight voltage levels (from L0 to L7) to store three bits in each cell, LSB (Least Significant Bit). CSB (Center Significant Bit) and MSB (Most Significant Bit). To differentiate the eight voltage levels, seven read reference voltages  $V_{read}$  (from R1 to R7) are required between each pair of adjacent voltage levels. Gray code is used by setting only one bit difference between every two neighboring voltage levels. Figure 1 shows a typical 2-3-2 code, which means 2 Vread (R1 and R5) are required to read LSB page, 3 Vread (R2, R4 and R6) to read CSB page and 2  $V_{read}$  (R3 and R7) to read MSB page.

NAND flash memory suffers from multiple error sources, like P/E cycles and retention time. The voltage levels will shift and fluctuate with retention and P/E cycles increasing, as shown in Figure 2, and thus bit error rate will increase. Every  $V_{read}$  will introduce an RBER individually. Set RBER(Ri) as the RBER produced by reference voltage Ri, RBER(xSB) as the RBER of xSB page.

$$RBER(LSB) = RBER(R1) + RBER(R5)$$
(1)  

$$RBER(CSB) = RBER(R2) + RBER(R4) + RBER(R6)$$
  

$$RBER(MSB) = RBER(R3) + RBER(R7)$$

When RBER exceeds a threshold, ECC engine can not recover page data correctly. In this case, read retry is necessary [8] [10]. Read retry is conducted to re-read the data with



tuned  $V_{read}$  to reduce the RBER seen by ECC [7] [8]. How to tune  $V_{read}$  is critical for read performance, which determines whether the tuned  $V_{read}$  introduces RBER lower than the error correction capability of ECC. However, it is hard to find a proper set of  $V_{read}$  because the errors on flash memory are complex and hard to predict. It may take multiple read retries for a successful read. Usually, vendors will provide a read retry table for the convenience of read retry operations, which contains tens of  $V_{read}$  sets. When read retry is required after ECC decoding fails, the  $V_{read}$  set is tried one by one until ECC correctly recovers the data.

#### III. MOTIVATION

In previous work, RBER is widely adopted as the reliability metric for NAND flash memory, which is calculated by dividing the number of error bits by the number of data bits. The number of error bits is usually obtained when reading data with default reference voltages. A low RBER indicates high flash reliability of stored data. However, with the adoption of read retry, the RBER of data varies with the read reference voltages. When different read reference voltages are used to read data, the RBER of a page may change significantly. Figure 3 illustrates different  $V_{read}$  to read data. Level(i-1) and Level(i) represent two adjacent voltage levels of TLC threshold voltage distribution. X-axis represents the threshold voltage of flash cells, y-axis represents the number of memory cells. RiO, RiD, RiL and RiT are four different V<sub>read</sub>. Different RBERs will be introduced using these four  $V_{read}$  to read data. RiO is the optimized  $V_{read}$ , and the page data read by RiO has the lowest RBER.

We tested the number of error bits with four sets of read voltages for CSB page on the same wordline after 1500 P/E cycles and one-year retention on a real flash memory chip. By setting  $V_{read}$  from the minimal to the maximal value of the threshold voltage, we can achieve the the threshold voltage of each cell and thus the Vt distribution in Figure 4. R2, R4 and R6 are required to read CSB page. As shown in Figure 4,  $R_{i}D$  ( $i \in 2, 4, 6$ ) is the default read reference voltage.  $R_{i}O$  is the optimized  $V_{read}$ , which is supposed to introduce the least number of error bits.  $R_{i}T$  is the  $V_{read}$  selected from the left side of the default  $V_{read}$ . Using these four sets of  $V_{read}$  to read CSB page, we can get the number of error bits, which is shown in Table I. Based on this table, the difference is huge, up to two order of magnitude.



Fig. 4. Different  $V_{read}$  produce very different RBER.

|                                      |                       | TABLE I               | 11g. 4. L             | merent v <sub>read</sub> |  |  |  |  |
|--------------------------------------|-----------------------|-----------------------|-----------------------|--------------------------|--|--|--|--|
| COMPARISON OF RBER FOR THE CSB PAGE. |                       |                       |                       |                          |  |  |  |  |
| Vread                                | Default $R_i_D$       | Optimized $R_i_O$     | Right $R_i\_T$        | Left $R_i\_L$            |  |  |  |  |
| Error bits                           | 3016                  | 191                   | 24856                 | 4533                     |  |  |  |  |
| RBER                                 | $2.06 \times 10^{-2}$ | $1.30 \times 10^{-3}$ | $1.69 \times 10^{-1}$ | $3.09 \times 10^{-2}$    |  |  |  |  |



Fig. 5. ERx and the size of valid window (VW).

As a result, it is not suitable to measure the reliability of flash memory just by RBER, because RBER changes significantly with different sets of  $V_{read}$ . The common metric RBER fails to measure the reliability on NAND flash memory accurately and stably. In this work, we propose a new metric for the reliability of NAND flash memory.

#### IV. A NEW METRIC: VALID WINDOW

To accurately measure flash reliability, this paper proposes a new metric, valid window (VW). A valid window is defined as the window between two adjacent voltage levels, and any  $V_{read}$  inside the window will produce an RBER lower than a pre-defined threshold. Figure 5 illustrates the valid window between level(i-1) and level(i). Suppose the threshold is ERx, the left border of the window is VrL and the right border is VrR. When VrL or VrR is used as the  $V_{read}$  to read data, the RBER equals to ERx. RBER will be lower than ERx if  $V_{read}$ is between VrL and VrR.

Let r be the maximum error rate that can be corrected by ECC. Then, for 2-3-2 gray code NAND flash, ERx = r/3 for every  $V_{read}$  of CSB page (R2, R4, R6), ERx = r/2 for every  $V_{read}$  of LSB (R1, R5) and MSB page (R3, R7). As long as

each  $V_{read}$  of a page is in its valid window, the page can be recovered by ECC engine.

Valid windows do not change with  $V_{read}$ . They are more stable than RBER. The size of the valid windows can be used to measure the reliability of NAND flash memory. A bigger valid window represents higher reliability, which indicates that two adjacent voltage levels are further away from each other. Equipped with a larger window, it is easier to move  $V_{read}$ into the valid window by read retry, which means it is easier to recover page data by ECC engines. As long as the size of each valid window is larger than 1 unit of  $V_{read}$  tuning, page data can be recovered by changing  $V_{read}$  with the help of ECC engine. If valid windows do not exit, it is not possible to recover the data by read retry. If all the default  $V_{read}$  of a page are in the corresponding valid windows, there is no need for read retry.

Compared to RBER, which can be achieved through one read operation, VW requires more read operations and calculations. In this paper, it is calculated through offline analysis. Multiple read reference voltages are used to read data and the RBER introduced by each  $V_{read}$  can be obtained to find the borders of the valid window. The time to compute the VW of a WL is less than 3 second. The increased overheads to calculate VW will not impact the adoption of VW as the reliability metric. The reason is that the reliability of blocks or pages does not change greatly over a short period of time. Therefore, VW can be calculated periodically to minimize the overheads.

Many management algorithms on flash memory are based on the reliability, for example, block-swapping algorithm, endurance and retention measurement, and bad block identification. Accurate measurement of reliability will greatly benefit these algorithms. Since valid windows can accurately and stably represent the reliability of NAND flash memory, adopting valid window as the reliability metric can improve both endurance and performance. To present the advantage of valid window, we propose a method to reduce the number of read retries in the following section as a use case.

## V. A USE CASE: READ RETRY REDUCTION

From the above analysis, we conclude that the key method to avoid read retry is to keep the default read reference voltage inside the valid window. In this section, we propose to reduce the number of read retry operations by adjusting the position of initial voltage levels to keep the read reference voltage inside the valid window over a longer retention period.



Fig. 8. Vt distribution after one-year retention.

Read retry is an effective way to recover data when the error rate exceeds the ECC's error-correcting capability. However, read retry is expensive. Every read retry requires a new set of  $V_{read}$ , and to re-read the page and re-run ECC engine to correct bit errors, which will greatly increase read latency and increase system power consumption. Along with the increasing of retention and P/E cycles, more pages need read retry to recover.

To solve these problems, the error rate should be kept as low as possible and as long as possible. Through the analysis of the new metric of valid window, we infer that, as long as every default  $V_{read}$  is in the valid window, the error rate will be kept within capability of ECC engine, and no read retry is needed.

In order to achieve this goal, we also need to understand the pattern of shifting for every level. As Figure 6 shows, with increasing of retention time, the upper levels (L7, L6, L5, L4, L3) will shift down. The higher the level, the more the level shifts down. The lower levels (L0, L1, L2) will shift up. The lower the level, the more the level shift up. For some 3D NAND memories, lower levels may also shift down. Retention errors are the main error source on flash memory, which cause left shift of voltage levels as the charges leak over retention time [11] [12]. As a result, the valid window shift to the left over retention time as well.

From Figure 6, two directions could be explored. The first one is to move valid windows in the opposite direction of retention, to keep default  $V_{read}$  in valid windows as long as possible. This means upper levels should be programmed a little higher than the original levels, and lower levels programmed a little lower. The second method is to move default  $V_{read}$  in the direction of retention. With this method, the retention time of default  $V_{read}$  in valid windows is much shorter than that of the first method. With increasing P/E cycles and retention time, the valid windows of the second method could even disappear. The first method not only optimizes the position of the valid window, but also increases the size of the valid window. Hence this paper focuses on the first method.

As P/E cycles and retention are the most important factors affecting the reliability [5] [2], this paper only considers these two factors and ignores other factors, such as read disturb. In order to accelerate retention experiment, high temperature



Fig. 9. Adjust VtCt to adjust VW.

baking at 100°C for 13 hours is used to simulate a one-year retention time at 40 °C. A 64-layer 3D NAND chip was used as the test subject. Figure 7 is the Vt distribution of original program result with default parameters after 1500 P/E cycles. Figure 8 is the Vt distribution after one-year retention. We notice that the default  $V_{read}$  of R3~R7 are outside the valid window, and others are not. That means we only need to adjust Level3~Level7, while the other levels keep the original program parameters.

The easiest way to adjust the position of the valid windows is to adjust the Vt Check threshold (VtCt). Increasing the value of VtCt of Level(i) will shift the level up. If increasing the VtCt of Level(i-1) and Level(i) at the same time, the position of VW(i) will move up, while the position of default  $V_{read}$ Ri is fixed. As shown in Figure 9, after the same retention time, the default  $V_{read}$  Ri is still in the valid window. After several rounds of trial and verification, we found a good set of parameters. That is to add 3 units to the VtCt of level 3, 6 units for Level 4, 7 units for Level 5, 10 units for Level 6, and 15 units for Level 7, respectively.

Figure 10 is the Vt distribution of program result after adjusting the parameters. Figure 11 is the distribution after one-year retention. In this case, all the default  $V_{read}$  are in the valid windows, which means no read retry is needed to recover data.

For some WLs, some of the default  $V_{read}$  may be out of the valid windows, but they are much closer to the valid window than before. This means that the number of read retry will be reduced.

## VI. EXPERIMENT

A set of experiments was carried out on a 64-layer 3D NAND flash memory chip to verify the effectiveness of parameter adjustment for program operation. Up to eight blocks were tested, which is divided into two groups. Group 1 has four blocks after 1000 P/E cycles, and group 2 has four blocks after 1500 P/E cycles. Two blocks of every group adjust the program parameter with the same value, and the remaining two blocks keep the default parameters. The eight blocks were

programmed with the same random data. To simulate a year of retention at  $40^{\circ}$ C, the chip was baked at  $100^{\circ}$ C for 13 hours. The ECC engine has an error correcting capability of 72 bits/(1K byte + Parity) with BCH algorithm. All the data in the eight blocks is read and recovered by ECC engine and read retries. The times of read retry operations are as shown in Table II. It is shown that to recover the blocks with adjusted parameters need much less read retry operations than to recover the blocks with default parameters.

| TABLE IIComparison of read retry times. |           |          |           |          |  |  |  |
|-----------------------------------------|-----------|----------|-----------|----------|--|--|--|
|                                         | PE = 1000 |          | PE = 1500 |          |  |  |  |
| Method                                  | Default   | Adjusted | Default   | Adjusted |  |  |  |
| Read retry times                        | 50579     | 2        | 77492     | 2703     |  |  |  |

Figure 12 is the comparison of the max page bit errors between adjusted program results and default program results with different P/E cycles after one-year retention. "Adjusted program" and "Default program" is tested just after the block is programmed. "Adjusted program after one year retention" and "Default program after one year retention" is tested after one-year retention. For the blocks with adjusted parameters, the max page bit error increases only slightly after one-year retention. Even for 1000 P/E cycles, the max page bit error is still in the error correcting capability of ECC engine. While for the blocks with default parameters, the max page bit error increases significantly after one-year retention.

Due to limited resources, only eight blocks of a 3D NAND device are tested. For different batches of samples or samples from different manufacturers, parameters for program may be needed to adjust to different value, and the effects of adjustments could be slightly different.

Increasing VtCt means more time to program a WL. Figure 13 shows the program time with default parameters and adjusted parameters. The program time of each logical WL in a block is measured. In 3D NAND, every layer of a block is a physical WL. Every physical WL consists of several logical WLs. Logical WLs are also called strings. Logical WLs in 3D NAND are similar to the WLs in 2D NAND. The programming time with adjusted parameters is 2.42% more than that with default parameters on average. For most applications using multi-channel and multi-CE (Chip Enable) scheme, this ratio of program time increasing has little effect on write performance.

Another impact of the program tuning is flash wearing. Based on previous work [13] [14], the wearing to a flash cell is proportional to its threshold voltage. Maximum of threshold voltage (denoted as  $V_p$ ) of flash cells is used to calculate the effective wearing  $w_e$ . The relationship is  $w_e = \lambda \times V_p$ , where  $\lambda$  is a constant, which depends on the physical characteristics of flash cells. The effective wearing will be increased since the maximum threshold voltage  $V_p$  is increased to adjust the valid window. In the experiments,  $V_p$  is increased by less than 3%, and thus the effective wearing is increased less than 3%.

## VII. CONCLUSION

This paper shows that the conventional metric RBER can not accurately and stably measure the reliability of NAND



Fig. 11. Vt distribution after one-year retention with program adjusted.



Fig. 12. RBER of default and adjusted before and after one-year retention.



Fig. 13. Program time comparison between default and adjusted parameters

flash memory, especially for 3D NAND. A new metric, valid window, is proposed to represent the reliability of NAND flash memory. Valid window is stable and accurate in measuring the reliability of NAND flash. Valid window has several useful features. Taking advantage of these features, we design a method to improve the quality of program operations of a 3D NAND flash. The frequency of the read retry operation is reduced significantly.

### REFERENCES

[1] C.-H. Hung, M.-F. Chang, Y.-S. Yang, Y.-J. Kuo, T.-N. Lai, S.-J. Shen, J.-Y. Hsu, S.-N. Hung, H.-T. Lue, Y.-H. Shih *et al.*, "Layer-aware program-and-read schemes for 3d stackable vertical-gate be-sonos nand flash against cross-layer process variations," *IEEE Journal of Solid-State Circuits*, vol. 50, no. 6, pp. 1491–1501, 2015.

- [2] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, "Improving 3d nand flash memory lifetime by tolerating early retention loss and process variation," in *Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems*, 2018, pp. 106–106.
- [3] M. Huang, Z. Liu, L. Qiao, Y. Wang, and Z. Shao, "An enduranceaware metadata allocation strategy for mlc nand flash memory storage systems." *IEEE Trans. on CAD of Integrated Circuits and Systems*, vol. 35, no. 4, pp. 691–694, 2016.
- [4] Q. Xiong, F. Wu, Z. Lu, Y. Zhu, Y. Zhou, Y. Chu, C. Xie, and P. Huang, "Characterizing 3d floating gate nand flash: Observations, analyses, and implications," ACM Transactions on Storage (TOS), vol. 14, no. 2, p. 16, 2018.
- [5] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, "Error patterns in mlc nand flash memory: Measurement, characterization, and analysis," in *Proceedings of the Conference on Design, Automation and Test in Europe (DATE)*, 2012, pp. 521–526.
- [6] M. C. Yang, Y. H. Chang, C. W. Tsao, and P. C. Huang, "New ERA: new efficient reliability-aware wear leveling for endurance enhancement of flash storage devices," in ACM/EDAC/IEEE Design Automation Conference (DAC), 2013, pp. 1–6.
- [7] B. Peleato, R. Agarwal, J. M. Cioffi, M. Qin, and P. H. Siegel, "Adaptive read thresholds for nand flash," *IEEE Transactions on Communications*, vol. 63, no. 9, pp. 3069–3081, 2015.
- [8] Q. Li, M. Ye, Y. Cui, L. Shi, X. Li, and C. J. Xue, "Sentinel cells enabled fast read for {NAND} flash," in 11th {USENIX} Workshop on Hot Topics in Storage and File Systems (HotStorage 19), 2019.
- [9] S. Tanakamaru, C. Hung, A. Esumi, M. Ito, K. Li, and K. Takeuchi, "95%-lower-ber 43%-lower-power intelligent solid-state drive (ssd) with asymmetric coding and stripe pattern elimination algorithm," in 2011 IEEE International Solid-State Circuits Conference, 2011, pp. 204–206.
- [10] B. S. Kim, J. Choi, and S. L. Min, "Design tradeoffs for SSD reliability," in 17th USENIX Conference on File and Storage Technologies (FAST 19), Boston, MA, 2019, pp. 281–294.
- [11] Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu, "Data retention in mlc nand flash memory: Characterization, optimization, and recovery," in *High Performance Computer Architecture (HPCA)*, 2015 IEEE 21st International Symposium on, 2015, pp. 551–563.
- [12] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, "Heatwatch: Improving 3d nand flash memory device reliability by exploiting selfrecovery and temperature awareness," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018, pp. 504–517.
- [13] J. Jeong, S. S. Hahn, S. Lee, and J. Kim, "Lifetime improvement of {NAND} flash-based storage systems using dynamic program and erase scaling," in *Proceedings of the 12th {USENIX} Conference on File and Storage Technologies ({FAST} 14)*, 2014, pp. 61–74.
- [14] Q. Li, L. Shi, C. Gao, Y. Di, and C. J. Xue, "Access characteristic guided read and write regulation on flash based storage systems," *IEEE Transactions on Computers*, vol. 67, no. 12, pp. 1663–1676, 2018.