# ORIENT: <u>Organized Interleaved ECCs for New</u> STT-MRAM Caches

Zahra Azad\*, Hamed Farbeh<sup>†</sup>, and Amir Mahdi Hosseini Monazzah<sup>‡</sup>

\*<sup>‡</sup>Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
<sup>†</sup>School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran Email: {\*zazad, <sup>‡</sup>ahosseini}@ce.sharif.edu, <sup>†</sup>farbeh@ipm.ir

Eman. { zazad, 'anossenn}@ce.snam.edu, 'farben@ipm.n

Abstract—Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising alternative to SRAM in cache memories. However, STT-MRAMs face with high probability of write errors due to its stochastic switching behavior. To correct the write errors, Error-Correcting Codes (ECCs) used in SRAM caches are conventionally employed. A cache line consists of several codewords and the data bits are selected in such a way that the maximum correction capability is provided based on the error patterns in SRAMs. However, the different write error patterns in STT-MRAM caches leads to inefficiency of conventional ECC configurations. In this paper, first we investigate the efficiency of ECC configurations and demonstrate that the vulnerability of codewords in a cache line varies by up to 17x. This variation means that, while some words are overprotected, some others are highly probable to experience uncorrectable errors. Then, we propose an ECC bit selection scheme, so-called ORIENT, to reduce the vulnerability variation of codewords to 1.4x. The simulation results show that conventional ECC configuration increases the write error rate by up to about 64.4% compared with the optimum ECC bit selection, whereas this value for **ORIENT** is only 4.5%.

Keywords—STT-MRAM caches; error-correcting codes; write errors; interleaving.

#### I. INTRODUCTION

SRAM memories can be disturbed by high-energy particle strike and radioactive chip packaging material, which leads to soft errors. These soft errors can cause a Single Bit Upset (SBU) or Multiple Bit Upsets (MBU). By technology size scaling the occurrence probability of MBU in the case of a particle strike has significantly increased. Error-Correcting Codes (EECs) are commonly used to protect SRAM memories against soft errors. In latency-sensitive on-chip cache memories, Single Error Correction-Double Error Detection (SEC-DED) code is the most conventional ECC used in commercial processors [1], [2], [3], [4].

For a cache line consisting of N words, e.g., for a 512bit line includes eight 64-bit words, N SEC-DED codes are employed. Each SEC-DED code is generated by either all bits of a single data word or interleaving the bits of all words with the interleaving distance of N. The former is conventionally used in L1 caches to save energy and time [4], [5] and the latter configuration is conventionally used in L2/L3 caches to provide MBU correction capability [5], [6].

In addition to high susceptibility to soft errors, high leakage power is another challenge for scaling SRAMs in todays nanoscale technology size. Recent development in memory technology has introduced some emerging memories for replacing the SRAMs. Among them, *Spin-Transfer Torque Magnetic Random Access Memory* (STT-MRAM) is the most promising alternative for SRAMs in on-chip caches [2], [7], [8], [9], [10]. In STT-MRAM cells, data is stored as a resistance state of a *Magnetic Tunneling Junction* (MTJ) device. By applying the spin-polarized current, resistance state of a cell can be set to high or low. Thermal fluctuations in the magnetization process of write operation causes uncertainty in MTJ switching time. In this case, if the actual MTJ switching time becomes longer than the applied write pulse width, a write error occurs [11].

To overcome the write errors in STT-MRAM caches, ECCs are used as a common approach [2], [7], [8], [9], [12], [13], [14]. However, conventional bit selection schemes used for SRAMs, such as per-word and interleaved [2], are designed to be efficient for soft error correction in SRAMs. Thus, they are not customized for STT-MRAM write errors characteristics. During a write operation in STT-MRAM cache line, if the number of bits that needs to be switched increases, the occurrence probability of the write error increases, as well. By applying SEC-DED code at word granularity in cache lines, each 8-Byte subset of cache line bits is separately grouped and protected by a SEC-DED code as a logical word. Due to the fact that a higher uniformity in bit switching distribution between logical words leads to lower write error rate, making bit switching distribution more uniform is an effective way to reduce the cache error rate. Therefore, the ability of the bit selection schemes to distribute bit switching between logical words significantly affects the write error rate.

In this paper, through evaluations of SPEC CPU2006 benchmarks [15], we investigated the efficiency of the conventional bit selection schemes used for SEC-DED code in distributing the number of bit switching between logical words. The results show that applying per-word and interleaved schemes for cache lines leads to highly non-uniform bit switching distribution and results in inefficiency of these schemes to reduce STT-MRAMs write error rate. Then, we proposed a novel bit selection scheme, so-called <u>Organized</u> <u>Interleaved ECCs for New STT-MRAM Caches</u> (ORIENT), which provides a near-optimum bit switching distribution and effectively reduces the write error rate of cache blocks during write operations.

The rest of this paper is organized as follows. Motivations for this study are described in Section II. Section III explains the proposed ORIENT. Section IV provides the experimental setup and discusses the simulation results. Finally, conclusions are given in Section V.

#### II. MOTIVATION

Cache line reliability using SEC-DED code for each logical word is calculated according to (1) [5].

$$BER = 1 - \prod_{i=1}^{\#part} R_i \tag{1}$$

Where #part is the number of logical words in a cache line and  $R_i$  is the reliability of  $i^{th}$  logical word. Considering a logical word with the capability of correcting up to t-bit error, reliability of a write operation in logical word i, i.e.,  $R_i$ , is estimated according to (2) [12].

$$R(bit_{flips},t) = \sum_{i=0}^{t} C^{i}_{bit_{flips}} BER^{i} (1 - BER)^{bit_{flips}-i}$$
(2)

Where BER (Bit Error Rate) is the probability of unsuccessful bit switching, t is the correction capability of ECC which is one in the case of using SEC-DED code,  $bit_{flips}$  denotes the number of required bit switching in the write operation of that logical word, and  $C_{bit_{flips}}^i$  shows the number of combinations of  $bit_{flips}$  in i bits.

Assuming that the total number of bit switching for a write operation in a cache line is fixed, the highest reliability for that write operation is achieved when all of the  $R_{i)s}$  are the same. In other words, the highest reliability is for a write operation in which the total number of bit switching is uniformly distributed between all logical words. Therefore, in order to achieve a uniform  $R_is$ , the number of bit switching in logical words should be uniform. The more uniformity between  $R_is$  leads to higher reliability in write operations. However, none of the existing bit selection schemes consider this STT-MRAM write operation requirement to achieve a higher reliability.

We conduct a set of simulations to explore the effects of bit switching distribution on the reliability of write operations. In this regard, a 4-MByte L2 cache with associativity of eight and 512-bit line width is simulated in gem5 simulator [16]. We use different combinations of SPEC CPU2006 benchmarks [15] as multi-programmed workloads.

Both per-word (without interleaving) and conventional 8way interleaved bit selection schemes are used in cache lines protected by SEC-DED(72,64). We count the number of bit switching in different logical words for each write operation. Then, we sort these numbers and normalized them to the lowest one. Finally, these sorted normalized numbers are accumulated for all of the write operations, and their averages are shown in Fig. 1. The higher non-uniformity in chart bars of Fig. 1 is interpreted as the higher non-uniformity among the number of bit switching in the logical words of a cache line.

As can be seen, in some workloads, applying per-word bit selection scheme to SEC-DED code leads to high levels of non-uniformity in bit switching distributions between the logical words. Considering Comb4-Comb6 in Fig. 1(a), the number of bit switching in the logical words with the highest number of switching is about 14.3x higher than that in the logical words with the lowest number of switching. This value is 4.3x on average. As it is shown in Fig. 1(b), the diversity in the number of bit switching for 8-way interleaved bit selection is on average 4.2x. In the worst case the number of bit switching in some words is 16.3x higher than other words.

Considering the same amount of protection overhead, this non-uniformity leads to lower reliability level for write operations in STT-MRAM caches comparing with the maximum achievable reliability level. Therefore, we need to propose a new bit selection scheme that more efficiently exploits protection resources to achieve higher reliability. This is done by uniformly distributing the bit switching of each write operation between logical words of a cache line. To propose the new scheme, first the reasons behind the inefficiency of the conventional bit selection schemes are investigated in the following subsection.



Fig. 1: Bit flips distribution among logical words in (a) perword and (b) 8-way interleaved schemes.



Fig. 2: Bit selection schemes structures.

# III. PROPOSED ORIENT SCHEME

According to the previous discussion, using SEC-DED code at word granularity, the optimum cache line write reliability can be achieved when bit switching distribution is uniform between protected words (logical words) of that cache line. However, none of per-word and interleaved bit selection schemes is capable to uniform bit switching distribution and efficiently decrease the write error rate. Fig. 2 depicts the structures of per-word, interleaved, and the proposed bit selection *ORIENT* (<u>Organized Interleaved ECCs for New STT-MRAM</u> Caches) scheme for a simple 16-bit width cache line with 4-bit width words.

As can be seen in Fig. 2 (a), in per-word scheme, adjacent bits of logical words are selected exactly same as the adjacent bits of data words. Then, each logical word is protected by SEC-DED. For example, bits 0-4 (word0) form the first logical word, then bits 5-8 (word1) form the second logical word, and so on. In the 4-way interleaved SEC-DED depicted in Fig. 2 (b), the first bit of the first logical word is bit-0 of the first data word, the second bit is bit-0 of the second data word, the third bit is bit-0 of the third data word, and the fourth bit is bit-0 of the fourth data word. Next, the fifth bit in the cache line is bit-1 of the second logical word. This data pattern is repeated throughout the cache line. In this way, bit-0s of all data words are selected for the first logical word, and bit-1s of all data words are selected for second logical word, and so forth. In this bit selection scheme, same bit positions of all data words in a cache line (the same bit position in the bytes of all words for a 512-bit cache line using SEC-DED(72, 64)) are selected for a logical word.



Fig. 3: Bit selection schemes in a 64B cache line.

In ORIENT, an interleaving scheme is proposed to overcome the mentioned limitation of the per-word scheme. On the other hand, the conventional N-way interleaving scheme is not suitable for most of the workloads, especially the first category. This is because all bits in a codeword are selected from the same position of the words. In ORIENT, the selected bits are from all words and from all positions in words. ORIENT eliminates the drawbacks of both per-word and interleaving scheme, while takes the advantages of the two.

In conventional interleaved SEC-DED, the same bit positions of all data words in a cache line are grouped to form a logical word. However, in ORIENT different bit positions of data words are grouped as a logical word. In the other words, the bit positions selected for the first logical word are the same from the first word (bit-0 of all bytes in the word), but they change in the second word (bit-1 of all bytes in the word), then they change in the third word (bit-2 of all bytes in the word), and so forth. In this way, a logical word includes all eight different bit positions of bytes in a cache line. As a result, if there is always high amount or low amount of bit switching in a specific bit position of all words, they are located in different logical words. This selection leads to more uniform bit switching distribution among all logical words and better STT-MRAM write operation reliability.

In Fig. 2 (c), ORIENT scheme for a 4-way interleaved SEC-DED is shown. To form the first logical word, ORIENT selects bit-0 of word-0, bit-1 of word-1, bit-2 of word-2, and bit-3 of word-3 instead of just selecting bit-0s of all words. Then, bit-0 of word-3, bit-1 of word-0, bit-2 of word-1, and bit-3 of word-2 are selected for the second logical word instead of just grouping bit-1s of all words.

Using ORIENT, all logical words of a cache line include all bit positions 0-3, each of which belongs to different logical words. Accordingly, any repetitive pattern in bit switching distribution, which leads to non-uniformity in the number of bit switching of logical words when using the conventional interleaved scheme, can be avoided by ORIENT. Structures of per-word, interleaved, and ORIENT schemes for a 64-Byte cache line are depicted in Fig. 3 (a), (b), and (c), respectively. In these structures, each color represents the bits of a logical word, and each plane block shows a 64-bit word (each row includes eight bits), and eight 64-bit words form a cache line. As Fig. 3 shows, interleaved SEC-DED selects the same bit positions of all words (blocks) to form a logical word, while in ORIENT bit positions selected to form a logical word vary from word to word.

# IV. SYSTEM SETUP AND RESULT

To evaluate the efficiency of ORIENT, we use the gem5 cycle-accurate simulator [16]. A detailed model of ARM processor operating at the frequency of 1-GHz is used in this study. We simulate a homogeneous quad-core CMP processor with out-of-order, four-issue superscalar cores. Each core has its own L1 instruction and data caches and L2 cache is shared

TABLE I: configuration of on-chip caches.

| Memory Unit | Configuration                                                    |
|-------------|------------------------------------------------------------------|
| L1          | 32+32KB I/D, 64B line, 4-way, write/read: 2 cycles, SRAM         |
| L2          | 4MB, 8-way, 64B line, write: 20 cycles, read : 5 cycles, STT-RAM |

among all cores. The details of the cache configuration are summarized in Table I.

SPEC CPU2006 benchmark suite is used as the workloads [15]. 18 combinations of the different benchmarks used as multi-programmed workloads, referred as *comb1-comb18*. For the sake of improving the accuracy of the experiments, all of the results are retrieved after skipping the cache warm-up phase. To protect the 64-Byte L2 cache lines, we integrate ORIENT on 8-way interleaved SEC-DED.

To compare the ability of ORIENT for providing uniform bit switching distribution between logical words, conventional interleaved and per-word schemes are also considered in our evaluations. In this regard, using ORIENT the number of bit switching of logical words in each write operation are counted, sorted, and normalized to the lowest one. Then, these sorted normalized numbers are accumulated for all of the write operations, and their averages are shown in Fig. 4. The same evaluations are done for per-word and interleaved SEC-DED and are shown in Fig. 1. Besides, for each workload, words with the highest amount of bit switching in Fig. 4 and Fig. 1 are compared with each other in Fig. 5.

As it can be seen in Fig. 4 and Fig. 5, ORIENT can efficiently uniform the number of bit switching distribution between logical words in all workloads. However, per-word and conventional interleaved schemes in some workloads lead to high levels of non-uniformity. For example, in some workloads, per-word and interleaved schemes can cause about 14x and 16x more bit switching in the logical word with the highest amount of bit switching compared to that with the lowest one, respectively. Note that in the worst case this number is 1.6x for ORIENT. This provided level of uniformity is due to the efficiency of ORIENT in distributing bit positions with the same amount of switching between different logical words.

For a better understanding of how much ORIENT can uniform bit switching distribution among logical words compared to the per-word and conventional interleaving schemes, variance of the number of bit switching of all logical words (from the word with the highest amount of bit switching to the word with the lowest number of bit switching) for each workload is shown in Fig. 6. Since variance measures how far a set of numbers are spread out from their average value, it is a representative metric to show the efficiency of a bit selection scheme. As can be seen in Fig. 6, variance can be as high as 18 and 5.5 in the cases of using per-word and interleaved schemes, respectively. However, it is less than 0.04 in all workloads using ORIENT. On average, variance is 3.22, 1.22, and 0.02 for per-word, interleaved, and ORIENT, respectively. The results show that ORIENT is well suited for the uniform distribution of bit switching between all logical words.

To better demonstrate the efficiency of ORIENT, we consider the optimum bit selection in which the total number of bit switching is equally distributed between all logical words. Then, we calculate the increase in the write error rate in cache blocks using the three evaluated SEC-DED configurations. Fig. 7 shows the increase in the error rate using per-word, interleaved, and ORIENT. As can be seen, exploiting conventional



Fig. 4: Bit Flips distribution among logical words in ORIENT.



Fig. 5: Words with the highest amount of bit flips in all schemes.

interleaving scheme in SEC-DED results in up to 64.39% higher error rate compared to the optimum scheme. In addition, using per-word scheme causes up to 167.82% higher block error rate than the one achieved by optimum scheme. However, the maximum increase in the cache block write error rate using ORIENT is only 5.2%. The average amount of increase in the block error rate for per-word and interleaved schemes is 38.5% and 26.48%, whereas it is only 4.5% for ORIENT. This near-optimum reliability in ORIENT is the results of its high uniformity in bit switching distribution depicted in Fig. 4. Furthermore, according to Fig. 6, the lower variance in the number of bit switching in ORIENT in comparison with perword and conventional interleaved ones confirms the results of Fig. 7, as well.

# V. CONCLUSION

In this work, we proposed ORIENT, a new bit selection scheme for ECCs, which can efficiently reduce the write error rates in STT-MRAM cache lines. ORIENT is designed based on the fact that higher uniformity in distributing the total bit switching between all logical words of a cache line leads to lower write error rate. Considering the requirements of different bit switching distribution patterns in real word workloads, and also by investigating the weaknesses of the conventional bit selection schemes in evenly distributing bit switching between logical words, ORIENT provides a near-optimum bit selection opportunity. As a result, ORIENT decreases the write error rates up to 160.5% and 59.9% compared to per-word and conventional N-way interleaved bit selection schemes, respectively. Furthermore, a near optimum block error rate (about 4.5% difference with optimum scheme, on average) makes it the most suitable candidate for STT-MRAMs.

#### REFERENCES

- J. Hong, J. Kim, and S. Kim, "Exploiting same tag bits to improve the reliability of the cache memories," *IEEE Trans. Very Large Scale Integr. Syst.*, vol. 23, no. 2, pp. 254–265, 2015.
- [2] H. Farbeh, K. Hyeonggyu, S. G. Miremadi, and S. Kim, "Floating-ECC: Dynamic repositioning of error correcting code Bits for Extending the Lifetime of STT-RAM Caches," *IEEE Transactions on Computers (TC)*, vol. 65, no. 12, pp. 3661–3675, 2016.



Fig. 6: Variance of logical words bit flips in all schemes.



Fig. 7: Percent of the increase in the block write error rates compared to optimum scheme.

- [3] A. Neale and M. Sachdev, "A new SEC-DED error correction code subclass for adjacent MBU tolerance in embedded memory," *IEEE Trans. Device Mater. Rel.*, vol. 13, no. 1, pp. 223–230, 2013.
- [4] M. Manoochehri, M. Annavaram, and M. Dubois, "Extremely low cost error protection with correctable parity protected cache," *IEEE Trans. Comput.*, vol. 63, no. 10, pp. 2431–2444, 2014.
- [5] H. Farbeh and S. G. Miremadi, "PSP-cache: A low-cost fault-tolerant cache memory architecture," in *Proc. Conf. Des., Autom. & Test in Eur.*, 2014, pp. 1–4.
- [6] J. Hong and S. Kim, "Smart ECC allocation cache utilizing cache data space," *IEEE Trans. Comput.*, vol. xx, no. 99, pp. 1–8, 2016.
- [7] Z. Azad, H. Farbeh, A. M. H. Monazzah, and S. G. Miremadi, "An efficient protection technique for last level STT-RAM caches in multicore processors," *IEEE Trans. Parallel Distrib. Syst.*, vol. 28, no. 6, pp. 1564–1577, 2017.
- [8] W. Wen, M. Mao, X. Zhu, S. H. Kang, D. Wang, and Y. Chen, "CD-ECC: Content-dependent error correction codes for combating asymmetric nonvolatile memory operation errors," in *Proc. Int. Conf. Comput.-Aided Des.*, 2013, pp. 1–8.
- [9] X. Wang, M. Mao, E. Eken, W. Wen, H. Li, and Y. Chen, "Sliding Basket: An adaptive ECC scheme for runtime write failure suppression of STT-RAM cache," in *Proc. Conf. Des., Autom. & Test in Eur.*, 2016, pp. 762–767.
- [10] A. M. H. Monazzah, H. Farbeh, and S. G. Miremadi, "LER: Leasterror-rate replacement algorithm for emerging STT-RAM caches," *IEEE Trans. Device Mater. Rel.*, vol. 16, no. 2, pp. 220–226, 2016.
- [11] E. Cheshmikhani, A. M. Hosseini Monazzah, H. Farbeh, and S. G. Miremadi, "Investigating the effects of process variations and system workloads on reliability of STT-RAM caches," in *Proc Eur. Depend. Comput. Conf.*, 2016, pp. 120–129.
- [12] J. Ahn, S. Yoo, and K. Choi, "Selectively protecting error-correcting code for area-efficient and reliable STT-RAM caches," in *Proc. Asia South Pacific Des. Autom. Conf.*, 2013, pp. 285–290.
- [13] Z. Azad, H. Farbeh, A. M. H. Monazzah, and S. G. Miremadi, "AWARE: Adaptive way allocation for reconfigurable ECCs to protect write errors in STT-RAM caches," *IEEE Trans. Emerging Topics Comput.*, vol. PP, no. 99, pp. 1–12, 2017.
- [14] H. Sun, C. Liu, N. Zheng, T. Min, and T. Zhang, "Design techniques to improve the device write margin for MRAM-based cache memory," in *Proc Great Lakes Symp. VLSI*, 2011, pp. 97–102.
- [15] J. L. Henning, "SPEC CPU2006 benchmark descriptions," ACM SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 1–17, 2006.
- [16] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," *ACM SIGARCH Comput. Archit. News*, vol. 39, no. 2, pp. 1–7, 2011.