# High-Speed and Energy-Efficient Single-Port Content Addressable Memory to Achieve Dual-Port Operation

Honglan Zhan, Chenxi Wang, Hongwei Cui, Xianhua Liu, Feng Liu and Xu Cheng School of Computer Science, Peking University, Beijing, China {laan.z@stu.pku.edu.cn, lxh@mprc.pku.edu.cn, chengxu@mprc.pku.edu.cn}

Abstract—High-speed and energy-efficient multi-port content addressable memory (CAM) is very important to modern superscalar processors. In order to overcome the disadvantages of multi-port CAM and improve the performance of searching stage, a high-speed and energy-efficient single-port (SP) CAM is introduced to achieve dual-port (DP) operation. For different bit cell topologies - the traditional 9T CAM cell and 6T SRAM cell, two novel peripheral schemes - CShare and VClamp are proposed. The proposed schemes are verified using all possible corners, a wide range of temperature and detailed Monte-Carlo variation analysis. With 65-nm process and 1.2 V supply, the search delay of CShare and VClamp is 0.55 ns and 0.6 ns, respectively, a reduction of approximately 87% compared to the state-of-the-art works. In addition, compared with the recently proposed 10T BCAM, CShare and VClamp can provide 84.9% and 85.1% energy reduction in the TT corner, respectively. Experimental results in an 8 Kb CAM at 1.2 V supply and across different corners show that the energy efficiency is improved by 45.56% (CShare) and 45.64% (VClamp) on average in comparison with DP CAM.

Keywords—Superscalar processors, content-addressable memory (CAM), dual-port

#### I. INTRODUCTION

In order to exploit instruction-level parallelism, many microarchitecture components greatly rely on the complex content addressable memory (CAM) with multiple read ports [1–5], such as register renaming circuits [3], [4], instruction scheduler [1], [2], load/store queue [2], [5], and Translation Look-aside Buffer (TLB) [6].

With the development of modern processors, the CAM-ports expand with the increase of processor width, which can be observed from the configurable number of the CAM-ports in a recent popular open-source processor [5]. However, it is a great challenge to realize high-speed and energy-efficient parallel CAM operations. As shown in Fig. 1, the conventional dual-port (DP) CAM is implemented by using two instances of a typical CAM-port. The increase of read ports further increases the area and power consumption [7], [8], which is especially costly for modern high-performance processors [9].

Conventional DP CAM requires a precharge (P) cycle to precharge all match lines (MLs) in port A and port B, which occupies the low voltage stage of CLK. After the precharge, ports A and B perform search (S) operations in the high voltage stage of CLK. Several previous studies support DP CAM or precharge-free CAM by adding more transistors to 6T SRAM cell or 9T CAM cell. However, it causes extra area and power consumption. Li [10] modified the 4+2T single-port (SP) BCAM cell and presented a DP 8T CAM-based network intrusion detection engine for internet of things (IoT). The work presented in [11] is a self-controlled precharge-free 10T BCAM based on static storage, while it has short circuit current paths



Fig. 1. Comparison between conventional dual-port CAM and the proposed single-port CAM.

due to the diode structure formed during ML evaluation [12], [13]. The work in [12] is also a precharge-free 10T BCAM, but its operating frequency is 50 MHz in 45-nm CMOS technology. Mahendra [13] presented a precharge-free 14T TCAM cell, however, the frequency is much slower owing to the multiple cells connected in series.

In this work, instead of adding more transistors or read ports to 9T CAM cell or 6T SRAM cell, two different peripheral schemes are proposed – the charge-share (CShare) scheme and the voltage-clamp (VClamp) scheme for the two different bitcell topologies. Both CShare and VClamp can act as high-speed and energy-efficient SP CAM. CShare / VClamp with a timing generator is further introduced to fully utilize the high and low voltage stages of CLK. In this way, two search operations can be performed in one CLK to achieve DP operation.

The remainder of this paper is organized as follows. Section II describes the CShare scheme with 9T CAM cell. The VClamp scheme with 6T CAM cell is elaborated in Section III. The performance and experimental results with 65-nm CMOS technology and 16-nm FinFET technology are presented in Section IV. Finally, conclusions are reached in Section V.

#### II. CHARGE-SHARE SCHEME WITH 9T CAM CELL

#### A. Charge-Share scheme

As shown in Fig. 2, the proposed CShare scheme consists of 9T CAM cells and a peripheral circuit with simple timing control. Considering the case of a match when SL = 1, SLB = 0, Q = 1 and QB = 0, node B is logic "0"; while when SL = 1, SLB = 0, Q = 0 and QB = 1, node B is logic "1", indicating a mismatch. Suppose that the first cell stores the least significant bit (LSB), PMOS is used to replace NMOS at M1 in LSB and the benefits of this change will be explained later. Different from the serial connection mode of other 9T CAMs, all cells except LSB are connected in parallel with the ML in the CShare scheme.

A timing generator, inspired by the double pumping clock

generator in [14], is designed to generate a suitable control signal "EN". Fig. 2 gives the schematic of the timing generator and its timing chart. Next, take the high voltage stage of CLK as an example to explain the working principle of CShare. When "EN" is "0", the capacitance (C) is charged to  $V_{\rm C}$  if a match occurs in LSB. Then "EN" turns to "1", the "charge" path is cut off and the "share" path is on. If SL1 to SLn match the corresponding cell, C is shared with the ML and the voltage of ML becomes  $(C \times V_C)/(C_{ML}+C)$ . Otherwise, C discharges to ground through mismatched cells. An asymmetric buffer with skewed transistor N1, which has higher driving strength, is connected to the ML as the sensing output. The output is logic "1" if the ML remains the shared voltage, otherwise, it is logic "0". Thereby, the output of the skewed buffer mimics the search result. Since the timing generator can double pumping "EN", the operations of the low and high voltage stages of CLK are the same. Therefore, two search operations can be performed in one CLK, which achieves DP operation.

#### B. Search example

For clarity, a search example on a simplified  $3 \times 4$  CAM array is presented in Fig. 3. If the target data are "0011", SL0 / SL1 / SL2 / SL3 = 0 / 0 / 1 / 1, and conversely, SLB0 / SLB1 / SLB2 / SLB3 = 1 / 1 / 0 / 0. Due to the mismatch of LSB in the first row, both  $V_{\rm C}$  and the voltage of ML0 are 0, thereby the buffer outputs "0". The second row also mismatches, but not caused by LSB. Therefore, C in the second row is charged to  $V_{\rm C}$  and then discharges to ground through the third cell. At last, the buffer outputs "0". The third row is in a match state, C is charged to  $V_{\rm C}$ and then shared with ML2. The buffer in the third row outputs "1", indicating a match.

#### III. VOLTAGE-CLAMP SCHEME WITH 6T CAM

Recently, a configurable CAM using 6T bit-cells with split word lines (WLs) was proposed in [15]. The combination of split WL (WLA and WLB) and cell data (Q and QB) provides an XOR function for pattern search, thus considerable reduction of CAM area could be achieved by removing the built-in XNOR circuit in 9T CAM cell. The proposed VClamp scheme with 6T SRAM cell is elaborated in this section.

Unlike 9T cell, 6T cell has a coupled read/write path, the voltage of bit-line (BL) must maintain a high level to avoid data destruction. In order to avoid precharging BL during the low voltage stage of CLK in each cycle, we propose VClamp to clamp the BL voltage. The proposed VClamp scheme with a skewed sense amplifier (SA) is on the left side of Fig. 4. P1 and P2 form a voltage clamper to clamp the BL voltage, and P3 and P4 mirror the current from VDD-P1-P2 to load capacitance (Cp). The current mirror (consisting of P1, P2, P3 and P4) ratio is set to approximately 1/4 (left/right). Since the WL is split into WLA and WLB, and only one of them will be turned on [15], BL and BLB are connected together in CAM mode. Hence, only one VClamp circuit is required for each column.

The timing diagram is illustrated in the upper right corner of Fig. 4, and the signal "EN" is generated by the timing generator in Fig. 2. Take the high voltage stage of CLK as an example, when "EN" is "0", P1 and P2 are activated to clamp the BL voltage, and Cp is charged to Vboost. Next, "EN" becomes "1" and the search operation starts (WLA and WLB are underdrive to VDD/2). If a mismatch occurs in the column, Cp collects the



Fig. 2. CShare and timing generator with 9T CAM cell.



Fig. 3. Search examples of CShare with 9T CAM cell.



Fig. 4. VClamp with 6T SRAM cell and search examples.

mirrored current, and the voltage of Cp rises. On the contrary, the BL has no discharge current in a matching case, and Cp remains Vboost.

In order to sense the search results, we adopt an asymmetric SA (Fig. 4) by skewing one of the transistors M2. In this paper, M2 is stronger than M1 due to the use of LVT device and the larger size of M2. A replica column-based circuit is designed to generate suitable reference voltage (Vref) even under process, supply voltage and temperature (PVT) variations. As revealed in the gray box in Fig. 4, the redundant column with all matching bits generates the required Vref and also tracks the PVT

variations in the memory array, thereby increasing the sensing margin with negligible area overhead. It is shown from Fig. 4 that the voltage of Cp is equal to Vref in the matching state, and the output of SA (SAout) remains "1". While in the mismatching state, the voltage of Cp is larger than Vref, and SAout turns to "0". Once SAout becomes "0", P4 is turned off immediately. Similar to CShare, VClamp can also perform two search operations in one CLK to achieve DP operation.

In [16], a cascade current mirror (CCM) was used to clamp the BL voltage with the purpose of improving linearity and consistency in analog multiplication. In this work, a VClamp circuit is proposed to clamp BL voltage in SP CAM to achieve DP operation. The redundant column generates the Vref and also provides good tracking characteristics of PVT variations. In [17], a voltage clamping scheme was proposed to clamp the ML voltage, where the ML discharge delay was reduced by increasing the WL voltage, and additional circuits (footer) were required to clamp the current. However, the read noise margin in [17] deteriorates due to the increased voltage at the drain node of the "footer". In this paper, the VClamp circuit is designed to reduce search delay without boosting the WL voltage, and the benefits will be elaborated in the following section.

#### IV. RESULTS AND ANALYSIS

The overall architecture of the proposed 8 Kb CShare scheme and VClamp scheme is illustrated in Fig. 5 (a) and (b), respectively. A comprehensive simulation with 65-nm CMOS technology is carried out on the Cadence IC618 design suit and Spectre circuit simulator. To further investigate the influence of advanced technology, 16-nm FinFET technology is utilized to verify our design.

#### A. Mismatch rate of LSB in CShare

This section clarifies the benefits of using PMOS to replace NMOS at M1 in the CShare scheme and points out the potential performance improvements. It is noted that each tag in the TLB is considered separately. If a tag has x mismatches in total, where LSB mismatches y times, the percentage of mismatches caused by LSB is y / x. The average value of y / x of all tags is defined as LSB mismatch rate. This paper calculates the average LSB mismatch rate in the data TLB of Medium BOOM [5] for SPEC2006 benchmarks, and the corresponding experiments were conducted on Digilent Genesys-2 FPGA board. As indicated in Fig. 6(a), the LSB mismatch rate is about 53.9%. The energy consumption under different LSB mismatch rates was measured in an 8 Kb CAM with 65-nm CMOS technology using the Spectre circuit simulator. As shown in Fig. 6(b), the energy consumption decreases significantly with the increase of LSB mismatch rate. When the mismatch rate is 1/2, the decrease in energy consumption is 36.4%. Even when the mismatch rate is 1/4, the energy consumption can be reduced by 18.4%.

#### B. Search delay reduction in VClamp

In the SRAM-array style CAM structures as in [15], [17], cell data can be corrupted when the multiple WLs are enabled at the same time because the lowered BL voltage can falsely write "0" to another cell that has stored "1". The common solution is to underdrive the WL, which however, will cause the increase of the ML discharge delay. In this paper, instead of brutally raising the WL voltage [17], Cp is charged to Vboost in VClamp before the search operation begins (Fig. 4). When "EN" is "1" and the



Fig. 5. The overall architecture of the proposed (a) CShare scheme and (b) VClamp scheme.



Fig. 6. (a) Mismatch rate caused by least significant bit (LSB) in Translation Look-aside Buffer (TLB) and (b) the energy consumption over a range of LSB mismatch rate.



Fig. 7. The voltage of Vboost for reducing search delay and energy in the proposed VClamp scheme.

search operation starts, the voltage of Cp increases from Vboost instead of "0". As a result, the search delay is reduced. The search delay and energy were measured across different Vboost of an 8 Kb CAM array with 65-nm technology. It can be seen from Fig. 7 that when Vboost is set to 1/4 VDD, the search delay is reduced by 25%, and the decrease of energy is 13.14%.

### C. Process Corner Variation

The proposed CShare and VClamp schemes have been verified using various process corners, and the performance metrics are provided in Table I. At the TT corner, the delay in [11] is 1.25 ns, while that of CShare and VClamp is 0.55 ns and 0.6 ns, respectively, which corresponds to an improvement of 56% and 52%. Compared to the delay of 4.39 ns in [13], the improvement is 87.5% and 86.3%, respectively. At the FF corner, there is little difference between the delay of the two schemes herein and that in [11]. However, the delay in [13] is still 2.9 times and 3.2 times that of CShare and VClamp, respectively. At the SF corner, the delay of CShare and VClamp

is 0.69 ns and 0.7 ns, respectively, which is 87.1% and 86.9% lower than that of 5.36 ns in [11]. In addition, the delay is improved by 84.2% and 84%, respectively, in comparison with that of 4.37 ns in [13]. The symbol "/" in Table I refers to data that have not been mentioned in relevant works.

From above, both the designs in [11] and [13] are vulnerable to process corners. The standard deviation of the delay in [11] is 2.75 and that in [13] is 1.46. In contrast, the standard deviation is 0.36 for CShare and 0.24 for VClamp. Therefore, the proposed CShare and VClamp schemes own relatively stable delay across different process corners.

The normalized energy  $(EfS_N)$  defined in [18] was used for legitimate comparison, and the energy metric was normalized to 65-nm/1.2 V according to (1). Table II summarizes the energy of the proposed two schemes and the recently reported works.

$$EfS_{N} = EfS \times (65 \text{-nm/Technology}) \times (1.2/VDD)^{2} \quad (1)$$

At the TT corner, the energy of CShare is 0.66 fJ/bit/search and that of VClamp is 0.65 fJ/bit/search, while the EfS<sub>N</sub> in [11] and [12] is 4.37 fJ/bit/search and 1.37 fJ/bit/search, respectively. Therefore, the improvement of CShare and VClamp is 84.9% and 85.1% compared to [11], and 51.8% and 52.6% compared to [12], respectively. Furthermore, in contrast to [12], CShare provides an increment of 83.9% in FF and 71.3% in FS, and the corresponding improvement of VClamp is 84.4% and 74.9%, respectively. At the TT and SF corners, the energy consumption of the proposed designs is larger than that of TCAM in [13], because [13] is implemented using NAND-ML at the expense of large cell area (14T) and long delay (about 4.4 ns in Table I). In addition, it can be concluded from Fig. 10 that when the delay reaches 4.4 ns (the same as [13]), the energy consumption of CShare and VClamp is lower than that in [13].

## D. Temperature Variation

To clarify the performance improvement of the proposed schemes compared to the existing ones, temperature variation analysis was performed on the designs and is displayed in Fig. 8 and Fig. 9. It is observed in Fig. 8(a) and Fig. 9(a) that in both CShare and VClamp, the search delay only slightly increases in the temperature range of -20 to  $100 \,^{\circ}$ C. The negligible delay variations of 0.09 ns in CShare and 0.14 ns in VClamp, compared to 0.3 ns in [11], are of interest to note. As shown in Fig. 8(b) and Fig. 9(b), within the temperature range of 20 to  $100 \,^{\circ}$ C, the energy variation ( $\Delta$ E) of CShare and VClamp is 0.007 fJ/bit/search and 0.028 fJ/bit/search, respectively, while that in [12] is about 0.3 fJ/bit/search. At lower temperature (from -20 to 40  $^{\circ}$ C),  $\Delta$ E in CShare is 0.01 fJ/bit/search and 0.025 fJ/bit/search in VClamp.

## E. Supply Voltage Scaling

Apart from assessing the performance of the proposed schemes at various process corners and temperature variations, estimation of the search delay and the energy against supply voltage scaling is another important concern. Fig. 10(a) indicates that CShare achieves 0.28 fJ/bit/search with 2.09 ns search delay at 0.8 V. The minimum operating voltage of the proposed schemes can be as low as 0.6 V. At this time, CShare achieves 0.15 fJ/bit/search with 7.7 ns search delay. It can be seen from Fig. 10(b) that VClamp achieves 0.25 fJ/bit/search with 4.6 ns search delay at 0.8 V, and 0.14 fJ/bit/search with 29 ns search delay at 0.6 V supply.

TABLE I

DELAY

| COM  | IPARISO | N ACROS | SS DIFFEF | RENT PROCE | SS CORNERS | , |
|------|---------|---------|-----------|------------|------------|---|
| onco | 1111    | [[12]]  | [13]      | CShare     | VClamp     | İ |

| Reference [11] |         | [12] | [13]    | CShare  | VClamp  |  |
|----------------|---------|------|---------|---------|---------|--|
| TT             | 1.25 ns | /    | 4.39 ns | 0.55 ns | 0.6 ns  |  |
| SS             | 6.95 ns | /    | /       | 1.42 ns | 1.1 ns  |  |
| FF             | 0.33 ns | /    | 1.29 ns | 0.44 ns | 0.4 ns  |  |
| FS             | 0.44 ns | /    | /       | 0.49 ns | 0.55 ns |  |
| SE             | 5.26 mg | /    | 1 27 mg | 0.60 mg | 0.7 mg  |  |

The symbol "/" refers to data not mentioned in relevant works.

 TABLE II

 ENERGY COMPARISON ACROSS DIFFERENT PROCESS CORNERS

| Reference | [11]<br>(45-nm / 1 V) |                             | [12]<br>(45-nm / 1 V) |                                               | [13]<br>(45-nm / 1 V) |                                               | CShare   | VClamp      |
|-----------|-----------------------|-----------------------------|-----------------------|-----------------------------------------------|-----------------------|-----------------------------------------------|----------|-------------|
|           | EfS                   | $\mathrm{EfS}_{\mathrm{N}}$ | EfS                   | $\mathrm{E}\mathrm{f}\mathrm{S}_{\mathrm{N}}$ | EfS                   | $\mathrm{E}\mathrm{f}\mathrm{S}_{\mathrm{N}}$ | EfS (65- | nm / 1.2 V) |
| TT        | 2.1                   | 4.37                        | 0.66                  | 1.37                                          | 0.181                 | 0.38                                          | 0.66     | 0.65        |
| SS        | /                     | /                           | /                     | /                                             | /                     | /                                             | 0.74     | 0.62        |
| FF        | /                     | /                           | 2.12                  | 4.41                                          | 0.37                  | 0.77                                          | 0.71     | 0.69        |
| FS        | /                     | /                           | 1.07                  | 2.23                                          | /                     | /                                             | 0.64     | 0.56        |
| SF        | /                     | /                           | /                     | /                                             | 0.24                  | 0.5                                           | 0.7      | 0.61        |

The symbol "/" refers to data not mentioned in relevant works



Fig. 8. Temperature variation from -20 °C to 100 °C in proposed CShare scheme, (a) Search delay and (b) Energy.



Fig. 9. Temperature variation from -20 °C to 100 °C in proposed VClamp scheme, (a) Search delay and (b) Energy.



Fig. 10. Search delay and energy consumption versus supply voltage scaling in the proposed (a) CShare scheme and (b) VClamp scheme.

## F. Monte-Carlo simulation

Monte-Carlo (MC) method was used to analyze the output stability and accuracy of CShare and VClamp. The upper part of Fig. 11(a) and (b) dipicts the variation of the output of CShare and VClamp, respectively. Whether all bits match or only one bit mismatches, the output accuracy of CShare and VClamp is 100% over 1000 MC runs, as shown in Fig. 11(a) and (b).

#### G. Bit-cell stability and area overhead

The 9T cell in CShare is free from read destruction because the read and write ports are separated. In contrast, the 6T cell in VClamp has coupled read-write path, which may cause read destruction. Read destruction can be mitigated by reducing the WL voltage, using high-threshold transistors, employing a dual WL structure and adopting CCM to clamp the BL voltage [16]. The proposed VClamp circuit used CCM to clamp the BL voltage and therefore, read destruction would be eliminated [16].

In this paper, all transistors in the 9T (except that PMOS in LSB is slightly larger) and 6T cells adopt the minimum size. Since no transistors are added to the storage cell, the overall area will not deteriorate seriously. In addition, the overhead can be further reduced as the array size increases, because only one CShare or VClamp circuit is added to each row or column. In [16], the CCM has an area overhead of 14.17% in a 4 Kb SRAM macro with 28-nm process, while in a 256 Kb SRAM, the area overhead is only 1.77%.

### H. Performance of CShare / VClamp as DP CAM / TLB

This section describes the benefits of CShare / VClamp acting as DP CAM / TLB. The 8T DP CAM (6T CAM with one additional port), CShare and VClamp were tested in two cases: half and all entries mismatch across different corners. Test conditions such as transistor size, temperature and voltage, are kept the same. Fig. 12 exhibits the energy consumption of CShare / VClamp and DP CAM with half and all entries mismatch, where the CLK frequency of CShare is 909 MHz and that of VClamp is 833 MHz. Note that the results have been normalized to CShare in TT. Compared to DP CAM, the energy reduction of CShare is 31.6% and that of VClamp is 29.7% when half entries mismatch. The discharge amount of the CAM array increases with the increase of mismatch entries number. When all entries mismatch, CShare and VClamp achieve 45.56% and 45.64% energy reduction, respectively, as shown in Fig. 12(b).

To further study the influence of advanced technology, the 16nm FinFET technology was used to verify the designs. As shown in Fig. 13(a–c), CShare and VClamp act as DP TLB with different number of entries (32, 64, 128), and the operating frequency is 3.125 GHz and 1.67 GHz, respectively. CShare obtains 0.127 fJ/search/bit and VClamp achieves 0.083 fJ/search/bit over 100 searches with 32 entries. When the number of entries increases to 128, the energy consumption of CShare and VClamp over 100 searches is 0.128 fJ/search/bit and 0.085 fJ/search/bit, respectively. Fig.13(d) shows that when CShare and VClamp are configured as SP TLBs, the delay of CShare is 0.16 ns and that of VClamp is 0.3 ns, indicating that the frequency is 6.25 GHz and 3.3 GHz, respectively.

Fig. 14 gives the energy consumption and delay trade-offs of several fully-associative TLBs. Both CShare and VClamp have a better trade-off between the energy consumption and the delay.

Table III summarizes the proposed designs as well as some recent CAM works. CShare and VClamp proposed herein are based on traditional CAM cell with negligible redesign efforts. Both CShare and VClamp have the highest working frequency with comparable energy efficiency. Even if the CLK frequency is halved as a DP CAM, CShare and VClamp are still faster than



Fig. 11. 1000 runs Monte-Carlo simulation for output's variation and accuracy in the proposed (a) CShare scheme and (b) VClamp scheme.



Fig. 12. Energy consumption of CShare / VClamp and dual-port CAM with (a) half entries mismatch and (b) all entries mismatch across different process corners.



Fig. 13. Multiple TLB searches analysis. Energy consumption of CShare / VClamp with (a) 32 entries, (b) 64 entries, and (c) 128 entries. (d) Delay comparison between dual-port operation (DP) and single-port operation (SP) of CShare / VClamp.



Fig. 14. Delay and energy trade-offs in fully-associative TLBs. TABLE III

| COMPARISON WITH REVIOUS CAM WORKS |       |      |      |       |      |          |          |         |          |
|-----------------------------------|-------|------|------|-------|------|----------|----------|---------|----------|
| Reference                         | [10]  | [11] | [12] | [13]  | [17] | CShare   |          | VClamp  |          |
| Config.<br>(Kb)                   | /     | 4    | 0.5  | 0.5   | 4    | 8        |          | 8       |          |
| Cell                              | 8T    | 10T  | 10T  | 14T   | 6T   | 9T       |          | 6T      |          |
| Tech (nm)                         | 65    | 45   | 45   | 45    | 28   | 65       | 16       | 65      | 16       |
| Supply (V)                        | 1.2   | 1    | 1    | 1     | 1    | 1.2      | 0.8      | 1.2     | 0.8      |
| CLK freq.                         | 144   | 500  | 50   | 228   | 10   | DP: 909  | DP: 3125 | DP: 833 | DP: 1667 |
| (MHz)                             | 144   | 300  | 30   | 228   | 10   | SP: 1818 | SP: 6250 | SP:1667 | SP: 3333 |
| EfS (fJ/bit/<br>search)           | 0.61* | 2.1  | 0.66 | 0.181 | 1.62 | 0.66     | 0.128    | 0.65    | 0.085    |
| $\mathrm{EfS}_{\mathrm{N}}$       | 0.61* | 4.37 | 1.37 | 0.38  | 5.42 | 0.66     | /        | 0.65    | /        |
| Ports<br>number                   | 2     | 1    | 1    | 1     | 1    | 1        |          | 1       |          |
| Dual-port<br>operation            | Yes   | No   | No   | No    | No   | Yes      |          | Yes     |          |

The symbol "/" refers to data not mentioned in relevant works.

The symbol "\*" refers to CAM/total pattern bytes per search

DP: Dual-port operation; SP: Single-port operation.

those in other works. Moreover, among all the works in Table III, our design is unique in demonstrating DP CAM operation with a SP CAM structure.

## V. CONCLUSIONS

This paper adds some peripheral circuits to SP CAM cell with negligible redesign efforts to achieve DP operation and provides two high-speed and energy-efficient CAM designs. To the best of our knowledge, this is the first attempt to demonstrate DP CAM operation with a SP CAM structure. The proposed CShare and VClamp schemes have been verified with 65-nm CMOS technology and the advanced 16-nm FinFET technology. Compared with relevant CAM works, CShare and VClamp have comparable operating frequency and energy efficiency. To perform post-layout simulation and disclose the impact of parasitic effects on the output result is our future work.

#### ACKNOWLEDGEMENTS

We greatly thank the anonymous reviewers for their insightful comments. This work was supported by the National Key R&D Program of China (Grant no. 2022YFB4500500). Xu Cheng and Xianhua Liu are the Corresponding authors of this paper.

#### REFERENCES

 K. Aasaraai and A. Moshovos, "Design space exploration of instruction schedulers for out-of-order soft processors," 2010 International Conference on Field-Programmable Technology, 2010, pp. 385-388.

- [2] H. Wong, V. Betz and J. Rose, "Quantifying the Gap Between FPGA and Custom CMOS to Aid Microarchitectural Design," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 10, pp. 2067-2080, Oct. 2014.
- [3] G. Burda, Y. Kolla, J. Dieffenderfer and F. Hamdan, "A 45nm CMOS 13port 64-word 41b fully associative content-addressable register file," 2010 IEEE International Solid-State Circuits Conference - (ISSCC), 2010, pp. 286-287.
- [4] H. Nguyen, J. Jeong, F. Atallah, D. Yingling and K. Bowman, "A 7-nm 6R6W Register File with Double-Pumped Read and Write Operations for High-Bandwidth Memory in Machine Learning and CPU Processors," in IEEE Solid-State Circuits Letters, vol. 1, no. 12, pp. 225-228, Dec. 2018.
- [5] Zhao, Jerry. "SonicBOOM: The 3rd Generation Berkeley Out-of-Order Machine." (2020).
- [6] V. Karakostas et al., "Energy-efficient address translation," 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 631-643.
- [7] X. Zeng et al., "Design and Analysis of Highly Energy/Area-Efficient Multiported Register Files with Read Word-Line Sharing Strategy in 65nm CMOS Process," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 7, pp. 1365-1369, July 2015.
- [8] Sangireddy, R. "Register port complexity reduction in wide-issue processors with selective instruction execution." Microprocessors & Microsystems 31.1(2007):51-62.
- [9] D. She, Y. He, B. Mesman and H. Corporaal, "Scheduling for register file energy minimization in explicit datapath architectures," 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012, pp. 388-393.
- [10] D. Li and K. Yang, "A Dual-Port 8-T CAM-Based Network Intrusion Detection Engine for IoT," in IEEE Solid-State Circuits Letters, vol. 3, pp. 358-361, 2020.
- [11] T. Venkata Mahendra, S. Mishra and A. Dandapat, "Self-Controlled High-Performance Precharge-Free Content-Addressable Memory," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 8, pp. 2388-2392, Aug. 2017.
- [12] Mahendra, T. V., et al. "Low discharge precharge free matchline structure for energy-efficient search using CAM." Integration 69(2019):31-39.
- [13] T. Venkata Mahendra, S. Wasmir Hussain, S. Mishra and A. Dandapat, "Energy-Efficient Precharge-Free Ternary Content Addressable Memory (TCAM) for High Search Rate Applications," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 7, pp. 2345-2357, July 2020.
- [14] M. Yabuuchi et al., "A 6.05-Mb/mm2 16-nm FinFET double pumping 1W1R 2-port SRAM with 313 ps read access time," 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), 2016, pp. 1-2.
- [15] S. Jeloka, N. B. Akesh, D. Sylvester and D. Blaauw, "A 28 nm Configurable Memory (TCAM/BCAM/SRAM) Using Push-Rule 6T Bit Cell Enabling Logic-in-Memory," in IEEE Journal of Solid-State Circuits, vol. 51, no. 4, pp. 1009-1021, April 2016.
- [16] Z. Lin et al., "Cascade Current Mirror to Improve Linearity and Consistency in SRAM In-Memory Computing," in IEEE Journal of Solid-State Circuits, vol. 56, no. 8, pp. 2550-2562, Aug. 2021.
- [17] J. Koo, E. Kim, S. Yoo, T. Kim, S. Ryu and J. Kim, "Configurable BCAM/TCAM Based on 6T SRAM Bit Cell and Enhanced Match Line Clamping," 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC), 2019, pp. 223-226.
- [18] T. Huang and W. Hwang, "A 65 nm 0.165 fJ/bit/search 256 × 144 TCAM macro design for IPv6 lookup tables," IEEE J. Solid-State Circuits, vol. 46, no. 2, pp. 507–519, Feb. 2011.
- [19] M. -M. Papadopoulou, X. Tong, A. Seznec and A. Moshovos, "Prediction-based superpage-friendly TLB designs," 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 210-222.
- [20] J. Kim, J. Lee and S. Kim, "TLB Index-Based Tagging for Reducing Data Cache and TLB Energy Consumption," in IEEE Transactions on Computers, vol. 66, no. 7, pp. 1200-1211, 1 July 2017.
- [21] K. -L. Tsai, Y. -J. Chang and Y. -C. Cheng, "Automatic Charge Balancing Content Addressable Memory with Self-control Mechanism," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 61, no. 10, pp. 2834-2841, Oct. 2014.