# **Recovery-aware Proactive TSV Repair for Electromigration in 3D ICs**

Shengcheng Wang\*, Hongyang Zhao<sup>†</sup>, Sheldon X.-D. Tan<sup>†</sup>, and Mehdi B. Tahoori<sup>\*</sup>

\*Chair of Dependable Nano Computing (CDNC), Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

{shengcheng.wang, mehdi.tahoori}@kit.edu

<sup>†</sup>Department of Electrical and Computer Engineering, University of California, Riverside, CA, USA

{hzhao, stan}@ece.ucr.edu

*Abstract*—Electromigration (EM) becomes a major reliability concern in three-dimensional integrated-circuits (3D ICs). To mitigate this problem, a typical solution is to use TSV redundancy in a reactive manner, maintaining the operability of a 3D chip in the presence of EM failures by detecting and replacing faulty TSVs with spares. In this work, we explore an alternative, more preferred approach to enhance the EM-related lifetime reliability of TSV grid, in which redundancy is used proactively to allow non-faulty TSVs to be temporarily deactivated. In this way, EM wear-out can be reversed by exploiting its recovery property. Applied to 3D benchmark designs, the recovery-aware proactive repair approach increases EM-related lifetime reliability (measured in mean-time-to-failure) of the entire TSV grid by up to 12X relative to the conventional reactive method, with less area overhead.

### I. INTRODUCTION

Three-dimensional integrated-circuits (3D ICs) promise to overcome interconnect bottlenecks in CMOS scaling by leveraging fast, dense inter-die vias [1]. Typically, a complete 3D IC implementation is envisioned as a stack of active chips using through-silicon vias (TSVs) to connect through each chip down to a package substrate. By utilizing such vertical connection, 3D ICs can provide abundant interconnect bandwidth with improved performance and less communication-energy consumption. However, concerns related to TSV reliability are key obstacles in the commercial exploitation of TSV-based 3D IC technology [2].

As one of the critical challenges for TSV reliability, electromigration (EM) refers to the diffusion of metal atoms induced by electric current [2]. Due to increased current density, higher temperature and thermal mechanical stress, EM reliability of TSVs<sup>1</sup> in a 3D IC becomes further exacerbated compared to the conventional interconnects in its 2D counterpart. The gradual transport of metal atoms caused by EM leads to void nucleation and growth in TSVs during field-operation. This will significantly increase TSV resistance, and may eventually cause open/short defects, which can drastically reduce the mean-time-to-failure (MTTF) of a 3D IC [3].

In order to extend EM-related lifetimes of TSVs, a typical solution is to add spare TSVs (s-TSVs) in the design to repair defective functional TSVs (f-TSVs) at run-time. To this end, various TSV redundancy allocation techniques and their corresponding repair algorithms have been proposed in the literature [3], [4]. TSV defects induced by EM can be effectively tolerated by in-field reconfigurable repair solutions. However, the transient recovery effect in EM-induced stress evolution was ignored completely in all these existing *ad hoc* 

<sup>1</sup>In this paper, we limit our scope only to the signal TSVs. Therefore, the term "TSV" in this paper refers to signal TSV unless otherwise specified.

methodologies. Here, the "recovery effect" refers to the EM stress relaxation in the interconnect, which occurs when there is no/lower/reverse current passing. Consequently, this effect can be considered as a healing process extending the lifetime of an interconnect as it will take longer time for the stress to reach to the critical threshold for void nucleation [5]. Such phenomena have been observed in many previous experimental work [6], [7]. According to these experiments, this healing process possesses positive temperature dependence and directional property: on the one hand, higher temperatures lead to faster and more complete recovery of EM stress; on the other hand, this recovery phenomenon is more visible when the interconnect is stressed by bi-directional current waveforms compared to unidirectional ones. Therefore, since most of the f-TSVs in 3D ICs experience very high temperatures and carry bi-directional currents [8], they exhibit significant recovery effect, which can be leveraged for EM-related lifetime enhancement.

In this paper, a recovery-aware *proactive* TSV repair solution is proposed to enhance the EM-related lifetime reliability of f-TSV grids. In this repair approach, TSV redundancy is used proactively to allow non-faulty f-TSVs to be temporarily deactivated and recover from certain EM wear-out well before failing. To this purpose, the implementation of the proposed methodology consists of two stages:

- *Design-time TSV grouping*: After identifying the f-TSVs vulnerable to EM failures, we partition them into groups and then assign s-TSV(s) to each group with appropriate location(s). This grouping is implemented based on lifetime as well as signal re-routing constraints.
- *Run-time TSV repair*: In each group, the logic signals carried by the f-TSVs take turn being transmitted through the assigned spare(s), which allows all the TSVs (including the redundant one(s)) to be deactivated on a rotating basis and recover from EM wear-out during field-operation.

Our simulation results demonstrate that:

- Applied to 3D benchmark designs, our proactive repair approach increases EM-related lifetime reliability (measured in MTTF) of the entire TSV grid by up to 12X compared to the conventional reactive method [3], but introduces less area overhead.
- The proposed greedy group-merging algorithm can further reduce area overhead introduced by the proposed repair solution, achieving a better trade-off between lifetime reliability and hardware cost.

The rest of this paper is organized as follows. Preliminaries and related prior work are presented in Section II. The motivation and basic idea of the proposed proactive repair approach are presented in Section III. Section IV and Section V describe the methodology in detail. In Section VI, we report simulation results. Finally, conclusions are drawn in Section VII.

## II. PRELIMINARIES & RELATED WORK

# A. EM recovery effect

Until now, a number of previous works has studied EM issues in 3D ICs, and shows that TSVs are susceptible to EM wear-out [3], [4]. Once the EM-induced hydrostatic tensile stress exceeds a critical value, a void would be formed in the TSV, which can increase its resistance, causing path delay fault and eventually open/short defect [3]. However, this time-varying stress can be reduced when the current density in the stressing current goes down (or even negative) temporarily (i.e., EM recovery effect). The recovery effect can be quite significant when the interconnect is stressed by symmetric bi-directional (bipolar) pulse current waveforms. Moreover, temperature can also affect it, and higher temperatures lead to faster recovery. Due to the recovery property, it takes longer time for the EM-induced stress to reach the critical value, and thus results in a longer lifetime of an interconnect [5].

In order to leverage the recovery effect for lifetime reliability improvement at the system-level, an EM recovery model with "two-step" equivalent DC current was proposed in [9], which can consider transient recovery effect for the EM stress evolution using existing simple EM models. The generation of the equivalent DC current can be divided into two steps: First, an arbitrary waveform with time-varying current and temperature stress (as shown in Figure 1(a)) is converted to an equivalent square waveform (red dotted line in Figure 1(b)) by matching at both highest peak stress and final stress in each period, instead of only matching the end point in the simple "equivalent DC" method (vellow dashed line in Figure 1(b)). Afterwards, the generated current is further parameterized in terms of current density, duty cycle, temperature and time period to define the waveform. As shown in Figure 2, compared to the conventional equivalent DC method, the proposed technique in [9] has smaller error in terms of time-to-failure estimation. By using this new recovery-aware EM DC current model, lifetime can be easily computed for a interconnect wire given the stressing current waveforms.



Figure 1: (a) Original input driving current density. (b) Calculated EM DC equivalent current density with two different methods.

### B. Related prior work

A number of s-TSV allocation techniques and their corresponding repair algorithms have been proposed in the lit-



Figure 2: Comparing the nucleation time of two different methods and original stress.

erature [10–12]. However, all of them only target to tolerate manufacturing defects instead of run-time failures (e.g., EM wear-out). To tackle this problem, several in-field repair methodologies were proposed for the post-manufacturing TSV faults [3], [4].

A typical in-field TSV repair scheme is as follows. First, s-TSVs are allocated in the design along with a reconfiguration infrastructure that enables signal re-routing. Afterwards, online testing will be triggered periodically or by events. Once a particular f-TSV is detected to be faulty, it would be replaced by a standby s-TSV through the reconfigurable routing network. Therefore, such *reactive* repair allows as many TSV defects to be tolerated as there are non-faulty s-TSVs. A TSV grid is regarded as being irreparable until the s-TSV resource is exhausted, and its lifetime reliability improvement is highly dependent on the number of allocated s-TSVs.

Due to this "detect-and-replace" scheme, the conventional reactive approach has the following shortcomings:

- For conducting reactive in-field repair, it is imperative to implement an on-chip sensor network in order to test and diagnose faulty f-TSVs, which results in significant hardware cost.
- Ignoring the EM recovery effect, the reactive approach cannot fully utilize the s-TSV resource, which makes the generated repair solution inefficient.

In this paper, we propose an alternative, more preferred *proactive* repair approach to address these drawbacks by exploiting the recovery property of EM wear-out.

### III. MOTIVATION & BASIC IDEA

As opposed to replacing f-TSVs after they become faulty, the proposed *proactive* approach allows f-TSVs to recover from EM wear-out before failing. By temporarily deactivating non-faulty f-TSVs, the onset of EM failure can be delayed due to the recovery effect, which significantly extends the effective TSV lifetime. Therefore, such proactive repair approach has the following advantages over a reactive one:

- Since f-TSVs can recover from EM wear-out before they fail, it is unnecessary to implement the entire on-chip sensor network for TSV defect detection and monitoring, which saves the associated hardware cost.
- In the reactive repair approach, the number of tolerated f-TSV failures is limited by the amount of pre-allocated

spares. By contrast, by exploiting the EM recovery effect, proactive approach can extend the lifetimes of multiple f-TSVs even using one single spare, taking full advantage of the limited redundancy resources.

The proposed proactive repair approach is based on two consecutive stages. At design-time, the identified EM-vulnerable f-TSVs are partitioned into groups according to their lifetimes, and then s-TSV(s) is (are) subsequently assigned to each group under routing constraints. The corresponding algorithms will be discussed in Section IV. Afterwards, the assigned s-TSVs are used proactively, which allows partitioned EM-vulnerable f-TSVs in each group to be temporarily deactivated on a rotating basis and recover from EM wear-out well before failing. The detailed implementation will be presented in Section V.

# IV. DESIGN-TIME TSV GROUPING

Due to the clustering effects of TSV faults [3], we may run into the situation that some faulty f-TSVs lack TSV redundancy while others have excessive one. Although allocating more s-TSVs can tackle this problem, it also results in significant hardware cost. Therefore, here we propose to:

- Identify the f-TSVs which are vulnerable to EM wear-out at design-time, and limit the use of s-TSVs to them only.
- Adopt the "shared s-TSV" technique [13], which partitions the set of EM-vulnerable f-TSVs into groups and subsequently assign s-TSV(s) to each of them.

Therefore, the "design-time TSV grouping" problem can be consequently divided into the following sub-problems: i) vulnerable f-TSV identification, ii) f-TSV partitioning, and iii) s-TSV assignment.

# A. EM-vulnerable f-TSV identification

As a series system, the EM-related lifetime of a TSV grid is dominated by the f-TSVs which are susceptible to EM failures. Therefore, in order to reduce hardware cost, it is more efficient to provide TSV redundancy to the f-TSVs having lower EMrelated lifetimes rather than all of them, and here we use the MTTF of a f-TSV to evaluate its vulnerability. Given a set of representative workloads, we can generate the power/thermal characteristics of each f-TSV, and then estimate its MTTF [9]. Note that, since we look into large time scales for EM recovery periods in this work (details in Section V), here a steady-state temperature analysis is sufficient. Then, after comparing with a user-defined threshold value, the f-TSVs with lower MTTF will be identified as EM-vulnerable.

For the f-TSVs with zero or very small timing slacks, a TSV fault is not necessarily a catastrophic open/short defect, but often a timing failure due to EM-induced resistance increase. However, as in [9], here we only use void nucleation phase to compute the lifetime of a TSV. Therefore, the proposed repair approach can prevent EM-induced void formulation and the accompanying TSV resistance increase during field-operation.

### B. f-TSV partitioning

After identifying the EM-vulnerable f-TSVs, the next step is to partition them into groups for spare sharing. However, in order to obtain an effective repair solution, it should avoid apportioning the f-TSVs with the lowest MTTF into the same group. Therefore, this problem can be formulated as follows:

- Input: i) A set of EM-vulnerable f-TSVs  $\mathbf{F} = \{f_i\}$  in which each f-TSV  $f_i$  has its MTTF value LT  $(f_i)$ ; ii) the f-TSV number in each partitioned group  $N_{gf}$ .
- **Output**: A set of groups  $\mathbf{G} = \{g_j\}$  that partitions  $\mathbf{F}$ .
- Constraint: F is partitioned into [|F| /N<sub>gf</sub>] groups with the most size N<sub>gf</sub>.
- Objective:  $\operatorname{Minimize} : \max_{\forall q_{j} \in \mathbf{G}} S(g_{j}),$

where 
$$S(g_j) = \sum_{\forall f_i \in g_j} \operatorname{LT}(f_i).$$

Here, the objective can guarantee that the difference of total MTTF value between the maximal and minimal f-TSV groups is minimized, which leads to a uniform partition of vulnerable f-TSVs according to their lifetimes. Then, this f-TSV partitioning problem can be reduced to the *Balanced Multiway Number Partitioning* problem [14]. Using the proposed heuristic in [14], we can solve this problem in  $O(n\log n)$  time, where  $n = |\mathbf{F}|$ .

# C. s-TSV assignment

For each partitioned f-TSV group, we need to subsequently assign s-TSV(s) to provide proactive redundancy, which allows f-TSVs to be temporarily deactivated and recover from EM wear-out. However, in order to maintain the normal operation of circuit, the logic signals carried by the f-TSVs should be capable of being re-routed during field-operation. To this purpose, it is necessary to implement a reconfigurable network for signal re-routing, which inevitably introduces delay overhead. Therefore, the assigned s-TSV(s) for each group should be appropriately chosen among a given set in order to minimize the delay overhead introduced by the in-field repair solution.

The formal problem statement is as follows:

- Input: i) A set of f-TSV groups G = {g<sub>j</sub>} that partitions F = {f<sub>i</sub>}; ii) a set of placed s-TSVs S = {s<sub>k</sub>}; iii) the assigned s-TSV number N<sub>qs</sub> for each group.
- **Output**: The mapping between G and the set of assigned s-TSVs  $S^* \subseteq S$ .
- Constraint: The assigned s-TSV number of each group is equal to  $N_{qs}$ .
- **Objective**: During in-field repair, the total delay overhead of all groups is minimized. Here the delay overhead of a group is the maximum overhead of all its f-TSVs.

Generally, the delay overhead during repair comes from: i) rerouting logic circuitry and ii) re-routing wire. The first aspect is determined by a given grouping ratio  $GR = N_{gf} : N_{gs}$ (details in Section V). Therefore, the objective of this step is to minimize the delay overhead introduced by re-routing wire, and the additional wire length during re-routing is used as a metric to evaluate it, as in [15].

Here we can formulate the s-TSV assignment as a min-cost flow problem. As illustrated in Figure 3, a network G = (V, E)is constructed, whose node set includes all the partitioned f-TSV groups  $\{g_j\}$ , all the placed s-TSVs  $\{s_k\}$ , a pseudo source node S, and a pseudo sink node T. There are three kinds of edges in the edge set E, where each edge will be assigned with a (*capacity*, *cost*) pair:



Figure 3: Min-cost flow problem for s-TSV assignment.

- The source node S has supply of N<sub>gs</sub> × |G|, and connect to |S| s-TSVs {s<sub>k</sub>}. Each edge (S, s<sub>k</sub>) has capacity 1 and cost 0.
- There are  $|\mathbf{S}| \times |\mathbf{G}|$  edges from the placed s-TSVs  $\{s_k\}$  to the partitioned groups  $\{g_j\}$ . The capacity of edge  $(s_k, g_j)$  is infinity. Its cost cost(j, k) is the additional wire length during re-routing when assigning  $s_k$  to  $g_j$ . In other words,  $cost(j, k) = \max_{\forall f_i \in g_j} L(f_i, s_k)$ , where  $L(f_i, s_k)$  is the Euclidean distance between  $f_i$  and  $s_k$ .
- Every group  $g_j$  connects to the sink node T, where each edge  $(g_j, T)$  has a capacity of  $N_{gs}$  and a cost of 0.

In this min-cost flow problem, the generated solution indicates the optimal assignment of each placed s-TSV to the partitioned group, in the sense of the re-routing additional wire length, and can be solved in polynomial time [16].

# D. Discussion

Until now, we assumed that f-TSV number  $N_{gf}$  is the same for each TSV group. However, to target a given MTTF, a nonuniform partitioning with varying  $N_{gf}$  for each group seems to be more efficient, and here we solve it iteratively using a greedy group-merging algorithm. After conducting the TSV grouping with GR = 1 :  $N_{gs}$  initially, we always merge the two groups with the highest MTTF in each iteration, and delete  $N_{gs}$  s-TSV(s) that results in higher delay overhead during s-TSV assignment. The iteration is performed until the achieved MTTF of the TSV grid decreases to the target one. In this way, we can achieve the same targeted MTTF with less s-TSVs, compared to a repair solution with uniform TSV grouping.

### V. RUN-TIME TSV REPAIR

After obtaining the TSV groups, the next step is to extend the EM-related lifetime of each EM-vulnerable f-TSV during field-operation. By the reconfigurable routing network, all nonfaulty TSVs in each group (including the assigned s-TSV(s)) are allowed to be temporarily deactivated and later reactivated on a rotating basis. Therefore, the signals in each group are routed with a subset of TSVs, while the rest can recover from EM wear-out well by exploiting the recovery property.

In order to leverage the recovery effect for EM reliability improvement, each TSV needs to be provided with dedicated shut-off time in the field. Therefore, the signal carried by the deactivated f-TSV should be re-routed to its final destination through another non-deactivated TSV in the same group for maintaining the normal operation. In this work, the assigned s-TSV(s) in each group is (are) served as alternative signal path(s) for the deactivated f-TSV(s), and thus a reconfigurable logic for signal path re-routing should be included within each group.

In order to realize s-TSV sharing and routing reconfiguration, the proposed redundant scheme in [17] is implemented in each group. Here, a (4 : 2) group is illustrated in Figure 4 as an example, in which two dedicated s-TSVs are assigned to a partitioned group consisting of four f-TSVs. As a symmetric scheme, each group needs to be configured both at the receiver and transmitter. To this end, reconfiguration circuitries (i.e., MUXes) are added to the two ends of each TSV, and every single input of the group can be selected and transmitted over the dedicated lines provided by the assigned s-TSVs when its original f-TSV is deactivated. In this way, all TSVs (including the assigned s-TSVs) can operate either in active mode or in recovery mode, and transition between them according to a recovery schedule.

In this work, a periodic recovery schedule is used, in which EM recovery can occur at regular time intervals. Consequently, according to the grouping ratio, each repair cycle can be split into multiple sub-cycles with the same duration  $T_{\text{unit}}$ , which is a user-defined parameter. Generally, for a  $(N_{gf} : N_{gs})$  group, the repair cycle of each TSV is divided into  $(N_{gf} + N_{gs})$  sub-cycles, including active time  $T_{\text{active}} = N_{gf}T_{\text{unit}}$  and recovery time  $T_{\text{recovery}} = N_{gs}T_{\text{unit}}$ . In each sub-cycle,  $N_{gs}$  TSV(s) is (are) deactivated for recovery, while the carried signal(s) (if any) will be re-routed through the non-deactivated s-TSV(s).

The overhead introduced by the proposed repair solution is analyzed as follows.

• Delay overhead: When combined with the fact that TSV latency is very small [18], here delay overhead is mainly determined by re-routing wire and reconfiguration circuitries (i.e., MUXes). Although the former one can be minimized by optimal s-TSV assignment based on the given placement of s-TSVs, the inserted MUXes can introduce more significant delay overhead. For a f-TSV  $f_i$  in a  $(N_{gf} : N_{gs})$  group, its re-routing logic-induced delay overhead is:

$$D(f_i) = (1 + \log_2(N_{qf})) D_{\text{MUX}_{2,\text{tot}}}$$
(1)



Figure 4: Illustration of the proposed repair architecture for a (4:2) TSV group consisting of 4 f-TSVs and 2 s-TSVs.

here  $D_{\text{MUX}_{2:0-1}}$  is the propagation delay of a 2-to-1 MUX. Therefore, it is desirable to partition f-TSVs into smaller groups (i.e., smaller  $N_{gf}$ ) in order to reduce the overhead. Note that, for those EM-vulnerable f-TSVs on the critical paths, their timing slacks can be impacted slightly by the added re-routing logic circuitry and its introduced delay overhead. However, this penalty is unavoidable since the EM-induced timing failures can be more severe without the proposed repair solution.

• Area overhead: The area overhead is dominated by the assigned s-TSVs and added MUXes [10]. After implementing TSV grouping with  $GR = N_{gf} : N_{gs}$ , the total area overhead of all groups can be represented as: <sup>2</sup>

$$A = N_{gs} \left| \mathbf{F} \right| \left[ A_s / N_{gf} + (2 - 1/N_{gf}) A_{\text{MUX}_{2\text{-to-1}}} \right] \quad (2)$$

where  $A_s$  is the area of an s-TSV and  $A_{MUX_{240-1}}$  is the area of a 2-to-1 MUX. Therefore, for a fixed  $N_{gs}$ , it is more preferred to partition f-TSVs into larger groups (i.e., larger  $N_{gf}$ ) to reduce the area overhead. Note that, since the recovery schedule can be fixed at design-time, it is unnecessary to control the MUXes from outside. Instead, a small finite-state machine can generate the control signal for each MUX internally, incurring negligible overhead.

# VI. SIMULATION RESULTS

### A. Simulation setup and implementation flow

For our simulations, six 3D benchmark designs selected from OpenCore benchmark suite [19] were used, including  $des\_perf-i$ ,  $cf\_rca\_16-i$ , and  $cf\_fft\_256\_8-i$  (i = 2, 4). Here, *i* is the number of stacked dies in each design. Given the netlist of each design, Cadence SoC Encounter was used to generate layout file using the Nangate 45 nm library [20]. Here, f-TSVs were placed regularly across each die with a 10  $\mu$ m pitch to form a grid [21], and s-TSVs were placed at the edges of the f-TSV grid with the same pitch [11]. For both f-TSVs and s-TSVs, the total TSV cell size including the keep-out zone is 8.4  $\mu$ m, which corresponds to six standard cell rows [22].

Given a grouping ratio  $GR = N_{gf} : N_{gs}$ , the proposed TSV grouping technique was conducted on the generated layout files of each design to obtain TSV groups. Afterwards, based on a periodic recovery schedule with a user-defined  $T_{unit}$ , the EM model proposed in [9] can be used to estimate the MTTF of each group considering transient recovery effect. To this end, the power/thermal characteristics of each TSV in the group need to be generated. After creating a top-level Verilog netlist for the design, post-synthesis simulation was performed in Modelsim with a testbench containing  $10^5$  random input vectors. In this way, the switching activity of each f-TSV can be extracted. Moreover, the generated switching activity interchange format (SAIF) file was forwarded to Power Compiler in order to obtain the power consumption of each cell. Based on this information and layout files, the experienced temperature of each TSV can be estimated using the 3D Hotspot [23].

# B. Impact of $T_{unit}$ and GR on repair solution

There are two user-defined parameters in the proposed approach, namely  $T_{unit}$  and GR. In this section, we investigate





1) Impact of  $T_{unit}$ :  $T_{unit}$  is the duration of each sub-cycle in the repair cycle during in-field repair. A larger  $T_{unit}$  implies longer recovery time of deactivated TSVs in each repair cycle, but also indicates more EM degradation of the TSVs operating in active mode. Here we present the impact of  $T_{unit}$  on the generated repair solution in terms of achieved MTTF.

Here the experiments were conducted on both  $des\_perf-2$ and  $cf\_fft\_256\_8-2$  with GR = 3 : 1, and Figure 5 illustrates the relationship between the achieved MTTF and  $T_{unit}$ . As shown, for both of the two benchmarks, a repair solution with short  $T_{unit}$  (e.g.,  $10^{-3}$  s) is incapable of fully exploiting EM recovery effect, which results in extremely short lifetime. With the increased  $T_{unit}$ , the generated repair solution attempts to strike a balance between recovery and degradation in each repair cycle, and achieve it at different  $T_{unit}$  for different benchmark. However, with the further increase of  $T_{unit}$ , the achieved balance becomes disturbed as the EM degradation in each repair cycle can no longer be compensated by recovery effect. As a result, the achieved MTTF becomes lower and saturates finally.

2) Impact of GR: Grouping ratio  $GR = N_{gf} : N_{gs}$  denotes the ratio between the number of f-TSVs and s-TSVs in each group. On the one hand, for a fixed  $N_{gs}$ , the partitioning with less  $N_{gf}$  leads to better EM recovery in each repair cycle and smaller delay overhead introduced by re-routing logic circuitry (as discussed in Section V), but also results in higher area overhead according to Equation (2). On the other hand, for the same GR, the different  $N_{gs}$  can also impact the generated repair solution in terms of MTTF and overhead.

In order to evaluate the impact of GR, the experiment was performed on des\_perf-2 with  $T_{unit} = 0.1$  s. First, for a fixed  $N_{gs} = 1$ ,  $N_{gf}$  was varied from 2 to 4, and the achieved MTTF and the corresponding overhead can be obtained using the proposed approach with different GR. Afterwards, three different cases with GR = 2 : 1, 4 : 2, 6 : 3 were considered, in which  $N_{gs}$  was varied from 1 to 3 but GR always equals 2. For all the cases, we report the achieved MTTF and the overhead in both area and delay. Here the area overhead is presented in terms of the number of assigned s-TSVs and added MUXes. Since the re-routing wire-induced delay

<sup>&</sup>lt;sup>2</sup>Here we assume that  $|\mathbf{F}|$  is divisible by  $N_{gf}$ .

Table I: Trade-off analysis between the achieved MTTF and overhead for different grouping ratio. Here delay overhead is the average value of all groups in the design.

|                | MTTF (yrs) | Overhead   |       |       |  |
|----------------|------------|------------|-------|-------|--|
| Grouping ratio |            | Delay (ps) | Area  |       |  |
|                |            |            | # TSV | # MUX |  |
| 2:1            | 11.81      | 141.61     | 28    | 82    |  |
| 3:1            | 10.92      | 180.34     | 19    | 91    |  |
| 4:1            | 9.69       | 214.13     | 14    | 96    |  |
| 4:2            | 12.74      | 214.13     | 28    | 192   |  |
| 6:3            | 13.67      | 240.06     | 30    | 300   |  |
| non-uniform    | 10         | 198.67     | 16    | 94    |  |

overhead is highly dependent on the given placement of s-TSVs, we only focus on the logic-induced delay overhead here, and report the average value of all the groups.

The results are listed in Table I. As shown, for a fixed  $N_{qs}$ , we can achieve a higher MTTF with smaller re-routing logic-induced delay overhead by partitioning less  $N_{qf}$  into each group, but also results in larger area overhead. Note that, here reducing the number of s-TSVs can save far more area compared to the area overhead introduced by MUXes. Moreover, for the same GR, assigning more s-TSVs to each group can provide longer recovery time for each repair cycle, which improves lifetime reliability more significantly. However, the penalty is the increased area and delay overheads. In addition, as discussed in Section IV-D, when targeting a given MTTF, a repair solution using non-uniform TSV grouping can achieve a better trade-off between reliability improvement and area overhead. According to our simulation results, a non-uniform solution with  $N_{qs} = 1$  can achieve a 10-year MTTF with lower area overhead compared to the uniform ones with GR = 2:1and 3 : 1.

#### C. Comparison with prior work

We compare our proposed proactive repair approach with the conventional reactive one [3], and the results in terms of achieved MTTF are listed in Table II. As shown, the proposed approach can increase MTTF of the TSV grid by up to 12X relative to the reactive method. Moreover, the number of EMvulnerable f-TSVs  $|\mathbf{F}|$  and the area overhead rate  $\Delta A$  (in terms of percentage of area introduced by s-TSVs and MUXes with respect to total chip area) are also listed in the table. According to our results, the area overhead introduced by the proposed repair solution is pretty small, which can be negligible for a large design. Note that, since here we assume that the same reconfiguration network is used in both proactive and reactive approaches, the proposed technique does not increase delay and area overheads compared to the baseline, but can achieve a higher MTTF.

### VII. CONCLUSION

In this paper, we propose a proactive repair approach to combat electromigration (EM) in TSVs by taking use of the EM recovery effect. Applied to 3D benchmark designs, our proactive approach improves the lifetime reliability of TSVs susceptible to EM failure by approximately 12X over the conventional reactive one with less area overhead. While our methodology with even simple recovery scheduling significantly improves TSV lifetime reliability, more sophisticated

Table II: Comparison between the proposed proactive approach and the conventional reactive approach [3]. Here  $|\mathbf{F}|$  is the number of EM-vulnerable f-TSVs in each design, and  $\Delta A$  is the percentage of area introduced by repair solution with respect to total chip area.

| Benchmark      | $ \mathbf{F} $ | Grouping ratio | $\Delta A \ (\%)$ | MTTF (yrs) |              |
|----------------|----------------|----------------|-------------------|------------|--------------|
|                |                |                |                   | Proactive  | Reactive [3] |
| des_perf-2     | 55             | 2:1            | 10.71             | 11.81      | 1.62         |
|                |                | 3:1            | 7.71              | 10.92      | 0.98         |
|                |                | 4:1            | 6.11              | 9.69       | 0.87         |
| cf_rca_16-2    | 87             | 2:1            | 4.83              | 9.71       | 3.51         |
|                |                | 3:1            | 3.40              | 9.14       | 3.32         |
|                |                | 4:1            | 2.67              | 8.93       | 2.94         |
| cf_fft_256_8-2 | 235            | 2:1            | 6.00              | 8.51       | 2.21         |
|                |                | 3:1            | 4.25              | 8.13       | 2.04         |
|                |                | 4:1            | 3.35              | 7.78       | 1.78         |
| des_perf-4     | 183            | 2:1            | 11.01             | 9.55       | 1.23         |
|                |                | 3:1            | 7.90              | 9.01       | 0.81         |
|                |                | 4:1            | 6.28              | 8.65       | 0.71         |
| cf_rca_16-4    | 218            | 2:1            | 5.65              | 8.93       | 3.03         |
|                |                | 3:1            | 3.99              | 8.21       | 2.89         |
|                |                | 4:1            | 3.13              | 7.99       | 2.72         |
| cf_fft_256_8-4 | 314            | 2:1            | 4.55              | 8.01       | 2.11         |
|                |                | 3:1            | 3.21              | 7.75       | 1.57         |
|                |                | 4:1            | 2.52              | 7.03       | 1.42         |

recovery scheduling can be studied to further enhance EM reliability, which could be considered as part of future work.

### References

- [1] W. R. Davis, et al. Demystifying 3D ICs: the pros and cons of going vertical. *IEEE Design & Test of Computers*, 22(6):498–510, 2005.
- [2] T. Frank, et al. Reliability of TSV interconnects: Electromigration, thermal cycling, and impact on above metal level dielectric. *Microelectronics Reliability*, 53(1):17–29, 2013.
- [3] L. Jiang, et al. On effective and efficient in-field TSV repair for stacked 3D ICs. In DAC, 2013.
- [4] C. Serafy and A. Srivastava. Online TSV health monitoring and built-in self-repair to overcome aging. In DFTS, 2013.
- [5] X. Huang, et al. Electromigration recovery modeling and analysis under time-dependent current and temperature stressing. In ASP-DAC, 2016.
- [6] K.-D. Lee. Electromigration recovery and short lead effect under bipolarand unipolar-pulse current. In *IRPS*, 2012.
- [7] M. Lin and A. Oates. AC and pulsed-DC stress electromigration failure mechanisms in Cu interconnects. In *IITC*, 2013.
- [8] J. Pak, et al. Electromigration-aware routing for 3D ICs with stressaware EM modeling. In *ICCAD*, 2012.
- [9] Taeyoung Kim, et al. Dynamic reliability management for near-threshold dark silicon processors. In *ICCAD*, 2016.
- [10] L. Jiang, et al. On effective through-silicon via repair for 3D-stacked ICs. TCAD, 32(4):559–571, 2013.
- [11] J. Xie, et al. Yield-aware time-efficient testing and self-fixing design for TSV-based 3D ICs. In ASP-DAC, 2012.
- [12] M. Nicolaidis, et al. Through-silicon-via built-in self-repair for aggressive 3D integration. In *IOLTS*, 2012.
- [13] A.-C. Hsieh and T. Hwang. TSV redundancy: architecture and design issues in 3D IC. *TVLSI*, 20(4):711–722, 2012.
- [14] W. Michiels, et al. Performance ratios for the differencing method applied to the balanced number partitioning problem. In *STACS*, 2003.
- [15] F. Ye and K. Chakrabarty. TSV open defects in 3D integrated circuits: Characterization, test, and optimal spare allocation. In DAC, 2012.
- [16] J. Cong and Y. Zhang. Thermal-driven multilevel routing for 3D ICs. In ASP-DAC, 2005.
- [17] M. Laisne, et al. Systems and methods utilizing redundancy in semiconductor chip interconnects. US Patent 8384417, 2013.
- [18] Y. Xie, et al. *Three-dimensional integrated circuit design*. Springer, 2010.
- [19] Open Cores Standard. [Online]. http://opencores.org/.
- [20] Nangate. http://www.nangate.com/.
- [21] G. Van der Plas, et al. Design issues and considerations for low-cost 3D TSV IC technology. JSSC, 46(1):293–307, 2011.
- [22] B. Noia and K. Chakrabarty. Pre-bond probing of TSVs in 3D stacked ICs. In *ITC*, 2011.
- [23] J. Meng, et al. Optimizing energy efficiency of 3D multicore systems with stacked DRAM under power and thermal constraints. In DAC, 2012.