# Thermal-aware TSV Repair for Electromigration in 3D ICs

Shengcheng Wang\*, Mehdi B. Tahoori\* and Krishnendu Chakrabarty<sup>†</sup>

\*Chair of Dependable Nano Computing (CDNC), Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

{shengcheng.wang, mehdi.tahoori}@kit.edu

<sup>†</sup>Department of ECE, Duke University, Durham, NC, USA

{krish}@ee.duke.edu

Abstract—Electromigration (EM) occurrence on through-silicon-vias (TSVs) is a major reliability concern for Three-Dimensional Integrated-Circuits (3D ICs), and EM can severely reduce the mean-time-to-failure (MTTF). In this work, a novel fault tolerant technique is proposed to increase the MTTF of the functional TSV network through the assignment of spare TSVs to EM-vulnerable functional TSVs. The objective is to meet the target MTTF with minimum spare TSVs and minimal impact on the circuit timing. By considering the impact of temperature variation, the proposed technique provides a more robust repair solution for EM-induced TSV defects with minimum delay overhead, compared to previous thermal-unaware methods.

#### I. INTRODUCTION

Three-dimensional integrated circuits (3D ICs) promise to overcome interconnect bottlenecks in CMOS scaling [1]. By utilizing vertical connections between stacked dies (i.e., through-silicon-vias (TSVs) [2], [3]), 3D ICs can provide abundant interconnect bandwidth with improved performance and less communication-energy consumption [4]. However, concerns about reliability constitute one of the key obstacles in the widespread adoption of TSV-based 3D IC technology in industry [5], [6]. One of the concerns is electromigration (EM) occurrence on TSV (referred to TSV EM in this paper) [7], [8].

EM refers to the transport of material due to the movement of electrons. This phenomenon can be attributed to various factors, such as geometrical shapes, current density, temperature distribution, mechanical stress and material properties [9]. EM on a wire leads to the accumulation of more atoms at the end of the wire towards the source pin creating "hillocks", while "voids" appear at the other end of the wire. Eventually, these defects lead to open- and/or short-circuit failures.

In 3D ICs, TSV EM becomes even more critical due to the following reasons. First, compared to their 2D counterparts, current densities of 3D ICs are much higher as integration densities increase. Hence TSVs carry a large amount of current. Second, temperature variations, especially inter-die variation, can be significant due to the inefficient heat dissipation of 3D stacked dies. Third, the large difference in coefficients of thermal expansion (CTE) of TSV filling material (e.g., copper) and the surrounding silicon results in thermomechanical stress in the TSV. For these reasons, TSV EM-related reliability issue in 3D ICs is more critical, and it is imperative to mitigate and repair EM-induced defects in TSVs.

A number of EM mitigation techniques for 2D interconnects have been proposed in the literature [10], [11], [12], [13]. However, due to the different characteristics and fabrication mechanisms of 2D interconnect compared to TSVs, these techniques cannot be applied directly for TSV EM [14]. On the other hand, most of the existing works for TSV EM mainly focus on EM modeling, except [14] and [15]. In [14], a technique was proposed to mitigate TSV EM by balancing the direction of current flow within it. In [15], a reconfigurable in-field repair solution was proposed to tolerate latent EM-induced TSV defects. However, both of these solutions ignored the impact of temperature on EM reliability, which results in inaccurate MTTF evaluation and ineffective repair. According to [16], as the temperature increases from 300K and 400K, the MTTF of a single TSV reduces by a factor of  $10^3$ . Therefore, the impact of temperature must be taken into account during TSV EM mitigation and repair.

In this paper, an efficient thermal-aware TSV EM repair technique is proposed to overcome the above limitation. The basic idea is to assign spare TSVs (referred to spares in this paper) to functional TSVs (referred to f-TSVs in this paper) in order to meet the target MTTF of the f-TSV network. The proposed technique consists of two stages:

- *EM-vulnerable f-TSV identification*: EM-vulnerable f-TSVs, which dominate the MTTF of the f-TSV network, are identified in this stage. The objective is to minimize the number of EM-vulnerable f-TSVs, while satisfying the target MTTF of the f-TSV network.
- *Spare assignment*: In this stage, one spare is assigned to each EM-vulnerable f-TSV. The mapping between the EM-vulnerable f-TSVs and the assigned spares is determined in a way that the signal re-routing delay overhead is minimized.

Our simulation results demonstrate that

- The proposed technique can increase the MTTF of the f-TSV network by up to 32% with the same number of spares, compared to a random spare assignment.
- To target the same MTTF of the f-TSV network, the previous thermal-unaware techniques underestimate the number of assigned spares and re-routing delay overhead by up to 17% and 12% respectively, which result in infeasible repair solution under a realistic temperature distribution.

To the best of our knowledge, this is the first work on thermal-aware TSV repair for EM-induced defects.

The rest of this paper is organized as follows. Related prior works and preliminaries are presented in Section II and Section III, respectively. Section IV describes the two-stage thermal-aware TSV repair methodology in detail. In Section V, we report simulation results. Finally, conclusions are drawn in Section VI.

#### II. RELATED PRIOR WORK

A number of techniques have been proposed to address EM robustness in 3D ICs, which can be divided into two categories. One category focused on EM modeling and its impact on electrical and mechanical properties of TSVs. In [8] and [17], finite element analysis (FEA) was used to study EM on a single TSV. However, since FEA is extremely memory- and time-consuming, it cannot be used for chip-level analysis. In [18], a look-up table that exhaustively investigates the impacts of stress, current, and temperature on the EM of metal wire was set up. Based on this information, a chip-scale EM analysis was performed. Nevertheless, only the EM occurrence on planar metal wire was considered in this work.

The other category studied EM robustness from the perspective of 3D IC design. In [19], an EM-aware clock tree synthesis design flow was proposed for 3D ICs. However, only the f-TSVs located in the clock tree were considered. In [14], TSV EM was mitigated by some on-line circuit modules. Through balancing current flow within TSVs, EM can be compensated by current in opposite directions [20]. In [15], a reconfigurable in-field repair solution was proposed. By using spares, the path delay faults introduced by EM-induced TSV resistance change were repaired. However, both of these TSV EM mitigation and repair techniques ignored the impact of temperature during the evaluation of TSV MTTF. In this case, infeasible or ineffective solutions are likely to be generated under realistic temperature distribution, and hence the target MTTF of the f-TSV network cannot be guaranteed. Therefore, a thermal-aware TSV EM repair technique is imperative.

# III. PRELIMINARIES

## A. TSV MTTF calculation

MTTF provides a measure of the expected lifetime of a component in a circuit. For TSV, the MTTF can be calculated as [21]:

$$MTTF_{\rm TSV} = A \cdot J^{-n} \cdot exp\left(\frac{Q}{k \cdot T}\right) \tag{1}$$

where A is a constant that depends on TSV fabrication technology, J is the current density of the TSV, Q is the EM activation energy, k is the Boltzmann's constant and T is the temperature. For the copper electroplating process, n is equal to 1.1 [22]. Moreover, J can be expressed as follows:

$$J = \frac{C \cdot V_{dd}}{S} \cdot f \cdot p \tag{2}$$

where C is the capacitance of the TSV,  $V_{dd}$  is the supply voltage, S is the cross-sectional area of TSV, f is clock frequency, and p is the switching activity of the signal carried by the TSV. From Equation (1) and Equation (2), we can observe that  $MTTF_{TSV}$  is strongly dependent on temperature and switching activity.

For 3D ICs, the MTTF of the f-TSV network,  $MTTF_{network}$ , can be calculated as a series system using the classical analytical technique described in [23]:

$$MTTF_{\text{network}} = \frac{1}{\sum_{i} (1/MTTF_{f_i})}$$
(3)

where  $MTTF_{f_i}$  is the MTTF value of the  $i^{th}$  f-TSV in the network.

At time zero, all the f-TSVs are EM-fault-free. However, during field operation, f-TSVs with lower MTTF may fail due to EM vulnerability. Therefore, such *EM-vulnerable* f-TSVs will limit the MTTF of the f-TSV network. To solve this problem, a promising method is to assign spares for them, as shown in Figure 1. By inserting MUX and DEMUX, an EM-vulnerable f-TSV can be replaced by the spare(s) once it becomes faulty due to EM-induced defects. Assuming that we assign a spare  $s_j$  to the f-TSV  $f_i$ , then the MTTF of  $f_i$  becomes:

$$MTTF_{f_i}^{\text{Enhnc}} = MTTF_{f_i} + MTTF_{s_j} \tag{4}$$

where  $MTTF_{f_i}^{\text{Enhnc}}$  is the enhanced MTTF of  $f_i$  after being assigned  $s_j$ . Note that  $s_j$  will have the same current density as  $f_i$  after TSV repair since they carry the same signal. However, they experience different temperature due to their different locations in the layout.

## B. TSV placement styles

In 3D IC design, there are two different f-TSV placement styles: regular placement and irregular placement [24]. In regular placement, f-TSVs are placed at regular grid-like sites over the die area, and considered as placement obstacles when logic cells (IP blocks) are placed in the 3D placement stage. In contrast, the f-TSVs are added to the 3D netlist as TSV cells in irregular placement, and then placed with the logic cells (IP blocks) simultaneously during 3D placement. In this work, we assume that the f-TSVs are placed in regular way. However, the proposed technique can be easily extended for designs with irregular TSV placement.



(a) Operation when all f-TSVs are EM-robust.



(b) Operation when the third f-TSV is EM-vulnerable.

Figure 1: Illustration of MTTF-enhancement by assigning a spare to an EMvulnerable f-TSV.

For the above two placement styles, the planning of spares is realized at different stages of the design flow. For the irregular case, since the locations of f-TSVs and logic cells need to be determined simultaneously, spares have to be inserted after the placement stage. For the regular case, since the locations of f-TSVs are determined before logic cells, spares and the supporting infrastructure (i.e., MUXes and DEMUXes) can be inserted right after f-TSV planning but prior to the placement of the logic cells and detailed routing [25]. In this case, there are two possible spare placement methods [26], as shown in Figure 2. One is to place them at the edges of the f-TSV grid with multiple rings [27]<sup>1</sup>, and the other one is to place them in a regularly spaced array among the f-TSVs.



(a) Spares placing at the edges of (b) Spares placing in a regularly TSV block. spaced array.

Figure 2: Illustration of two different approaches to place spares for regular f-TSV placement.

## IV. THERMAL-AWARE TSV EM REPAIR METHODOLOGY

#### A. Problem statement

In this work, we solve the thermal-aware TSV EM repair by assigning spares to EM-vulnerable f-TSVs. The formal problem statement is as follows:

- **Input**: i) A 3D IC design where all f-TSVs and spares are already placed; ii) the target MTTF of the f-TSV network.
- **Output**: An optimal repair solution, i.e., the mapping between EM-vulnerable f-TSVs and the assigned spares.
- **Objective**: Minimize the number of assigned spares and the rerouting delay overhead induced by the generated repair solution.
- **Constraints**: i) The MTTF of the f-TSV network should meet the target MTTF with the generated repair solution; ii) After being assigned spare, the additional delay due to re-routing of each EM-vulnerable f-TSV should be below an upper limit.

We propose a two-step optimization methodology to solve this problem, as detailed below.

<sup>1</sup>Note that we only show the first spare ring in Figure 2(a).

#### B. Overview of the optimization method

The proposed methodology can be divided into two parts: 1) EMvulnerable f-TSV identification; 2) spare assignment. However, as an important metric of TSV EM, the MTTF of each f-TSV and spare should be calculated first. Thus, the basic flow of the proposed methodology is as follows: Given a 3D IC design, the MTTF of each f-TSV and spare is calculated during the TSV MTTF calculation step. According to the calculated MTTF, EM-vulnerable f-TSVs can be identified next, under constraint i) in the problem statement. The objective is to minimize the number of assigned spares. We use an iterative method to solve this problem. Afterwards, the assignment of spares to EM-vulnerable f-TSVs is performed under constraint ii). The objective of this step is to minimize the average re-routing delay overhead introduced by TSV repair solution. Next, we discuss the three steps in more detail.

#### C. TSV MTTF Caclulation

The objective of this step is to calculate the MTTF of each f-TSV and spare. To this end, the application profile (i.e, switching activity of the signal transferred by each f-TSV) along with the thermal profile should be taken into account. Using Equation (1) and Equation (2), the MTTF of each f-TSV and spare can be calculated. The overall flow is illustrated in Figure 3.



Figure 3: Overall flow of TSV MTTF calculation.

The inputs are the gate-level netlist and layout information of each die in the 3D design, which can be obtained using commercial tools. Next, post-synthesis simulation is performed. With the input vectors based on the expected circuit functionality, the application profile can be extracted, which includes the switching activity and signal probability of each circuit node and net. Then the application profile is given to a power analysis tool to estimate leakage and dynamic power consumption of each cell. The power information is then forwarded to a temperature estimation tool along with the layout information to estimate the thermal profile of each die. Finally, the MTTF of each f-TSV and spare can be calculated based on the thermal and application profiles.

## D. EM-vulnerable f-TSV Identification

In this step, EM-vulnerable f-TSVs, which dominate the MTTF of the f-TSV network, are identified. The objective is to minimize the number of assigned spares while satisfying the target MTTF of the f-TSV network. Since we assign the same number of spare(s) to each EM-vulnerable f-TSV, it is equivalent to minimize the number of EM-vulnerable f-TSVs. This problem can be formulated as follows:

- Input: i) A set of f-TSVs F; ii) A set of spares S; iii) The target MTTF of f-TSV network  $MTTF_{target}$ .
- **Output**: The subset of EM-vulnerable f-TSVs  $\mathbf{F}_{vulnerable} \subseteq \mathbf{F}$ .
- **Objective**: Minimize |**F**<sub>vulnerable</sub>|.
- **Constraint**: The MTTF of the f-TSV network  $MTTF_{network} \ge$ •  $MTTF_{target}$ .

Here, we solve this problem iteratively. The basic idea is to assign spare(s) to the f-TSV with the lowest MTTF value in each iteration. The reason is the following. Assuming that a spare  $s_i$  with the MTTF  $MTTF_{s_j}$  is assigned to a f-TSV  $f_i \in \mathbf{F}$  with the MTTF  $MTTF_{f_i}$ , then the MTTF of  $f_i$  becomes  $MTTF_{f_i} + MTTF_{s_i}$ according to Equation (4). In this case,  $MTTF_{network}$  is increased by  $\Delta MTTF_{\text{network}}$ , which is calculated using Equation (3):

$$\Delta MTTF_{\text{network}} = \frac{MTTF_{\text{network}}^2}{MTTF_{f_i}^2/MTTF_{s_j} + MTTF_{f_i} - MTTF_{\text{network}}}$$
(5)

Note that, the only variable in Equation (5) is  $MTTF_{f_i}$ . Therefore, to increase  $MTTF_{network}$  efficiently (i.e., maximize  $\Delta MTTF_{network}$ in each iteration), we should select  $f_i$  with the lowest MTTF in each iteration, and assign spare(s) to it. These selected f-TSVs are so-called "EM-vulnerable" f-TSVs.

To identify them from all the f-TSVs, an iterative method is used, as shown in Algorithm 1. The identification is performed until  $MTTF_{network} \geq MTTF_{target}$ . In each iteration, the f-TSV with the lowest MTTF, which is not identified so far as EM-vulnerable, is selected and assigned k spare(s). In this case, one (k + 1)-to-1 MUX and one 1-to-(k + 1) DEMUX is also required for TSV repair. Here k is equal to 1. However, we can assign multiple spares for each EM-vulnerable f-TSV by increasing k. Note that, since the location of each assigned spare will be determined in the next step, its exact MTTF cannot be obtained during this step due to the temperature dependence of MTTF. Therefore, all the assigned spares are assumed to experience the highest temperature among all the spares. In this way, we can guarantee that  $MTTF_{network} \ge MTTF_{target}$  is still satisfied during the subsequent spare-assignment step. Finally, the selected f-TSVs constitute  $\mathbf{F}_{vulnerable}$ . Since we maximize  $\Delta MTTF_{network}$  in each iteration, the iteration number, which is equal to  $|\mathbf{F}_{vulnerable}|$ , can be minimized for a specific  $MTTF_{target}$ .

Algorithm 1 Iterative method for EM-vulnerable f-TSV identification

Input: F, S,  $MTTF_{target}$ Output:  $\mathbf{F}_{vulnerable}$ 1:  $\mathbf{F}_{vulnerable} = \emptyset$ ;

2: repeat

select  $f_i \in (\mathbf{F} - \mathbf{F}_{vulnerable})$  with the lowest MTTF; 3: 4. assign one spare(s) to  $f_i$ ;

calculate  $MTTF_{network}$  according to Equations (3) and (4);

6:  $\mathbf{F}_{\text{vulnerable}} = \mathbf{F}_{\text{vulnerable}} \cup \{f_i\};$ 

until  $MTTF_{network} \ge MTTF_{target}$ 

8: output **F**<sub>vulnerable</sub>;

In this work, we assume that all the spares are already placed before TSV repair. Therefore, it is necessary to guarantee that there are sufficient spares to satisfy  $MTTF_{target}$ . This can be realized by estimating an upper bound on the required spare number at earlier stages of the design cycle, such as the register-transfer and behaviour levels. To this end, we need to consider the worst-case scenario: all f-TSVs have the lowest possible MTTF, which can be calculated using the highest temperature  $T_{max}$  across all the die and the highest switching activity  $p_{\max}$  among the outputs of all the macro blocks.

In [28], an efficient method was proposed to estimate switching activity and power consumption at the register-transfer level (RTL). With this technique, the power consumption and output switching activity of each RTL block can be extracted. After partitioning a 2D design into multiple dies at behaviour level, this technique is performed for each die to extract its power and application profiles. Using the similar method described in Section IV-C, we can obtain  $T_{\rm max}$  and  $p_{\rm max}$ . Afterwards, the lowest possible MTTF of f-TSV can be estimated. Note that, since the detailed gate-level netlist is unavailable, the planning of f-TSVs and spares is not performed at this stage. The subsequent insertion of TSVs can alleviate the thermal problem by providing additional heat dissipation paths to the heatsink as the TSV-fill material has much larger thermal conductivity than silicon. Therefore, this TSV insertion can only reduce  $T_{\rm max}$ instead of increasing it. In this case, the number of required spares will be decreased, and hence the estimated upper bound can still satisfy the requirement.

# E. Spare assignment

The objective of this stage is to minimize the re-routing delay overhead incurred after TSV repair by assigning appropriate spares to EM-vulnerable f-TSVs. In this work, we use the additional wire length when a faulty f-TSV is replaced by a spare as a metric to evaluate the re-routing delay overhead, as it is also used in [29]. This problem can be formally stated as follows:

- Input: i) A set of EM-vulnerable f-TSVs  $\mathbf{F}_{vulnerable}$ ; ii) A set of spares S.
- **Output**: The mapping between the EM-vulnerable f-TSVs and the assigned spares.
- Objective: The average additional wire length of all the EMvulnerable f-TSVs is minimized after spares are assigned.
- **Constraint**: The additional wire length due to re-routing of each EM-vulnerable f-TSV should be less than the maximum allowable additional wire length  $\Delta L_{\text{max}}$ , which is a user-defined parameter.

This problem can be treated as a *Minimum Weight Bipartite Matching* problem [30].



Figure 4: Assignment of spares to EM-vulnerable f-TSVs to minimize rerouting additional wire length.

Figure 4 shows the formulation of this spare assignment problem. In this figure, a bipartite graph G = (V, E) consists of a set V of *vertices* and a set *edges* E of pairs of vertices. The vertex set V can be partitioned into two disjoint sets F and S. Here,  $f_i \in F$  is the vertex representing the  $i^{th}$  EM-vulnerable f-TSV to be assigned with spare and  $s_j \in S$  is the vertex representing the  $j^{th}$  spare. Note that,  $\Delta L_{ij}$  is the weight for all  $(i, j) \in E$ , which represents the additional wire length when EM-vulnerable f-TSV  $f_i$  is replaced by spare  $s_j$ .

Then this spare assignment problem can be simplified to the problem of finding the minimum weight bipartite matching  $M \subseteq E$  where the weight is given by  $w(M) = \sum_{(i,j) \in M} \Delta L_{ij}$ . However, before starting to solve this problem, the bipartite graph needs to be pre-processed in order to satisfy the following assumptions:

- The graph is *balanced*: |F| = |S|.
- The graph is complete:  $E = F \times S$ .

The reasons for these assumptions are as follows: First, |S| is determined by the given layout information of 3D design, hence it could be greater or less than |F|; Second, due to the constraint of this spare assignment problem, some matchings M are infeasible in which at least one of the edges  $e \in M$  has a weight  $\Delta L_e > \Delta L_{\text{max}}$ .

However, it is always possible to reformulate the problem on a complete balanced bipartite graph in an equivalent way.

In order to balance it, some dummy vertices are inserted in the partition F, and no matching is affected by this operation. Moreover, if the given graph is not complete, we can insert dummy edges with a very large weight to make it complete. In this case, infeasible matchings are feasible now but with very large weight. Therefore, a maximum matching in the original graph corresponds to the matching with the smallest number of dummy edges in the new graph, and its optimality only depends on the weights of the original edges.

After the graph pre-processing, we can reformulate the problem as follows: Find a minimum weight complete matching between the two vertex subsets of a given weighted bipartite graph. This problem can be solved by the so-called "Hungarian algorithm", which is *strongly polynomial* [31]. In this algorithm, the number of operations is upper bounded by  $O(n^3)$  where n = |V|.

## V. SIMULATION RESULTS

## A. Simulation Setup

For our simulations, four benchmark circuits were used. Besides *des\_perf*, *cf\_rca\_16*, and *cf\_fft\_256\_8* selected from OpenCore benchmark suite [32], an artificial benchmark circuit *des\_cf\_fft* was also used by combining *des\_perf* and *cf\_fft\_256\_8* together. All designs are partitioned into two sets of implementation: 2-die case and 4-die case, respectively. The simulations were performed on a server with four AMD Opteron 6174 processors and 256GB RAM.

First, the netlist of each die was extracted using Synopsys Design Compiler. Then Cadence SoC Encounter was used to perform placement and routing for all the dies in each design separately using the Nangate 45 nm standard cell library [33]. In the floorplan, f-TSVs were placed regularly across each die with a 10  $\mu m$  pitch to form a grid [34], in which spares were placed at the edges with the same pitch. The TSVs have a keep-out zone with a width of four times the minimum-sized inverter, and the height of a standard cell [35]. The maximum allowable additional wire length  $\Delta L_{max}$  is set to  $300\mu m$ for all the test cases [29].

After creating a top-level Verilog netlist that instantiates the design for each die, post-synthesis simulation was performed in Modelsim with a testbench containing  $10^5$  random input vectors. After the switching activity interchange format (SAIF) file was extracted, it was forwarded to Power Compiler to obtain the power consumption of each cell. Finally, 3D Hotspot [36] was used for temperature estimation based on this information, in which the configuration setting was the same as [37].

#### B. Temperature Variation and MTTF Distribution

Significant temperature variation (both intra-die and inter-die) is a major issue in 3D ICs. According to Equation (1), the EM-related MTTF of TSV is strongly dependent on its temperature. Therefore, it is imperative to consider temperature variation during MTTF evaluation.

The maximum and minimum steady-state temperatures for each die are reported in Table I for the two implementations (2-die and 4-die) of benchmark *des\_perf*. Moreover, the distributions of TSV MTTF for the two cases are also illustrated in the histogram of Figure 5. In this figure, the x-axis represents the TSV MTTF value of  $MTTF_{f_i}$  normalized to the lowest MTTF value among all the f-TSVs of each design; the y-axis represents the percentage of f-TSVs in each category.

For the 2-die case, all the f-TSVs are placed in the same die. Hence only intra-die temperature variation affects MTTF, which is not significant, as shown in Table I. Therefore, the MTTF of each f-TSV is dominated by switching activity, and the distribution is nonuniformed to some degree. However, for the 4-die case, since the

Table I: Steady-state temperature comparison between 2-die and 4-die cases of *des\_perf* 

| Implementation | Die  | $T_{max}(^{\circ}C)$ | $T_{\min}(^{\circ}C)$ | $T_{avg}(^{\circ}C)$ |
|----------------|------|----------------------|-----------------------|----------------------|
| 2 die          | Die0 | 42.12                | 41.91                 | 42.01                |
| 2-010          | Die1 | 39.11                | 38.99                 | 39.04                |
|                | Die0 | 57.88                | 57.76                 | 57.82                |
| 4-die          | Die1 | 55.16                | 55.02                 | 55.09                |
| +-uic          | Die2 | 50.36                | 50.23                 | 50.29                |
|                | Die3 | 43.90                | 43.97                 | 43.84                |

f-TSVs are located in the different dies, inter-die variation plays an important role on TSV MTTF, which is very significant. Combined with the impact of switching activity, we can obtain more uniformed and wider-ranging MTTF distribution, as shown in Figure 5.



Figure 5: The distribution of TSV MTTF for two implementations of benchmark *des\_perf*: 2-die and 4-die.

## C. The Necessity of EM-vulnerable f-TSV Identification

Here we show the necessity of EM-vulnerable f-TSV identification with an experiment performed on the benchmark *des\_perf* with twodie implementation. Note that,  $MTTF_{target} = \alpha \times MTTF_{worst}$ , where  $MTTF_{worst}$  is the worst MTTF of TSV network without any spare for TSV repair and  $\alpha$  is the target enhancement factor. To satisfy the different  $MTTF_{target}$ , here spares were assigned to f-TSVs by using two different methods, respectively:

- *EM-vulnerable assignment*: spares were assigned to EM-vulnerable f-TSVs, which were identified using the proposed algorithm in Section IV-D.
- *Random assignment*: spares were assigned to all the f-TSVs randomly.

First, for a range of  $\alpha$  from 1 to 1.8, EM-vulnerable assignment was performed and the number of assigned spares was obtained for each  $MTTF_{\text{target}}$ . Next, with the same number of spares, random assignment was performed, and the increased MTTF of the f-TSV network was calculated using Equations (3)-(4).

The comparison is illustrated in Figure 6. As shown, the EMvulnerable assignment can achieve higher  $MTTF_{network}$  with the same number of assigned spares, compared to the random assignment. According to the simulation result, this improvement can be up to 32%. Therefore, it is necessary to perform an EM-vulnerable f-TSV identification before spare assignment.

## D. Comparison with prior work

In prior work, the impact of temperature variations on TSV EM was ignored during TSV mitigation and repair [14], [15]. Here we investigate this significant impact on the number of assigned spares, additional wire length due to re-routing and  $MTTF_{target}$ .

For both the thermal-aware and thermal-unaware scenarios, the proposed techniques were performed to satisfy the same



Figure 6: The comparison of satisfied target MTTF with the same assigned spares between EM-vulnerable assignment and random assignment.

 $MTTF_{target}$ . The only difference is that we consider an average temperature across all the dies for thermal-unaware scenario. Admittedly, this is an unrealistic assumption, but it is consistent with the prior work. The number of assigned spares (#spare) and the average additional wire length ( $\Delta L_{avg}$ ) can be calculated using the techniques proposed in Section IV-D and Section IV-E, respectively.

The results are shown in Table II. Due to its unrealistic assumption, the thermal-unaware scenario underestimates #spare and  $\Delta L_{avg}$ . For example, to satisfy the same  $MTTF_{target}$ , only 213 spares were required in the thermal-unaware scenario for  $cf\_rca\_16$  with 4-die implementation. However, when considering a realistic temperature distribution in the thermal-aware scenario, 240 spares were required; hence the error using the baseline thermal-unaware method is 11.25%. Moreover, using the over-optimistic repair solutions generated in the thermal-unaware scenario,  $MTTF_{target}$  cannot be satisfied under a realistic temperature distribution. As shown in column 7 of Table II, the errors  $Err_{MTTF}$  between the achieved MTTF using the thermal-unaware solution in thermal-aware scenario and the target MTTF are significant (up to 14%).

The reason for this result is as follows: First, when we consider an average temperature across all the dies in the thermal-unaware scenario, the die that is closest to the heat-sink will experience the fastest average temperature decrease. For example, as shown in Table I, for des\_perf with a 4-die implementation, Die3 has much lower temperature compared to the other three dies, and the average temperature across all the dies is 51.76°C. In this case, all the f-TSVs placed in Die0 and Die1 have much lower temperatures while the temperatures of the f-TSVs placed in Die2 are increased slightly in the thermal-unaware scenario. Therefore, the MTTF of most of the f-TSVs will be increased under this unrealistic assumption according to Equation (1), which results in an over-optimistic repair solution for a realistic scenario. Second, in the thermal-unaware scenario, the only factor that affects the MTTF value of a f-TSV is the switching activity of the signal carried by it. Therefore, the strong temperature dependence of TSV MTTF is ignored in this scenario. Besides the significant inter-die temperature variation shown in Table I, the intra-die temperature can be also noticeable for a larger benchmark. For example, for des\_cf\_fft with the 2-die implementation, the intra-die temperature variation is more than 4°C, according to our simulation results. Both of these variations play a significant role in the TSV MTTF calculation and EM-vulnerable f-TSV identification. Therefore, the baseline thermalunaware method is likely to misidentify the EM-vulnerable f-TSVs without considering the impact of temperature variation, which results in infeasible solution under a realistic temperature distribution.

Table II: Comparison between the thermal-aware and thermal-unaware scenarios

| Benchmark    | Implementation | #f-TSV | #Spare        |                      | $\Delta L_{\rm avg} \left( \mu m \right)$ |                      | Erro      |
|--------------|----------------|--------|---------------|----------------------|-------------------------------------------|----------------------|-----------|
| Deneminark   | implementation | #1-151 | Thermal-aware | Thermal-unaware [15] | Thermal-aware                             | Thermal-unaware [15] | LII MITTF |
| des_perf     | 2-die          | 369    | 42            | 40                   | 149.31                                    | 145.66               | 4.76%     |
|              | 4-die          | 1220   | 216           | 201                  | 113.57                                    | 101.12               | 6.94%     |
| cf_rca_16    | 2-die          | 582    | 52            | 48                   | 179.03                                    | 171.44               | 7.69%     |
|              | 4-die          | 1451   | 240           | 213                  | 158.93                                    | 146.75               | 10.78%    |
| cf_fft_256_8 | 2-die          | 1569   | 164           | 156                  | 198.10                                    | 187.29               | 4.88%     |
|              | 4-die          | 2100   | 282           | 257                  | 169.38                                    | 152.39               | 7.89%     |
| des_cf_fft   | 2-die          | 1938   | 197           | 171                  | 217.41                                    | 191.80               | 9.17%     |
|              | 4-die          | 3320   | 451           | 374                  | 183.55                                    | 165.92               | 14.04%    |

#### E. Runtime analysis

To evaluate the CPU runtime for the proposed technique, the experiments were performed on all the four benchmarks with a 2-die implementation. This measure consists of two parts: the runtime for EM-vulnerable f-TSV identification (identification for short) and the runtime for spare assignment (assignment for short). As illustrated in Table III, both of these steps can be finished within tens of seconds, even for the largest design.

Table III: The runtime of the proposed methodology for all benchmarks.

| Benchmark | Identification (s) | Assignment (s) | Total (s) |
|-----------|--------------------|----------------|-----------|
| des_perf  | 1.01               | 18.93          | 19.94     |
| of roa 16 | 3 70               | 24.77          | 28.56     |

| - <u>j_</u>  | ÷,    |       |       |
|--------------|-------|-------|-------|
| cf_fft_256_8 | 8.31  | 37.85 | 46.16 |
| des_cf_fft   | 10.07 | 57.61 | 67.68 |
|              |       |       |       |

# VI. CONCLUSION

Three-dimensional chip stacking with TSVs has gained traction in recent years as a promising means to continue Moore's law. However, electromigration (EM)-induced reliability degradation is one of the key obstacles for industry adoption of TSV-based 3D ICs. To handle this challenge, we have presented a fault-tolerance technique to increase the MTTF of the functional TSV network through assigning spare TSV to EM-vulnerable functional TSVs. By considering the impact of temperature variation, we have presented the trade-off analysis between the target MTTF, the number of assigned spare TSVs, and the re-routing delay overhead. The simulation results demonstrated that, in contrast to previous methods, the proposed technique can generated robust repair solution under a more realistic scenario.

#### REFERENCES

- [1] International Technology Roadmap for Semiconductors (ITRS'13) [Online]. Available: http://www.itrs.net/.
- [2] R. S. Patti. Three-dimensional integrated circuits and the future of system-on-chip designs. Proc. of the IEEE, 94(6):1214-1224, 2006.
- [3] M. Motoyoshi. Through-silicon via (TSV). Proc. of the IEEE, 97(1):43-48, 2009.
- [4] S. Sukegawa et al. A 1/4-inch 8mpixel back-illuminated stacked CMOS image sensor. in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2013, pp. 484-485.
- [5] I. Loi et al. A low-overhead fault tolerance scheme for TSV-based 3D network on chip links. in Proc. Int. Conf. Comput.-Aided Des., Nov. 2008, pp. 598-602.
- [6] B. Swinnen et al. 3D integration by Cu-Cu thermo-compression bonding of extremely thinned bulk-Si die containing 10  $\mu$ m pitch through-Si vias. in Proc. IEEE IEDM, Dec. 2006, pp. 1-4.
- [7] Y. Tan et al. Electromigration performance of through silicon via (TSV)a modeling approach. Microelectron. Reliab., 50(9):1336-1340, 2010.
- [8] J. Pak et al. Modeling of electromigration in through-silicon-via based 3D IC. In Proc. IEEE ECTC, May. 2011, pp. 1420-1427.
- [9] Y. Liu et al. 3D modeling of electromigration combined with thermalmechanical effect for IC device and package. Microelectron. Reliab., 48(6):811-824, 2008.
- [10] J. Lienig et al. Electromigration avoidance in analog circuits: two methodologies for current-driven routing. In Proc. IEEE Asia South Pacific Des. Autom. Conf., Jan. 2002, pp. 372-378.
- [11] J.-T. Yan et al. Electromigration-aware rectilinear steiner tree construction for analog circuits. In Proc. IEEE Asia South Pacific Circuits Syst., Nov. 2008, pp. 1692-1695.

- [12] T. Adler et al. A current driven routing and verification methodology for analog applications. in Proc. Des. Automat. Conf., Jun. 2000, pp. 385-389
- [13] I. H.-R. Jiang et al. Optimal wiring topology for electromigration avoidance considering multiple layers and obstacles. In Proc. Int. Symp. Phys. Des., Mar. 2010, pp. 177-184.
- Y. Cheng et al. A novel method to mitigate TSV electromigration for [14] 3D ICs. In Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Aug. 2013, pp. 121-126.
- [15] L. Jiang et al. On effective through-silicon via repair for 3D-stacked ICs. IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., 32(4):559-571, 2013.
- [16] X. Zhao et al. Transient modeling of TSV-wire electromigration and lifetime analysis of power distribution network for 3D ICs. in Proc. Int. Conf. Comput.-Aided Des., Nov. 2013, pp. 363-370.
- [17] Z. Chen et al. Modeling of electromigration of the through silicon via interconnects. In Proc. IEEE Int. Electron. Packag. Technol. & High Density Packag., Aug. 2010, pp. 1221-1225.
  [18] M. Pathak et al. Electromigration modeling and full-chip reliability
- analysis for BEOL interconnect in TSV-based 3D ICs. in Proc. Int. Conf. Comput.-Aided Des., Nov. 2011, pp. 555-562.
- [19] T. Lu et al. Electromigration-aware clock tree synthesis for TSV-based 3D-ICs. In Proc. Great Lakes Symp. VLSI, May. 2015, pp. 27-32.
- [20] I. Blech et al. Direct transmission electron microscope observation of electrotransport in aluminum thin films. Appl. Phys. Lett., 11(8):263-266, 1967.
- [21] J. R. Black. Electromigration failure modes in aluminum metallization for semiconductor devices. *Proc. of the IEEE*, 57(9):1587–1594, 1969. [22] J. Abella et al. Refueling: Preventing wire degradation due to electro-
- migration. IEEE micro, (6):37-46, 2008.
- [23] J. Romeu. Understanding series and parallel systems reliability. Internet: https://src.alionscience.com/pdf/S&PSYSREL.pdf.
- [24] D. H. Kim et al. Study of through-silicon-via impact on the 3D stacked IC layout. IEEE Trans. Very Large Scale Integr. Syst., 21(5):862-874, 2013.
- [25] L. Jiang et al. On effective TSV repair for 3D-stacked ICs. in Proc. Des. Autom. Test Eur. Conf. Exhibit., Mar. 2012, pp. 793-798.
- [26] J. Xie et al. Yield-aware time-efficient testing and self-fixing design for TSV-based 3D ICs. In Proc. IEEE Asia South Pacific Circuits Syst., Jan. 2012, pp. 738-743.
- [27] W.-H. Lo et al. Architecture of ring-based redundant TSV for clustered faults. In Proc. Des. Autom. Test Eur. Conf. Exhibit., Mar. 2015, pp. 848-853.
- [28] A. Raghunathan et al. High-level macro-modeling and estimation techniques for switching activity and power consumption. IEEE Trans. Very Large Scale Integr. Syst., 11(4):538-557, 2003.
- [29] F. Ye et al. TSV open defects in 3D integrated circuits: characterization, test, and optimal spare allocation. in Proc. Des. Automat. Conf., Jun. 2012, pp. 1024-1030.
- [30] P. Sankowski. Maximum weight bipartite matching in matrix multiplication time. Theor. Comput. Sci., 410(44):4480-4488, 2009.
- [31] J. Munkres. Algorithms for the assignment and transportation problems. SIAM J. Appl. Math., 5(1):32–38, 1957. [32] Open Cores Standard [Online]. Available: http://opencores.org/.
- Nangate [Online]. Available: http://www.nangate.com/. [33]
- [34] V. Plas et al. Design issues and considerations for low-cost 3-D TSV IC technology. IEEE J. Solid-State Circuits, 46(1):293-307, 2011.
- [35] M. Agrawal et al. Reuse-Based Optimization for Prebond and Post-Bond Testing of 3-D-Stacked ICs. IEEE Trans. Comput-Aided Design Integr. Circuits Syst., 34(1):136-349, 2015.
- [36] J. Meng et al. Optimizing energy efficiency of 3D multicore systems with stacked DRAM under power and thermal constraints. In Proc. Des. Automat. Conf., June. 2012, pp. 648-655.
- [37] H. Qian, et al. Thermal simulator of 3D-IC with modeling of anisotropic TSV conductance and microchannel entrance effects. In Proc. IEEE Asia South Pacific Circuits Syst., Jan. 2013, pp. 485-490.