# Integrated Placement and Skew Optimization for Rotary Clocking

Ganesh Venkataraman<sup>1</sup>, Jiang Hu<sup>1</sup>, Frank Liu<sup>2</sup> and C-N. Sze<sup>2</sup> <sup>1</sup>Dept. of Electrical Engineering, Texas A&M University, College Station, TX 77843 {ganesh, jianghu}@ee.tamu.edu <sup>2</sup>IBM Austin Research Lab {frankliu,cnsze}@us.ibm.com

# ABSTRACT

The clock distribution network is a key component on any synchronous VLSI design. As technology moves into the nanometer era, innovative clocking techniques are required to solve the power dissipation and variability issues. Rotary clocking is a novel technique which employs unterminated rings formed by differential transmission lines to save power and reduce skew variability. Despite its appealing advantages, rotary clocking requires latch locations to match pre-designed clock skew on rotary clock rings. This requirement is a difficult chicken-and-egg problem which prevents its wide application. In this work, we proposed an integrated placement and skew scheduling methodology to break this hurdle, making rotary clocking compatible with practical design flows. A network flow based latch assignment algorithm and a cost-driven skew optimization algorithm are developed. Experiments show that our method can generate chip placements which satisfy the unique requirements of rotary clocks, without sacrificing design quality. By enabling concurrent clock network and placement design, our method can also be applied in other clocking methodologies as well.

# 1. INTRODUCTION

Power and variation are two of the important factors that limit the performance deep submicron VLSI circuits. For any synchronous VLSI design, the clock distribution network is a unique sub-circuit which is critical for both the power and the variation issue. It is well-known that the Clock Distribution Network dissipates a large amount of power, up to 40% of the entire chip power budget [1]. At the same time, clock distribution network is a vulnerable victim of many variational effects such as process variations, power supply noise and temperature fluctuations. Careful design of clock distribution network is required to address both the power and variability problems.

The power dissipation of a clock network is mainly due to frequent charging/discharging of the huge capacitive load and the leakage power through clock buffers. Many efforts have been made to reduce the clock network power by minimizing clock network size [2–5]. Others tried to reduce unnecessary clock signal switchings by clock gating [6,7]. Even though these approaches can alleviate the clock network power to some degree, the improvements are intrinsically limited by the fundamental charging/discharging power dissipation characteristics in the conventional clock distribution networks.

Minimizing clock skew variation is another major objective in many recent works on clock distribution network design [8, 9]. However, many of these methods achieve reduction of skew variation at the expense of increased clock network size and power. In particular, the very effective approach of clock mesh [8] may result in excessive wirelength and power overhead. Conventional clocking structure consists of a clock signal driven by a single clock source (and possibly regenerated by buffers). Accurate control of this clock signal becomes a hard problem since it is propagated over a long distance.

In order to solve the power and the variation problem more effectively, several novel clocking technologies have been developed [10–12]. Among them, rotary traveling wave clock [10] is a promising approach. The basic component of a rotary clock is a pair of cross-connected differential transmission line circles, namely a rotary clock ring. A clock signal propagates along the ring without termination so that the energy can be recirculated and the charging/discharging power dissipation is greatly reduced. A recent study [13] shows that rotary clocks can reduce power dissipation by 70% compared to conventional clock networks. The measurement results from a test chip also showed that a low skew variation of 5.5ps at 950MHz can be achieved [10]. Compared to other novel clocking methods, rotary clock can provide uniform clock signal amplitude in contrast to the non-uniform amplitude in standing wave clock [11].

There is one technical hurdle that prevents wide applications of the rotary clock: the clock signal has different phases at different locations on the rotary clock ring. If zero skew design is insisted, the usage of rotary clocks would be very restrictive. Hence, non-zero intentional skew design is a better approach to fully utilize the rotary clock. Unlike the intentional skew design in the conventional clocking technology where no restrictions are imposed on the latch locations, the skew at each latch has to be matched with a specific location at the rotary clock ring. This requirement forms a difficult chicken-and-egg problem: the latch placement depends on skew optimization while it is well known that skew optimization depends on latch locations. This is quite different from traditional intentional skew designs [5] where the placement does not depend on skew optimization.

The goal of this work is to break this technical hurdle so that rotary clocks can be easily deployed in practice to

<sup>\*</sup>This work was supported in part by Semiconductor Research Corporation under contract number 2004-TJ-1205 and in part by DAC Graduate Scholarship.

alleviate the power and variation problem. We make the following contributions in this work.

- A relaxation technique based on flexible tapping is suggested to break the loop in the chicken-and-egg problem.
- An integrated placement and skew optimization methodology is proposed to facilitate the application of rotary clocking. This methodology has the advantage that traditional placement methods can be employed directly without any change.
- A min-cost network flow algorithm is found to assign latches to the rotary clock rings so that the movement of latches has the least disturbance to traditional placement.
- A pseudo net technique is introduced to guide latches toward their preferred locations without intrusive disturbance to traditional placement.
- A cost driven skew optimization formulation is developed to reduce the connection cost between latches and their corresponding rotary clock rings.

In addition to the rotary clock, potentially our work can be also beneficial to the design of conventional clock network. The reason is that in the traditional clock design flow, the clock network synthesis has to be performed after placement, at which stage the design space of the clock network is very constrained. While by applying our method, it is possible to perform clock network synthesis and placement concurrently, thus a better solution can be achieved. We believe our method can be applied to reduce power consumption of conventional clock network such as the one in [14]. Experimental results on benchmark circuits show that our method can reduce the tapping cost for rotary clocking by 33% - 53%.

#### 2. ROTARY TRAVELING WAVE CLOCK

The basic component of a rotary clock is a pair of crossconnected transmission line circles as shown in Figure 1(a). In the rotary clock ring, an oscillation can start spontaneously upon any noise event [10]. When the oscillation is established, the square wave signal can travel along the ring without termination. An arbitrary point on the ring can be designated as the reference point with clock signal delay t = 0 and clock phase  $\phi = 0$ . Starting from this reference point, the clock signal travels along the ring and reaches back to the reference point with delay equal to clock period T and phase  $\phi = 360$ . The numbers in Figure 1(a) indicate clock signal phases. Clock signal delay t and clock signal phase  $\phi$  can be converted to each other by  $\frac{\phi}{360} = \frac{t}{T} - \lfloor \frac{t}{T} \rfloor$ .

The energy loss due to the wire resistance is compensated by the anti-parallel inverters as shown in Figure 1(a). In addition, these inverter pairs help to achieve phase locking as the phases of the two circles at the same location are always opposite. In order to maintain uniform capacitance distribution along the ring, dummy capacitive load needs to inserted at places where no latch exists. In chip level designs, multiple rotary clock rings can be connected together to form an array as shown in Figure 1(b), where the dashed arrows indicate the signal propagation directions and the small triangles indicate equal phase locations for all rings.

A rotary clock has the advantage of both low power dissipation and low skew variation. It consumes less power as the energy is recirculated along the ring as opposed to energy loss during the charging/discharging through transistors in



Figure 1: (a) A rotary clock ring. The numbers indicate relative clock signal phase. (b) An array of 13 rotary clock rings. The small triangles points to the equal-phase points for the 13 rings.

conventional clocking. A recent study [13] shows that rotary clock can reduce power dissipation by 70% compared to conventional clock networks. In the rotary clock ring array (Figure 1(b)), the phase averaging at the junction points can reduce skew variation remarkably. The measurement from the test chip shows that 5.5ps skew variation can be obtained on an operation frequency of 950MHz [10].

# 3. ROTARY CLOCK AND TRADITIONAL DESIGN FLOW

In this section, we discuss whether a rotary clock can fit into a traditional design flow and if not, what are the difficulties. Since at each spot on a rotary clock ring, the clock signal has a distinct phase, a zero clock skew design implies that only one spot on each ring can be utilized. In Figure 1(b), there are 13 rings and there are only 13 useful spots for a zero clock skew design. Obviously, such usage of rotary clock is very inefficient. In order to fully utilize rotary clock, intentional skew design is a much better choice.

However, rotary clock is not compatible with the existing intentional skew design flows. A typical intentional skew design flow, which is employed in IBM high performance ASIC designs [5] proceeds in the order of the following stages.

- 1. Placement. In placement [15, 16], cells are placed in a non-overlapping manner so that an objective function, such as the total signal net wirelength, congestion, critical path timing or a combination of them, is minimized.
- 2. Clock skew optimization. For intentional skew designs, clock signal delay target to each latch is found so as to minimize clock period or maximize timing slack, subject to long path and short path constraints. Since the long path and short path constraints depend heavily on cell and latch locations, placement information is essential in order to perform skew optimization. We will discuss skew optimization in Section 7.
- 3. Clock distribution network synthesis. In this stage, a clock distribution network layout is generated to approximately deliver intentional skews [5], which correspond to the clock delay targets obtained in skew optimization. Of course, the clock distribution network layout depends on latch locations.

The feasibility of this flow is based on its one-way dependency that each stage relies on the result of the previous stages, but not vice versa. More specifically, the placement (with the inclusion of latches), does not depend on skew optimization or clock distribution network synthesis. Likewise, skew optimization does not depend on clock distribution network synthesis.

Rotary clock is usually designed independently. In placement, each latch needs to be placed at a rotary clock ring and the clock phase at its location has to match the clock signal delay target for the latch. This requirement causes a cyclic dependency. The latch placement depends on its clock signal delay target, which is generated by skew optimization. But skew optimization is always dependent on placement. This chicken-and-egg problem has to be solved to enable the application of rotary clocking technique.

# 4. RELAXATION VIA FLEXIBLE TAPPING

For a complex constrained optimization problem, relaxation is an effective technique to handle troublesome constraints. The difficulty of applying rotary clock can be alleviated if we relax the constraint which requires each latch to be attached exactly on a rotary clock ring. For a latch at an arbitrary location  $(x_f, y_f)$ , we can always find a tapping point p on a rotary clock ring and deploy a buffer at p to drive the latch through a wire such that clock signal delay  $t_f$ at  $(x_f, y_f)$  satisfies a pre-specified clock delay target  $\hat{t}_f$  for the latch. Of course, we do not want the latch to be too far away from the ring. Otherwise, the long wire between the tapping point and the latch may cause significant power and variability degradation that it becomes meaningless to use rotary clock. On the other hand, if a latch is very close to its tapping point on the ring, the buffer can be omitted. The wirelength between the tapping point p and the latch can be counted as a cost to be minimized in placement and skew optimization. The approach of transforming troublesome constraints to cost is very similar to Lagrangian relaxation.



# Figure 2: The tapping point p for a latch can be found by solving the delay satisfying clock signal delay target.

Now we will show how to find the location of the tapping point so that the delay target  $\hat{t}_f$  for the latch at  $(x_f, y_f)$ can be satisfied. We illustrate this procedure through an example in Figure 2. A rotary clock ring is implemented in square shape in layout and is composed by four inside segments and four outside segments, as shown in Figure 2. Without loss of generality, we only consider the case when the tapping point is on the top segment. Other cases can be derived in an almost identical manner. We assume that the left end of the segment is set at coordinate (0,0), the location for the right end, the latch and the tapping point are represented as (b,0),  $(x_f, y_f)$  and (x,0), respectively. If the delay of the clock signal at (0,0) is  $t_0$ , then the delay at p can be expressed as  $t_0 + \rho x$  [10] with  $\rho$  being a positive constant. We denote the distance between the tapping point and the latch as l, assuming wire resistance and capacitance per unit length are r and c, respectively. Let  $C_{latch}$  denote the input capacitance of the latch. In order to satisfy the clock signal delay target at the latch, the clock signal delay  $t_f$  has to satisfy:

$$t_f(x) = t_0 + \rho x + \frac{1}{2}rcl^2 + rlC_{latch} = \hat{t}_f$$
 (1)

Since  $l = |x - x_f| + y_f$ , *l* is a function of *x*. The location of the tapping point can be obtained by solving the above equation.

Figure 2 shows an example of curve  $t_f(x)$  which is composed by two parabolas joining at  $x = x_f$ . The function  $t_f(x)$  can be decomposed into a quadratic function plus the term of  $|x - x_f|$ . The quadratic function leads to a single parabola. However, the term of  $|x - x_f|$  consists of two pieces of linear parts with a non-differentiable joint point at  $x = x_f$ . Therefore, the overall shape of the  $t_f(x)$  curve becomes two pieces of parabolas joining at  $x = x_f$ . Depending on the value of  $\hat{t}_f$ , there are four cases for solving Equation (1).

- Case 1:  $\hat{t}_f$  is very small like  $t_{f1}$  in Figure 2. There is no direct solution for this case. This case can be circumvented by reducing  $t_0$  by integer number of clock period time T. Note that such reduction does not affect clock phase. This is equivalent to lowering the curve  $t_f(x)$  by multiple T and eventually results in one of the following three cases. Obviously, the number of T for the reduction needs to be minimized.
- Case 2:  $\hat{t}_f$  is moderately small like  $t_{f2}$  in Figure 2. There are two solutions in the case and the solution with smaller wirelength is selected.
- Case 3:  $\hat{t}_f$  is at middle level like  $t_{f3}$  in Figure 2. There is a unique solution.
- Case 4:  $\hat{t}_f$  is large like  $t_{f4}$  in Figure 2. There is no direct solution for this case. However, we can choose (b, 0) as the tapping point and intentionally introduce wire detour between p and the latch so that delay target  $\hat{t}_f$  is satisfied. This is almost the same as the wire snaking in clock tree routing [3].

After the tapping points on each of the eight segments are calculated, the one leading to the minimum wirelength is selected as the tapping point for the ring. The actual wirelength for achieving the clock signal delay target is defined as the tapping cost. Note that we could also use a buffer to drive the signal from point p. If needed, equation (1) can be easily modified to take care of the buffer delay.

#### 5. PROPOSED METHODOLOGY

The flexible tapping technique breaks the cyclic dependency between the placement and skew optimization. It also facilitates a new methodology flow as shown in Figure 3, in which skew awareness for the placement is achieved indirectly through a pseudo net technique. By doing so, the traditional placement methods can be used without change. This is an appealing feature because placement is a much more complicated problem than skew optimization.



Figure 3: Proposed methodology flow.

The first two stages are almost the same as the traditional flow. The initial placement can be implemented with any existing placement method [15, 16] without special considerations of latch locations or their skews. In stage 2, skew optimization [17] is performed based on the placement of stage 1 to maximize the timing slack. The details of this part will be described in Section 7.

In stage 3, each latch is assigned to a ring of the rotary clock ring array. This assignment only establishes an association between a particular latch to a ring and does not change the latch location. When a latch is assigned to a specific ring, the corresponding tapping cost can be computed according to Section 4. The objective of this assignment is to minimize the total tapping cost for all latches. Since each ring can only accommodate a limited number of latches, this capacity constraint has to be followed as well. We introduce an algorithm based on min-cost network flow to solve this problem in Section 6.

After each latch is assigned to a ring, another skew optimization can be performed to reduce the tapping cost. This is somewhat different from traditional skew optimization and will be discussed in details in Section 7. After stage 4, the overall cost is evaluated as a weighted sum of total tapping cost and traditional placement cost, which is usually total signal wirelength. If the overall cost is sufficiently small, this flow is completed. Otherwise, the flow proceeds to stage 5.

Since the initial placement is based on the traditional objectives such as signal wirelength and ignores tapping cost, it is quite likely that the tapping cost is very high when we enter stage 5 for the first time. In order to reduce tapping cost, we insert a pseudo net between each latch and its ring. The latches can be pulled toward their associated rings in the placement of stage 6. Since stage 6 is an incremental placement, it normally runs considerably faster than the initial placement. Of course, the incremental placement is preferred to be a stable one [15], i.e., small changes on the netlist should not cause dramatic change on the placement result.

#### 6. LATCH ASSIGNMENT

In stage 3 of the above flow, each latch needs to be assigned to a ring in the rotary clock ring array. It is required that the assignment minimizes the tapping cost defined in Section 4. We denote the tapping cost as  $c_{i,j}$ , when a latch *i* is assigned to ring *j*. Each ring *j* has limited space and can accommodate no more than  $U_j$  latches. We introduce a decision variable  $x_{i,j}$ : if latch *i* is assigned to ring *j*,  $x_{i,j} = 1$ , otherwise  $x_{i,j} = 0$ . The latch assignment problem can then be formulated as the following 0-1 programming problem.





#### Figure 4: Min-cost network flow model for latch assignment. Each arc is associated with cost/capacity.

This assignment problem can be solved using min-cost network flow model [18] as shown in Figure 4. The vertices in the network include a column of latch vertices, a column of ring vertices, a source vertex and a target vertex. There is an arc from a latch vertex to a particular ring vertex only if the corresponding latch is considered to be a potential candidate of the ring. If a latch and a ring are too far away from each other, it is not necessary to insert an arc between them. Each arc between a latch vertex i and a ring vertex j has a cost of  $c_{i,j}$ . The rest arcs have zero cost. Each arc from a ring vertex j to the target vertex has a capacity of  $U_j$ . The other arcs have a capacity of 1. It is well known that this min-cost network flow problem can be solved optimally in polynomial time [18].

# 7. SKEW OPTIMIZATION

Skew optimization is an important part of the proposed flow. The skew optimization in stage 2 is the same as the max-slack version of the traditional method [17] and we summarize it here for the completeness of the description. Assuming that signals depart from latch i, propagate through combinational logic and arrive at latch j, then latch i and latch j are called sequentially adjacent and the sequential adjacency is denoted as  $i \mapsto j$ . Let the maximal(minimal) combinational logic path delay be  $D_{max}(D_{min})$ . If the clock signal delay target at i and j are  $\hat{t}_i$  and  $\hat{t}_j$ , respectively, setup time is  $t_{setup}$ , hold time is  $t_{hold}$  and the clock period is T, then the max-slack version of skew optimization can be formulated as [17]:

Maximize 
$$M$$
 (2)

Subject to 
$$\hat{t}_i - \hat{t}_j + M \le T - D_{max} - t_{setup}$$
  $i \longmapsto j(3)$   
 $\hat{t}_i - \hat{t}_i \ge M + t_{hold} - D_{min}$   $i \longmapsto j(4)$ 

where M is the slack. Inequality (3) and (4) are the long path constraint and short path constraint, respectively. It

has been shown that this problem can be solved using linear programming [17] or graph based algorithms [19, 20].

For the skew optimization in stage 4, instead of using the traditional method, we propose a cost driven method so that skew optimization can be leveraged to assist placement on reducing the total tapping cost. The computation of the tapping cost (Section 4) is based on a known clock signal delay target while the delay target is a decision variable in skew optimization. Therefore, we try to minimize the tapping cost indirectly by finding clock signal delay targets such that the tapping point can be moved to the latch as close as possible. For example, in Figure 2, if the delay target of the latch can result in tapping point at c, then the tapping cost is minimized. In other words, if the actual delay to a latch i is  $t_i$  through tapping point at c, we try to make the delay target  $\hat{t}_i$  to be as close to  $t_i$  as possible.

For latch i, we first find the closest point c on its ring and the distance between i and c, which is the shortest distance between i and the ring, is denoted as  $l_i$ . The delay  $t_i$  at *i* through tapping point at *c* depends on the reference clock signal delay on the rotary clock rings. We can arbitrarily choose a set of equal phase points for all the rotary clock rings as reference points like the small triangular spots in Figure 1(b). Let the clock signal delay at the reference points be  $t_{ref}$ . If the delay from the reference point to c is  $t_{ref,c}$ , then the clock signal delay at c is  $t_c = t_{ref} + t_{ref,c}$ . The delay from c to i is  $t_{c,i} = \frac{1}{2}rcl_i^2 + rl_iC_{latch}$  which is very similar to Equation (1). Hence, the clock signal delay at i is  $t_i = t_c + t_{c,i}$ . If the difference  $|t_i - \hat{t}_i|$  is minimized, there is a good chance that the tapping point is the closest to c and the latch. When a latch i is far from its ring, i.e.,  $t_{c,i}$  is large, it is especially desired that the tapping point is closest to c. Therefore,  $|t_i - \hat{t}_i| + t_{c,i}$  is minimized in the cost driven skew optimization.

where  $\Delta$  is the maximum difference and M is a pre-specified slack. The constraint from the two inequalities in this formulation is equivalent to  $|t_i - \hat{t}_i| + t_{c,i} \leq \Delta$ . Obviously, this problem can be solved through linear programming [17].

Alternatively, the skew optimization problem can be formulated to minimize a weighted sum of the differences as follows.

where  $\delta_i$  is the difference for latch *i* and  $w_i$  is its weighting factor. A natural choice of the weighting factors is to let  $w_i = l_i$ , as we wish to focus our effort on those latches far away from their rings. Again, this problem can be solved directly through linear programming [17].

# 8. EXPERIMENTAL RESULTS

| C | Circuit | #Cells | #Latches | #Nets | $PL(\mu m)$ |
|---|---------|--------|----------|-------|-------------|
|   | s9234   | 1510   | 135      | 1471  | 2471        |
|   | s5378   | 1112   | 164      | 1063  | 2718        |
| S | s15850  | 3549   | 566      | 3462  | 5175        |
| S | 38417   | 11651  | 1463     | 11545 | 8261        |
| s | 35932   | 17005  | 1728     | 16685 | 8290        |

Table 1: Testcases. PL is the average source-sink path length in conventional clock trees [2,4].

The proposed methodology and algorithms are tested on the ISCAS89 benchmark suite. The benchmark characteristics are summarized in Table 1. At the rightmost column of Table 1, we also list the average source-sink path length in conventional clock trees [2, 4] for reference. The rotary clock ring arrays are generated as in [10]. The circuits are synthesized by using SIS [21]. The main algorithms are implemented in C++. The initial placement as well as the incremental placement are obtained from an academic placement tool mPL [16,22]. All experiments are performed on a Sun Ultra Sparc machine running Solaris operating system with 2 GB RAM.

To the best of our knowledge, there is no published work on placement and skew optimization for rotary clocking. Our method is performed on the benchmark circuits and the results are displayed in Table 2. There are two issues which we care about: (1) each latch needs to be close to the ring it associated with so that the off-ring variation effect is negligible; (2) moving latches toward their associated rings should not degrade signal wirelength remarkably and the total wirelength including tapping wirelength needs to be minimized.

As indicated by the third column of Table 2, after running the latch assignment in stage 3, the average latch-ring distance is significantly smaller (compared to the initial distance shown in the second column). Especially, they are much smaller than the average source-sink path length in conventional clock trees [2,4] as shown in the rightmost column of Table 1. As a matter of fact, after the iterations of stage 4-6, the average distance has reduced to the range of  $100 - 200 \mu m$ , which is significantly smaller than the stub length limit indicated in [10]. The wirelength results and comparisons are listed in column 4-12 of Table 2. The data show that the iterations of stage 4-6 can reduce tapping wirelength by 33%-53% with only 0.5%-4.8% penalty on signal wirelength increase. In fact, the total wirelength is reduced by 2.7%-8.3%. The CPU time is at the rightmost column of Table 2. As one can see that most of runtime is dominated by the placer. Our method converges within five iterations for all these circuits.

# 9. CONCLUSION

In this paper, we propose an integrated placement and skew optimization method for a novel clocking technique rotary traveling wave clock, which is superior to the conventional clocking on both power dissipation and tolerance to variations. By utilizing intentional skew management and latch clustering, both the phase and physical location constraints of the rotary clocking can be satisfied. To the best of our knowledge, this is the first placement and skew optimization work for rotary clocking. Experimental results on benchmark circuits validated the effectiveness of our approach. Since our method enables a concurrent clock net-

|         | Ave latch-ring dist |       | Tapping wirelength |        | Signal wirelength |         | Total wirelength |       |         | CPU(s)  |       |         |      |
|---------|---------------------|-------|--------------------|--------|-------------------|---------|------------------|-------|---------|---------|-------|---------|------|
| Circuit | Stage 3             | Final | Stage 3            | Final  | Imprv             | Initial | Final            | Imprv | Stage 3 | Final   | Imprv | Stg 2-5 | mPL  |
| s9234   | 282.6               | 132.2 | 38150              | 17852  | 53.2%             | 244485  | 245692           | -0.5% | 282635  | 263544  | 6.8%  | 52      | 271  |
| s5378   | 194.4               | 127.1 | 31878              | 20842  | 34.6%             | 260931  | 264128           | -1.3% | 292809  | 284970  | 2.7%  | 21      | 449  |
| s15850  | 240.9               | 160.9 | 136351             | 91050  | 33.2%             | 643336  | 653033           | -1.5% | 779687  | 744083  | 4.6%  | 191     | 991  |
| s38417  | 369.6               | 220.1 | 525990             | 313152 | 40.5%             | 1634920 | 1712825          | -4.8% | 2160910 | 2025977 | 6.2%  | 191     | 712  |
| s35932  | 342.7               | 219.8 | 592179             | 379751 | 35.9%             | 1735820 | 1755206          | -1.1% | 2327999 | 2134957 | 8.3%  | 20      | 1089 |

Table 2: Experimental results. The distance and wirelength data are in  $\mu m$ .

work and placement design, it is potentially useful for other clocking methodologies like [14] as well.

### **10. REFERENCES**

- D. E. Duarte, N. Vijaykrishnan, and M. J. Irwin. A clock power model to evaluate impact of architectural and technology optimizations. *IEEE Transactions on VLSI Systems*, 10(6):844–855, December 2002.
- [2] T.-H. Chao, Y.-C. Hsu, J.-M. Ho, K. D. Boese, and A. B. Kahng. Zero skew clock routing with minimum wirelength. *IEEE Transactions on Circuits and Systems - Analog and Digital Signal Processing*, 39(11):799–814, November 1992.
- [3] R.-S. Tsay. An exact zero-skew clock routing algorithm. *IEEE Transactions on Computer-Aided Design*, 12(2):242–249, February 1993.
- [4] M. Edahiro. A clustering-based optimization algorithm in zero-skew routings. In *Proceedings of the* ACM/IEEE Design Automation Conference, pages 612–616, 1993.
- [5] S. Held, B. Korte, J. Maβberg, M. Ringe, and J. Vygen. Clock scheduling and clock tree construction for high performance ASICs. In *Proceedings of the IEEE/ACM International Conference on Computer-Aided Design*, pages 232–239, 2003.
- [6] A. H. Farrahi, C. Chen, A. Srivastava, G. Tellez, and M. Sarrafzadeh. Activity-driven clock design. *IEEE Transactions on Computer-Aided Design*, 20(6):705–714, June 2001.
- [7] J. Oh and M. Pedram. Gated clock routing for low-power microprocessor design. *IEEE Transactions* on Computer-Aided Design, 20(6):715–722, June 2001.
- [8] P. J. Restle, T. G. McNamara, D. A. Webber, P. J. Camporese, K. F. Eng, K. A. Jenkins, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. Boerstler, C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petrovick, B. L. Krauter, and B. D. McCredie. A clock distribution network for microprocessors. *IEEE Journal of Solid-State Circuits*, 36(5):792–799, May 2001.
- [9] N. Bindal, T. Kelly, N. Velastegui, and K. L. Wong. Scalable sub-10ps skew global clock distribution for a 90nm multi-GHz IA microprocessor. In *Proceedings of the IEEE International Solid-State Circuits Conference*, pages 346–355, 2003.
- [10] J. Wood, T. C. Edwards, and S. Lipa. Rotary traveling-wave oscillator arrays: a new clock technology. *IEEE Journal of Solid-State Circuits*, 36(11):1654–1665, November 2001.
- [11] R. O'Mahony, C. P. Yue, M. A. Horowitz, and S. S. Wong. A 10-GHz global clock distribution using coupled standing-wave oscillators. *IEEE Journal of Solid-State Circuits*, 38(11):1813–1820, November

2003.

- [12] S. C. Chan, K. L. Shepard, and P. J. Restle. Uniform-phase uniform-amplitude resonant-load global clock distributions. *IEEE Journal of Solid-State Circuits*, 40(1):102–109, January 2005.
- [13] Z. Yu and X. Liu. Power analysis of rotary clock. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI, pages 150–155, 2005.
- [14] Y. Lu, C. N. Sze, X. Hong, Q. Zhou, Y. Cai, L. Huang, and J. Hu. Navigating registers in placement for clock network minimization. In *Proceedings of the ACM/IEEE Design Automation Conference*, pages 176–181, 2005.
- [15] H. Eisenmann and F. M. Johannes. Generic global placement and floorplanning. In *Proceedings of the ACM/IEEE Design Automation Conference*, pages 269–274, 1998.
- [16] T. F. Chan, J. Cong, J. Shinnerl, and K. Sze. An enhanced multilevel algorithm for circuit placement. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pages 299–306, 2003.
- [17] J. P. Fishburn. Clock skew optimization. *IEEE Transactions on Computers*, C-39:945–951, July 1990.
- [18] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network flows: theory, algorithms, and applications. Prentice Hall, Upper Saddle River, NJ, 1993.
- [19] R. B. Deokar and S. S. Sapatnekar. A graph-theoretic approach to clock skew optimization. In *Proceedings of* the *IEEE International Symposium on Circuits and* Systems, pages 1.407–1.410, 1994.
- [20] C. Albrecht, B. Korte, J. Schietke, and J. Vygen. Cycle time and slack optimization for VLSI-chips. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pages 232–238, 1999.
- [21] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P. R. Stephan, R. K. Brayton, and A. L. Sangiovanni-Vincentelli. SIS: a system for sequential circuit synthesis. Memorandum no. M92/41, ERL, University of California, Berkeley, May 1992.
- [22] CPMO-constrained placement by multilevel optimization. http://ballade.cs.ucla.edu/cpmo/. Computer Science Department, UCLA.