Adaptive Delay Monitoring for Wide Voltage-Range Operation

Jongho Kim*† Gunhee Lee* Kiyoungh Choi*

Yonghwan Kim† Wook Kim† Kyungtae Do† Youngyun Choi†

*Department of Electrical and Computer Engineering Seoul National University {jongho1119.kim, ppghl88, kchoi}@snu.ac.kr jongho1119.kim@samsung.com

†Design Technology, System LSI Samsung Electronics {yahuzz.kim, wook81.kim, kyungtae.do, youngyun74 choi}@samsung.com

Abstract— As process technology scales down, circuit delay variations become more and more serious due to manufacturing and environmental variations. The delay variations are hardly predictable and thus require additional design margin and impede the chance to reduce area and power consumption of a chip. One way to alleviate the problem is to measure the circuit delay at run-time and control the supply voltage accordingly through a closed-loop dynamic voltage and frequency scaling (closed-loop DVFS) scheme. The circuit delay is typically measured by a monitoring circuit. However, the key issue of this scheme is the delay mismatch between the monitoring circuit and the target circuit block such as a CPU or a GPU. A large delay mismatch might lose the advantage of closed-loop DVFS. And it becomes worse as the circuit block operates in a wider voltage-range. This paper proposes a novel adaptive delay monitoring scheme for a wide voltage-range operation, which provides a better delay correlation between the monitor and the target compared to conventional monitoring approaches. The proposed approach reduces the average error in the measured delay by up to 45% and the maximum error by up to 68%. The reduction of the error brings the decrease of design margin, resulting in a lower-power and lower-cost design.

Keywords— Monitoring circuit, Delay monitor, Adaptive voltage scaling, Closed-loop dynamic voltage and frequency scaling, Design margin, Wide voltage-range operation

I. INTRODUCTION

Semiconductor products have been perpetually shrinking over the past decades to allow performance enhancement at lower fabrication cost per transistor. However, this process scaling has also brought serious circuit delay variations [1][2][3] mainly due to manufacturing and environmental variations including inter/intra-die variability, temperature shift, supply voltage droop noise, and circuit aging. As CPU/GPU operates in a wide voltage-range (from near-threshold voltage to super-overdrive voltage), circuit delay variations become much worse than previous chip operations [4-8]. Typically, circuit delay variations have been covered by design margins to ensure ‘no-error’ operations under the variations. This is the most pessimistic approach considering all worst-cases and thus incurs additional costs that would be unnecessary in better-than-worst-cases. Moreover, the problem becomes more serious as the variation becomes larger and larger. Even, it is difficult to determine the optimal design margin considering all the operating conditions at manufacturing test. Especially, this problem might be too serious in CPU/GPU designs because such a block requires a wide-range of operating voltages. Most commercial CPUs/GPUs used to be operated by an open-loop dynamic voltage and frequency scaling (open-loop DVFS) scheme using look-up tables (LUTs) in memory. However, it is recently required to use a closed-loop dynamic voltage and frequency scaling (closed-loop DVFS) scheme which can monitor the circuit delay at run-time. This scheme can prevent from increasing the design margin by circuit delay variation and achieve low-cost, low-area and low-power design.

In the closed-loop DVFS scheme, the key point is how to implement the delay monitoring circuit to estimate the circuit delay accurately. That is, the delay mismatch between the target block (such as a CPU or a GPU) and the monitoring circuit should be minimized, which is directly connected to the effectiveness of the closed-loop DVFS scheme. There have been various circuits proposed to implement a delay monitor [9-19]. They can be classified into two groups according to the dependency on the block or design that the monitor is targeting: generic monitoring circuit and design-dependent monitoring circuit. The generic monitoring circuit [9][10][11] is mainly implemented by a simple inverter-based ring oscillator (RO). It does not have any dependencies on the target block. It is very practical and suitable for a short-time-to-market design because it can be easily implemented and reused for any chip design platforms. It might be difficult, however, to use it for a design implemented by a mix of device types (e.g., multi-threshold voltage design) possibly causing a big gap between the measured delay and the actual delay of the target block. Also, it still requires a large design margin because it typically has a relatively big delay mismatch with the actual critical path of the target block under various operating conditions. Design-dependent monitoring circuit [12-19] is designed to be highly correlated with the target block in terms of delay. However, it typically requires high area overhead and design complexity. And it might be difficult to use it for commercial chip designs because it has longer design turn-around time than the generic one and cannot be reused unless the same target block is reused. The two types of monitoring circuits are presented in the following section in more detail.

In this paper, we propose a novel monitoring circuit and scheme which give both design simplification and accurate
 Dynamic voltage and frequency scaling (DVFS) is widely used for reducing power consumption during off-peak processing time and preventing from thermal overheating problems. The goal of this technique is to run the chip at the lowest possible voltage while achieving the desired operating frequency. According to the use of feedback information from delay monitoring circuits, the DVFS approaches can be classified into two different schemes: open-loop DVFS and closed-loop DVFS. Following subsections briefly introduce the two DVFS schemes as well as various monitoring circuits used for the closed-loop DVFS scheme.

A. Open-loop DVFS scheme

Open-loop DVFS is the most commonly used in DVFS scheme. The operating voltage for each desired operating frequency can be determined at the manufacturing test step while considering operating temperature conditions. Then the frequency-to-voltage mapping information is generally stored in a look-up table (LUT) and used by the power management controller to scale the supply voltage up and down as requested by the application. Typically a large design margin is assigned to the operating voltage for each target frequency stored in the LUT for safety reason. Thus, there is a limitation in reducing power consumption aggressively, since the operating voltage is pre-determined and cannot be adjusted dynamically depending on the run-time conditions.

B. Closed-loop DVFS scheme

Closed-loop DVFS is recently emerged to get more aggressive power reductions in ultra-low power competitive markets. The design margin in the supply voltage should be reduced as much as possible. Since the actual operating speed changes with various PVT (Process/Voltage/Temperature) conditions, there is a need to calculate and assign the optimal voltage dynamically at run-time. For example, circuit aging, supply voltage droop noise, and temperature shift always occur in the real environment and they affect the actual chip operating speed. This problem can be resolved by a feedback loop based on a delay monitoring circuit which provides information on how fast or slow the chip is actually running. As shown in Fig. 1, the feedback loop is facilitated by a monitoring circuit that enables closed-loop DVFS, where the operating voltage is adaptively scaled to an optimal point.

Compared to open-loop DVFS, closed-loop DVFS does not require a large design margin because it tracks the operating speed at run-time. However, it is impossible to remove the whole design margin because there can be a delay mismatch between the target block and the monitoring circuit. Therefore, the advantage of closed-loop DVFS depends on the delay correlation between the target block and the monitoring circuit, and thus it is the most important factor to decrease the delay mismatch between the two circuits in such a system. In the next section, we introduce previous work on monitoring circuits.

C. Previous work on monitoring circuits

The accuracy of delay monitoring circuit is very important to maximize the effect of closed-loop DVFS. There have been many researches on monitoring delays and various circuits have been proposed. However, the circuits can be simply classified into two categories: generic monitoring circuit and design-dependent monitoring circuit. A generic monitoring circuit is typically designed as a simple inverter-based ring oscillator (RO). Process-specific ROs (PSROs) have been proposed to measure process parameters or variations of a chip [9][10]. Phase-locked loop (PLL) is used for an alternative monitoring circuit [11]. This monitoring circuit is very simple and easy to design. Also, it does not generate large area overhead and can be easily reused for any other chip designs. But, it is in general less accurate than a design-dependent monitoring circuit and thus incurs a large design margin due to the large delay mismatch. A design-dependent monitoring circuit is tuned so that its delay characteristics are better correlated with those of the target block. Thus it is more accurate than generic PSROs, but many calibrations and parameter storage resources are required. Such a design-dependent delay monitor can be implemented based on a design-specific delay model [12][13]. A design-dependent RO can be synthesized according to the target design and process information [14]. It is relatively simple and has lower area overhead than other kinds of design-dependent monitor, but additional design turnaround time and characterization of the target block are required. Also, it cannot be reused for other chip designs. Critical path replica and in-situ monitors which can analyze the critical path of the target block have been presented [15][16][17]. Although it has good correlation with the target block, it generates large area overhead and long design turnaround time. Reconfigurable monitors presented in [18] can be flexibly tuned according to the circuit delay of the target block. Hence, it provides more accurate circuit delay.
effects of PMOS and NMOS transistors to circuit delay and NOR cells with one-to-one ratio, in order to have the fair good for control (it has only one-input). Also, we use NAND input NAND and 2-input NOR cells because INV cells are not multi-threshold voltage transistors or different gate types. In different delay characteristics, for example, with a mix of conditions. It is possible to add an additional chain that has characteristics according to PVT (especially voltage)

We illustrate a simple view of the proposed monitoring circuit circuits implemented respectively with LVT, RVT and HVT circuit, we develop a combination of three generic monitoring B. implemented with multi-threshold-voltage transistors.

limitation in estimating the circuit delay of the target block needed. Any of such implementation decisions still has a transistors or choose HVT when high delay sensitivity is the assumption that most of critical paths consist of LVT target block. Thus, generic monitors mainly choose LVT under threshold-voltage (HVT) transistors. Alternatively, it is possible to mix these types with a specific ratio. However, it’s very difficult to determine the optimal ratio of different threshold-voltage transistors from many critical paths of the target block. Thus, generic monitoring circuit has the limitation by using only one device type among the types from a multi-threshold-voltage library. Examples include, low-threshold-voltage (LVT) transistors, regular-threshold-voltage (RVT) transistors and high-threshold-voltage (HVT) transistors. Alternatively, it is possible to mix these types with a specific ratio. However, it’s very difficult to determine the optimal ratio of different threshold-voltage transistors from many critical paths of the target block. Thus, generic monitors mainly choose LVT under the assumption that most of critical paths consist of LVT transistors or choose HVT when high delay sensitivity is needed. Any of such implementation decisions still has a limitation in estimating the circuit delay of the target block implemented with multi-threshold-voltage transistors.

B. Proposed monitoring circuit

To overcome the limitation of a single generic monitoring circuit, we develop a combination of three generic monitoring circuits implemented respectively with LVT, RVT and HVT transistors, which are used for the target block implementation. We illustrate a simple view of the proposed monitoring circuit structure in Fig. 2. Each chain has different delay characteristics according to PVT (especially voltage) conditions. It is possible to add an additional chain that has different delay characteristics, for example, with a mix of multi-threshold voltage transistors or different gate types. In our experiments, we implement the monitoring circuits with 2-input NAND and 2-input NOR cells because INV cells are not good for control (it has only one-input). Also, we use NAND and NOR cells with one-to-one ratio, in order to have the fair effects of PMOS and NMOS transistors to circuit delay estimation. The delay chain of NAND is dominated by NMOS transistor delay, while the delay chain of NOR is by PMOS transistor delay and two cases might cause problems to estimate circuit delay accurately in process skew corners (e.g., SF/FS process corners). This monitoring circuit outputs information on the number of cell stages in the selected chain, through which the input signal propagates within one-clock cycle. The number becomes large if the supply voltage is scaled up and becomes small if voltage is scaled down. Thus the cell delay is easily known through the output number.

To make the effect of PVT variation on the monitoring circuit as close as that of the target block, we propose to place the monitoring circuit closely to the target block. Also, it is desirable to share the same supply voltage rail to consider the supply voltage droop noise of the target block.

C. Proposed monitoring scheme

Critical paths of the target block and the delay chains have different delay-voltage characteristics throughout the whole operating voltage range (see Fig. 4) because they have different ratio of LVT/RVT/HVT cells and different gate types in their own paths. Typically, the delay-voltage characteristics of HVT are the most sensitive, while those of LVT are the least sensitive.

We consider using different delay chains as the voltage and/or temperature change. As the supply voltage or temperature changes, the delay also changes; the rate of the delay change of the target block may match well with that of a

\[ \text{Output} = \{v_1, v_2, v_3, \ldots, v_n\} \]

\[ \text{Input} = \{c_1, c_2, c_3, \ldots, c_m\} \]

\[ \text{Algorithm 1. Chain mapping algorithm for each voltage range.} \]

\[ \text{For all } v_n \]

\[ \text{Calculate the delay-voltage slope of critical path and all chains (slope_critical, slope_1, \ldots, slope_n)} \]

\[ \text{Find: } \min \{\text{slope}\_\text{critical} - \text{slope}_1, \ldots, \text{slope}_n\} \]

\[ \text{If } \text{slope}\_\text{critical} - \text{slope}_n \text{ is minimum,} \]

\[ p_n = [x_n, x_2, x_3, \ldots, x_m]^T \quad \text{all except } x_n = 0 \]

\[ x_n = 1 \]

\[ \text{End} \]

Algorithm 1. Chain mapping algorithm for each voltage range.

\[ \text{Proposed Algorithm} \]

\[ \text{Input : voltage range index set } V = \{v_1, v_2, \ldots, v_n\} \]

\[ \text{chain index set } C = \{c_1, c_2, \ldots, c_m\} \]

\[ \text{Output : chain mapping set } P = \{p_1, p_2, \ldots, p_n\} \]

\[ p_n = [x_n, x_2, x_3, \ldots, x_m]^T, \quad x_n = 0 \text{ or } 1 \]

\[ \text{For all } v_n \]

\[ \text{Calculate the delay-voltage slope of critical path and all chains (slope}\_\text{critical}, \text{slope}_1, \ldots, \text{slope}_n) \]

\[ \text{Find: } \min \{\text{slope}\_\text{critical} - \text{slope}_1, \ldots, \text{slope}_n\} \]

\[ \text{If } \text{slope}\_\text{critical} - \text{slope}_n \text{ is minimum,} \]

\[ p_n = [x_n, x_2, x_3, \ldots, x_m]^T \quad \text{all except } x_n = 0 \]

\[ x_n = 1 \]

\[ \text{End} \]

Algorithm 1. Chain mapping algorithm for each voltage range.
delay chain. At different voltage and temperature, however, a different chain may match better, and that is why we use multiple delay chains.

Among the multiple chains, the proposed monitoring scheme adaptively selects a chain that matches best with the target block at the current voltage/temperature during run-time. The voltage steps are determined by the resolution of the PMIC (Power Management IC) or the designer’s decision. The finer the voltage steps are, the accuracy becomes higher. For this, critical paths of the target block and the delay chains should be characterized to obtain delay-voltage slope throughout the entire operating voltage range. Regarding the temperature, since the delay is much less sensitive to temperature variations\(^1\), the entire range is divided into only three sub-ranges. Then, for each voltage step and temperature sub-range, the chain having the most similar delay-voltage characteristic with the critical path is mapped. That is, the proposed scheme selects the chain that has delay-voltage slope closest to that of the critical path and use it for circuit delay estimation at the voltage and temperature. The mapping information is stored in an LUT and the power management controller selects a proper chain for the current voltage and temperature by using the LUT.

To explain how we determine the contents of the LUT, suppose there are \(m\) chains, \(n\) operating voltage steps. We define a vector \(V = [v_1, v_2, v_3, \ldots, v_n]^T\) of \(n\) voltage steps (for \(i\)-th voltage step, \(v_i=1\) and all other entries are 0’s) and a vector \(C = [c_1, c_2, c_3, \ldots, c_m]^T\) of \(m\) chains (for \(j\)-th chain, \(c_j=1\) and all other entries are 0’s). Then, for each temperature range \(t\), we determine an \(m \times n\) matrix \(P_t = [p_{1t}, p_{2t}, p_{3t}, \ldots, p_{nt}]\) that maps a voltage step to a chain as follows.

\[
C = P_t \cdot V
\]

where each column \(p_i\) of \(P_t\) has all 0’s except for one entry which is 1. Thus once the current voltage step is determined, then the proper chain is determined by \(P_t\). Algorithm 1 shows how we determine the entries of matrix \(P_t\). Once the matrix is determined, it becomes the contents of the LUT. This approach gives much better delay correlation between the monitoring circuit and the target block than existing generic monitoring circuits without incurring large area overhead and design complexity. In section VI, we analyze the experimental results in detail.

IV. DESIGN METHODOLOGY OF PROPOSED APPROACH

The design methodology of the proposed approach is shown in Fig. 3. It is very easy to plug the approach into an existing design flow. First, we implement the monitoring circuit with multiple chains following the design specifications such as number of chains using a multi-threshold voltage library. Then, we extract the critical paths of the target block and analyze the delay-voltage characteristics of both the critical paths and the multiple chains under various PVT conditions. Based on the analysis, the proposed algorithm extracts the chain mapping information for each voltage step. After this analysis, an additional chain can be added to the monitoring circuit for better delay correlation. This mapping information might be refined at the post silicon step, based on the physical chip test. It does not generate the additional test cost because the chips should be tested anyway for each frequency level to find the optimal voltage. During this test, it is required to read the output value of each chain for each voltage step. The final mapping information is stored in the LUTs and power management controller uses it for circuit delay estimation at run-time.

V. EXPERIMENTAL RESULT

To verify our proposed monitoring circuit and scheme, we use ARM Cortex-A53 implemented by Samsung 14nm FinFET technology. It’s implemented by three multi-threshold voltage libraries. Following the proposed design methodology in Fig. 3, we analyze the experimental results below.

\[\text{Fig. 3. Design methodology of the proposed approach.}\]

\[\text{Table I. Experimental Environments}\]

<table>
<thead>
<tr>
<th>Target Design</th>
<th>ARMv8-A Cortex-A53</th>
</tr>
</thead>
<tbody>
<tr>
<td>Process Technology</td>
<td>Samsung 14nm FinFET</td>
</tr>
<tr>
<td>Library</td>
<td>Multi-Vth library (3-types)</td>
</tr>
<tr>
<td>Process Corner</td>
<td>NN</td>
</tr>
<tr>
<td>Operating Voltage [V]</td>
<td>0.6 - 1.2</td>
</tr>
<tr>
<td>Temperature [°C]</td>
<td>-40 / 25 / 125</td>
</tr>
</tbody>
</table>

\[\text{Fig. 4. Delay-voltage characteristic curve of critical path #2 for each chain and the proposed approach at 25°C.}\]
we extract 20 critical paths and implement three generic chains with corresponding multi-threshold voltage libraries (VTH_TYPE1/VTH_TYPE2/VTH_TYPE3). To analyze the delay-voltage characteristics of the critical paths and the chains, we run HSPICE under various PVT conditions. We summarize the experimental environments in Table I.

We choose the representative five paths among the 20 extracted critical paths and analyze the delay-voltage characteristics of the five paths and the chains while sweeping voltage from 0.6V to 1.2V with 12.5mV granularity. In Fig. 4, we draw the delay-characteristic curve of a critical path, chains, and the proposed approach under a given specific condition (e.g., 25°C) and normalize the delay to that at 1.0V. VTH_TYPE3 chain is most sensitive to voltage changes and then VTH_TYPE2 and VTH_TYPE1 come in that order. That is, the delay-voltage slope of VTH_TYPE3 is largest and the slope of VTH_TYPE1 is smallest. And we confirm that the delay is increasing considerably faster at low voltage region. As can be seen from the figure, the delay-voltage characteristic obtained by our proposed approach is closest to that of critical path.

**Fig. 5.** Average errors in the delay estimation by monitoring circuits.

**Fig. 6.** Maximum errors in the delay estimation by monitoring circuits.

Fig. 5 and Fig. 6 show the comparison results of our proposed approach for various experimental conditions. Fig. 5 shows average errors in the delay estimation by the monitoring circuits. In critical paths #1, #2 and #3, the error rate of the proposed approach is much smaller than that of other single generic monitoring circuits. Compared to the best result of single generic monitoring circuit, the error rate decreases by about 10% ~ 80%. In critical paths #4 and #5, the error rate of the proposed approach is comparable with the best result of the single generic monitoring circuits. Some results are same and other results are slightly better or worse, but it shows almost the same error rate. Fig. 6 shows the maximum errors in the delay estimation. In the view point of design margin reduction, maximum error rate is much more important than the average error rate because design margin is determined to ensure ‘no-error’ operation against the worst case. Similarly to the result of average error rates, the proposed approach shows better error rates in critical paths #1, #2 and #3 than other single generic monitoring circuits. In case of critical paths #4 and #5, it shows error rates similar to those of the best result of single generic monitoring circuits.
As described above, if we consider each critical path separately, the proposed approach gives similar or about 10%–80% improved delay correlation. However, the target block should be designed considering all critical paths. In other words, all critical paths should be fully considered to determine the design margin. Thus the design margin should be determined by the critical path that renders the worst delay estimation error. In our experiment, the thick-lined boxes of Fig. 5 and Fig. 6 show the worst results as well as the corresponding critical paths. In case of average error rate, the worst case of VTH_TYPE1 chain is 4.67% on critical path #3, that of VTH_TYPE2 chain is 6.63% on critical path #5, and that of VTH_TYPE3 chain is 15.47% on critical path #5. Compared to the best result of single monitoring circuit (VTH_TYPE1 chain), the proposed approach reduces the error rate by up to 45% (the difference is much larger than that obtained when considering only one critical path such as #5). Moreover, considering that the design should actually consider the maximum error rate, we should use the data for maximum error rate, which shows error rate reduction by up to 68% (from 16.10% down to 5.13%). The improvement of this delay estimation error rate brings the decrease of design margin and lowers design cost and power consumption.

VI. CONCLUSION

In this paper, we propose a monitoring circuit composed of multiple generic chains and a method to adaptively select a proper chain for a wide voltage-range operation. The proposed design methodology is easy to plug into existing design flows. Also, it reduces errors due to delay mismatch by 45% on average and maximum error by up to 68% without large area overhead and cost. It can significantly reduce the design margin for compensating the delay mismatch between the monitoring circuit and the target block in the closed-loop DVFS scheme. In an advanced process technology, the effect of variations on circuit delay can be much more serious and thus the accuracy improvement of measured circuit delay becomes very important.

ACKNOWLEDGMENT

This work was supported by System LSI and Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.

REFERENCES