# Verifying Dividers Using Symbolic Computer Algebra and Don't Care Optimization 

Christoph Scholl ${ }^{1} \quad$ Alexander Konrad ${ }^{1} \quad$ Alireza Mahzoon ${ }^{2} \quad$ Daniel Große $^{3} \quad$ Rolf Drechsler ${ }^{2}$<br>${ }^{1}$ University of Freiburg, Germany $\quad{ }^{2}$ University of Bremen, Germany $\quad{ }^{3}$ Johannes Kepler University Linz, Austria<br>$\{$ scholl, konrada $\}$ @informatik.uni-freiburg.de, \{mahzoon, drechsle\} @informatik.uni-bremen.de, daniel.grosse @jku.at


#### Abstract

In this paper we build on methods based on Symbolic Computer Algebra that have been applied successfully to multiplier verification and more recently to divider verification as well. We show that existing methods are not sufficient to verify optimized non-restoring dividers and we enhance those methods by a novel optimization method for polynomials w. r. t. satisfiability don't cares. The optimization is reduced to Integer Linear Programming (ILP). Our experimental results show that this method is the key for enabling the verification of large and optimized non-restoring dividers (with bit widths up to 512).


## I. INTRODUCTION

Arithmetic circuits are crucial components in processor designs as well as in special-purpose hardware for computationally intensive applications like signal processing and cryptography. At the latest since the famous Pentium bug [1] in 1994, where a subtle design error in the divider had not been detected by Intel's design validation (leading to erroneous Pentium chips brought to the market), it has been widely recognized that incomplete simulationbased approaches are not sufficient for verification and formal methods should be used to verify the correctness of arithmetic circuits. Since the design of circuits containing arithmetic is nowadays not only confined to the major processor vendors, but is also done by many different suppliers of special-purpose embedded hardware who cannot afford to employ large teams of specialized verification engineers being able to provide humanassisted theorem proofs, the interest in fully automatic formal verification of arithmetic circuits is growing more and more.

While the automatic formal verification of adder circuits (and hence also subtractor circuits) turned out to be easier, multipliers and dividers formed a major problem for a long time. Early attempts to represent both multiplier specifications and implementations by canonical forms like BDDs [2] showed exponential space complexity [3]. SAT-based methods did not prove to be scalable either [4]. Nevertheless, for the automatic formal verification of gate-level multipliers there has been great progress during the last few years. Apart from rewriting based approaches like [5], methods based on Symbolic Computer Algebra (SCA) were able to verify large, structurally complex, and highly optimized multipliers. Both finite field multipliers [6] and integer multipliers [4], [7]-[17] have been considered in SCA-based approaches. Here the verification task has been reduced to an ideal membership test for the specification polynomial based on so-called backward rewriting, proceeding from the outputs of the circuit in direction of the inputs.

Although the Pentium bug affected the divider unit, research efforts for divider verification were not as successful as for multipliers. Attempts to use Decision Diagrams for proving the correctness of an SRT divider [18] were confined to a single stage of the divider (at the gate level) [19]. Methods based on word-level model checking [20] looked into SRT division as well, but considered only a special abstract and clean sequential (i.e.,

[^0]non-combinatorial) divider without gate-level optimizations. Other approaches like [21], [22], or [23] looked into fixed division algorithms and used semi-automatic theorem proving with ACL2, Analytica, or Forte to prove their correctness. Nevertheless, all those efforts did not lead to a fully automated verification method suitable for gate-level dividers. Hamaguchi et al. [24] mentioned in a side remark a simple and fully automated method to verify integer dividers using *BMDs [25] which is closely related to the SCA-based method to be presented in this paper. They start with a *BMD representing $Q \times D+R$ (where $Q$ is the quotient, $D$ the divisor, and $R$ the remainder of the division) and use a backward construction to replace the bits of $Q$ and $R$ step by step by *BMDs representing the gates of the divider. The goal is to finally obtain a *BMD representation for the dividend $R^{(0)}$ which proves the correctness of the divider circuit. Unfortunately, in their experiments they observed exponential blow-ups of *BMDs in the backward construction and thus the approach did not provide an effective way for verifying large integer dividers.

We are aware of three recent approaches which try to take up the success of SCA to provide a fully automatic divider verification. The work in [26] is mainly confined to division by constants and cannot handle general dividers due to a memory explosion problem. [27] works at the gate level, but assumes that hierarchy information in a restoring divider is present. Using this hierarchy information it decomposes the proof obligation $R^{(0)}=Q \times D+R$ into separate proof obligations for each level of the restoring divider. Nevertheless, the approach scales only to medium-sized bit widths (up to 21 as shown in the experimental results of [27]). The approach of [28] works on the gate level as well, but it does not need any hierarchy information. It proves the correctness of non-restoring dividers by "backward rewriting" starting with the "specification polynomial" $Q \times D+R-R^{(0)}$. Here backward rewriting performs "backward substitutions" of gate output variables with the gates' specification polynomials. By finally obtaining the 0-polynomial they prove the dividers to be correct. To avoid an exponential blow-up during backward rewriting, their method uses information on equivalent and antivalent signals which is derived by forward propagating the allowed range of the divider inputs, i. e., the constraint $0 \leq R^{(0)}<D \cdot 2^{n-1}$, using SAT. The method scales very well, but it is restricted to a very clean implementation of non-restoring division. Using optimizations to the implementation (which are usually even contained in textbook designs) leads to a memory explosion problem for this approach, too.

Here we present a novel SCA-based verification method which is basically along the lines of [28], but completely avoids the problems mentioned above. Our method works on the gate-level netlist without using any hierarchy information which may have been lost during logic optimization. As [28] it does not use a "golden specification circuit" for division, but only uses the natural abstract specification from the definition of division. We consider optimized non-restoring dividers and show that the method from
[28] leads to memory explosion in this case. Whereas we agree with the authors of [28] in the analysis that backward rewriting cannot be efficient without forward propagation of information, we do not limit ourselves to the detection and exploitation of equivalences and antivalences as in [28]. Rather we compute satisfiability don't cares of atomic blocks in the divider circuit which are present due to the input constraint $0 \leq R^{(0)}<D \cdot 2^{n-1}$. We use those satisfiability don't cares to optimize polynomials modulo don't cares, i.e., from a polynomial $P$ we compute a smaller polynomial $P^{\prime}$ which evaluates to the same values as $P$ for all input combinations which do not satisfy the don't care conditions. The computation of optimized polynomials is reduced to suitable Integer Linear Programming (ILP) problems and makes use of progress in ILP solving made during the last decades. Since we change polynomials only w.r.t. satisfiability don't cares, i. e., w.r.t. value combinations which do not occur in the circuit as long as we stay in the allowed input range, it is clear that the final polynomial after backward rewriting is - within the allowed input range - identical to the polynomial which would have been computed without using don't care optimization.

Our experiments show that this approach to optimize polynomials in early stages (before an exponential blow-up occurs) is very efficient in restricting the sizes of polynomials. E.g. for the 512bit divider the whole verification including backward rewriting and don't care based optimizations of polynomials could be performed in less than 162 CPU minutes.

Moreover, we make the observation that for a correct divider the final polynomial only has to evaluate to 0 for the inputs in the allowed input range $\left(0 \leq R^{(0)}<D \cdot 2^{n-1}\right)$. Whereas for the clean divider from [28] the final polynomial evaluates to 0 for all input combinations ("by chance"), this is not the case for the optimized divider designs and thus we had to provide an efficient evaluation proving that the final polynomial is 0 for all inputs in the allowed input range $0 \leq R^{(0)}<D \cdot 2^{n-1}$.

The paper is structured as follows: In Sect. II we provide background on SCA and divider circuits. In Sect. III we motivate the need for our novel optimization method which is presented in Sect. IV. We evaluate the approach in Sect. V and conclude with final remarks in Sect. VI.

## II. Preliminaries

## A. SCA for Verification

For the presentation of SCA we basically follow [28]. SCA based approaches work with polynomials and reduce the verification task to an ideal membership test using a Gröbner base representation of the ideal. The ideal membership test is performed using polynomial division. While Gröbner base theory is very general and, e.g., can be applied to finite field multipliers [6] and truncated multipliers [16] as well, for integer arithmetic it boils down to substitutions of variables for gate outputs by polynomials over the gate inputs (in reverse topological order), if we choose an appropriate "term order" (see [4] or [13], e.g.). Here we restrict ourselves to exactly this view.

For integer arithmetic we consider polynomials over binary variables (from a set $X=\left\{x_{1}, \ldots, x_{n}\right\}$ ) with integer coefficients, i. e., a polynomial is a sum of terms, a term is a product of a monomial with an integer, and a monomial is a product of variables from $X$. Polynomials represent pseudo-Boolean functions $f:\{0,1\}^{n} \mapsto \mathbb{Z}$.

As a simple example consider the full adder from Fig. 1. The full adder defines a pseudo-Boolean function $f_{F A}:\{0,1\}^{3} \mapsto \mathbb{Z}$ with $f_{F A}\left(a_{0}, b_{0}, c\right)=a_{0}+b_{0}+c$. We can compute a polynomial representation for $f_{F A}$ by starting with a weighted sum $2 c_{0}+s_{0}$

$2 c_{0}+s_{0}$
$\xrightarrow{c_{0}} \quad 2 h_{2}+2 h_{3}-2 h_{2} h_{3}+s_{0}$
$\xrightarrow{3} 2 h_{2}+2 c h_{1}-2 c h_{1} h_{2}+s_{0}$
$2 h_{2}-2 c h_{1} h_{2}+c+h_{1}$
$2 a_{0} b_{0}-2 a_{0} b_{0} c h_{1}+c+h_{1}$

Fig. 1. Circuit with series of substitutions.
(called the "output signature" in [10]) of the output variables. Step by step, we replace the variables in polynomials by the so-called "gate polynomials". This replacement is performed in reverse topological order of the circuit, see Fig. 1. We start by replacing $c_{0}$ in $2 c_{0}+s_{0}$ by its gate polynomial $h_{2}+h_{3}-h_{2} h_{3}$ (which is derived from the Boolean function $c_{0}=h_{2} \vee h_{3}$ ). Finally, we arrive at the polynomial $a_{0}+b_{0}+c$ (called the "input signature" in [10]) representing the pseudo-Boolean function defined by the circuit. During this procedure (which is called backward rewriting) the polynomials are simplified by reducing powers $v^{k}$ of variables $v$ with $k>1$ to $v$ (since the variables are binary), by combining terms with identical monomials into one term, and by omitting terms with leading factor 0 . We can also consider $a_{0}+b_{0}+c=2 c_{0}+s_{0}$ as the "specification" of the full adder. The circuit implements a full adder iff backward substitution, now starting with $2 c_{0}+s_{0}-a_{0}-b_{0}-c$ instead of $2 c_{0}+s_{0}$, reduces the "specification polynomial" to 0 in the end. (This is the notion usually preferred in SCA-based verification.)

The correctness of this statement already follows from the following lemma:
Lemma 1. Polynomials (with the above mentioned simplifications resp. normalizations) are canonical representations of pseudoBoolean functions (up to reordering of the terms).

The proof of Lemma 1 as well as a discussion of the relation between polynomials and $*$ BMDs can be found in [28], e.g..

## B. Divider Circuits

In the following we briefly review textbook knowledge on dividers. We use $\left\langle a_{n}, \ldots, a_{0}\right\rangle:=\sum_{i=0}^{n} a_{i} 2^{i}$ and $\left[a_{n}, \ldots, a_{0}\right]_{2}:=\left(\sum_{i=0}^{n-1} a_{i} 2^{i}\right)-a_{n} 2^{n}$ for interpretations of bit vectors $\left(a_{n}, \ldots, a_{0}\right) \in\{0,1\}^{n+1}$ as unsigned binary numbers and two's complement numbers respectively. The leading bit $a_{n}$ is called the sign bit. An unsigned integer divider is a circuit with the following property:
Definition 1. Let $\left(r_{2 n-2}^{(0)} \ldots r_{0}^{(0)}\right)$ be the dividend with sign bit $r_{2 n-2}^{(0)}=0$ and value $R^{(0)}:=\left\langle r_{2 n-2}^{(0)} \ldots r_{0}^{(0)}\right\rangle=\left[r_{2 n-2}^{(0)} \ldots r_{0}^{(0)}\right]_{2}$, $\left(d_{n-1} \ldots d_{0}\right)$ be the divisor with sign bit $d_{n-1}=0$ and value $D:=\left\langle d_{n-1} \ldots d_{0}\right\rangle=\left[d_{n-1} \ldots d_{0}\right]_{2}$, and let $0 \leq R^{(0)}<$ $D \cdot 2^{n-1}$.* Then $\left(q_{n-1} \ldots q_{0}\right)$ with value $Q=\left\langle q_{n-1} \ldots q_{0}\right\rangle$ is the quotient of the division and $\left(r_{2 n-2} \ldots r_{0}\right)$ with value $R=$ $\left[r_{2 n-2} \ldots r_{0}\right]_{2}$ is the remainder of the division, if $R^{(0)}=Q \cdot D+R$ (verification condition $1=" v c 1 "$ ) and $0 \leq R<D$ ("vc2").

The simplest algorithm to compute quotient and remainder is restoring division which is the "school method" to compute quotient bits and "partial remainders" $R^{(j)}$. In each step it subtracts a shifted version of $D$. If the result is less than 0 , the corresponding quotient bit is 0 and the shifted version of $D$ is "added back",

[^1]```
Algorithm 1 Non-restoring division.
    \(R^{(1)}:=R^{(0)}-D \cdot 2^{n-1} ;\) if \(R^{(1)}<0\) then \(q_{n-1}:=0\) else \(q_{n-1}:=1 ;\)
    for \(i=2\) to \(n\) do
        if \(R^{(j-1)} \geq 0\) then \(R^{(j)}:=R^{(j-1)}-D \cdot 2^{n-j}\)
        else \(R^{(j)}:=R^{(j-1)}+D \cdot 2^{n-j} ;\)
        if \(R^{(j)}<0\) then \(q_{n-j}:=0\) else \(q_{n-j}:=1\);
    \(R:=R^{(n)}+\left(1-q_{0}\right) \cdot D ;\)
```

i. e., "restored". Otherwise the quotient bit is 1 and the algorithm proceeds with the next smaller shifted version of $D$. Non-restoring division optimizes restoring division by combining in case of a negative partial remainder two steps of restoring division: adding the shifted $D$ back and (tentatively) subtracting the next $D$ shifted by one position less. These two steps are replaced by just adding $D$ shifted by one position less (which obviously leads to the same result). More precisely, non-restoring division works according to Alg. 1.

Fig. 2 shows a combinational circuit for $n=3$ computing $Q$ and $R$ by non-restoring division as it was used in [28]. It works with two's complement numbers, the first row implements a subtractor, row $j$ with $2 \leq j \leq n$ represents a combined adder/subtractor (CAS), depending on the quotient bit $q_{n-j+1}$, row $n+1$ represents the backaddition in case $q_{0}=0$. The divider works with partial remainders $R^{(j)}$ of fixed lengths $2 n-1$, i.e., the width of a row is $2 n-1$. However, well-known textbook implementations make it with rows of width $n$ only, leading to a non-restoring divider as shown in Fig. 3 for $n=3$. This can be seen as follows: Due to shifting of $D$ by $n-j$ bit positions to the left in rows 1 to $n$, the right-most $n-j$ full adders can be omitted in row $j$. The fact that the $j-1$ left-most full adders may be omitted as well needs some deeper insight into non-restoring division and is due to range restrictions derived from the condition on the input range. We obtain the following result for the size of the partial remainders:

Lemma 2. For all partial remainders in non-restoring division with $0 \leq R^{(0)}<D \cdot 2^{n-1}$ we obtain $-D \cdot 2^{n-j} \leq R^{(j)}<D \cdot 2^{n-j}$. Proof. By induction using $0 \leq R^{(0)}<D \cdot 2^{n-1}$.

Corollary 1. For $1 \leq j \leq n R^{(j)}$ may be represented by a two's complement representation with $2 n-j$ bits.
Proof. With $-D \cdot 2^{n-j} \leq R^{(j)}<D \cdot 2^{n-j}$ and $D \leq 2^{n-1}-1 \leq$ $2^{n-1}$ we obtain $-2^{2 n-j-1} \leq R^{(j)}<2^{2 n-j-1}$.

From Corollary 1 we can conclude that $R^{(j)}$ can be represented by $\left(r_{2 n-j-1}^{(j)} \ldots r_{0}^{(j)}\right)$ and a circuit for non-restoring division does not need to compute any bits with higher indices than $r_{2 n-j-1}^{(j)}$. However, in the adder/subtractor stage computing $R^{(j+1)}$, the leading bit $r_{2 n-j-1}^{(j)}$ of $R^{(j)}$ does not need to be used as an input, since we know that the result $R^{(j+1)}$ can be represented by one bit less, i.e., $r_{2 n-j-1}^{(j+1)}$ resulting from addition at bit position $2 n-j-1$ would just be a sign extension of $\left(r_{2 n-j-2}^{(j+1)} \ldots r_{0}^{(j+1)}\right)$. A further analysis shows that for the allowed input range the carry output of the full adder computing $r_{2 n-j-1}^{(j)}$ as a sum output is exactly equal to the negated sign bit $r_{2 n-j-1}^{(j)}$ (which is in turn equal to the quotient bit $q_{n-j}$ ). Thus, $r_{2 n-j-1}^{(j)}$ does not need to be computed at all. Altogether, the considerations sketched above lead to an optimized non-restoring divider implementation shown in Fig. 3.

## III. ANALYZING SCA FOR DIVIDER VERIFICATION

With only a high level view on the algorithm for non-restoring division one could assume that backward rewriting starting with the specification polynomial $Q \times D+R-R^{(0)}$ would lead to small polynomials - at least if the backward rewriting would always hit the boundaries between the stages of the divider and there would not be any significant peaks in polynomial sizes in between: With $R=R^{(n)}+\left(1-q_{0}\right) \cdot D$ the polynomial obtained after processing stage $n+1$ would be $\left(\sum_{i=1}^{n-1} q_{i} 2^{i}+2^{0}\right) \cdot D+R^{(n)}-R^{(0)}$. For $j=2$ to $n$ the algorithm implies $R^{(j)}=R^{(j-1)}-q_{n-j+1}\left(D \cdot 2^{n-j}\right)+$ $\left(1-q_{n-j+1}\right)\left(D \cdot 2^{n-j}\right)=R^{(j-1)}+\left(1-2 q_{n-j+1}\right)\left(D \cdot 2^{n-j}\right)$ and thus the polynomial after processing stage $j$ with $j=n$ to 2 would be (by induction) equal to ( $\sum_{i=n-j+2}^{n-1} q_{i} 2^{i}+2^{n-j+1}$ ). $D+R^{(j-1)}-R^{(0)}$. Finally, after processing stage 1 using the equation $R^{(1)}=R^{(0)}-D \cdot 2^{n-1}$, the polynomial would reduce to 0 .

It has been observed already in [28] that even for the unoptimized implementation from Fig. 2 the polynomials represented at the cuts between stages are different from this high-level derivation. The reason lies in the fact that the stages do not really implement signed addition / subtraction. In general, signed addition / subtraction of two $(2 n-1)$-bit numbers leads to a $2 n$ bit number. The leading bit of the result can only be omitted, if "no overflow occurs". Since this fact cannot be seen by backward reasoning, backward rewriting leads to exponential intermediate polynomials. As a main effect of SAT-based information forwarding (SBIF), [28] could derive from the input constraint $0 \leq R^{(0)}<D \cdot 2^{n-1}$ that in each stage from 1 to $n$ the most significant bits are antivalent, and thus the stages really implement addition / subtraction under this constraint. Using this information as early as possible in backward rewriting avoids blow-ups in the sizes of polynomials.

In the optimized implementation from Fig. 3 the situation is completely different. The most significant bits in stages 1 to $n$ are neither equivalent nor antivalent. Thus, we cannot expect that SBIF with exploiting equivalent / antivalent signals helps much for the design from Fig. 3. In fact, we did not only experimentally observe memory blow-ups during verification of non-restoring dividers as in Fig. 3 with the method from [28], but we were able to prove that the canonical polynomials for pseudo-boolean functions occurring

```
Algorithm 2 Optimized backward rewriting.
Input: Specification polynomial \(S P^{\text {init }}\), Input constraint \(I C\), Circuit \(C U V\) with atomic
    blocks \(a_{1} \prec_{\text {top }} \ldots \prec_{\text {top }} a_{m}\) in topological order \(\prec_{\text {top }}\) and signals \(s_{1}, \ldots, s_{n}\)
    tput: 1 iff specification holds for all inputs satisfying \(I C\)
    \(: S P_{m}:=S P^{\text {init }} ;\) oldsize \(:=\operatorname{size}\left(S P_{m}\right) ; i:=m ; S T:=\emptyset\);
    : \(\left(d c\left(a_{1}\right), \ldots, d c\left(a_{m}\right)\right):=\) Compute_DC( \(\left.C U V, I C\right)\);
    \(:\left(r p\left(s_{1}\right), \ldots, r p\left(s_{n}\right)\right):=\operatorname{SBIF}\left(C \bar{U} V, I C,\left(d c\left(a_{1}\right), \ldots, d c\left(a_{m}\right)\right)\right)\)
    while \(i>0\) do
        \(S P_{i-1}:=\operatorname{Rewrite}\left(S P_{i}, a_{i}\right)\);
        if \(\operatorname{size}\left(S P_{i-1}\right)>\) threshold \(\cdot\) oldsize and \(S T \neq \emptyset\) then
            \((S P, j\), type \()=\operatorname{pop}(S T)\)
            \(i:=j ; S P_{i-1}:=S P\).
            if type \(=\) dc then \(S P_{i-1}:=\) Opt_DC \(\left(S P_{i-1}, d c\left(a_{i}\right)\right)\);
            if type \(=\) eq then \(\forall s \in \operatorname{in}\left(a_{i}\right): S P_{i-1}:=\operatorname{Replace}\left(S P_{i-1}, s, r p(s)\right)\);
        else
            if \(d c\left(a_{i}\right) \neq \emptyset\) then \(\operatorname{push}\left(S T,\left(S P_{i-1}, i, \mathrm{dc}\right)\right)\); oldsize \(:=\operatorname{size}\left(S P_{i-1}\right)\);
            if \(\exists s \in \operatorname{in}\left(a_{i}\right)\) with \(r p(s) \neq s\) then
                \(\operatorname{push}\left(S T,\left(S P_{i-1}, i\right.\right.\), eq \(\left.)\right)\); oldsize \(:=\operatorname{size}\left(S P_{i-1}\right)\);
        \(i:=i-1\)
    return evaluate \(\left(S P_{0}\right)\);
```

| Algorithm 3 evaluate $\left(S P_{0}\right)$. |  |
| :--- | :--- |
| 1: for $i=n-2$ to 0 do |  |
| 2: if $\left.S P_{0}\right\|_{d_{i}=1, r_{i+n-1}^{(0)}=0} ^{(0)} \neq 0$ then return $0 ;$ | $\triangleright$ divider incorrect |
| 3: $\quad S P_{0}:=\left.S P_{0}\right\|_{r_{i+n-1}^{(0)}=d_{i}} ^{(0)} ;$ |  |
| 4: return $1 ;$ | $\triangleright$ divider correct |

at cuts between stages in Fig. 3 have exponential sizes. (The proof is omitted due to lack of space.) Thus, the memory explosion we observed in our implementation of [28] is not due to unfavorable heuristic decisions, but cannot be avoided - if the backward rewriting from [28] is not enhanced by a stronger propagation and exploitation of constraints in the opposite direction from inputs to outputs.

## IV. SCA WITH DON' T CARE OPTIMIZATION

In this section we present our algorithm which uses don't care optimization of polynomials as an essential ingredient. The algorithm starts with a gate level netlist and detects atomic blocks [15], resulting in a circuit with non-trivial atomic blocks (like full adders, half adders etc.) and trivial atomic blocks (original gates not included in non-trivial atomic blocks). We compute a topological order $\prec_{t o p}$ on the atomic blocks with heuristics from [14], [15]. Assume $m$ atomic blocks with $a_{1} \prec_{\text {top }} \ldots \prec_{\text {top }} a_{m}$. The remaining algorithm is given in Alg. 2.

For divider verification the algorithm is started with the specification polynomial $S P^{\text {init }}$ representing $Q \times D+R-R^{(0)}$ and the input constraint $I C=0 \leq R^{(0)}<D \cdot 2^{n-1}$.
a) Computation of satisfiability don't cares: In a first step we compute satisfiability don't cares at the inputs of atomic blocks (like full adders) that result from the input constraint $I C$ (Line 2). We start by using simulation (taking $I C$ into account) to detect candidates for satisfiability don't cares. Whereas in principle we could prove those candidates to be really satisfiability don't cares using SAT, preliminary experiments showed that for large dividers it was not possible to confirm a sufficient number of don't care candidates by SAT due to lack of resources. SAT-solving restricted to windows of moderate size was not successful either, since it seems that for the non-restoring divider designs existing don't cares cannot be confirmed by local reasoning. However, it turned out in our experiments (see Sect. V) that a series of BDD-based image computations [29] was able to derive all satisfiability don't cares at the inputs of atomic blocks. We start with a BDD for representing input constraint $I C$. Then we identify the first atomic block $a_{i}$ in the topological order which has a non-empty set of don't care candidates (computed by simulation). Atomic blocks
$a_{1}, \ldots, a_{i-1}$ form the first slice $s l_{1}$ in the circuit. The output signals of $s l_{1}$ are exactly the signals connecting $s l_{1}$ with atomic blocks $a_{i}, \ldots, a_{m}$, i.e., by construction, the inputs of $a_{i}$ are outputs of $s l_{1}$. We use BDD-based image computation to compute the BDD for the image $I M G_{1}$ of $I C$ under $s l_{1}$. Then we evaluate $I M G_{1}$ wrt. to the don't care candidates at the inputs of $a_{i}$. If the evaluation results in constant 0 for some candidate, then it is not included in the image and it is confirmed as a don't care. Then, we consider the next atomic block $a_{j}$ with a non-empty set of don't care candidates, choose $a_{i}, \ldots, a_{j-1}$ to form the next slice, compute the image $I M G_{2}$ of $I M G_{1}$ under this slice etc.. In the end we have classified all don't care candidates to be real don't cares or not. As an example see Fig. 3 showing (in red) the numbers of don't cares computed for the full adders in the optimized 3-bit divider before constant propagation.
During our BDD-based computations we just use a static variable order corresponding to the initial order in [28]. The variable order is chosen such that the bits $r_{i}$ and $d_{i}$ as well as $r_{i+n-1}^{(0)}$ and $d_{i}$ are arranged side by side and bits with higher indices are evaluated first. Other variables corresponding to internal signals are ordered according to [30], extended to the case that the relative order of the previously mentioned variables has already been fixed.
Note that, as a side effect, those image computations can also be used to check whether verification condition (vc2) from Def. 1 holds. As a last step we perform an image computation with the remaining atomic blocks and obtain the complete image $I M G_{f}$ of $I C$ under the whole circuit under verification. Now we just have to check whether $I M G_{f}$ implies $0 \leq R<D .^{\dagger}$
b) Optimization: During backward rewriting we are using two optimization methods: (1) Exploitation of equivalent / antivalent signals from [28] and (2) our novel don't care optimization for polynomials. For (1) we compute classes of equivalent / antivalent signals (Line 3). The signals $s$ in each class are represented by the unique signal $r p(s)$ that is minimal w.r.t. $\prec_{\text {top }}$. It is important to note that for the divider design from Fig. 3 the original SBIF version with simulation and restricted window-based SATsolving does not work, since it seems that existing antivalences / equivalences cannot be proven by local reasoning. For this reason, it is essential to enhance the SAT-problems with the information that signal combinations corresponding to satisfiability don't cares (as computed in Line 2) cannot occur.

Our experiments in Sect. V show that it is essential to use our novel don't care optimization method for polynomials $P\left(x_{1}, \ldots, x_{n}\right)$ which represent pseudo-boolean functions $f\left(x_{1}, \ldots, x_{n}\right):\{0,1\}^{n} \rightarrow \mathbb{Z}$. If certain valuations for the inputs of a polynomial $P$ cannot occur, since they are satisfiability don't cares, then we can modify the pseudo-boolean function for those valuations. Our goal is to find an assignment to the don't cares that minimizes the size of $P\left(x_{1}, \ldots, x_{n}\right)$. For a polynomial $P\left(x_{1}, \ldots, x_{n}\right)$ with don't care cubes $d c_{1}, \ldots, d c_{n}$ the method consists in the following steps:

- Introduce a new integer variable $v_{i}$ for each don't care cube $d c_{i}$.
- Add for all $1 \leq i \leq n$ " $v_{i} \cdot d c_{i}$ " to $P$.
- Multiply out, combine terms with the same monomial etc..
- Use Integer Linear Programming to minimize the size of $P$.

Example 1. As an example we choose $P\left(x_{1}, x_{2}, x_{3}\right)=1-x_{1}-$ $x_{2}-x_{3}+2 x_{1} x_{2}+2 x_{1} x_{3}+2 x_{2} x_{3}-4 x_{1} x_{2} x_{3}$ with don't care cubes $\neg x_{1} x_{2} x_{3}$ and $x_{1} \neg x_{2} \neg x_{3}$. For $\neg x_{1} x_{2} x_{3}$ we choose the integer

[^2] $D$ and performs backward substitutions.
variable $v_{1}$, for $x_{1} \neg x_{2} \neg x_{3}$ we choose $v_{2}$. Since $v_{1}\left(1-x_{1}\right) x_{2} x_{3}$ and $v_{2} x_{1}\left(1-x_{2}\right)\left(1-x_{3}\right)$ reduce to 0 for all care vectors, we can add them to $P\left(x_{1}, x_{2}, x_{3}\right)$ without changing the polynomial within the care set. By multiplying out and combining terms with the same monomial we arrive at the polynomial $1+\left(v_{2}-1\right) x_{1}-$ $x_{2}-x_{3}+\left(2-v_{2}\right) x_{1} x_{2}+\left(2-v_{2}\right) x_{1} x_{3}+\left(2+v_{1}\right) x_{2} x_{3}+\left(v_{2}-\right.$ $\left.v_{1}-4\right) x_{1} x_{2} x_{3}$. Minimizing the number of terms in the polynomial means choosing integer values for $v_{1}$ and $v_{2}$ such that the constant of a maximum number of terms is 0 (thus eliminating a maximum number of terms). Thus, in the equation system in Fig. 4 we try to satisfy a maximum number of equations. It is easy to see that the optimal solution is $v_{1}=-2, v_{2}=2$, leading to the polynomial $1+x_{1}-x_{2}-x_{3}$.

Satisfying a maximum number of linear integer equations can be reduced to integer linear programming by standard methods (replacing each equation $\ell_{i}\left(x_{1}, \ldots, x_{n}\right)=0$ by $\ell_{i}\left(x_{1}, \ldots, x_{n}\right) \leq$ $M d_{i}$ and $-\ell_{i}\left(x_{1}, \ldots, x_{n}\right) \leq M d_{i}$ with a

| $v_{2}-1$ | $=0$ |
| :---: | :---: |
| $2-v_{2}$ | $=0$ |
| $2-v_{2}$ | $=0$ |
| $2+v_{1}$ | $=0$ |
| $v_{2}-v_{1}-4$ | $=0$ |

Fig. 4. Minimization sufficiently large constant $M$ and a binary deactivation variable $d_{i}$ for each equation, then minimizing the number of deactivation variables assigned to 1 ).

In Lines 4 to 15 of Alg. 2 we perform backward rewriting processing the atomic blocks in reverse topological order. We apply our optimization methods only if needed, i. e., if we observe a significant growth in size of the polynomial $S P_{i-1}$. Whenever we arrive at an atomic block with a non-empty don't care set $d c\left(a_{i}\right)$ or at an atomic block whose set of inputs $\operatorname{in}\left(a_{i}\right)$ contains a signal with a topologically smaller equivalent / antivalent signal (indicated by $r p(s) \neq s$ in Line 13) we save this situation as a backtrack point on a stack $S T$. If the polynomial $S P_{i-1}$ grows too much (Line 6, we use a growth factor threshold $=2.0$ in our implementation), we backtrack to the last backtrack point and perform an optimization of the saved polynomial - either by applying don't care optimization of the polynomial (Line 9) or by replacing input variables of the replaced atomic block by their topologically minimal equivalent / antivalent representative signals (of course in the right polarity, Line 10).
c) Evaluating the final polynomials: We observed that simple and non-optimized SCA-based methods applied to optimized non-restoring dividers as in Fig. 3 finally result in a non-zero polynomial (for small bit widths where the verification finishes). Whereas this is in contrast to the typical assumption in SCA-based verification ("the implementation is correct iff backward rewriting reduces the specification polynomial to 0 "), it does not indicate an error. The divider has only to be correct for input combinations satisfying the input constraint $I C=0 \leq R^{(0)}<D \cdot 2^{n-1}$, i. e., the divider is correct iff the final polynomial evaluates to 0 for all inputs satisfying $I C=0 \leq R^{(0)}<D \cdot 2^{n-1}$. Of course, the number of inputs satisfying the input constraint is exponential and thus it is not advisable to evaluate the polynomial for all admissible input combinations. Fortunately, in the special case of a divider the input constraint $I C$ has a form that allows a decomposition into a linear number of cases to be evaluated, following the idea of bitwise comparing $R^{(0)}$ and $D \cdot 2^{n-1}$ starting with the most significant bit (see Alg. 3).

## V. Experimental Results

Our experimental evaluation has been performed on one core of an Intel Xeon CPU E5-2643 with 3.3 GHz and 62 GiB of main memory. The runtime of all experiments was limited to 24 CPU hours. To solve the ILP problems for don't care optimization
of polynomials we used the ILP solver Gurobi [31]. For image computations we used the BDD package CUDD 3.0.0 [32]. The run times in Table I are given in CPU seconds.

We consider the verification of optimized non-restoring dividers as in Fig. 3 with different bit widths (Col. 1 in Table I). During verification we did not use any hierarchy information. We just used the flat gate-level netlist (numbers of gates are shown in Col. 2) and employed heuristics for detecting atomic blocks as well as for finding a good substitution order [14], [15].

We start with three experiments for comparison. In those experiments we check the equivalence of the divider circuit with a "golden specification". For this we construct a miter circuit between the divider and its golden specification and conjoin it with a circuit for $0 \leq R^{(0)}<D \cdot 2^{n-1}$ to ensure that counterexamples are restricted to the allowed range of inputs.

In the first experiment we used a SAT-solver (MiniSat 2.2.0 [33]) to solve the corresponding satisfiability problems. In Table I the results are presented in Col. 3. The results show that SAT-solving for non-trivial arithmetic circuits is hard and the SAT-problems with bit widths larger than 8 could not be solved due to a timeout. In the second experiment we considered the combinational equivalence checking (CEC) approach of ABC [34], [35] which is based on And-Inverter-Graph (AIG) rewriting via structural hashing, simulation, and SAT and reduces the overall complexity of equivalence checking between two designs by finding equivalent internal AIG nodes. As for SAT-solving, ABC cannot verify the dividers with bit widths larger than 8 , see Col. 4 in Table I. In a third experiment we considered the commercial verification tool from OneSpin [36]. As Col. 5 in Table I shows, the OneSpin tool performs better on the non-restoring dividers, but it needs already more than 1 CPU hour for the verification of 24 -bit-dividers and runs into a timeout for larger dividers.

Col. 6 of Table I shows that the SCA-based method using SBIF from [28] is not suitable for verifying large optimized non-restoring dividers either. The method exceeds the available memory for all dividers with bit widths larger than 8 bits.

In contrast, from Col. 7 we can see that dividers up to 512 bits do not form any problem for our method. The verification of 512bit dividers needs less than 162 CPU minutes. Remember that (as in [28]) we do not need any golden specification circuits for the verification, since we prove that the divider circuits fulfill their abstract specification from Def. 1. We split the overall run time into different sub-tasks for a more detailed analysis. Col. 8 shows the run times for reading the circuit design, Col. 9 the run times for simulations that produce candidates for satisfiability don't cares. In Col. 10 the run times for BDD-based image computation confirming don't care candidates are given. Col. 11 shows the corresponding peak sizes for the number of BDD nodes. Col. 12 gives the run times for computing equivalent / antivalent signals (SBIF, [28]) with a window size of 2 for the SAT-problems, making use of the computed don't cares as well. Finally, Col. 13 gives the run times for SCA-based backward rewriting, Col. 14 the run times for don't care optimization of polynomials, Col. 15 the number of backtrackings our algorithm performs, Col. 16 the peak sizes of polynomials, and Col. 17 gives the size of the final polynomial after rewriting. In almost all cases our optimization was strong enough to minimize the final polynomial to 0 and thus the run times for evaluating the final polynomial were negligible. Verifying condition (vc2) from Def. 1 has been integrated into the BDDbased image computation and is thus included in Col. 10. The results clearly show that don't care optimization of polynomials is fast and effective. By using it before the polynomial sizes have

TABLE I
VERIFYING OPTIMIZED NON-RESTORING DIVIDERS, TIMES IN CPU SECONDS.

|  |  |  |  |  |  | Our method |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $n$ | \#Gates | $\begin{aligned} & \text { SAT } \\ & \text { time } \end{aligned}$ | ABC time | $\begin{aligned} & \text { Commercial } \\ & \text { time } \end{aligned}$ | [28] time | overall time | read <br> time | $\begin{aligned} & \operatorname{sim} \\ & \text { time } \end{aligned}$ | $\underset{\text { BDD }}{\text { time }}$ | $\begin{aligned} & \text { BDD } \\ & \text { nodes } \end{aligned}$ | SBIF <br> time | rewrite time | $\begin{gathered} \text { DCOP } \\ \text { time } \end{gathered}$ | \#bt | peak <br> poly. | final <br> poly. |
| 3 | 54 | 0.27 | $<0.01$ | 2.44 | 0.08 | 0.10 | 0.05 | 0.02 | $<0.01$ | 216 | $<0.01$ | $<0.01$ | 0.02 | 4 | 74 | 6 |
| 4 | 100 | 0.82 | 0.01 | 2.56 | 0.16 | 0.16 | 0.09 | 0.02 | $<0.01$ | 366 | <0.01 | $<0.01$ | 0.03 | 7 | 79 | 0 |
| 8 | 404 | 47.17 | 17.60 | 2.65 | 904.30 | 0.48 | 0.37 | 0.05 | 0.01 | 1,228 | $<0.01$ | 0.02 | 0.04 | 11 | 272 | 0 |
| 16 | 1,588 | TO | TO | 165.89 | MO | 1.91 | 1.59 | 0.13 | 0.03 | 3,617 | $<0.01$ | 0.08 | 0.07 | 19 | 859 | 0 |
| 24 | 3,540 | TO | TO | 4,319.20 | MO | 3.82 | 3.12 | 0.22 | 0.11 | 6,881 | $<0.01$ | 0.25 | 0.11 | 27 | 2,046 | 0 |
| 32 | 6,260 | TO | TO | TO | MO | 6.79 | 5.62 | 0.29 | 0.22 | 11,041 | 0.01 | 0.51 | 0.14 | 35 | 2,864 | 0 |
| 48 | 14,004 | TO | TO | TO | MO | 14.19 | 11.07 | 0.53 | 0.75 | 22,049 | 0.02 | 1.54 | 0.25 | 51 | 7,509 | 0 |
| 64 | 24,820 | TO | TO | TO | MO | 28.86 | 21.51 | 0.87 | 1.92 | 36,641 | 0.03 | 4.11 | 0.40 | 67 | 9,706 | 0 |
| 96 | 55,668 | TO | TO | TO | MO | 70.22 | 44.40 | 1.85 | 7.19 | 76,577 | 0.09 | 15.80 | 0.85 | 99 | 24,724 | 0 |
| 128 | 98,804 | TO | TO | TO | MO | 148.18 | 78.38 | 3.02 | 20.25 | 130,849 | 0.16 | 44.52 | 1.78 | 131 | 54,920 | 0 |
| 256 | 394,228 | TO | TO | TO | MO | 989.91 | 335.40 | 8.62 | 294.84 | 491,297 | 0.68 | 343.91 | 6.49 | 259 | 203,838 | 0 |
| 512 | 1,574,900 | TO | TO | TO | MO | 9,668.70 | 1,258.88 | 26.21 | 5,595.47 | 1,900,321 | 2.78 | 2,762.31 | 23.02 | 515 | 676,521 | 0 |
| 1024 | 6,295,540 | TO | TO | TO | MO | TO | , | - | - | - | - | - | - | - | - | - |

increased to a large extent we are able to successfully verify dividers up to 512 bits with small peak sizes of polynomials.

To check whether don't care optimization of polynomials could also be replaced by don't care optimization at the bit level, we considered the following variant of our algorithm: We forbid to use exactly the don't cares (whose number is two for $n \geq 4$ ) that occur at the inputs of one


Fig. 5. Growth behaviour of polynomials for 32-bit divider. atomic block, the block computing output $q_{1}$. Instead, we replace this block by all possible circuit functions that result from the 4 possible assignments of don't cares in its function table (the resulting variants are called $v 1, v 2, v 3$, and $v 4$ in Fig. 5, $v 1$ is the version which does not change the atomic block). Fig. 5 shows that for the 32-bit divider, e.g., all four variants lead to an exponential growth in polynomial sizes, no matter how the don't care assignment at the bit level is chosen. (The curves for $v 1$ to $v 4$ are not exactly equal, but unfortunately they overlap at the given resolution.) In contrast, for our original method the growth behaviour of the polynomials is moderate. (The observed small peaks always occur before backtracking to an optimization step.) This result clearly shows that optimization of polynomials at the word level is essential and cannot be replaced by bit-level don't care optimization.

## VI. Conclusions and Future Work

We presented a novel ILP-based method for optimizing polynomials w.r.t. satisfiability don't cares. Using this method has been essential to successfully verify large optimized non-restoring dividers. For the verification we had to take into account that usually divider designs are only correct under given input constraints. This observation typically holds for other designs as well and (to the best of our knowledge) has not been used in SCA-based verification approaches so far. We strongly believe that the combination of forward information propagation (leading to the computation of satisfiability don't cares) and backward computation of polynomials that will be optimized using this information is not only applicable to non-restoring dividers, but will be the key to move forward the verification of other divider architectures as well as other arithmetic circuits.

## REFERENCES

[1] T. Coe, "Inside the Pentium FDIV bug," Dr. Dobbs J., vol. 20, no. 4, pp. 129-135, 1995.
[2] R. E. Bryant, "Graph-based algorithms for Boolean function manipulation," TC, vol. 35 no. 8, pp. 677-691, 1986
[3] J. R. Burch, "Using BDDs to verify multipliers," in DAC, 1991, pp. 408-412
[4] A. Sayed-Ahmed, D. Große, U. Kühne, M. Soeken, and R. Drechsler, "Formal verification of integer multipliers by combining Gröbner basis with logic reduction," in DATE, 2016, pp. 1048-1053
[5] M. Temel, A. Slobodová, and W. A. Hunt, "Automated and scalable verification of integer multipliers," in CAV, 2020, pp. 485-507
[6] J. Lv, P. Kalla, and F. Enescu, "Efficient Gröbner basis reductions for formal verification of Galois field arithmetic circuits," TCAD, vol. 32, no. 9, pp. 1409-1420, 2013.
[7] O. Wienand, M. Wedler, D. Stoffel, W. Kunz, and G. Greuel, "An algebraic approach for proving data correctness in arithmetic data paths," in CAV, 2008, pp. 473-486.
[8] F. Farahmandi and B. Alizadeh, "Gröbner basis based formal verification of large arithmetic circuits using gaussian elimination and cone-based polynomial extraction, MICPRO, vol. 39, no. 2, pp. 83-96, 2015.
19] M. Ciesielski, C. Yu, D. Liu, and W. Brown, "Verification of gate-level arithmetic circuit by function extraction," in $D A C, 2015$, pp. 52:1-52:6.
10] C. Yu, W. Brown, D. Liu, A. Rossi, and M. Ciesielski, "Formal verification of arithmetic circuits by function extraction," $T C A D$, vol. 35, no. 12, pp. 2131-2142, 2016.
11] D. Ritirc, A. Biere, and M. Kauers, "Column-wise verification of multipliers using computer algebra," in FMCAD, 2017, pp. 23-30.
12] C. Yu, M. Ciesielski, and A. Mishchenko, "Fast algebraic rewriting based on And-Inverte graphs," TCAD, vol. 37, no. 9, pp. 1907-1911, 2017.
13] D. Ritirc, A. Biere, and M. Kauers, "Improving and extending the algebraic approach for verifying gate-level multipliers,' in DATE, 2018, pp. 1556-1561.
14] A. Mahzoon, D. Große, and R. Drechsler, "PolyCleaner: clean your polynomials before backward rewriting to verify million-gate multipliers," in ICCAD, 2018, pp. 129:1-129:8
15] _—, "RevSCA: Using reverse engineering to bring light into backward rewriting for big and dirty multipliers," in DAC, 2019, pp. 185:1-185:6.
[16] D. Kaufmann, A. Biere, and M. Kauers, "Verifying large multipliers by combining SAT and computer algebra," in FMCAD, 2019, pp. 28-36.
171 A. Mahzoon, D. Große, C Scholl, and R. Drechsler "Towards formal verification of optimized and industrial multipliers," in DATE, 2020, pp. 544-549.
18] J. E. Robertson, "A new class of digital division methods," IRE Trans. Electronic J. E. Robertson, "A new class of digital
Computers, vol. 7, no. 3, pp. 218-222, 1958.

19] R. E. Bryant, "Bit-level analysis of an SRT divider circuit", in DAC, 1996, pp. 661-665.
20] E. M. Clarke, M. Khaira, and X. Zhao, "Word level model checking - avoiding the Pentiun FDIV error," in DAC, 1996, pp. 645-648.
[21] D. M. Russinoff, "A mechanically checked proof of IEEE compliance of the floating point multiplication, division and square root algorithms of the AMD-K7 processor," $L M S$ Journal Comput. Math., vol. 1, pp. 148-200, 1998
22] E. M. Clarke, S. M. German, and X. Zhao, "Verifying the SRT division algorithm using theorem proving techniques," Form Methods Syst. Des., vol. 14, no. 1, pp. 7-44, 1999.
231 J. O'Leary, X. Zhao, R. Gerth, and C.-J. H. Seger, "Formally verifying IEEE compliance of floating point hardware," Intel Technology Journal, vol. Q1, pp. 1-10, 1999.
24] K. Hamaguchi, A. Morita, and S. Yajima, "Efficient construction of binary moment diagrams for verifying arithmetic circuits," in ICCAD, 1995, pp. 78-82
[25] R. E. Bryant and Y. A. Chen, "Verification of arithmetic circuits with binary moment diagrams," in DAC, 1995, pp. 535-541.
26] A. Yasin, T. Su, S. Pillement, and M. J. Ciesielski, "Formal verification of integer dividers Division by a constant," in ISVLSI, 2019, pp. 76-81.
[27] -, "Functional verification of hardware dividers using algebraic model," in VLSI-SOC 2019, pp. 257-262.
28] C. Scholl and A. Konrad, "Symbolic computer algebra and sat based information forwarding for fully automatic divider verification," in $D A C, 2020$
29] O. Coudert and J. C. Madre, "A unified framework for the formal verification of sequential circuits," in ICCAD, 1990, pp. 126-129.
30] S. Malik, A. R. Wang, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, "Logic verification using binary decision diagrams in a logic synthesis environment," in ICCAD, 1988 pp. 6-9.
31] Gurobi Optimization, LLC, "Gurobi optimizer reference manual," 2020. [Online] Available: http://www.gurobi.com
32] F. Somenzi, "Efficient manipulation of decision diagrams," STTT, vol. 3, no. 2, pp. 171181, 2001.
33] N. Eén and N. Sörensson, "An extensible SAT-solver," in SAT, 2003, pp. 502-518.
[34] R. K. Brayton and A. Mishchenko, "ABC: an academic industrial-strength verification tool,' in CAV, 2010, pp. 24-40
[35] "ABC: A system for sequential synthesis and verification," available at https://people.eecs. berkeley.edu/~ alanmi/abcl, 2019
[36] OneSpin Solutions GmbH, 2020. [Online]. Available: https://www.onespin.com


[^0]:    This work was supported by the German Research Foundation (DFG) within the project VerA (SCHO 894/5-1, GR 3104/6-1 and DR 297/37-1).

[^1]:    *As in the case of multipliers where the number of product bits is two times the number of bits of one factor, we consider here the general case that the dividend has twice as many bits as the divisor. If both the dividend and the divisor are supposed to have the same length, we just set $r_{2 n-2}^{(0)}=\ldots=r_{n-1}^{(0)}=0$ and require $D>0$ (which then implies $0 \leq R^{(0)}<D \cdot 2^{n-1}$ ).

[^2]:    ${ }^{\dagger}$ This is in contrast to [28] where the verification of (vc2) starts with $0 \leq R<$

