# A High-Speed Transceiver Architecture Implementable as Synthesizable IP Core

Andreas Wortmann, Sven Simon, Matthias Müller {wortmann,simon,mueller}@siom-tec.com ZIMT, Flughafenallee 10 28199 Bremen, Germany

#### Abstract

In this work, a synthesizable architecture for serial high speed transceiver is presented, which can be implemented on register-transfer level (RTL) with standard hardware description languages (HDL). The proposed implementation as a soft IP macro can be synthesized applying a semicustom design flow, widely used in industry whenever possible.

Generally, the implementation of high speed transceivers is a typical domain of a full custom design style because the timing critical parts are realized by dedicated transistor level design of the PLL/DLL based architectures. Compared to this method, the design productivity can be enhanced significantly, with the usage of this soft IP macro. With the presented implementation, data rates of about 1 GBit/s can be achieved. This is certainly less compared to full custom implementations. Nevertheless, this is an appealing solution for short design time and low cost design, if the achieved data rate is sufficient. In addition, current research show that data rates above the mentioned result can be achieved.

#### 1. Introduction

Over the last decade, the computation power of VLSI implementations increased significantly with technology scaling. Thus, many VLSI architectures have become I/O-limited, especially if synchronous data transfer is used. Synchronous parallel data transfer is usually limited to 100-300 Mbit/s and is prohibitively expensive for high data bandwidth because a huge number of I/Os and complex wiring is needed for the required amount of data. With the development of synchronization receiver architectures, high-speed serial transmission techniques evolved to become a solution for the I/O bottleneck.

The huge number of publications concerning serial highspeed transceivers can be divided into two groups which differ in the architectural concept, see [4, 13]. A clock/data recovery architecture (CDR, see Figure 1 a)) is based on a



Figure 1. Established solutions for correct data reception. a) CDR b) DR.

loop control sensitive to signal changes of the incoming data signal. The clock signal to sample the incoming data with is recovered from the data signal. The loop usually is implemented as a phase-locked loop (PLL).

A data recovery architecture (DR, see Figure 1 b), utilizes an amount of equidistant and well adjusted phases of a single reference clock signal to oversample the data signal. To generate different phases of a clock a PLL or a delaylocked loop (DLL) is required also.

For very high data rates, architectures are implemented in bipolar or GaAs technology [7]. Nevertheless, for common applications high speed transceiver are typically implemented in standard CMOS technology [2, 6, 11, 14] for cost reasons [3, 8]. To further reduce production cost, the transceiver circuits are implemented as full custom macros and are integrated on the same die with other circuitry [4]. Often this full custom block is the only one on the die and thus prohibit a 100 % semi-custom implementation of a system [1]. The proposed transceiver solution of this paper can be implemented within a standard semi-custom design flow as a soft IP macro. Applying a semi-custom design flow to the soft IP macro, a register-transfer level description of the circuit in a hardware description language like VHDL or Verilog is synthesized to gates. The final layout is generated by further place and route steps automatically. The majority of industrial IC designs are based on a semi-custom design flow, since it's application is fast, inexpensive and verification is better automated. Additionally, portability of the design to a different technology  $(0.18\mu, 0.15\mu, 0.11\mu)$  and to a different foundry is easy. As shown in [9] very complex circuits are implemented applying a semi-custom approach and still are capable of execution speeds typically only achievable by time-consuming custom designs.

An important aspect of high-speed transceiver circuits is the sensitivity to jitter. The detailed discussion in Section 3 differentiates between long-term and short-term jitter. On-chip noise being the source of short-term jitter is investigated in [10].

The architecture proposed in this paper is capable to compensate for delay variations of both, the data and clock signals due to jitter and process technology (best case, worst case). The achieved advantages regarding the design flow come at the cost of a lower transmission rate. For example, Horowitz et.al. estimate the achievable data rate of a  $0.35\mu$  CMOS technology to be 5.5 Gbit/s, when utilizing all developed techniques like predistorting transmitters, receiver equalization, multilevel signaling and parallel architectures [4, 12]. In contrast to this, a typical highest nominal toggle rate of a standard cell flip-flop of the same technology which is relevant for the proposed architecture is about 900MHz. With technology scaling also the maximal toggle rate is expected to scale. For example the highest nominal toggle rate of a  $0.18\mu$  CMOS technology is around 1GHz. Technology scaling and further development on the architectural level, which has not been exhausted yet, shows the perspective of achieving data rates of 2.5Gbit/s using the soft IP approach with an extension of the proposed architecture.

#### 2. Transceiver Architecture

The top level of the proposed architectural concept is shown in Figure 2. Basically, the architecture consists of two blocks. In the first block, the serial input data signal is sampled with respect to the reference clock signal and a set of so-called phase bit streams (PBS) is generated. The second block selects one of these phase bit streams carrying the correct, i.e. the originally sent, data. The first block is referred to as the phase bit stream generator (PBSG) and the second is referred to as the selection logic (SL). Furthermore, a serial to parallel converter might be used for the output data  $D_{out}$  before the received data is passed on to the relatively slow core logic.



Figure 2. Architectural concept of the high speed receiver architecture.

Besides the input data signal  $D_{in}$  a reference clock  $clk_1$ of a frequency matching the data rate is required. To obtain such a reference clock signal from an external low frequency clock within the required frequency range a PLL might be necessary, if no other signal source can be used. Even then the number of PLL/DLLs of a SoC design will be reduced if several high speed I/Os are placed on one chip because one clock signal with an arbitrary skew is sufficient to serve all receivers. As calculated in Section 3, the frequency of the reference clock is important to stay within a certain margin with respect to the data rate. Generally, this is the case for plesiochronous systems.

In the PBSG circuit metastable problems may occur. It is recommended to use two or more flip-flops in series to reduce the probability of metastable failure [5]. If the standard cell library used comprises a metastable-hardened flip-flop it is recommended to use this. Apart from that, the sampled data of the PBS which are propagated to the output of the circuit do not violate the setup and hold times.

In general, the data values of the phase bit streams (PBS) are influenced by jitter and wander of the data and clock signal. Groups of adjacent PBSs show the same sampled data at each sampling time instance, see Figure 3. Each of these groups is called a data bit stream (DBS). Figure 3 a) shows an eye diagram of the transmitted signal and each eye opening represents a data bit stream. Clearly, jitter and noise effects can be observed. The affiliation of the PBSs with these data bit streams varies over time due to the described long-and short-term jitter.

Noise is introduced by the transmission media and IObuffers leading to short-term jitter (also denoted as cycleto-cycle jitter) added to the exact positions of the signal transitions. Long-term jitter referred to as wander between the clock and data signal is added to the data signal due



a)



Figure 3. Generation of phase bit streams and data bit streams from the data signal shown as an eye diagram.

to slow temperature and operating voltage variations [4]. Additional, the signal on the chip is exposed to coupling due to simultaneous switching activity of the core. The onchip noise spectrum as investigated in [10] contributes to the short-term or cycle-to-cycle jitter.

The group of PBS close to the signal's transitions are faulty since they are within the margin used up by the shortterm jitter. Deviations of the reference clock and the transmission data rate cause the DBSs to wander across all possible PBSs and even to leave the coverage of the PBSs. Thus, jitter and wander introduce faults into some PBSs. The openings shown by the eye diagram represent the PBSs which are used to derive each DBS. As shown in Figure 3 b) adjacent PBS form each DBS.

The selection logic selects a PBS carrying the data most likely to be identical with the transmitted data. Since the sampling frequency is the highest possible clock rate of the standard cell library to achieve the highest possible serial data rate, the selection logic can be operated at a lower frequency. This relaxes the timing constraints and makes it possible to handle the design in a semi-custom design flow.

The basic architecture of this DBS tracking finite state



Figure 4. Implementation blocks of the selection logic.

machine (FSM) is shown in Figure 4. The circuit is split into two different clock domains, the fast sampling clock  $clk_1$  with period  $T_1$  and the slower domain of  $clk_2$  with  $T_2 = nT_1$ . The factor n is chosen suitable for the convenient application of a semi-custom design flow for the FSM design in clock domain clk2. The 'MakeDiff' block compares every two adjacent PBS in the  $clk_1$ -domain and verifies if both signals are identical. If they differ, at least one PBS will not contain the correct sampled input data and thus, they do not belong to the same DBS. In order to analyze these difference signals in detail in the clock-domain  $clk_2$  the difference signal is accumulated over a period of time. If a difference occurs within this time period, the decisions are made as mentioned above. The responsible circuit for this accumulation is the 'Zero Trap' block in Figure 4. A detailed view of the zero trap is shown in Figure 5 as a FSM and gate level circuit. Once a '0' is detected at the zero trap's input (diff) it is buffered and exhibited at the output (trap) until the trap is reset. The generated trap signals accumulate the differences monitoring the difference signals, and hence change at a relatively low frequency.

State machines, referred to as Selector A and B in Figure 4, take the trap signals as input and and follow a DBS which might be time varying. Detecting the adjacent areas of jitter each selector chooses a PBS within a DBS, which is sufficiently far away from the areas of jitter. The selection is updated as soon as the edge of the DBS is detected. The selection can not be updated if there is no PBS left to select. In this case, the synchronization is lost and the selector



Figure 5. Schematic and FSM of a zero trap.

needs to synchronize with a different DBS. Since in this period of time no data must be lost, another instance of selector works in parallel, synchronizing with a different DBS, and takes responsibility for selecting a valid PBS.

A detailed discussion of the operation of the selector FSMs is beyond the scope of this paper.

#### 3. Experimental Results

Simulations have proven the functionality of the architecture. Different scenarios of jitter have been simulated and approve the architecture to work properly. The tracking architecture given in the previous section results in no data loss even assuming the high short-term jitter ratio of 70% evenly distributed jitter. Two simulations are shown in Figure 6 and 7. In the first figure, the PBS selection with high short-term jitter (70% of the clock cycle) is shown. The solid line plots the trajectory of the chosen bitstream. Only PBSs chosen by selector B are taken into account because the received data is chosen from Selector B in this case. In the picture, the bad signal quality can be observed within the trap signals. The random jitter is spread over 7 or 8 adjacent trap signals and causes randomly distributed signal changes. A sinusoidal long-term jitter is added between the clock and the data signal showing the necessity and ability of the tracking architecture to adjust the selection.

In Figure 7, a scenario with less short-term jitter and a frequency deviation between the clock and data signal is depicted. Both selectors A and B are used because the resulting wander is larger. If selector B is out of range, the PBS of the selector A is chosen and vice versa.



Figure 6. PBS selection with a high amount of jitter.



# Figure 7. Selection trajectories of both selectors A and B.

The transceiver circuit has been implemented on two low-cost FPGA for emulation purpose, running at a clock frequency of 50MHz, 150MHz and 320MHz to verify the correct functionality before chip fabrication. In all cases, no bit errors occurred during an experiment lasting several days. In the test environment, the delay of the data between the transmitting and the receiving FPGA has been chosen above the clock period by different lengths of coax cable. Thus, we ensured an asynchronous data input signal at the receiver side for the macro verification. Also, opening and closing the connection asynchronously proved the device to resynchronize properly.

Our implementation requires at least 5 valid PBS to form each DBS. Table 1 shows the number of PBS forming a DBS given a certain amount of evenly distributed shortterm jitter. A high resolution, e.g. a large number of PBS per transmitted data Bit, is required to achieve a good jitter tolerance. If less than 5 PBS for each DBS are valid the se-

|            | Jitter | PBS / Bit |      |      |      |     |  |  |
|------------|--------|-----------|------|------|------|-----|--|--|
| 622 MBit/s |        | 23.0      | 16.1 | 12.4 | 10.1 | 8.5 |  |  |
|            | 10%    | 20.7      | 14.5 | 11.1 | 9.0  | 7.6 |  |  |
| valid PBS  | 30%    | 16.1      | 11.2 | 8.7  | 7.0  | 5.9 |  |  |
| per DBS    | 50%    | 11.5      | 8.0  | 6.2  | 5.0  | 4.2 |  |  |
|            | 70%    | 6.9       | 4.8  | 3.7  | 3.0  | 2.5 |  |  |

Table 1. PBS with valid data, e.g. associated with one DBS at 622 MBit/s and various short-term jitter values.

|            | $clk_2$ | PBS / Bit |      |      |      |      |  |  |
|------------|---------|-----------|------|------|------|------|--|--|
| 622 MBit/s |         | 23.0      | 16.1 | 12.4 | 10.1 | 8.5  |  |  |
|            | 40      | 0.28      | 0.40 | 0.52 | 0.64 | 0.76 |  |  |
|            | 48      | 0.34      | 0.48 | 0.62 | 0.77 | 0.91 |  |  |
|            | 62.2    | 0.44      | 0.62 | 0.81 | 1.00 | 1.18 |  |  |
|            | 75      | 0.52      | 0.75 | 0.97 | 1.20 | 1.42 |  |  |
|            | 100     | 0.70      | 1.00 | 1.30 | 1.60 | 1.90 |  |  |

Table 2. Maximal tolerable deviation (in %) of the reference clock and data rate in dependency of the selectors tracking abilities at 622 MBit/s (long-term jitter).

lection logic may not work correctly. An insufficient number of valid PBS is printed with italic characters in Table 1. Obviously, the number of these cases increases with the clock frequency and decreases with the number of PBS per bit and the amount of jitter. Thus, the robustness concerning jitter can be enhanced using a sufficient number of PBS per bit.

Table 2 summarizes the maximal tolerable deviation of the reference clock and the data rate in dependency of the number of PBS per Bit at 480 MBit/s, 622 MBit/s and 750 MBit/s, respectively. Here the selectors ability to track longterm jitter is given. Since the selection logic operates at the clock frequency of  $clk_2$  which is *n* times slower than the sampling frequency of  $clk_1$ , only every *n*th bit an adjustment of one PBS can be accomplished. Thus, few PBS per Bit Time enables the logic to track variations of the phase faster.

The highly pipelined design consists of the flip-flops and just one or two levels of logic in-between. Figure 8 shows the histogram of the path delays within the timing critical clock domain of the sampling clock  $clk_1$  of our implementation. As a side-effect of this highly pipelined structure, a very good fault coverage of the design can be achieved if a scan path based production test is used. A fault coverage of more than 98% has been evaluated. Due to the very limited depth of logic within these fast operating pipeline stages, the automaton selecting the PBS has been implemented within the reach of a synchronously divided slower clock.



Figure 8. Histogram of the path delays in the domain of clock  $clk_1$  for a  $0.35\mu m$  technology.

A test chip in  $0.35 \mu m$  technology is currently under fabrication. The chip is expected to prove the design technique to work properly at clock frequencies above the FPGAimplementations. Furthermore the portability to an ASIC technology is proven. The test chip is shown in Figure 9, where the layout typical for a semi-custom design can be seen. Efficient timing driven placement optimizes the placement of the standard cells and, thus, cells forming critical paths are placed close to each other. Therefore the different design modules can be identified in the figure. The total core cell area is approximately  $1mm^2$ . The maximum number of pad cells surrounding the cell area was chosen although a serial transmission with the test chip requires only one high speed input and output. The reason for the huge number of I/Os is the number of control signals which help to analyze the impact of jitter effects on the circuit behavior. Additionally, the output data of a serial/parallel converter (S/P) are connected to output-cells. The prototype chip comprises of the phase bit stream generator (PBSG) and the selection logic (SL) to implement the basic functionality. Additionally, a fast FIFO and a serial/parallel converter (S/P) has been implemented to realize the receiver functionality. Plesiochronous transmission systems require fast FIFOs to buffer the received data and compensate for frequency deviations. Also included in the design is a transmitter unit (Tx) comprising of a parallel/serial converter and a data generator. An evaluation logic (EL) is implemented to verify the various modes of operation and to gain further experiences on the devices jitter behavior. The evaluation modes cover the functional replacement of different parts of the design by external components which can be implemented on an FPGA. These external components usually are slower, but the basic functionality of the chip can still be guaranteed if a component does not work properly.



Figure 9. Test chip.

## 4. Conclusion

In this paper, a novel synthesizable high-speed transceiver architecture is proposed to which an industrial semi-custom design flow can be applied. The architecture is implemented as soft IP macro on register-transfer level with a standard hardware description language. Usually, a full custom design style is used for high speed transceiver architectures which enforces a very time consuming development and much more effort concerning technology migration. The major contribution of this work is to make a semi-custom design flow applicable to the class of high speed transceiver architectures.

With the implementation presented here, data rates of about 1GBit/s for the transceiver macros in modern process technologies can be achieved. This is certainly less than the figures achievable with full custom macros. Nevertheless, this is no drawback for applications where a higher data rate is not necessary and the technology-independent description of the IP macro is an advantage. In addition, current research results show that higher data rates are possible, if this architecture is extended. A test chip for this patent-pending technology in a standard  $0.35 \mu m$  CMOS technology is currently under fabrication. Further research will include modifications of the architecture to accomplish even higher data rates while still applying a semi-custom design flow.

## References

- International workshop on ip-based soc design. Grenoble, November 13-14, 2003.
- [2] Cao, Momtaz, and Vakilian. Oc-192 receiver in standard 0.18μm cmos. In International Solid-State Circuits Conference, 2002.
- [3] D. Chinnery and K. Keutzer. Closing the gap between asic and custom: An asic perspective. In *Design Automation Conference*, 2000.
- [4] M. Horowitz, C. K. K. Yang, and S. Sidiropoulos. Highspeed electrical signaling — overview and limitations. *IEEE Micro*, 18(1):12–24, /1998.
- [5] Johnson and Graham. High-Speed Digital Design, A Handbook of Black Magic. Prentice Hall, 1993.
- [6] Kurisu, Fukaishi, Asazawa, Nishikawa, Nakamura, and Yotsuyanagi. Design innovations for multi-gigahertz-rate communication circuits with deep-submicron cmos technology. *IEICE Trans. Electron., Special Issue on Ultra-High-Speed IC and LSI Technology*, E82-C(3), 1999.
- [7] Meghelli, Rylyakov, and Shan. 50 gb/s sige bicmos 4:1 multiplexer and 1:4 demultiplexer for serial communication systems. In *International Solid-State Circuits Conference*, 2002.
- [8] S. Rich, M. Parker, and J. Schwartz. Reducing the frequency gap between asic and custom designs: A custom perspective. In *Design Automation Conference*, 2001.
- [9] N. Richardson, L. B. Huang, R. Hossain, T. Zounes, N. Soni, and J. Lewis. The icore 520 mhz synthesizable cpu core. In *Design Automation Conference*, 2002.
- [10] K. Shepard, V. Narayanan, and R. Rose. Harmony: static noise analysis of deep submicron digital integrated circuits. *IEEE Transactions on CAD of ICAS*, 18(8), 1999.
- [11] S.-J. Song, S. M. Park, and H.-J. Yoo. A 4-gb/s cmos clock and data recovery circuit using 1/8-rate clock technique. *IEEE Journal of Solid-State Circuits*, 38(7), July 2003.
- [12] V. Stojanovic, G. Ginis, and M. A. Horowitz. Transmit preemphasis for high-speed time-division-multiplexed seriallink transceiver. In *IEEE International Conference on Communications*, 2002.
- [13] Yang, Farjad-Rad, and Horowitz. A 0.5μm cmos 4.0-gbit/s serial link transceiver with data recovery using oversampling. *IEEE Journal of Solid-State Circuits*, 33(5), May 1998.
- [14] A. Zolfaghari and B. Razavi. A low-power 2.4-ghz transmitter/receiver cmos ic. *IEEE Journal of Solid-State Circuits*, 38(2), February 2003.