# Low Power Design on Algorithmic and Architectural Level: A Case Study of an HSDPA Baseband Digital Signal Processing System

M. Schämann, S. Hessel and U. Langmann Lehrstuhl für Integrierte Systeme Ruhr-Universität Bochum D-44780 Bochum, Germany marcus.schaemann@is.rub.de

#### Abstract

The optimization of power consumption plays a key role in the design of a cellular system: Increasing data rates together with high mobility represent a constantly growing design challenge because advanced algorithms are required with a higher complexity, more chip area and increased power consumption which contrast with limited power supply. In this contribution, digital baseband components for a High Speed Downlink Packet Access (HSDPA) system are optimized on algorithmic and architectural level. Three promising algorithms for the equalization of the propagation channel are compared regarding performance, complexity and power consumption using fixed-point SystemC models. On architectural level an adaptive control unit is introduced together with an output interference analyzer. The presented strategy reduces the arithmetic operations for convenient propagation conditions up to 70% which relates to an estimated power reduction of up to 40 % while the overall performance is not affected.

## 1 Introduction

Advanced receiver structures and algorithms have been proposed for cellular systems which increase data rates and mobility of users [1, 2]. However, they are difficult to implement in a small mobile device due to their higher complexity compared with current solutions. Either a higher frequency or more parallelism of processing units is required to perform more arithmetic operations in the same time. The increased demand of chip area can be compensated by a higher level of integration, but the increasing M. Bücker Nokia Research Center Meesmannstr. 103 D-44807 Bochum, Germany martin.bucker@nokia.com

power consumption does not only lead to strong heat development but also restricts the standby and talk time of a mobile device due to the limited power supply. Therefore, the optimization of downlink receiver designs becomes important at all stages of the design process. The selection of the algorithm for the system as well as its partition into hardware and software tasks offers the highest potential for power optimization. On behavior and register-transfer level remarkable power reductions can also be achieved by variations of the architecture. Also, at higher design levels a faster estimation of power consumption can be performed compared to transistor level [3].

In our work, we analyze and optimize mobile High Speed Downlink Packet Access (HSDPA) receiver components concerning the power consumption on algorithmic and architectural level using fixed-point models described with SystemC. The HSDPA protocol is an extension of the Wideband Code Division Multiple Access (WCDMA) system specified by the 3GPP, also known as Universal Mobile Telecommunication System (UMTS).

The paper is organized as follows: Section 2 gives a short overview of the propagation conditions, the parameters of the transceiver model and the specifications for the investigated application. Section 3 classifies algorithms regarding raw bit error rate performance (i.e. BER without channel coding), complexity and estimated power consumption. On architectural level section 4 explains general and adaptive strategies to reduce the overall power consumption by proposed control stategies. Results of simulations and power estimations on both levels lead to the conclusion of this paper.

## 2 Channel and transceiver model

The propagation channel modifies the transmitted signal and creates a challenge for the receiver design. The most serious effect is the multipath propagation due to reflec-

This work has been funded by the German Federal Ministry of Education and Research (BMBF) within the research project LEMOS (grant no. 01M3155). Further information is available on the project website: http://lemos.offis.de

tions at surrounding buildings and structures that creates a large variation of power received at the mobile device and leads to intra-symbol and inter-symbol interference (ISI). Other effects are Doppler frequency shifts and thus frequency and timing errors as well as additive white Gaussian noise (AWGN). For cellular systems most of these properties also change with the movement of the receiver and its surroundings which creates a time-variant slow and fast fading channel [4].

Therefore, one of the most complex and power consuming parts of the digital baseband signal processing is the multipath combination. The deterioration of the signal by the propagation channel has to be compensated at this stage to restore the chip spaced signal. Several approaches and algorithms have been proposed for advanced WCDMA receivers to compensate the propagation channel, minimize interference and thus improve the BER performance [1]. The complexity of the optimal maximum likelihood detector scales non-polynomial with the number of users and is therefore difficult to implement. As a consequence, suboptimal receivers, like the decorrelating detector, the linear MMSE equalizer and several interference cancellation methods have been proposed and investigated to combine or equalize the information received from the propagation channel or to eliminate the interference which is corrupting the signal.

#### 2.1 Rake combiner

The Rake combiner is most commonly used in CDMA receivers (described e.g. in [5, 6]). It superposes the incoming rays in the receiver at maximum ratio by reversing the phase rotation of the propagation channel, compensating the delay and weighting the rays according to their signal power. Usually, only the strongest paths of the channel impulse response (CIR) are used to reduce the complexity (typically 4 to 8 fingers). The fingers are separated by adjustable delays and apply the conjugate complex CIR to the incoming data. Due to its simple and scalable structure the Rake combiner is implemented in most current mobile devices. However, its inability to provide a sufficient raw BER performance under certain propagation conditions can be regarded as a drawback of the Rake combiner because it does not consider interference which results from the correlation of multiple paths among each other.

## 2.2 MMSE equalizer

The linear Minimum Mean Square Error (MMSE) equalizer produces a signal which is as similar to the transmitted signal as possible concerning the noise received [7]. The incoming paths are not only combined, but a whole Finite Impulse Response (FIR) is calculated to suppress interference. However, to avoid noise enhancement the Signalto-Noise Ratio (SNR) has to be estimated as well as the impulse response of the propagation channel. The linear MMSE equalizer provides a very good raw BER performance. However, not only few paths have to be combined, but a complete filter is required to equalize the propagation channel. The filter typically has a size similar to the maximum delay of the propagation channel which may span over 20 chips for urban cells.

#### 2.3 Prefilter Rake equalizer

The Prefilter Rake equalizer was proposed in [8] as a variation of the MMSE equalizer. The equalization process is separated into two tasks, first the elimination of the cross-correlation of the different paths and then the combination of the signal by a standard Rake combiner. Thus, the Pre-filter Rake equalizer is a trade-off between the improved performance of the MMSE equalizer and the low complexity of the Rake combiner. The size of the FIR of the Prefilter can be smaller than the size of the FIR of the MMSE equalizer. Another advantage of the Prefilter Rake equalizer can be used in systems with transmit diversity: due to the blind operation of the Prefilter the required complexity is reduced [9].

#### **3** Optimizations on the algorithmic level

The obtainable BER performance is an important property to choose an algorithm which calculates the coefficients for the MMSE or Prefilter Rake approach. The equations can be solved either by matrix operations at once or by adaptive algorithms which approximate the solution up to a possible remaining error. Examples for adaptive algorithms are the Griffith algorithm which belongs to the class of Least Mean Square (LMS) algorithms [10] and the Levinson algorithm, a Recursive Least Square (RLS) solver [11]. For adaptive algorithms convergence criteria have to be considered, otherwise the algorithm may not converge.

Fig. 1 shows the raw BER which was obtained for the three receiver approaches in case of propagation channel VA30 [12]. The MMSE equalizer using a LMS adaptation performs best and the Prefilter Rake using the Levinson algorithm has a similar performance. The Griffith algorithm for the Prefilter has a worse performance while the Rake combiner is not even able to reach a raw BER of 2% in the vehicular channel which is necessary to reach the throughput defined in the standard [12]. With respect to performance the MMSE equalizer is therefore the best choice.

For a low power design the complexity of the design has to be considered. Especially if multiple samples per chip are used in the receiver (oversampling) the complexity may rise dramatically. A first criterion for choosing an algorithm is



Figure 1. Comparison of raw BER performance for different receivers in case of propagation channel VA30 and an oversampling ratio of 8 (fixed-point bit-true simulation).

therefore the order of the algorithm with respect to the filter size. For example, the Griffith algorithm is of the order O(M) while the Levinson algorithm is  $O(M^2)$ . This makes the Griffith or similar LMS algorithms often the first choice for adaptive filters with a low complexity.

However, for a fixed size M of a filter impulse response a higher order may not be a sufficient criterion, e.g. an algorithm with a higher amount of complexity to determine the filter coefficients may have a lower average complexity if the algorithm is performed block by block and not continuously. It is therefore important to determine the required complexity of an algorithm under real operating conditions. A C++ class has been added to the receiver components which counts the performed additions and multiplications for a typical testcase and reports them periodically to identify the complexity correctly. For the fixed-point implementation of the algorithms it has to be considered that divisions with variable denominators require a high amount of hardware in digital designs as they are converted to iterative multiplications and additions (e.g. Goldschmidt's algorithm) [13].

Illustrated in Fig. 2 are the multiplications performed per slot of four different receiver algorithms for different oversampling ratios [14]. As expected, the MMSE equalizer and the Rake combiner have the highest and lowest complexity, respectively. For the Prefilter Rake equalizer the Levinson algorithm has a higher order, but it requires less operations per time than the Griffith algorithm because of a block-based calculation.

To estimate the power consumption of the designs, the SystemC models have been analyzed with the tool ORINOCO<sup>®</sup> of ChipVision [15]. The tool determines an



Figure 2. Number of performed multiplications per slot of different algorithms with respect to oversampling.

| algorithm        | Levinson           | Griffith             |
|------------------|--------------------|----------------------|
| classification   | RLS solver         | LMS                  |
| runtime behavior | block-based        | continuous           |
| filter size M    | 13 taps            | 17 taps              |
| area             | $0.61  {\rm mm}^2$ | $0.24 \mathrm{mm^2}$ |
| energy           | 2.83 mWs           | 2.38 mWs             |

Table 1. Results obtained with ORINOCO<sup>®</sup> for Prefilter algorithms mapped to a 130 nm standard cell library.

abstract floorplan of the design, monitors the switching activity for a typical testcase and calculates the energy consumed by the functional units, registers, controller, interconnects and clock tree. Tab. 1 shows the parameters in combination with the results of the estimation for an 130 nm standard cell library. On the one hand the Levinson algorithm has a higher order and requires a larger area. On the other hand the higher performance permits a reduction of the filter size compared to the Griffith algorithm. The smaller filter size and the block-based calculation of the filter coefficients lead to an almost equal energy consumption of both algorithms. Therefore the Levinson algorithm is the better choice of both. Additionally, the block-based mode allows more flexibility by an adaptive update rate which can be exploited in the architecture of the Prefilter Rake.

The choice of the best algorithm is difficult due to several optimization criteria. Three algorithms will now be investigated further on architectural level: MMSE equalizer using LMS algorithm (best performance with high complexity), Prefilter Rake equalizer using the Levinson algorithm (good performance with medium complexity) and the common Rake combiner (worst performance with lowest complexity).



Figure 3. Simulation environment combining three algorithms in connection with the proposed adaptive control unit.

### **4 Optimizations on the architectural level**

#### 4.1 General optmizations

First, general optimizations can be applied when the algorithms are mapped to an architecture. A way to reduce the complexity in modern communication systems is to convert a complex multiplication (e.g. rotation of phase and weighting of amplitude in the Rake combiner) from four multiplications and two additions into three multiplications and five additions [16].

Further, the conjugate complex symmetry of the Prefilter's coefficients can be utilized in the design of the filter which reduces the number of operations from six multiplications and ten additions to only four multiplications and six additions to process two conjugate complex taps.

#### 4.2 Adaptive change of receiver mode

A simulation environment combining all three algorithms for the receiver design has been developed as depicted in Fig. 3. The FIR filters can be adjusted to process a variable number of coefficients as well as to provide a mode which combines both filters for the MMSE equalizer.

The actual complexity and power consumption can now be tuned by an adaptive control unit which chooses the right mode for the current state of the propagation channel. For bad channel conditions the MMSE or the Prefilter Rake equalizer is active, in case of low noise environments the receiver can be tuned into the Prefilter Rake mode or the Prefilter can even be switched off to operate in a Rake-only mode. The Rake mode can also be chosen if high data rates are not required by the user. As can be seen in Fig. 2, the oversampling has a strong impact on complexity and power consumption. Again, an optimization can be achieved by reducing the oversampling ratio in case of good propagation conditions to allow a fine tuning of the receiver mode.

However, to adjust the adaptive control unit a criterion is required. Block error rate (BLER), bit error rate (BER) and a pseudo-BER (changed bits during channel decoding) are not suitable because of a long latency of their feedback which exceeds the change of the propagation channel in vehicular environments. A suitable characteristic in the inner receiver can be obtained by an output interference analyzer which determines the power of the signal used for softbit decision and the sum of power of the signal at adjacent chip positions which create interference for the decision. If the ratio of both values falls below a lower threshold the oversampling ratio is increased or the receiver mode is switched from Rake-only to the Prefilter Rake or MMSE equalizer. In case of good propagation conditions the interference can be eliminated very well. The criterion exceeds an upper threshold and the receiver can be switched to a lower oversampling ratio or to the Rake-only mode. Fig. 4 displays the work of the control unit for a VA30 propagation channel at a SNR of 10 dB: the mode and oversampling ratio is adjusted adaptively which yields a mean raw BER of 1.6 % which relates to a BER after turbo decoding of 0.25 % and an overall system performance with a BLER of 6.7 % for the high speed channels. A switching of algorithms and parameters may be carried out every other slot (1.33 ms) to allow a fast adaptation to the current propagation conditions.

#### 4.3 Adaptive change of parameters

The MMSE equalizer with the best raw BER performance has been selected for further optimization. The activity can be minimized by the adaptive control unit by adjusting the filter size and defining a threshold which stops the adaptation in case the sum of error of the coefficients falls below it. An interesting option on architectural level is also the implementation of a convergence masking vector (CMV) [17]. An additional register with one bit for each coefficient stores the state of convergence of each value. The bit is set if the coefficient does not need to be improved further. From the next iteration on, no more arithmetic operations are performed and therefore no more dynamic power is consumed by the algorithm for this coefficient.

Fig. 5 shows the impact of the strategies and their combination on the performed operations in case of the MMSE equalizer and an oversampling ratio of 8. Depending on the propagation conditions (here a VA30 channel) the control unit reduces the number of multiplications and additions adaptively by up to 71.9 % each. The settings of the control structures allow a trade-off between BER performance and computational complexity. In this case the settings were chosen to ensure that the raw BER performance is not affected and remains about 1.1 % [18].



Figure 4. Switching of mode and oversampling ratio by the adaptive control unit determined by output interference ratio and the effect on performance.



Figure 5. Reduction of computational complexity achieved by the adaptive control unit for the MMSE equalizer and its impact on BER performance. However, the reduction of complexity for multiplications and additions does not mean a reduction of power consumption of the same dimension. This is caused by the necessary complexity and power consumption of the control unit and the corresponding structures which enable the control of the data path (i.e. separation into active and inactive parts). The evaluation of the reduction of power consumption for the MMSE equalizer with and without control unit has been investigated also using the tool ORINOCO<sup>®</sup>. The power consumptions of the equalizer designs have been estimated and a reduction of power consumption up to 42 % by additional control structures for the MMSE equalizer could be observed.

## 5 Conclusions

The importance and impact of power optimization on algorithmic and architectural level has been discussed in this paper for an HSDPA case study. A selection of algorithms for the application has been compared regarding raw BER performance, arithmetic complexity and power consumption. An architecture combining the benefits of the algorithms has been proposed together with an adaptive control unit which uses an output interference analyzer to choose the best algorithm and its setting according to the actual propagation environment to minimize complexity and power consumption. The control unit is able to reduce the arithmetic operations up to approx. 70% and the power consumption up to approx. 40% for the MMSE equalizer while the necessary performance can be maintained to fulfill the standard. This shows the potential of the strategy which can similarly be applied to other algorithms like the Prefilter Rake equalizer.

## References

- [1] S. Verdu. *Multiuser Detection*. Cambridge University Press, Cambridge, 1998.
- [2] K. Hooli et al. Chip-level channel equalization in W-CDMA downlink. *EURASIP Journal on Applied Signal Processing*, volume 2002, number 8, pages 757–770, August 2002.
- [3] A. Raghunathan et al. *High-Level Power Analysis and Optimization*. Kluwer Academic Publishers, 1998.
- [4] M. Pätzold. *Mobile Fading Channels*. John Wiley & Sons, Cinchester, 2002.
- [5] J. Proakis. Digital Communications. McGraw-Hill, 2001.
- [6] S. Sheng and R. Brodersen. Low-Power CMOS Wireless Communications: A wideband CDMA System Design. Kluwer Academic Publishers, Boston, 1998.
- [7] T. P. Krauss et al. Simple MMSE equalizers for CDMA downlink to restore chip sequence: Comparison to zeroforcing and rake. In *IEEE International Conference on Acoustics, Speech and Signal Processing*, volume 5, pages 2865–2868, June 2000.

- [8] M. J. Heikkilä. A novel blind adaptive algorithm for channel equalization in WCDMA downlink. In *IEEE International Symposium on Personal, Indoor and Mobile Radio Communications*, volume 2, pages 41–45, September 2001.
- [9] M. Schämann, W. Wilhelm, D. Bierbaum and U. Langmann. Efficient Hardware Architectures of MIMO Receivers for HSDPA Applications in Frequency Selective Channels: A Comparison. In *World Wireless Congress*, San Francisco, Mai 2004.
- [10] B. Widrow et al. Adaptive Signal Processing. Prentice-Hall, New Jersey, 1985.
- [11] T. K. Moon and W. C. Stirling. *Mathematical Methods and Algorithms for Signal Processing*. Prentice-Hall, Upper Saddle River, NJ, 2000.
- [12] 3GPP. TS 25.101 V6.0.0: UE Radio Transmission and Reception (FDD). March 2003. Release 6.
- [13] G. Even et al. A parametric error analysis of Goldschmidt's division algorithm. *Journal of Computer and System Sciences*, volume 70, number 1, February 2005.

- [14] M. Schämann, M. Bücker, S. Hessel and U. Langmann. Channel Equalization in HSDPA Receivers: Trade-Off between Performance and Complexity with a Variable Oversampling. In *IEEE Vehicular Technology Conference*, Melbourne, Mai 2006.
- [15] W. Nebel and D. Helms. Low-Power Electronics Design. In C. Piguet (Ed.). *High-Level Power Estimation and Analysis*, CRC Press, 2005.
- [16] A. T. Fam. Efficient complex matrix multiplications. *IEEE Transactions on Computers*, volume 37, number 7, pages 877–879, July 1988.
- [17] Y. Guo et al. Low power VLSI architecture for adaptive MAI suppression in CDMA using multi-stage convergence masking vector. In *IEEE Vehicular Technology Conference*, Dallas, TX, September 2005.
- [18] M. Schämann, M. Bücker, S. Hessel and U. Langmann. Power Optimization of Digital Baseband WCDMA Receiver Components on Algorithmic and Architectural Level. In U.R.S.I. Kleinheubacher Tagung, Miltenberg, September 2006.