Sessions: [2D] [3D] [4D] [5D] [6D] [7D] [8D] [9D] [10D] [Interactive Presentations]

2D: Hot Topic -- From Working Design Flow to Working Chips: Dependencies and Impacts of Methodology Decisions

Organizers/Moderators: F. Muradali, Agilent Technologies, US; R. Aitken, Artisan Components, US [p. 2]

Successful product development is based on successful design flows and methodology. This session explores three key and often-overlooked aspects of these flows: integrating a working system-on-chip from a diverse set of IP, packaging and the silicon/PCB interface, and finally manufacturing test. Discovering the dependencies that exist among the decisions and objectives at each stage of the flow can drive global optimization. The speakers will discuss the tasks, decisions and objectives of their portion of the flow. These are compared for apparent and non-obvious optimization opportunities. The discussion continues in a panel format that invites the audience to further explore the topic.

Systems on Chips Design: System Manufacturer Point of View [p. 3]
V. Loukusa, H. Pohjonen, A. Ruha, T. Ruotsalainen, and O. Varkki

Package Design for High Performance ICs [p. 5]
S. Dandia

IP Testing -- The Future Differentiator? [p. 6]
B. Eklow

3D: Analogue and RF Design

Moderators: D. Appello, STM Test Solutions Group, IT; C. Das, IMEC, BE
PDF icon Highly Digital, Low-Cost Design of Statistic Signal Acquisition in SoCs [p. 10]
A. Júnior and L. Carro

Presently, the gap between analog and digital processes is ever increasing. Although digital circuits are still obeying Moore's law, their analog counterparts follow far behind. Since signal acquisition, through ADC circuits is an often required feature, for many embedded applications the benefits of Moore's law have not been achieved. This paper presents our approach to take advantage of the increasing integration of technology for analog interfacing in SoC's, by converting the statistics of the signal. Digital self-tuning of the threshold levels, the use of less expensive and highly variable analog blocks, and stochastic convergence of resolution allow a robust acquisition process. We present the mathematics behind the approach, as well as a set of target applications and experimental results validating the concept.

PDF icon RUNE: Platform for Automated Design of Integrated Multi-Domain Systems' Application to High-Speed CMOS Photoreceiver Front-Ends [p. 16]
F. Tissafi-Drissi, I. O'Connor, and F. Gaffiot

In this paper, we present a framework for the automated design of integrated multi-domain systems. The platform allows the designer to set optimization problems according to a hierarchical decomposition strategy, define complex specification functions for each block at a given hierarchical level, follow the progress of optimization and finally view results. Encapsulation of design methodologies is simplified through access to a library of optimization algorithms. The framework is demonstrated through the co-synthesis of a high-speed CMOS photoreceiver front-end comprised of a PIN photodiode and a transimpedance amplifier.

PDF icon Demonstration of a SiGe RF LNA Design Using IBM Design Kits in 0.18um SiGe BiCMOS Technology [p. 22]
Y. Chen, X. Yuan, D. Scagnelli, J. Mecke, J. Gross, and D. Harame

A 1.5GHz-2.0GHz Low Noise Amplifier (LNA) is designed in IBM 0.18um BiCMOS technology using IBM design kits in Cadence Design Flow. The fabricated LNA chip is packaged and tested. The measured results (gain, noise figure, and IIP3) correlate with the simulation very well. The results demonstrate that IBM SiGe technology, Modeling, Design Kits and the Cadence design flow are solid and accurate for RFIC design.

PDF icon Low Power Analogue 90 Degree Phase Shifter [p. 28]
P. Saul

This paper describes a re-useable circuit module for a 900 phase shifter, sometimes called a "Hilbert Transformer", which has been demonstrated on a 0.35-micron CMOS process. The 10-pole circuit is entirely analogue in operation, and achieves measured amplitude and phase accuracy compatible with >50dB sideband suppression. Statistical design techniques assure good functional yields. Total current consumption is 236microamps at 3.3V, and chip area is 1.42 square mm, excluding bond pads. Applications include low power, low cost SSB receivers, more advanced communications architectures in GSM, DCS and G3, and sonar.

PDF icon A 16 Bit + Sign Monotonic Precise Current DAC for Sensor Applications [p. 34]
P. Horský

A 16 bit + sign monotonic precise current DAC for sensor applications working in a harsh environment is described. It is working in a wide temperature range with high output voltage swing and low current consumption. The converter is based on current division and segmentation techniques to guarantee monotonicity. Two active cascoding loops and one follower loop are used to improve the output impedance, the accuracy and the voltage compliance of the DAC. The resolution of the DAC is further increased by applying PWM to one fine LSB current. To achieve low power consumption unused coarse current sources are switched off. Several second order technological effects influencing final performance and circuits dealing with them are discussed.

PDF icon An Inductance Modeling Flow Seamlessly Integrated in the RF IC Design Chain [p. 39]
S. Bantas, Y. Koutsoyannopoulos, and A. Liapis

A novel design flow is introduced based on an efficient inductance modeler, supporting RLCk extraction for spiral inductors, transformers and RF interconnect lines. The modeler operates on a set of EM-derived algorithms that can model complex cross-coupled devices on any silicon substrate rapidly and reliably. A design flow is set up in Cadence SKILL, integrating the inductance modeler with the layout editor and RCX extraction tools. Spiral inductor parametric cells are provided, that can be extracted with full connectivity in a single netlist along with other layout devices and parasitics. The resulting netlist includes mutual coupling (k) elements and is produced automatically without need for user intervention or back-annotation. Measured results on RF silicon circuitry showcase the accuracy and efficiency of the inductance modeling flow. The introduced flow can evolve into a platform for RF Intellectual Property (IP) evaluation and trade.

4D: Platform and IP Design

Moderators: K. Currie, Philips Semiconductors, NL; L. Torres, LIRMM, FR
PDF icon A High-Speed Transceiver Architecture Implementable as Synthesizable IP Core [p. 46]
A. Wortmann, M. Müller, and S. Simon

In this work, a synthesizable architecture for serial high speed transceiver is presented, which can be implemented on register-transfer level (RTL) with standard hardware description languages (HDL). The proposed implementation as a soft IP macro can be synthesized applying a semi-custom design flow, widely used in industry whenever possible. Generally, the implementation of high speed transceivers is a typical domain of a full custom design style because the timing critical parts are realized by dedicated transistor level design of the PLL/DLL based architectures. Compared to this method, the design productivity can be enhanced significantly, with the usage of this soft IP macro. With the presented implementation, data rates of about 1 GBit/s can be achieved. This is certainly less compared to full custom implementations. Nevertheless, this is an appealing solution for short design time and low cost design, if the achieved data rate is sufficient. In addition, current research show that data rates above the mentioned result can be achieved. PSfrag replacements

PDF icon Design of Very Deep Pipelined Multipliers for FPGAs [p. 52]
A. Panato, S. Silva, F. Wagner, M. Johann, R. Reis, and S. Bampi

This work investigates the use of very deep pipelines for implementing circuits in FPGAs, where each pipeline stage is limited to a single FPGA logic element (LE). The architecture and VHDL design of a parameterized integer array multiplier is presented and also an IEEE 754 compliant 32-bit floating-point multiplier. We show how to write VHDL cells that implement such approach, and how the array multiplier architecture was adapted. Synthesis and simulation were performed for Altera Apex20KE devices, although the VHDL code should be portable to other devices. For this family, a 16 bit integer multiplier achieves a frequency of 266MHz, while the floating point unit reaches 235MHz, performing 235 MFLOPS in an FPGA. Additional cells are inserted to synchronize data, what imposes significant area penalties. This and other considerations to apply the technique in real designs are also addressed.

PDF icon Application of a Multi-Processor SoC Platform to High-Speed Packet Forwarding [p. 58]
P. Paulin, C. Pilkington, E. Bensoudane, M. Langevin, and D. Lyonnard

In this paper, we explore the requirements of emerging complex SoC's and describe StepNP, an experimental flexible, multi-processor SoC platform targeted towards communications and networking applications. We present the results of mapping an internet protocol (IPv4) packet forwarding application, running at 2.5Gb/s and 10Gb/s. We demonstrate how the use of high-speed hardware-assisted messaging and dynamic task allocation in the StepNP platform allows us to achieve very high processor utilization rates (up to 97%) in spite of the presence of high network-on-chip and memory access latencies. The inter-processor communication overhead is kept very low, representing only 9% of instructions.

PDF icon Islands of Synchronicity, A Design Methodology for SoC Design [p. 64]
A. Niranjan and P. Wiscombe

To meet the challenges of faster time to market and growing design complexity, a methodology and supporting infrastructure for advanced System-on-Chip design have been developed and applied to 0.13 micron technology designs. The Islands of Synchronicity methodology uses locally synchronous islands to produce a timing-closure friendly design style that is widely applicable across different architectures. This approach enables a modular, hierarchical physical design strategy which significantly eases top-level timing closure problems. The resultant design flow is supported by the Skeleton of Reuse, a collection of IP generators and tools that automate many of the steps in SoC implementation.

PDF icon The Design of a High Speed ASIC Unit for the Hash Function SHA-256 (384, 512) [p. 70]
L. Dadda, M, Macchetti, and J. Owen

After recalling the basic algorithms published by NIST for implementing the hash functions SHA-256 (384, 512), a basic circuit characterized by a cascade of full adder arrays is given. Implementation options are discussed and two methods for improving speed are exposed: the delay balancing and the pipelining. An application of the former is first given, obtaining a circuit that reduces the length of the critical path by a full adder array. A pipelined version is then given, obtaining a reduction of two full adder arrays in the critical path. The two methods are afterwards combined and the results obtained through hardware synthesis are exposed, where a comparison between the new circuits is also given.

PDF icon LZW-Based Code Compression for VLIW Embedded Systems [p. 76]
C. Lin, W. Wolf, and Y. Xie

We propose a new variable-sized-block method for VLIW code compression. Code compression traditionally works on fixed-sized blocks and its efficiency is limited by the small block size. Branch blocks -- instructions between two consecutive possible branch targets -- provide larger blocks for code compression. We propose LZW-based algorithms to compress branch blocks. Our approach is fully adaptive and generates coding table on-the-fly during compression and decompression. When encountering a branch target, the coding table is cleared to ensure correctness. Decompression requires only a simple lookup and update when necessary. Our method provides 8 bytes peak decompression bandwidth and 1.82 bytes in average. Compared to Huffman's 1 byte and V2F's 13-bit peak performance, our methods have higher decoding bandwidth and comparable compression ratio. Parallel decompression could also be applied to our methods, which is more suitable for VLIW architecture.

PDF icon A Generic RTOS Model for Real-Time Systems Simulation with SystemC [p. 82]
R. Le Moigne, O. Pasquier, and J. Calvez

The main difficulties in designing real-time systems are related to time constraints: if an action is performed too late, it is considered as a fault (with different levels of criticism). Designers need to use a solution that fully supports timing constraints and enables them to simulate early on the design process a real-time system. One of the main difficulties in designing HW/SW systems resides in studying the effect of serializing tasks on processors running a Real-Time Operating System (RTOS). In this paper, we present a generic model of RTOS based on SystemC. It allows assessing real-time performances and the influence of scheduling according to RTOS properties such as scheduling policy, context-switch time and scheduling latency.

PDF icon A Scalable Architecture for LDPC Decoding [p. 88]
M. Cocco, J. Huisken, J. Dielissen, M. Heijligers, and A. Hekstra

Low Density Parity Check (LDPC) codes offer excellent error correcting performance. However, current implementations are not capable of achieving the performance required by next generation storage and telecom applications. Extrapolation of many of those designs is not possible because of routing congestions. This article proposes a new architecture, based on a redefinition of a lesser-known LDPC decoding algorithm. As random LDPC codes are the most powerful, we abstain from making simplifying assumptions about the LDPC code which could ease the routing problem. We avoid the routing congestion problem by going for multiple independent sequential decoding machines, each decoding separate received codewords. In this serial approach the required amount of memory must be multiplied by the large number of machines. Our key contribution is a check node centric reformulation of the algorithm which gives huge memory reduction and which thus makes the serial approach possible.

5D: Design Verification and Test

Moderators: F. Fummi, Verona U, IT; A. Fedeli, STMicroelectronics, IT
PDF icon Verification of a Microcontroller IP Core for System-on-a-Chip Designs Using Low-Cost Prototyping Environments [p. 96]
S. Schmitt and W. Rosenstiel

Rapid prototyping is a fast and efficient way for the functional verification of Systems-on-a-Chip in an early stage of the design process. Because of the rising part of software in those systems the use and reuse of microcontroller IP cores is necessary to keep development cycles short. Today, prototyping of such IP cores is done with large and expensive hardware emulation machines consisting of many processor or FPGA-based prototyping boards. In this paper the authors describe an alternative prototyping method for microcontrollers using one low-cost FPGA-based prototyping board. The method is based on the efficient usage of all resources of the prototyping system to emulate special parts of the microcontroller.

PDF icon Formal Refinement and Model Checking of an Echo Cancellation Unit [p. 102]
A. Krupp, W. Mueller, and I. Oliver

This article presents an approach, which combines theorem proving-based refinement with model checking for state based real-time systems. Our verification flow starts from UML state diagrams, which are translated to the formal B language and are model checked for real-time properties. By means of the B language and a B theorem prover, refined state diagrams are verified against their abstract representation. The approach is presented by means of the refinement of a digital echo cancellation unit.

PDF icon Test Infrastructure Design for the Nexperia™ Home Platform PNX8550 System Chip [p. 108]
S. Goel, E. Marinissen, K. Chiu, T. Nguyen, and S. Oostdijk

Philips has adopted a modular manufacturing test strategy for its SOCs that are part of the Nexperia™ Home Platform. The on-chip infrastructure that enables modular testing consists of wrappers and Test Access Mechanisms (TAMs). Optimizing that infrastructure minimizes the test application time and helps to fit the test data into the ATE vector memory. This paper presents the test architecture design for the chiplet-based PNX8550, the most complex Nexperia™ SOC designed to date. Significant savings in test time and TAM wires could be obtained with the help of TR-ARCHITECT, an in-house tool for automated design of SOC test architectures.

PDF icon Have I Really Met Timing? Validating Primetime Timing Reports with SPICE [p. 114]
T. Thiel

At sign-off everybody is wondering about how good the accuracy of the static timing analysis timing reports generated with PrimeTime® really is. Errors can be introduced by STA setup, interconnect modeling, library characterization etc. The claims that path timing calculated by PrimeTime usually is within a few percent of Spice don't help to ease your uncertainty. When the Signal Integrity features were introduced to PrimeTime there was also a feature added that was hardly announced: PrimeTime can write out timing paths for simulation with Spice that can be used to validate the timing numbers calculated by PrimeTime. By comparing the numbers calculated by PrimeTime to a simulation with Spice for selected paths the designers can verify the timing and build up confidence or identify errors. This paper will describe a validation flow for PrimeTime timing reports that is based on extraction of the Spice paths, starting the Spice simulation, parsing the simulation results, and creating a report comparing PrimeTime and Spice timing. All these steps are done inside the TCL environment of PrimeTime. It will describe this flow, what is needed for the Spice simulation, how it can be set up, what can go wrong, and what kind of problems in the STA can be identified.

PDF icon At Speed Testing of SOC ICs [p. 120]
V. Vorisek, T. Koch, and H. Fischer

This paper discusses the aspects and associated requirements of design and implementation of at-speed scan testing. It also demonstrates some important vector generation and implementation procedures based on a real design. An innovative method of scan pattern timing creation based on the results from Static Timing Analysis is presented. The paper also describes the usage of a clock control module on J750 tester, which creates fast clock by combining two tester channels with high edge placement accuracy. These methods allow a short test pattern preparation time and the use of low-cost test equipment, while providing the high quality at-speed testing.

PDF icon Utilizing Formal Assertions for System Design of Network Processors [p. 126]
X Chen, Y. Luo, H. Hsieh, L. Bhuyan, and F. Balarin

System level modeling with executable languages such as C/C++ has been crucial in the development of large electronic systems from general processors to application specific designs. To make sure that the executable models behave as they should, the designers often have to 'eye-ball' the simulation traces and at best, apply simple 'assert' statements or write simple trace checkers in some scripting languages. The problem is the lack of a concise and formal method to specify and check desired properties, whether they be functional or performance in nature. In this paper, we apply assertion checking methodology to the system design of network processors. Functional and performance assertions, based on Linear Temporal Logic and Logic of Constraints, are written during the design process. Trace checkers and simulation monitors are automatically generated to validate particular simulation runs or to analyze their performance characteristics. Several categories of assertions are checked throughout the design process, such as equivalence, functionality, transaction, and performance. We demonstrate that the assertion-based methodology is very useful for both system level verification and design exploration.

6D: Design Methodology

Moderators: V. Gerousis, Infineon Technologies, DE; D. Bailey, Mentor Graphics, US
PDF icon Clock Management in a Gigabit Ethernet Physical Layer Transceiver Circuit [p. 134]
J. Diaz and M. Saburit

This paper describes the clock management of a mixed signal, high-speed, multi-clock, fully synchronous circuit. The MA1111A13 circuit clock distribution is a complicated structure that seamlessly incorporates different well-known techniques for power reduction, asynchronous clock domains inter-operability, and compatibility with different IO timing standards and data rates. This complex clocking scheme has been successfully integrated into the standard semi-custom physical design flow. The physical implementation of the clock network with Synopsys Astro is also presented.

PDF icon Expert System Perimeter Block Placement Floorplanning [p. 140]
R. Auletta

With the dramatic increase in the size and block count of systems on a chip (SOC) over their application specific integrated circuit (ASIC) counterparts, engineers now need assistance beyond the clerical optimization tasks of placement and routing, they need assistance in applying their own expert abilities to design planning. This paper presents an investigation in applying expert systems to the automated floorplanning of systems on a chip. The investigation presents some background on expert systems, and then the implementation and results of an expert system based edge placer for perimeter placement of floorplan hard blocks.

PDF icon A CAD Methodology and Tool for the Characterization of Wide On-Chip Buses [p. 144]
I. Elfadel, A. Deutsch, G. Kopcsay, B. Rubin, and H. Smith

In this paper, we describe a CAD methodology for the full electrical characterization of high-performance, onchip data buses. The goal of this methodology is to allow the accurate modeling and analysis of wide, on-chip data buses as early as possible in the design cycle. The modeling is based on a manufacturing (rather than design-manual) description of the back-end-of-the-line (BEOL) cross section of a given technology and on a full yet contained description of the power-ground mesh in which the data bus is embedded. One major aspect of the resulting electrical models is that they allow the designer to evaluate the wide bus from the three viewpoints of signal timing, crosstalk (both inductive and capacitive), and common-mode signal integrity. Another major aspect is that they take into account such important high-frequency phenomena as the dependence of the current return-path resistance on frequencies. The CAD methodology described in this paper has been extensively correlated with on-chip hardware measurements.

PDF icon MATLAB/SIMULINK-Based High-Level Synthesis of Discrete-Time and Continuous-Time ΣΔ Modulators [p. 150]
J. Ruiz-Amaya, J. De La Rosa, F. Medeiro, F. Fernández, R. Del Río, B. Pérez-Verd&uacte;, and A. Rodríguez-Vázquez

This paper describes a tool that combines an accurate SIMULINK-based time-domain behavioural simulator with a statistical optimizer for the automated high-level synthesis of ΣΔ Modulators ( ΣΔs). The combination of high accuracy, short CPU time and interoperability of different circuit models together with the efficiency of the optimization engine makes the proposed tool an advantageous alternative for ΣΔM synthesis. The implementation on the well-known MATLAB/SIMULINK platform brings numerous advantages in terms of data manipulation, flexibility and simulation with other electronic subsystems. Moreover, this is the first tool dealing with the synthesis of ΣΔMs using both Discrete-Time (DT) and Continuous-Time (CT) circuit techniques(*).

PDF icon RTL Processor Synthesis for Architecture Exploration and Implementation [p. 156]
O. Schliebusch, A. Chattopadhyay, R. Leupers, G. Ascheid, H. Meyr, M. Steinert, G. Braun, and A. Nohl

Architecture description languages are widely used to perform architecture exploration for application-driven designs, whereas the RT-level is the commonly accepted level for hardware implementation. For this reason, design parameters such as timing, area or power consumption cannot be taken into consideration accurately during design space exploration. Design automation tools currently used to bridge this gap are either limited in the flexibility provided or only generate fragments of the architecture. This paper presents a synthesis tool which preserves the full flexibility of the architecture description language LISA, while being able to generate the complete architecture on RT-level using SystemC. This paper also presents two real world architecture case studies to prove the feasibility of our approach.

PDF icon Java-through-C Compilation: An Enabling Technology for Java in Embedded Systems [p. 161]
A. Varma and S. Bhattacharyya

The Java programming language is achieving greater acceptance in high-end embedded systems such as cellphones and PDAs. However, current embedded implementations of Java impose tight constraints on functionality, while requiring significant storage space. In addition, they require that a JVM be ported to each such platform. We demonstrate the first Java-to-C compilation strategy that is suitable for a wide range of embedded systems, thereby enabling broad use of Java on embedded platforms. This strategy removes many of the constraints on functionality and reduces code size without sacrificing performance. The compilation framework described is easily retargetable, and is also applicable to barebones embedded systems with no operating system or JVM. On an average, we found the size of the generated executables to be over 25 times smaller than those generated by a cutting-edge Java-to-native-code compiler, while providing performance comparable to the best of various Java implementation strategies.

7D: Network Design

Moderators: M. Turolla, Telecom Italia, IT; K. Goossens, Philips Research, NL
PDF icon Heterogeneous Co-Simulation of Networked Embedded Systems [p. 168]
F. Fummi, S. Martini, G. Perbellini, M. Poncino, F. Ricciato, and M. Turolla

Networked embedded systems pose several challenges in the modeling, simulation, and design domains. The presence of the network, in particular, makes an already critical task such as HW/SW co-simulation even more complex, since a three-way (HW/SW/network) co-simulation and codesign capability is required. Modeling of networks and their interaction with hardware and software is thus key for an effective design methodology at early stages of the design flow. In this work, we present a HW/SW/network cosimulation and co-design methodology, based on the integration of heterogeneous simulation environments such as SystemC and NS (Network Simulator). This methodology has been successfully applied to the design of a system-on-chip performing the fast path of IPv4 routing, allowing to explore different HW/SW allocation for different network configurations.

PDF icon OCCN A Network-on-Chip Modeling and Simulation Framework [p. 174]
M. Coppola, S. Curaba, G. Maruccia, F. Papariello, and M. Grammatikakis

The open-source On-Chip Communication Network (OCCN) defines an efficient framework for network-onchip modeling and simulation based on an object-oriented C++ library built on top of SystemC. OCCN increases the productivity of developing communication driver models through the definition of a universal communication API. This API provides a new design pattern that enables creation and reuse of executable transaction level models (TLMs). OCCN also addresses protocol refinement, design exploration, and high-level performance modeling.

PDF icon A Design Methodology for the Exploitation of High Level Communication Synthesis [p. 180]
F. Bruschi and M. Bombana

In this paper we analyse some methodological concerns that have to be faced in a design flow which contains automatic synthesis phases from high-level, system descriptions. In particular, the issues related to the synthesis of the communication between the system elements are considered. The context in which the analysis is performed is the design flow proposed in the ODETTE project: in this ambient, SystemC is exploited in order to provide efficient system-level models; after that, the SystemC+ SystemC subset and extensions can be used to get a refined description that, despite the use of object oriented features such as polymorphism and inheritance, can be automatically synthesised by means of the ODETTE tools. Still, the problem of interfacing the hardware synthesised with the other elements of the design (memories, peripherals) remains an important issue. In order to face this problem, we propose a pattern that can be used to design bus interfaces that allow both an high level of abstraction in the communication on the "user" side, and automatic synthesis by the ODETTE tools. In order to do this, OSSS global objects are exploited to implement the communication between the application and the interface. After presenting the general methodology, a specific library interface is presented, that could connect the device under design to a PCI bus. In order to prove the viability of the approach, an example of synthesis of an example, from the system level down to the RT level is performed.

PDF icon Software Processing Performance in Network Processors [p. 186]
I. Papaefstathiou, G. Kornaros, and N. Zervos

To meet the demand for higher performance, flexibility, and economy in today's state-of-the-art networks, an alternative to the ASICs that traditionally were used to implement packet-processing functions in hardware, called network processors (NPs), has emerged. In this paper, we briefly outline the architecture of such an innovative network processor aiming at the acceleration of protocol processing in high-speed network interfaces, and we use this architecture as a case study for our measurements. We focus on the performance of the general purpose processors used for executing high level protocol processing, since this part proves to be the bottleneck of the design. The performance is analyzed by executing a set of widely used, real applications and by applying network traffic according to certain stochastic criteria. The performance of the RISC used is compared with that of other well-known CPU architectures so as to verify that our results are applicable to the general network processors era. As our results demonstrate, the bottleneck of the majority of the network processors is the general-purpose processing units used, since today's network protocols need a great amount of high-level processing. On the other hand the specific purpose processors or co-processors, optimized for certain part of the network packet processing, involved in such systems, can provide the power needed, even at today's ultra high network speeds.

PDF icon Channel Decoder Architecture for 3G Mobile Wireless Terminals [p. 192]
F. Berens, G. Kreiselmaier, and N. Wehn

Channel coding is a key element of any digital wireless communication system since it minimizes the effects of noise and interference on the transmitted signal. In third generation (3G) wireless systems channel coding techniques must serve both voice and data users whose requirements considerably vary. Thus the Third Generation Partnership Project (3GPP) standard offers two coding techniques, convolutional-coding for voice and Turbo-coding for data services. In this paper we present a combined channel decoding architecture for 3G terminal applications. It outperforms a solution based on two separate decoders due to an efficient reuse of computational hardware and memory resources for both decoders. Moreover it supports blind transport format detection. Special emphasis is put on low energy consumption.

PDF icon RASoC: A Router Soft-Core for Networks-on-Chip [p. 198]
C. Zeferino, E. Kreutz, and A. Susin

The building block of a Network-on-Chip (NoCs) is its router. It is responsible to switch the channels which forward the messages exchanged by the cores attached to the NoC, and the costs and performance of the NoC strongly depends on the router architecture. In this paper, we present RASoC, a router architecture intended to be used in the building of low area overhead NoCs for embedded systems. The difference among RASoC and current routers relies on its implementation as a parameterized VHDL model, which improve the reuse of RASoC in the synthesis of NoCs with different sizes, and allows the tuning of the NoC parameters in order to meet the requirements of the target application. The paper presents details of RASoC architecture, the structure of the VHDL model and some experimental results which show the scalability of the soft-core and its costs.
Keywords: Systems-on-Chip. On-Chip Networks. FPGA.

8D: Reconfigurable Architecture

Moderators: M. Lindwer, Philips Silicon Hive, NL; P. Pezzati, Cadence, FR
PDF icon Carry-Save Montgomery Modular Exponentiation on Reconfigurable Hardware [p. 206]
A. Cilardo, A. Mazzeo, L. Romano, and G. Saggese

In this paper we present a hardware implementation of the RSA algorithm for public-key cryptography. Basically, the RSA algorithm entails a modular exponentiation operation on large integers, which is considerably time-consuming to implement. To this end, we adopted a novel algorithm combining the Montgomery's technique and the carry-save representation of numbers. A highly modular, bit-slice based architecture has been designed for executing the algorithm in hardware. We also propose an FPGA-based implementation of the architecture developed. The characteristics of the algorithm, the regularity of the architecture, and the data-flow aware placement of the FPGA resources resulted in a considerable performance improvement, as compared to other implementations presented in the literature.

PDF icon Design and Implementation of a Secret Key Steganographic Micro-Architecture Employing FPGA [p. 212]
M. Saeb and H. Farouk

In the well-known 'prisoners' problem', a representative example of steganography, two persons attempt to communicate covertly without alerting the warden. One approach to achieve this task is to embed the message in an innocent-looking cover-media. In our model, the message contents are scattered in the cover in a certain way that is based on a secret key known only to the sender and receiver. Therefore, even if the warden discovers the existence of the message, he will not be able to recover it. In other words a covert or subliminal communication channel is opened between two persons who possess a secret key to reassemble its contents. In this article, we propose a video or audio steganographic model in which the hidden message can be composed and inserted in the cover in real-time. This is realized by designing and implementing a secret key steganographic microarchitecture employing Field Programmable Gate Arrays FPGA.
Keywords: Steganography, data hiding, FPGA, architecture, covert communications, subliminal channel.

PDF icon NeuroFPGA -- Implementing Artificial Neural Networks on Programmable Logic Devices [p. 218]
D. Ferrer, R. González, R. Fleitas, R. Canetti, and J. Pérez

An FPGA implementation of a multilayer perceptron neural network is presented. The system is parameterized both in network related aspects (e.g.: number of layers and number of neurons in each layer) and implementation parameters (e.g.: word width, pre-scaling factors and number of available multipliers). This allows to use the design for different network realizations, or to try different area-speed trade-offs simply by recompiling the design. Fixed point arithmetic with pre-scaling configurable in a per layer basis was used. The system was tested on an ARC-PCI board from Altera™. Several examples from different application domains were implemented showing the flexibility and ease of use of the obtained circuit. Even with the rather old board used, an appreciable speed-up was obtained compared with a software-only implementation based on Matlab neural network toolbox.

PDF icon Project Space Exploration on the 2-D DCT Architecture of a JPEG Compressor Directed to FPGA Implementation [p. 224]
R. Porto and L. Agostini

This paper presents a project space exploration on the baseline JPEG compressor proposed and implemented in previous works. This exploration took as basis the substitution of the operators used in the 2-D DCT calculation architecture of the compressor and the consequent evaluation of impact in terms of performance and resources utilization. This substitution was made with main focus in the carry lookahead, hierarchical carry lookahead and carry select architectures, with the objective to increase the JPEG compressor performance. As the compressor architecture was designed in an hierarchical mode the operators substitution was an activity quite simple, because it has not involved the other hierarchy levels. The operators were described in VHDL, synthesized and validated. They were inserted in the 2-D DCT architecture for synthesis in the whole module. The 2-D DCT was synthesized for an Altera FPGA. With this project space exploration, the highest performance obtained for the 2-D DCT was 23% higher than the original, using 11% more logic cells.

PDF icon A Scalable Implementation of a Reconfigurable WCDMA Rake Receiver [p. 230]
M. Quax, J. Huisken, and J. van Meerbergen

The demands in terms of processing performance, communication bandwidth and real-time throughput of new generation mobile communication applications (mobile and base-stations) are much higher than today's programmable processing architectures can deliver. On the other hand, standards and market uncertainties, non-recurring engineering costs, and lack of access to (or knowledge of) application IP will require the next generation of embedded computing platforms to be fully programmable. In terms of silicon cost and power, practical yet fully programmable embedded computing platforms are enabled by reconfigurable processors that replace fixed ASICs in current standard platforms [8]. This paper explains the concepts behind a novel reconfigurable WCDMA Rake receiver and gives benchmark results. The proposed Rake receiver enables a high performance, yet flexible computing platform for WCDMA.

PDF icon Customisable EPIC Processor: Architecture and Tools [p. 236]
W. Chu, R. Dimond, S. Perrott, S. Seng, and W. Luk

This paper describes a customisable architecture and the associated tools for a prototype EPIC (Explicitly Parallel Instruction Computing) processor. Possible customisations include varying the number of registers and functional units, which are specified at compile-time. This facilitates the exploration of performance/area trade-off for different EPIC implementations. We describe the tools for this EPIC processor, which include a compiler and an assembler based on the Trimaran framework. Various pipelined EPIC designs have been implemented using Field Programmable Gate Arrays (FPGAs); the one with 4 ALUs at 41.8 MHz can run a DCT application 5 times faster than the StrongARM SA-110 processor at 100 MHz.

PDF icon A Run-Time Reconfigurable Datapath Architecture for Image Processing Applications [p. 242]
M. Boschetti, I. Silva, and S. Bampi

This paper describes a run-time reconfigurable architecture targeted to flexible low-level image processing functions. The purpose is to present the evolution of the DRIP (Dynamically Reconfigurable Image Processor) architecture from a statically configurable datapath design to a dynamically reconfigurable approach. The methodology used to redefine the datapath basic building blocks and the hardware units developed to provide an efficient and flexible image processing system are also discussed. An important issue is the granularity of the basic processing elements of the datapath, in view of the combination of programmable function by hardware control -- the classical datapath paradigm -- and the dynamic reconfiguration. DRIP can perform a large set of digital image processing algorithms with real-time performance to fulfill the requirements of contemporary complex applications.

PDF icon Synthesis of Embedded SystemC Design: A Case Study of Digital Neural Networks [p. 248]
D. Lettnin, A. Braun, M. Bodgan, J. Gerlach, and W. Rosenstiel

This work presents the whole System-on-Silicon design flow using SystemC system specification language. In this study, SystemC is used to design a multilayer perceptron neural network, which is applied to an electrocardiogram pattern recognition system. The objective of this work is to exemplify the synthesis of RTL- and behavioral integrated systems. To achieve this, a preprocessing methodology was used to optimize the three main constraints of hardware neural network (HNN) design: accuracy, space and processing speed. This allows a complex HNN to be implemented on a single Field Programmable Gate Array (FPGA). The high level SystemC synthesis allows the straightforward translation of system level into hardware level, avoiding the error prone and the time consuming translation into another hardware description language.
Keywords: SystemC Synthesis; Rapid Prototyping; Embedded Systems; Digital System Design; Hardware Neural Network (HNN); Electrocardiogram (ECG).

9D: Constrained and Domain Specific Architectures

Moderators: J. Gerlach, Robert Bosch GmbH, DE; M. Lindwer, Philips Silicon Hive, NL
PDF icon Experiences during the Experimental Validation of the Time-Triggered Architecture [p. 256]
S. Blanc, J. Gracia, and P. Gil

During last years, the Time-Triggered Architecture (TTA) has been gaining acceptance as a generic architecture for highly dependable real-time systems. It is now being used to implement the 'x-by-wire' concept. A problem for this kind of systems is their validation. Fault Injection has achieved a great acceptance among designers for the experimental validation of systems. This work describes the results and experiences obtained during the validation of the TTP/C controller, a communication controller based on the TTA. Two different fault-injection techniques have been used: VHDL-based fault injection and physical fault injection at pin level. Due to the access that each technique has to different parts of the system, they can complement each other, but moreover, some experiments can be reproduced using both techniques, being very helpful for the analysis of the results.

PDF icon Evaluation of a Refinement-Driven SystemC™-Based Design Flow [p. 262]
T. Schubert, J. Appell, W. Nebel, J. Hanisch, and J. Gerlach

This paper describes the experiences and results that were made with a SystemC-based design flow for the implementation of an automotive digital hardware design. We present the refinement process starting from an initial high-level executable specification in C++ via SystemC down to a Gate-level description. We compare the synthesis results of the SystemC-based system-level design flow with those from a traditional VHDL-based register-transfer level design flow in terms of efficiency and simulation performance.

PDF icon Evaluation of an Object-Oriented Hardware Design Methodology for Automotive Applications [p. 268]
N. Bannow and K. Haug

In this paper we present results in using the new object-oriented design approach OSSS (ODETTE System Synthesis Subset). The methodology and tools of the ODETTE (Object-oriented co-DEsign and functional Test TEchniques) project have been developed within the context of the IST programme of the European Commission. Main focus of OSSS lies in the field of hardware design and in synthesis capability. The strategy is based on an extension of the synthesizable subset of standard SystemC. The approach supports real object-oriented and synthesizable design features like classes, inheritance, templates, polymorphism and global object access. Therefore OSSS promises high efficiency in sense of capability to handle complex designs, faster development time, improved code quality and faster time to market. In contrast, standard SystemC is also based on C++ constructs, but no object-oriented constructs are available yet for a synthesizable system description. We have evaluated OSSS on an automotive design example. It was chosen for the implementation of a component that is part of all video projects: A camera's exposure control unit (ExpoCU). The first main goal that was achieved is a synthesizable design by the automatic generation of an FPGA netlist from an OSSS description.
Furthermore we have also proved the methodology to fulfill industrial requirements such as usability for complex system development, integration of existing IP, improved code quality and decreased development effort. Comparison will be done against existing VHDL based design flow. We especially focus on the implementation and testability by comparing the new object-oriented synthesis approach with a standard VHDL flow by laying emphasis on synthesizability. OSSS and equivalent kinds of methodology show a large potential to handle new generations of complex HW-SW systems. Moreover the gap between increasing design complexity and available methodologies already now gets bigger and bigger and thus needs to be closed by new solutions such as OSSS.

PDF icon The Design and Test of a Smartcard Chip Using a CHAIN Self-Timed Network-on-Chip [p. 274]
W. Bainbridge, L. Plana, and S. Furber

The CHAIN self-timed Network-on-Chip (NoC) architecture provides a flexible, clock-independent solution to the problems of system-on-chip (SoC) interconnect. In this paper we look at the use of CHAIN in a low-performance, smartcard chip to connect two self-timed processors and a range of memories and peripherals. Key design-time advantages provided by the use of CHAIN in this design included the ability to operate a very-narrow, high-frequency network fabric using serial communication without the need for high frequency clocking, rapid assembly in the final stages of the design and the avoidance of the need to perform timing analysis or validation on the SoC interconnect. Additionally we describe a bare port that provided direct access to the CHAIN fabric which was instrumental in testing and debugging the smartcard chip.

PDF icon A Domain-Specific Cell Based ASIC Design Methodology for Digital Signal Processing Applications [p. 280]
B. Ren, A. Wang, J. Bakshi, K. Liu, W. Li, and W. Dai

This paper proposes an innovative domain-specific cell based ASIC design flow to narrow the performance gap between the full custom ASIC design method and conventional standard-cell based ASIC design method. The flow can improve the design performance and still preserve the efficiency of the standard ASIC design flow. Targeting on digital signal processing applications, a domain-specific cell library is provided to augment of standard cell libraries. Experimental results of designing macros such as FFT, FIR etc. are shown in the paper. Based on this methodology a 64-tap FFT can achieve up to 24X performance improvement, with the Power x Delay x Area (PDA) criteria, over the conventional designed ASICs.

PDF icon Qualification and Integration of Complex I/O in SoC Design Flows [p. 286]
J. Abraham and G. Rao

Low power, high speed, and reduced cost requirements force integration of specialized Intellectual Property (IP) like complex I/O blocks on a System on Chip (SoC). Today designers have access to a variety of specialized IP blocks and cells for use in SoC design flows. Complex I/O appear in a myriad of standards such as USB 1.0/1.1/2.0, IEEE 1394 a/b (FireWire), SSTL, HSTL, PCI-X, LVDS, and more. These new standards are driven by consumer's demand for bandwidth and capability, and the industry's desire to reuse proven design blocks in vastly different applications and domains [1]. Integration of these specialized IP blocks introduces increased complexity to design flows. For example, digital designs must now consider the analog like properties of some complex I/O. This paper discusses the uniqueness of embedding complex I/O in a SoC. The features and properties that differentiate complex I/O from standard design practices will be described. Finally methodologies for characterizing and building accurate digital abstractions of I/O will be presented.

10D: Low Power Design

Moderators: W. Luk, Imperial College, UK; V. Gerousis, Infineon Technologies, DE
PDF icon A Power Optimized Display Memory Organization for Handheld User Terminals [p. 294]
L. Hollevoet, A. Dewilde, K. Denolf, F. Catthoor, and F. Louagie

Today's handheld devices become more and more multimedia capable. One subsystem of a multimedia terminal that accounts for a considerable amount of the total power consumption is the display unit. The backlight is the major culprit there. As new display units without backlights emerge, the data transfers required to put data on the screen start using up an increasingly important part of the platform's power. We have examined a novel system view that allows for power savings by decreasing the required number of memory accesses to put a frame on the screen. A two-step optimization method for existing platforms is presented. Measurements on a multimedia application show that, on average, power savings of 72% can be obtained on the display related memory accesses. For the proposed optimizations methods to work, it is important that both hardware and software designers become aware of the impact their design-time decisions have on the final power consumption of a system.

PDF icon Energy Estimation Based on Hierarchical Bus Models for Power-Aware Smart Cards [p. 300]
U. Neffe, K. Rothbart, C. Steger, R. Weiss, E. Rieger, and A. Mühlberger

Smart cards are one of the smallest computing platforms in use today. Due to their limited resources applications are often simple and less complex. High performance 32-bit smart cards, which were introduced by several vendors in the last years, allow the implementation of complex applications on smart cards. Additional to the high performance processor cores these smart cards contain coprocessors to reach the performance and power consumption goals. The interface between the processor and the coprocessor influences the performance and power consumption and should be evaluated early in the design process. We propose a hierarchical bus model for system-level smart card design which supports accurate energy dissipation estimation. The bus models have been implemented in SystemC 2.0 at transaction level layer one (cycle accurate) and layer two (timing estimation). We evaluate accuracy and simulation performance of the models and show their usage as bus functional models for a smart card application.

PDF icon Analysis and Modeling of Energy Reducing Source Code Transformations [p. 306]
C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto

This paper presents a methodology and a set of models supporting energy-driven source-to-source transformations. The most promising code transformation techniques have been isolated and studied leading to accurate analytical and/or statistical models. Experimental results, obtained for some common-embedded-system processors over a set of typical benchmarks, are presented, showing the viability of the proposed approach as a support tool for embedded software design.

PDF icon A Simulation-Based Power-Aware Architecture Exploration of a Multiprocessor System-on-Chip Design [p. 312]
L. Benini, L. Bisdounis, G. Donno, F. Menichelli, and M. Olivieri

We present the design exploration of a System-on-Chip architecture dedicated to the implementation of the HIPERLAN/2 communication protocol. The task was accomplished by means of an ad-hoc C++ simulation environment, integrating power models for CPUs, memories and buses used in the design and incorporating software profiling capabilities. The architecture is based on two ARM microprocessors, an AMBA bus and a local bus, DMA unit and other peripherals. Software mapping on the processor has been based on the power/performance profiling results.

PDF icon System Level Power Modeling and Simulation of High-End Industrial Network-on-Chip [p. 318]
A. Bona, V. Zaccaria, and R. Zafalon

Today's System on Chip (SoC) technology can achieve unprecedented computing speed that is shifting the IC design bottleneck from computation capacity to communication bandwidth and flexibility. This paper presents an innovative methodology for automatically generating the energy models of a versatile and parametric on-chip communication IP (STBus). Eventually, those models are linked to a standard SystemC simulator, running at BCA and TLM abstraction level. To make the system power simulation fast and effective, we enhanced the STBus class library with a new set of power profiling features ("Power API"), allowing to perform power analysis either statically (i.e.: total avg. power) or at simulation runtime (i.e.: dynamic profiling). In addition to random patterns, our methodology has been extensively benchmarked with the high-level SystemC simulation of a real world multi-processor platform (MP-ARM). It consists of four ARM7TDMI processors accessing a number of peripheral targets (including several banks of SRAMs, Interrupt's slaves and ROMs) through the STBus communication infrastructure. A remarkable amount of SW layers are executed on top of MPARM platform, including a distributed real-time operating system (RTEMS) and a set of multi-tasking DSP applications. The power analysis of the benchmark platform proves to be effective and highly correlated, with an average error of 9% and a RMS of 0.015 mW vs. the reference (i.e. gate level) power figures.
Keywords: Network-on-Chip power analysis, communication based low power design, system-level energy optimization.

PDF icon IEM926: An Energy Efficient SoC with Dynamic Voltage Scaling [p. 324]
K. Flautner, D. Flynn, D. Roberts, and D. Patel

One of today's most successful embedded devices, the mobile phone, embodies a set of challenging design requirements: long battery life, small size, high performance and low cost. The single parameter that complicates the simultaneous fulfilment of all of these design goals is energy efficiency of the system, since batteries only hold a finite amount of charge. To operate within the allotted energy budget, systems must be optimized for energy consumption during design and also at run-time. Increasingly it is not sufficient to statically optimize for worst-case conditions but designers must enable systems to adapt to conditions at runtime. The Intelligent Energy Manager™ (IEM) technology provides an integrated solution for addressing energy management of SoC devices. In this paper we present data about the energy consumption characteristics of a multiple power-domain based SoC which includes PDA functionality built around an ARM926EJ-S core.

Interactive Presentations

PDF icon Can IP Quality be Objectively Measured? [p. 330]
K. Werner

Virtual Components (VC), also known as Intellectual Property (IP), have long been a part of the engineering reality. Business drivers, such as improved time to market and better resource utilization are factoring ever more into the make versus buy decision process. Maximizing in-house design resources and purchasing commodity or standard IP has become the de facto business model. Unfortunately, with the increasing number of IP vendors competing in the marketplace, the decision making process is not clear. Simplistically, functionality needs to be the first criteria, but when two or more similar IPs are available, the selection criterion quickly becomes more difficult. This paper addresses the process of measuring IP quality, presents a summary of the VSIA Quality IP (QIP) Metric, and reports the on-going work.

PDF icon Improving Design and Verification Productivity with VHDL-200x [p. 332]
S. Bailey, E. Marschner, J. Lewis, J. Bhasker, and P. Ashenden

VHDL is a critical language for RTL design and is a major component of the $200+ million RTL simulation market1. Many users prefer to use VHDL for RTL design as the language continues to provide desired characteristics in design safety, flexibility and maintainability2. While VHDL has provided significant value for digital designers since 1987, it has had only one significant language revision in 1993. It has taken many years for design state-of-practice to catch-up to and, in some cases, surpass the capabilities that have been available in VHDL for over 15 years. Last year, the VHDL Analysis and Standardization Group (VASG), which is responsible for the VHDL standard, received clear indication from the VHDL community that it was now time to look at enhancing VHDL. In response to the user community, VASG initiated the VHDL-200x project3. VHDL-200x will result in at least two revisions of the VHDL standard. The first revision is planned to be completed next year (2004) and will include a C language interface (VHPI); a collection of high user value enhancements to improve designer productivity and modeling capability and potential inclusion of assertion-based verification and testbench modeling enhancements4. A second revision is planned to follow about2 years later. This paper focuses on the 1st revision enhancements.

PDF icon Building the Hierarchy from a Flat Netlist for a Fast and Accurate Post-Layout Simulation with Parasitic Components [p. 336]
P. Daglio, D. Iezzi, D. Rimondi, C. Roma, and S. Santapa

Main concerns related to post-layout simulation, today, are about the format of the netlist coming out from the parasitic extractor. In fact, such a netlist is usually flat so that readability, whether compared to the pre-layout hierarchical one, is very poor due to device and net names which often change and to the difficulty to compare pre-layout and post-layout output signals. Furthermore, simulating such large flat netlists is frequently time consuming because it is not possible to exploit algorithms like Hierarchical Array Reduction (HAR) and Isomorphic Matching (IM), strength points of state-of-the-art full chip simulators. In this paper, we present a new approach that, starting from a flat netlist with parasitic components and a pre-layout hierarchical one, allows to create a fully hierarchical post-layout netlist containing device parameters and parasitic components directly extracted from the layout. In this way, a fast and accurate post-layout simulation is made possible by the use of look-up table simulators, taking advantages from the HAR and IM algorithms as mentioned before. This methodology has been integrated in a complete design flow to guarantee first silicon success, cut down time-to-design, improve time-to-market and streamline design quality.

PDF icon VHDL-AMS Library Development for Pacemaker Applications [p. 338]
B. Hecker, M. Chavassieux, M. Laflutte, E. Beguin, L. Lagasse, and J. Oudinot

This paper describes the development by ELA Medical of an analog library dedicated to implantable pacemakers and defibrillators using the VHDL-AMS language for mixed-signal ASICs. ELA Medical has been a leading company since 1977 for medical devices used in the diagnosis and treatment of heart rhythm disorders. The objective is to provide designers with a ready-to-use customized library for mixed-signal top-down and bottomup methodologies. The dramatic gain in simulation speed by using behavioral models allows more exhaustive functional validation within an acceptable simulation time. The ADMS mixed-signal simulator from Mentor Graphics has been used with design kit environments provided by major silicon vendors.

PDF icon Modeling and Analysis of Heterogeneous Industrial Networks Architectures [p. 342]
F. Fummi, M. Poncino, S. Martini, G. Perbellini, and M. Monguzzi