DATE 2003 ABSTRACTS

Sessions: [Plenary] [1A] [1B] [1C] [1F] [2A] [2B] [2C] [2E] [2F] [3A] [3B] [3C] [3F] [4A] [4B] [4C] [4E] [4F] [5A] [5B] [5C] [5E] [5F] [6A] [6B] [6C] [6E] [6F] [7A] [7B] [7C] [7E] [7F] [8A] [8B] [8C] [8D] [8E] [8F] [9A] [9B] [9C] [9D] [9E] [9F1] [9F2] [10A] [10B] [10C] [10D] [10E] [10F] [Poster]


Plenary: Keynote Session

Moderator: D. Verkest, IMEC, BE

IC Design Challenges for Ambient Intelligence [p. 2]
E. Aarts and R. Roovers

The vision of Ambient Intelligence opens a world of unprecedented experiences: the interaction of people with electronic devices is changed as contextual awareness, natural interfaces and ubiquitous availability of information are realized. We analyze the consequences of the ambient intelligence vision for electronic devices by mapping the involved technologies on a power-information graph. Based on the differences in power consumption, three types of devices are introduced: the autonomous or microWatt-node, the personal or milliWatt-node and the static or Watt-node. Ambient intelligent functions are realized by a network of these devices with the computing, communication and interface electronics realized in Silicon IC technologies. Three case studies highlight the IC design challenges involved, and show the variety of problems that have to be solved.

Semiconductor Challenges [p. 8]
A. Cuomo

Semiconductors will play a key role to drive the technological evolution in the next 20 years. We already possess many of the technologies that will deeply change our scenario, among which we can mention nanotechnologies, bioelectronics, photonics. The central role of Integrated Circuits in the economy will grow stronger and stronger in future, starting from the convergence between storage, security, video, audio, mobility and connectivity. Systems are converging and ICs are more and more converging with systems. The fundamental issue is how to translate knowledge and competence coming from different fields into single architectures. The key factor to win this challenge is to build the right culture. This means to be able to build an organisation for innovation, with the right mix of creativity, personal initiative and execution skills.


1A: Hot Topic: Ambient Intelligence Visions and Achievements: Linking Abstract Ideas to Real-World Concepts

Organizer/Moderator: M. Lindwer, Philips, NL
Speakers:
R. Zimmermann, European Union, Unit for Microelectronics, BE
D. Marculescu, CMU, US
R. Marculescu, CMU, US
S. Jung, Infineon, DE
E. Cantatore, Philips, NL
T. Basten, Eindhoven U, NL

Panel Statement [p. 10]

The Ambient Intelligence vision is abstract and as such not useful for funding decisions, research project definition, and business plan development. This is in particular the case for the electronic design community. The European Commission intends for the EU to achieve world leadership in Information Societies technologies within ten years. To that end, it has incorporated the Ambient Intelligence vision in its Sixth Framework. Microelectronics and nano-and optical devices are seen as key technologies. Interesting chip-level challenges are found in, amongst others, explicit modeling of mobility and self-management, and novel computing substrates, based on electronic textiles or organic electronics.


1B: Energy-Efficient Memory Systems

Moderators: V. Narayanan, Penn State U, US; P. Fugger, Infineon, DE

Improving the Efficiency of Memory Partitioning by Address Clustering [p. 18]
A. Macii, E. Macii, and M. Poncino

Memory partitioning is an effective approach to memory energy optimization in embedded systems. Spatial locality of the memory address profile is the key property that partitioning exploits to determine an efficient multi-bank memory architecture. This paper presents an approach, called address clustering, for increasing the locality of a given memory access profile, and thus improving the efficiency of partitioning. Results obtained on several embedded applications running on an ARM7 core show average energy reductions of 25% (maximum 57%) w.r.t. a partitioned memory architecture synthesized without resorting to address clustering.

A New Algorithm for Energy-Driven Data Compression in VLIW Embedded Processors [p. 24]
A. Macii, E. Macii, F. Crudo, and R. Zafalon

This paper presents a new algorithm for on-the-fly data compression in high performance VLIW processors. The algorithm aggressively targets energy minimization of some of the dominant factors in the SoC energy budget (i.e., main memory access and high throughput global bus). Based on a differential technique, both the new algorithm and the HW compression unit have been developed to efficiently manage data compression and decompression into a high performance industrial processor architecture, under strict real time constraints (Lx-ST200: A 4issue, 6-stages pipelined VLIW processor with on-chip D and I-Cache). The original DataCache line is compressed before write-back to main memory and, then, decompressed whenever Cache refill takes place. An extensive experimental strategy has been developed for the specific validation of the target Lx processor. In order to allow public comparison, we also report the results obtained on a MIPS pipelined RISC processor simulated with SimpleScalar. The two platforms have been benchmarked over Ptolemy and MediaBench programs. Energy savings provided by the application of the proposed technique range from 10% to 22% on the Lx-ST200 platform and from 11% to 14% on the MIPS platform.
Keywords: Data compression algorithms, system-level energy optimization, VLIW embedded processors.

Power Efficiency through Application-Specific Instruction Memory Transformations [p. 30]
P. Petrov and A. Orailoglu

The instruction memory communication path constitutes a significant amount of power consumption in embedded processors. We propose an encoding technique that exploits application information to reduce the associated power consumption. The microarchitectural support enables reprogrammability of the encoding transformations so as to track code particularities effectively. The restriction to functional transformations enables effective coding while delivering major power savings, in the process obviating furthermore the necessity to rely on dictionary lookup, one of the major shortcomings of prior approaches. The frugal functional transformation, reliant on a single bit logic gate, introduces no impact to the critical fetch stage of the processor pipeline while delivering fully all the theoretically achievable power savings. The reprogrammable hardware implementation enables flexible and inexpensive switches between the transformations. Extensive experimental results on numerical and DSP codes confirm the theoretically expected magnitude of power savings, evincing reductions that range up to half of the original transitions.

Low Energy Data Management for Different On-Chip Memory Levels in Multi-Context Reconfigurable Architectures [p. 36]
M. Sánchez-Élez, M. Fernández, M. Anido, H. Du, R. Hermida, and N. Bagherzadeh

This paper presents a new technique to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve application energy consumption. Two levels of on-chip data storage are assumed in the reconfigurable architecture. The Data Scheduler attempts to optimally exploit this storage, by deciding in which on-chip memory the data have to be stored in order to reduce energy consumption. We also show that a suitable data scheduling could decrease the energy required to implement the dynamic reconfiguration of the system.


1C: Embedded Tutorial: Circuit, Platform Design and Test Challenges in Technologies Beyond 90nm

Organizer/Moderator: S. Kundu, Intel, US
Speakers:
B. Grundmann, Intel Corporation, US
R. Galivanche, Principal Engineer, Intel Corporation, US

Panel Statement [p. 44]

There are already a huge number of problems for silicon designers and it is likely to just get worse. Many of these problems are technical associated with shrinking geometries and increasing architecture complexities, but there are a significant number that seem to be caused by procedurally related mistakes and issues. Many of the technical problems are solved and re-solved on a piecemeal basis, focusing on local optimizations of small design-space problems. Unfortunately, many of these local solutions really create a less apparent but larger inefficiency in the whole design flow. The reason for this is that a few ever look at the whole design methodology, especially as it applies to large design teams. As a consequence, this lack of oversight for the whole methodology is causing project procedural problems and inefficiencies.


1F: Uncertainty

Moderators: F. Johannes, TU Munich, DE; C. Sechen, Washington U, US

Global Wire Bus Configuration with Minimum Delay Uncertainty [p. 50]
L. Huang, H. Chen, and D. Wong

The gap between the advances and the utilization of the deep submicron (DSM) technology is increasing as the new generation of technology is introduced faster than ever. Signal integrity is one of the most important issues in overcoming this gap. With the increasing coupling capacitance between the high aspect ratio wires, the delay uncertainty is unpredictable in the current design flow. We present an algorithm to generate the global wire bus configuration with minumum delay uncertainty under timing constraints. The timing window information from the timing budget (or specified in IPs) is integrated with the modern accurate crosstalk noise models in the proposed algorithm. HSPICE simulations show that the algorithm is very effective and efficient when compared to the buffer insertion scheme with minumom delay. The standard deviation of the delay obtained from the Monte-Carlo simulation is improved up to 73%. This global wire bus configuration can be adopted in early wire planning to improve the timing closure problem and increase the accuracy of the timing budget.

Timing Verification with Crosstalk for Transparently Latched Circuits [p. 56]
H. Zhou

Delay variation due to crosstalk has made timing analysis a hard problem. In sequential circuits with transparent latches, crosstalk makes the timing verification (also known as clock schedule verification) even harder. In this paper, we point out a false negative problem in current timing verification techniques and propose a new approach based on switching windows. In this approach, coupling delay calculations are combined naturally with latch timing iterations. A novel algorithm is given for timing verification with crosstalk in transparently latched circuits and primitive experiments show promising results.
Categories & Subject Descriptors
B.7.2 Design Aids: Verification
General Terms
Algorithms, Design, Verification
Keywords
Timing, Clock, Verification, Coupling, Delay

Statistical Timing Analysis Using Bounds [p. 62]
A. Agarwal, D. Blaauw, V. Zolotov, and S. Vrudhula

The growing impact of within-die process variation has created the need for statistical timing analysis, where gate delays are modeled as random variables. Statistical timing analysis has traditionally suffered from exponential run time complexity with circuit size, due to the dependencies created by reconverging paths in the circuit. In this paper, we propose a new approach to statistical timing analysis which uses statistical bounds. First, we provide a formal definition of the statistical delay of a circuit and derive a statistical timing analysis method from this definition. Since this method for finding the exact statistical delay has exponential run time complexity with circuit size, we also propose a new method for computing statistical bounds which has linear run time complexity. We prove the correctness of the proposed bounds. Since we provide both a lower and upper bound on the true statistical delay, we can determine the quality of the bounds. The proposed methods were implemented and tested on benchmark circuits. The results demonstrate that the proposed bounds have only a small error.

Reduced Delay Uncertainty in High Performance Clock Distribution Networks [p. 68]
D. Velenis, E. Friedman, and M. Papaefthymiou

The design of clock distribution networks in synchronous digital systems presents enormous challenges. Controlling the clock signal delay in the presence of various noise sources, process parameter variations, and environmental effects represents a fundamental problem in the design of high speed synchronous circuits. A polynomial time algorithm that improves the tolerance of a clock distribution network to process and environmental variations is presented in this paper. The algorithm generates a clock tree topology that minimizes the uncertainty of the clock signal delay to the most critical data paths. Strategies for enhancing the physical layout of the clock tree to decrease delay uncertainty are also presented. Application of the methodology on benchmark circuits demonstrates clock tree topologies with decreased delay uncertainties of up to 90%. Techniques to enhance a clock tree layout have been applied on a set of benchmark circuits, yielding a reduction in delay uncertainty of up to 48%.


2A: Hot Topic: Scaling into Ambient Intelligence

Organizer/Moderator: T. Basten, TU Eindhoven, NL
Speakers:
A. Chandrakasan, MIT, US
M. Lindwer, Philips, NL
J. Liu, PARC, US
L. Benini, DEIS, Bologna U, IT
R. Min, MIT, US
F. Zhao, Palo Alto Research Center (PARC), US

Panel Statement [p. 76]

Envision the situation that high quality information and entertainment is easily accessible to anyone, anywhere, at any time, and on any device. How realistic is this vision? And what does it require from the underlying technology? Ambient Intelligence (AmI) integrates concepts ranging from ubiquitous computing to autonomous and intelligent systems. An AmI environment will be highly dynamic in many aspects. Underlying technology must be very flexible to cope with this dynamism. Scalability of technology is only one crucial aspect. This paper explores scalability from the processing, the communication, and the software perspectives.
keywords: multi-processor systems on chip, high-density wireless networks, scalable software infrastructure, scalable algorithms, low power and energy


2B: Power-Aware Design and Synthesis

Moderators: R. Zafalon, STMicroelectronics, IT; D. Marculescu, CMU, US

Masking the Energy Behavior of DES Encryption [p. 84]
H. Saputra, S. Vijaykrishnan, M. Kandemir, M. Irwin, R. Brooks, S. Kim, and W. Zhang

Smart cards are vulnerable to both invasive and non-invasive attacks. Specifically, non-invasive attacks using power and timing measurements to extract the cryptographic key has drawn a lot of negative publicity for smart card usage. The power measurement techniques rely on the data-dependent energy behavior of the underlying system. Further, power analysis can be used to identify the specific portions of the program being executed to induce timing glitches that may in turn help to bypass key checking. Thus, it is important to mask the energy consumption when executing the encryption algorithms. In this work, we augment the instruction set architecture of a simple five-stage pipelined smart card processor with secure instructions to mask the energy differences due to key-related data-dependent computations in DES encryption. The secure versions operate on the normal and complementary versions of the operands simultaneously to mask the energy variations due to value dependent operations. However, this incurs the penalty of increased overall energy consumption in the data-path components. Consequently, we employ secure versions of instructions only for critical operations; that is we use secure instructions selectively, as directed by an optimizing compiler. Using a cycle-accurate energy simulator, we demonstrate the effectiveness of this enhancement. Our approach achieves the energy masking of critical operations consuming 83% less energy as compared to existing approaches employing dual rail circuits.

Scheduling and Mapping of Conditional Task Graphs for the Synthesis of Low Power Embedded Systems [p. 90]
D. Wu, B. Al-Hashimi, and P. Eles

This paper describes a new Dynamic Voltage Scaling (DVS) technique for embedded systems expressed as Conditional Task Graphs (CTGs). The idea is to identify and exploit the available worst case slack time, taking into account the conditional behaviour of CTGs. Also we examine the effect of combining a genetic algorithm based mapping with the DVS technique for CTGs and show that further energy reduction can be obtained. The techniques have been tested on a number of CTGs including a real life example. The results show that the DVS technique can be applied to CTGs with energy saving up to 24%. Furthermore it is shown that savings of up to 51% are achieved by considering DVS during the mapping.

Synthesis of Application-Specific Highly Efficient Multi-Mode Systems for Low-Power Applications [p. 96]
L. Chiou, S. Bhunia, and K. Roy

We present a novel design methodology for synthesizing multiple configurations (or modes) into a single programmable system. Many DSP and multimedia applications require reconfigurability of a system along with efficiency in terms of power, performance and area. FPGAs provide a reconfigurable platform, however, they are slower in speed with significantly higher power consumption than achievable by a customized ASIC. In this work, we have developed techniques to realize an efficient reconfigurable system for a set of user-specified configurations. A data flow graph transformation method coupled with efficient scheduling and allocation are used to automatically synthesize the system from its behavioral level specifications. Experimental results on several applications demonstrate that we can achieve about 60X power reduction on average with about 4X improvement in performance over corresponding FPGA implementations.


2C: Test Data Compression

Moderators: M. Flottes, LIRMM, FR; P. Prinetto, Politecnico di Torino, IT

Virtual Compression through Test Vector Stitching for Scan Based Design [p. 104]
W. Rao and A. Orailoglu

We propose a technique for compressing test vectors. The technique reduces test application time and tester memory requirements by utilizing part of the predecessor response in constructing the subsequent test vector. An algorithm is provided for stitching test vectors that retains full fault coverage while appreciably reducing time and tester requirements. The analysis provided enables significant compression ratios, while necessitating no hardware outlay whatsoever, making the technique we propose particularly suitable for SOC testing. The test time benefits necessitate no MISR utilization, ensuring no consequent aliasing loss. We examine a number of implementation considerations for the new compression technique and we provide experimental data that can be used to guide an eventual commercial implementation. Experimental data confirms the significant test application time and tester memory reductions.

Test Pattern Compression Using Prelude Vectors in Fan-Out Scan Chain with Feedback Architecture [p. 110]
N. Oh, R. Kapur, T. Williams, and J. Sproch

This paper proposes a new test compression technique that employs Fan-out SCAN chain with Feedback (FSCANF) architecture. It allows us to use prelude vectors to resolve dependencies created by fanning out multiple scan chains from a single scan-in pin. This paper describes the new proposed architecture as well as the algorithm that generates compressed test vectors using vertex coloring algorithm. The distribution of specified bits in each test pattern determines the compression ratio of the individual test pattern. Therefore, our technique optimizes the overall compression ratio and shows higher reduction in test data and application time than previous techniques, which use the extreme case of serializing all the scan chains in the presence of conflicts across the fanout scan chains. The FSCANF architecture has small hardware overhead and is independent of scan cell orders in the scan chains. Experimental results show that our technique significantly reduces both the test data volume and test application time in six of the largest ISCAS 89 sequential benchmark circuits compared to the previous techniques.

A Technique for High Ratio LZW Compression [p. 116]
M. Knieser, D. Weyer, R. McIntyre, F. Wolff, and C. Papachristou

Reduction of both the test suite size and the download time of test vectors is important in today's System-On-a-Chip designs. In this paper, a method for compressing the scan test patterns using the LZW algorithm is presented. This method leverages the large number of "Don't-Cares" in test vectors in order to improve the compression ratio significantly. The hardware decompression architecture presented here uses existing on-chip embedded memories. Tests using the ISCAS89 and the ITC99 benchmarks show that this method achieves high compression ratios.

Fast Computation of Data Correlation Using BDDs [p. 122]
Z. Zeng, Q. Zhang, I. Harris, and M. Ciesielski

Data correlation is a well-known problem that causes difficulty in VLSI testing. Based on a correlation metric, an efficient heuristic to select BIST registers has been proposed in the previous work. However, the computation of data correlation itself was a computational intensive process and became a bottleneck in the previous work. This paper presents an efficient technique to compute data correlation using Binary Decision Diagrams (BDDs). Once a BDD is built, our algorithms take linear time to compute the corresponding data correlation. The experimental results show that this technique is much faster than the previous technique based on simulation. It enables testing approaches based on data correlation to handle more practical designs. As one of the successful applications, partial scan is demonstrated by integrating our computation results.


2E: Operating System Abstraction and Targeting (Embedded Software Forum)

Moderators: S. Yoo, TIMA Laboratory, FR; M. Coppola, STMicroelectronics, IT

RTOS Modeling for System Level Design [p. 130]
A. Gerstlauer, H. Yu, and D. Gajski

System level synthesis is widely seen as the solution for closing the productivity gap in system design. High level system models are used in system level design for early design exploration. While real time operating systems (RTOS) are an increasingly important component in system design, specific RTOS implementations can not be used directly in high level models. On the other hand, existing system level design languages (SLDL) lack support for RTOS modeling. In this paper we propose a RTOS model built on top of existing SLDLs which, by providing the key features typically available in any RTOS, allows the designer to model the dynamic behavior of multi-tasking systems at higher abstraction levels to be incorporated into existing design flows. Experimental result shows that our RTOS model is easy to use and efficient while being able to provide accurate results.

Modeling and Integration of Peripheral Devices in Embedded Systems [p. 136]
S. Wang, S. Malik, and R. Bergamaschi

This paper describes automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability and fast time to market. We formally specify device behaviors using event driven finite state machines, communication channels, declaratively described rules, constraints and synthesis patterns. A driver is synthesized from this specification for a virtual environment that is platform (processor, operating system and other hardware) independent. The virtual environment is mapped to a specific platform to complete the driver implementation. The illustrative application of our approach for a USB device driver in Linux demonstrates improved productivity and reusability.

Systematic Embedded Software Generation from SystemC [p. 142]
F. Herrera, H. Posadas, P. Sánchez, and E. Villar

The embedded software design cost represents an important percentage of the embedded-system development costs [1]. This paper presents a method for systematic embedded software generation that reduces the software generation cost in a platform-based HW/SW codesign methodology for embedded systems based on SystemC. The goal is that the same SystemC code allows system-level specification and verification, and, after SW/HW partition, SW/HW co-simulation and embedded software generation. The C++ code for the SW partition (processes and process communication including HW/SW interfaces) is systematically generated including the user-selected embedded OS (e.g.: the eCos open source OS).


2F: Analysis of Jitter and Noise for Analogue Systems and SD Modelling and Simulation

Moderators: G. Vandersteen, IMEC, BE; H. Graeb, TU Munich, DE

Noise Macromodel for Radio Frequency Integrated Circuits [p. 150]
Y. Xu, X. Li, P. Li, and L. Pileggi

Noise performance is a critical analog and RF circuit design constraint, and can impact the selection of the IC system-level architecture. It is therefore imperative that some model of the noise is represented at the highest levels of abstraction during the design process. In this paper we propose a noise macromodel for analog circuits and demonstrate it by way of implementation in a system level simulator based on MATLAB. We also explain our process of macromodel extraction via reformulation of frequency-domain noise analysis results, and the corresponding steps of model order reduction. The results demonstrate the efficacy of this macromodel for frequency domain system level simulation.

Approximation Approach for Timing Jitter Characterization in Circuit Simulators [p. 156]
M. Gourary, S. Rusakov, S. Ulyanov, M. Zharov, K. Gullapalli, and B. Mulvaney

A new computational concept of timing jitter is proposed that is suitable for exploitation in circuit simulators. It is based on the approximation of computed noise characteristicin an arbitrary language. It is explained how different models of the building blocks can easily be included in the model. To show the validity of the model, some simulation results with an implementation in the MATLABTM programming language are presented.

Behavioural Modelling and Simulation of Delta-Sigma Modulators Using Hardware Description Languages [p. 168]
R. Castro-López, F. Fernández, F. Medeiro, and A. Rodríguez-Vázquez

Behavioural simulation is the common alternative to the costly electrical simulation of Delta-Sigma modulators (Delta-SigmaMs). This paper explores the behavioural modelling and simulation of Delta-SigmaMs by using hardware description languages (HDLs) and commercial behavioural simulators,as an alternative to the common special-purpose behavioural simulators. A library of building blocks, where a HDL has been used to model a complete set of circuit non-idealities influencing the performance of Delta-SigmaMs,is introduced. Three alternatives for introducing Delta-SigmaM topologies have been implemented. Experimental results of the simulation of a fourth-order 2-1-1 cascade multi-bit Delta-SigmaM are given.


3A: Hot Topic: Securing Mobile Appliance: New Challenges for the System Designer

Organizer/Moderator: A. Raghunathan, NEC, US
Speakers:
S. Ravi, NEC, US
S. Hattangady, Texas Instruments, US
J.-J. Quisquater, UC de Louvain, BE

Panel Statement [p. 176]

As intelligent electronic systems pervade all aspects of our lives, capturing, storing, and communicating a wide range of sensitive and personal data, security is emerging as a critical concern that must be addressed in order to enable several current and future applications. Mobile appliances, which will play a critical role in enabling the visions of ubiquitous computing and communications, and ambient intelligence, are perhaps the most challenging to secure -- they often rely on a public medium for (wireless) communications, are easily lost or stolen due to their small form factors and mobility, and are highly constrained in cost and size, as well as computing and battery resources. This paper presents an introduction to security concerns in mobile appliances, and translates them into challenges that confront system architects, HW engineers, and SW developers, including how to bridge the processing and battery gaps, efficient tamper-proofing of devices, content protection, etc. Recent innovations and emerging commercial technologies that address these issues are also highlighted. We envision that, for a large class of embedded systems, security considerations will pervade all aspects of system design, driving innovations in system architecture, software, circuits, and design methodologies.


3B: Scheduling and Analysis of Embedded Systems

Moderators: V. Mooney, Georgia IT, US; A. Jantsch, Royal IT, SE

Schedulability Analysis and Optimization for the Synthesis of Multi-Cluster Distributed Embedded Systems [p. 184]
P. Pop, P. Eles, and Z. Peng

We present an approach to schedulability analysis for the synthesis of multi-cluster distributed embedded systems consisting of time-triggered and event-triggered clusters, interconnected via gateways. We have also proposed a buffer size and worst case queuing delay analysis for the gateways, responsible for routing inter-cluster traffic. Optimization heuristics for the priority assignment and synthesis of bus access parameters aimed at producing a schedulable system with minimal buffer needs have been proposed. Extensive experiments and a real-life example show the efficiency of our approaches.

A General Framework for Analysing System Properties in Platform-Based Embedded System Designs [p. 190]
S. Chakraborty, S. Kuenzli, and L. Thiele

We present a framework (Real-Time Calculus) for analysing various system properties pertaining to timing analysis, loads on various components and on-chip buffer memory requirements of heterogeneous platform-based architectures, in a single coherent way. Many previous analysis techniques from the real-time systems domain, which are based on standard event models, turn out to be special cases of our framework. We illustrate this using various realistic examples.

Exact High Level WCET Analysis of Synchronous Programs by Symbolic State Space Exploration [p. 196]
G. Logothetis and K. Schneider

In this paper, a novel approach to high-level (i.e. architecture independent) worst case execution time (WCET) analysis is presented that automatically computes exact bounds for all inputs. To this end, we make use of the distinction between micro and macro steps as usually done by synchronous languages. As macro steps must not contain loops, a later low-level WCET analysis (architecture dependent) is simplified to a large extent. Checking exact execution times for all inputs is a complex task that can nevertheless be efficiently done when implicit state space representations are used. With our tools, it is not only possible to compute path information by exploring all computations, but also to verify given path information.

Rapid Prototyping of Flexible Embedded Systems on Multi-DSP Architectures [p. 204]
B. Rinner, M. Schmid, and R. Weiss

The functionality of a typical embedded system is specified once at design time and cannot be altered later during the whole mission period. There are, however, a number of important application domains that ask for both flexibility and availability. In such a flexible embedded system the functionality can be modified while the application is running. This paper presents a rapid prototyping environment for flexible embedded systems on multi-DSP architectures. This prototyping environment automatically maps and schedules an application onto a multi-DSP architecture and introduces a special, lightweight reconfiguration environment onto the target platform. A running multi-DSP application can, therefore, be modified by reconfiguring software tasks. By using our prototyping environment the modified application can be tested, simulated and emulated prior to the implementation on the target.
keywords: embedded system; multi-DSP architectures; task reconfiguration; testability; rapid prototyping


3C: Recent Advances in DFT and BIST

Moderators: A. Benso, Politecnico di Torino, IT; I.G. Harris, Massachusetts U, Amherst, US

DFT for Testing High-Performance Pipelined Circuits with Slow-Speed Testers [p. 212]
M. Nummer and M. Sachdev

This paper presents a methodology for testing high-performance pipelined circuits with slow-speed testers. The technique uses a clock timing circuit to control data transfer in the pipeline in test mode. A clock timing circuit capable of achieving a timing resolution of 50ps in 0.18um CMOS technology is presented. The design provides the ability to test the clock timing circuit itself.
Keywords: Delay-fault testing, high-performance testing, design for testability, design for delay testability.

Extending JTAG for Testing Signal Integrity in SoCs [p. 218]
N. Ahmed, M. Tehranipour, and M. Nourani

As the technology is shrinking and the working frequency is going into multi gigahertz range, the issues related to interconnect testing are becoming more dominant. Specifically, signal integrity loss issues are becoming more important and detection and diagnosis of these losses are becoming a great challenge. In this paper, an enhanced boundary scan architecture with slight modification in the boundary scan cells is proposed to test signal integrity in SoC interconnects. Our extended JTAG architecture: 1) minimizes scan-in operation by using modified boundary scan cells in pattern generation; and 2) incorporates the integrity loss information within the modified observation cells. To fully comply with JTAG standard, we propose two new instructions, one for pattern generation and the other for scanning out the captured signal integrity information.

EBIST: A Novel Test Generator with Built-In Fault Detection Capability [p. 224]
D. Pradhan, C. Liu, and K. Chakraborty

A novel design methodology for test pattern generation in BIST is presented. Here faults and errors in the generator itself are detected. Two different design methodologies are presented. The first one guarantees all single fault/error detection and the second methodology is capable of detecting multiple faults and errors. Furthermore the proposed LFSRs do not have additional hardware overhead. Also importantly the test patterns generated have the potential to achieve superior fault coverage.

A Partition-Based Approach for Identifying Failing Scan Cells in Scan-BIST with Applications to System-On-Chip Fault Diagnosis [p. 230]
C. Liu and K. Chakrabarty

We present a new partition-based fault diagnosis technique for identifying failing scan cells in a scan-BIST environment. This approach relies on a two-step scan chain partitioning scheme. In the first step, an interval-based partitioning scheme is used to generate a small number of partitions, where each element of a partition consists of a set of scan cells. In the second step, additional partitions are created using an earlier-proposed random-selection partitioning method. Two-step partitioning leads to higher diagnostic resolution than a scheme that relies only on random-selection partitioning, with only a small amount of additional hardware. The proposed scheme is especially suitable for a system-on-chip (SOC) composed of multiple embedded cores, where test access is provided by means of a TestRail that is threaded through the internal scan chains of the embedded cores. We present experimental results for the six largest ISCAS-89 benchmark circuits and for two SOCs crafted from some of the ISCAS-89 circuits.


3F: Analogue and RF Modelling, Simulation and Optimisation

Moderators: F.V. Fernandez, IMSE CNM, ES; R. Schwencker, Infineon, DE

Time-Varying, Frequency-Domain Modeling and Analysis of Phase-Locked Loops with Sampling Phase-Frequency Detectors [p. 238]
P. Vanassche, G. Gielen, and W. Sansen

This paper presents a new, frequency-domain based method for modeling and analysis of phase-locked loop (PLL) small-signal behavior, including time-varying aspects. Focus is given to PLLs with sampling phase-frequency detectors (PFDs) which compute the phase error only once per period of the reference signal. Using the harmonic transfer matrix (HTM) formalism, the well-known continuous-time, linear time-invariant (LTI) approximations are extended to take the impact of time-varying behavior, arising from the sampling nature of the PFD, into account. Especially for PLLs with a fast feedback loop, this time-varying behavior has severe impact on, for example, loop stability and cannot be neglected. Contrary to LTI analysis, our method is able to predict and quantify these difficulties. The method is verified for a typical loop design.

A New Simulation Technique for Periodic Small-Signal Analysis [p. 244]
M. Gourary, S. Rusakov, S. Ulyanov, M. Zharov, and J. Mulvaney

A new numerical technique for periodic small signal analysis based on harmonic balance method is proposed. Special-purpose numerical procedures based on Krylov subspace methods are developed that reduce the computational efforts of solving linear problems under frequency sweeping. Examples are given to show the efficiency of the new algorithm for computing small signal characteristics for typical RF circuits.

Generalized Posynomial Performance Modeling [p. 250]
T. Eeckelaert, W. Daems, G. Gielen, and W. Sansen

This paper presents a new method to automatically generate posynomial symbolic expressions for the performance characteristics of analog integrated circuits. The coefficient set as well as the exponent set of the posynomial expression are determined based on SPICE simulation data with device-level accuracy. We will prove that this problem corresponds to solving a non-convex optimization problem without local minima. The presented method is capable of generating posynomial performance expressions for both linear and nonlinear circuits and circuit characteristics. This approach allows to automatically generate an accurate sizing model that composes a geometric program that fully describes the analog circuit sizing problem. The automatic generation avoids the time-consuming nature of hand-crafted analytic model generation. Experimental results illustrate the capabilities and effectiveness of the presented modeling technique.

Holmes: Capturing the Yield-Optimized Design Space Boundaries of Analog and RF Integrated Circuits [p. 256]
B. De Smedt and G. Gielen

A novel methodology is presented to structured yield-aware synthesis. The trade-off between yield and the unspecified performances is explored along the design space boundaries, while respecting specifications on the other performances. Through the unique combination of multi-objective evolutionary optimization techniques, multi-variate regression modeling and sensitivity-based yield estimation, the designer is given access to this trade-off, all within transistor-level accuracy. Even more, a large reduction in required computer resources is obtained compared to alternative approaches.


4A: Architectural Level Synthesis

Moderators: P.Y.K. Cheung, Imperial College London, UK; J. Teich, Erlangen-Nuremberg U, DE

High-Level Allocation to Minimize Internal Hardware Wastage [p. 264]
M. Molina, J. Mendías, and R. Hermida

Conventional synthesis algorithms perform the allocation of heterogeneous specifications, those formed by operations of different types and widths, by binding operations to functional units of their same type and width. Thus, in most of the implementations obtained some hardware waste appears. This paper proposes an allocation algorithm able to minimize this hardware waste by fragmenting operations into their common operative kernel, which then may be executed over the same functional units. Hence, fragmented operations are executed over sets of several linked hardware resources. The implementations proposed by our algorithm need considerably smaller area than the ones proposed by conventional allocation algorithms. And due to operation fragmentation, in the datapaths produced the type, number, and width of the hardware resources are independent of the type, number, and width of the specification operations and variables.

Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive Designs [p. 270]
S. Gupta, N. Dutt, R. Gupta, and A. Nicolau

We present two novel strategies to increase the scope for application of speculative code motions: (1) Adding scheduling steps dynamically during scheduling to conditional branches with fewer scheduling steps. This increases the opportunities to apply code motions such as conditional speculation that duplicate operations into the branches of a conditional block. (2) Determining if an operation can be conditionally speculated into multiple basic blocks either by using existing idle resources or by creating new scheduling steps. These strategies lead to balancing of the number of steps in the conditional branches without increasing the longest path through the conditional block. Algorithms for these strategies have been implemented within the Spark high-level synthesis framework that accepts a behavioral description in ANSI-C as input and produces synthesizable register-transfer level VHDL. Experiments on two moderately complex industrial-strength applications, namely, MPEG-1 and the GIMP image processing tool, demonstrate that conditional speculation is ineffective without using these strategies.

Distributed Synchronous Control Units for Dataflow Graphs under Allocation of Telescopic Arithmetic Units [p. 276]
E. Kim, H. Saito, H. Nakamura, T. Nanya, J. Lee, and D. Lee

In order to enjoy performance improvement effects of variable computation time arithmetic units in a system level, we propose a new synchronous control unit design methodology for dataflow graphs under allocation of a telescopic arithmetic unit which is one of well-known synchronous variable computation time arithmetic units. The proposed method generates an independent synchronous controller for each component arithmetic unit, and builds a distributed synchronous control unit through integrating derived controllers. The distributed structure of a final synchronous control unit maximizes performance improvement effect of telescopic arithmetic units through a complete preservation of original concurrency among operations.

Automated Bus Generation for Multiprocessor SoC Design [p. 282]
K. Ryu and V. Mooney III

The performance of a system, especially a multiprocessor system, heavily depends upon the efficiency of its bus architecture. This paper presents a methodology to generate a custom bus system for a multiprocessor System-on-a-Chip (SoC). Our bus synthesis tool (BusSyn) uses this methodology to generate five different bus systems as examples: Bi-FIFO Bus Architecture (BFBA), Global Bus Architecture Version I (GBAVI), Global Bus Architecture Version III (GBAVIII), Hybrid bus architecture (Hybrid) and Split Bus Architecture (SplitBA). We verify and evaluate the performance of each bus system in the context of two applications: an Orthogonal Frequency Division Multiplexing (OFDM) wireless transmitter and an MPEG2 decoder. This methodology gives the designer a great benefit in fast design space exploration of bus architectures across a variety of performance impacting factors such as bus types, processor types and software programming style. In this paper, we show that BusSyn can generate buses that achieve superior performance when compared to a simple General Global Bus Architecture (GGBA) (e.g., 16.44% performance improvement in the case of OFDM transmitter) or when compared to the CoreConnect Bus Architecture (CCBA) (e.g., 15.54% performance improvement in the case of MPEG2 decoder). In addition, the bus architecture generated by BusSyn is designed in a matter of seconds instead of weeks for the hand design of a custom bus system.


4B: Scheduling in Reconfigurable Computing

Moderators: A. Koch, TU Braunschweig, DE; G. Koch, Bridges2Silicon, DE

Online Scheduling for Block-Partitioned Reconfigurable Devices [p. 290]
H. Walder and M. Platzner

This paper presents our work toward an operating system that manages the resources of a reconfigurable device in a multitasking manner. We propose an online scheduling system that allocates tasks to a block-partitioned reconfigurable device. The blocks are statically-fixed but can have different widths, which allows to match the computational resources with the task requirements. We implement several non-preemptive and preemptive schedulers as well as different placement strategies. Finally, we present a simulation environment that allows to experimentally investigate the effects of specific partitioning, placement, and scheduling methods.

Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling [p. 296]
B. Mei, H. De Man, R. Lauwereins, S. Vernalde, and D. Verkest

Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compilation tools are essential to their success. In this paper, we present a modulo scheduling algorithm to exploit loop-level parallelism for coarse-grained reconfigurable architectures. This algorithm is a key part of our Dynamically Reconfigurable Embedded Systems Compiler (DRESC). It is capable of solving placement, scheduling and routing of operations simultaneously in a modulo-constrained 3D space and uses an abstract architecture representation to model a wide class of coarse-grained architectures. The experimental results show high performance and efficient resource utilization on tested kernels.

Virtual Hardware Byte Code as a Design Platform for Reconfigurable Embedded Systems [p. 302]
S. Lange and U. Kebschull

Reconfigurable hardware will be used in many future embedded applications. Since most of these embedded systems will be temporarily or permanently connected to a network, the possibility to reload parts of the application at run time arises. In the 90ies it was recognized, that the huge variety of processors would lead to a tremendous amount of binaries for the same piece of software. For the hardware parts of an embedded system, the situation today is even worse. The java approach based on a java virtual machine (JVM) was invented to solve the problem for software. In this paper, we show how the hardware parts of an embedded system can be implemented in a hardware byte code, which can be interpreted using a virtual hardware machine running on an arbitrary FPGA. Our results show that this approach is feasible and that it leads to fast, portable and reconfigurable designs, which run on any programmable target architecture.


4C: Delay Testing and Diagnosis

Moderators: H. Obermeir, Infineon, DE; N. Nicolici, McMaster U, CA

A Method of Test Generation for Path Delay Faults Using Stuck-At Fault Test Generation Algorithms [p. 310]
S. Ohtake, K. Ohtani, and H. Fujiwara

In this paper, we propose a test generation method for non-robust path delay faults using stuck-at fault test generation algorithms. In our method, we first transform an original combinational circuit into a circuit called a partial leaf-dag using path-leaf transformation. Then we generate test patterns using a stuck-at fault test generation algorithm for stuck-at faults in the partial leaf-dag. Finally we transform the test patterns into two-pattern tests for path delay faults in the original circuit. We prove the correctness of the approach and experimental results on several benchmark circuits show the effectiveness of it.

A Novel, Low-Cost Algorithm for Sequentially Untestable Fault Identification [p. 316]
M. Syal and M. Hsiao

This paper presents a new and low-cost approach for identifying sequentially untestable faults. Unlike the single fault theorem, where the stuck-at fault is injected only in the right-most time frame of the k-frame unrolled circuit, our approach can handle fault injection in any time frame within the unrolled sequential circuit. To efficiently apply our concept to untestable fault identification, powerful sequential implications are used to efficiently extend the unobservability propagation of gates in multiple time frames. Application of the proposed theorem to ISCAS '89 sequential benchmark circuits showed that more untestable faults could be identified using our approach, at practically no overhead in both memory and execution time.

Non-Enumerative Path Delay Fault Diagnosis [p. 322]
S. Padmanaban and S. Tragoudas

The first non-enumerative framework for diagnosing path delay faults using zero suppressed binary decision diagrams is introduced. We show that fault free path delay faults with a validated non-robust test may together with fault free robustly tested faults be used to eliminate faults from the set of suspected faults. All operations are implemented by an implicit diagnosis tool based on the zero suppressed binary decision diagram. The proposed method is space and time non-enumerative as opposed to existing methods which are space and time enumerative. Experimental results on the ISCAS'85 benchmarks show that the proposed technique is on an average least three times more efficient to improve the diagnostic resolution than existing techniques.

Delay Defect Diagnosis Based Upon Statistical Timing Models -- The First Step [p. 328]
A. Krstic, L. Wang, K. Cheng, J. Liou, and M. Abadir

This paper defines a new diagnosis problem for diagnosing delay defects based upon statistical timing models. We illustrate the differences between the delay defect diagnosis and traditional logic defect diagnosis. We propose different diagnosis algorithms, and evaluate their performance via statistical defect injection and statistical delay fault simulation. With a statistical timing analysis framework developed in the past, we demonstrate the new concepts in delay defect diagnosis, and discuss experimental results based upon benchmark circuits.


4E: Embedded Tutorial: Embedded Operating Systems for SoC (Embedded Software Forum)

Organizer: A. Jerraya, TIMA Laboratory, FR
Moderator: J. Madsen, TU Denmark, DK

Introduction to Hardware Abstraction Layers for SoC [p. 336]
S. Yoo and A. Jerraya

In this paper, hardware abstraction layer is explained in the context of SoC design. First, HAL definition is given and the difference between HAL and other similar concepts are given. Existing HALs are examined. The role of HAL is explained for SoC design. Finally, a proposal of standard HAL is presented.

Hardware/Software Partitioning of Operating Systems [p. 338]
V. Mooney III

Traditionally, an Operating System (OS) implements in software basic system functions such as task/process management and I/O. Furthermore, a Real-Time Operating Systems (RTOS) has also been implemented in software to manage tasks in a predictable, real-time manner. However, with System-on-a-Chip (SoC) architecture similar to Figure becoming more and more common, OS and RTOS functionality need not be implemented solely in software. Thus, partitioning the interface between hardware and software for an OS is a new idea that can have a significant impact.

Embedded Software in Digital AM-FM Chipset [p. 340]
M. Sarlotte, B. Candaele, J. Quevremont, and D. Merel

The new standard DRM for digital radio broadcast in AM band requires integrated devices for radio receivers at low cost and very low power consumption. A chipset is currently designed based upon an ARM9 multi-cores architecture. This paper introduces the application itself, the HW architecture of the SoC and the SW architecture which includes physical layer, receiver management, the application layer and the global scheduler based on a real-time OS. Then, the paper presents the HW/SW partitioning and SW breakdown between the various processing cores. The methodology used in the project to develop, to validate and to integrate the SW covering various methods such as simulation, emulation and covalidation is described. Key points and critical issues are also addressed. One of the challenge is to integrate the whole receiver in the mono-chip with respect to the real-time constraints linked to the audio services.


4F: Networks-on-Chip

Moderators: P. Ienne, EPF Lausanne, CH; E. Verhulst, Eonic Solutions GmbH, DE

Packetized On-Chip Interconnect Communication Analysis for MPSoC [p. 344]
T. Ye, G. De Micheli, and L. Benini

Interconnect networks play a critical role in shared memory multiprocessor systems-on-chip (MPSoC) designs. MPSoC performance and power consumption are greatly affected by the packet dataflows that are transported on the network. In this paper, by introducing a packetized on-chip communication power model, we discuss the packetization impact on MPSoC performance and power consumption. Particularly, we propose a quantitative analysis method to evaluate the relationship between different design options (cache, memory, packetization scheme, etc.) at the architectural level. From the benchmark experiments, we show that optimal performance and power tradeoff can be achieved by the selection of appropriate packet sizes.

Trade Offs in the Design of a Router with both Guaranteed and Best-Effort Services for Networks On Chip [p. 350]
E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander

Managing the complexity of designing chips containing billions of transistors requires decoupling computation from communication. For the communication, scalable and compositional interconnects, such as networks on chip (NoC), must be used. In this paregate bandwidth of 80 Gbit/s. It occupies 0.26 mm2 in CMOS12. This shows that our router provides high performance at reasonable cost, bringing NoCs one step closer.

Communication Centric Architectures for Turbo-Decoding on Embedded Multiprocessors [p. 356]
F. Gilbert, M. Thul, and N. Wehn

Software implementations of channel decoding algorithms are attractive for communication systems with their large variety of existing and emerging standards due to their flexibility and extensibility. For high throughput, however, a single processor can not provide the necessary compute power. Using several processors in parallel without exploiting the internal parallelism of the algorithm leads to intolerable overhead in area, power consumption, and latency. We propose a multiprocessor based Turbo-Decoder implementation where inherently parallel decoding tasks are mapped onto individual processing nodes. The implied challenging inter-processor communication is efficiently handled by our framework such that throughput is not degraded. In this paper we present communication centric architectures from buses to heterogenous networks that allow to interconnect numerous processors to perform high throughput Turbo-decoding.


5A: System Level Modelling

Moderators: C. Delgado Kloos, Carlos III de Madrid U, ES; E. Villar, Cantabria, U, ES

Development and Application of Design Transformations in ForSyDe [p. 364]
I. Sander, A. Jantsch, and Z. Lu

The ForSyDe methodology has been developed for system level design. Starting with a formal specification model, that captures the functionality of the system at a high abstraction level, it provides formal design transformation methods for a transparent refinement process of the system model into an implementation model that is optimized for synthesis. The main contribution of this paper is the formal treatment of transformational design refinement. Using the formal semantics of ForSyDe processes we introduce the term characteristic function to be able to define and classify transformations as either semantic preserving or design decision. We also illustrate how we can incorporate classical synthesis techniques that have traditionally been used with control/data-flow graphs as ForSyDe transformations. Thus, our approach avoids discontinuities since it moves design refinement into the domain of the specification model.

System Level Specification in Lava [p. 370]
S. Singh

The Lava system provides novel techniques for representing system level specifications which are supported by a design flow that maps Lava descriptions onto System-on-Chip platforms implemented on very large FPGAs. The key contribution of this paper is a type class based approach for specifying bus-based system configurations. This provides a very flexible and parameterised flow for combining predesigned IP blocks into a complete FPGA-based system.

Formal Semantics of Synchronous SystemC [p. 376]
A. Salem

In this article, a denotational definition of synchronous subset of SystemC is proposed. The subset treated includes modules, processes, threads, wait statement, ports and signals. We propose formal model for System C delta delay. Also, we give a complete semantic definition for the language's two-phase scheduler. The proposed semantic can constitute a base for validating the equivalence of synchronous HDL subsets.

Introspection in System-Level Language Frameworks: Meta-Level vs. Integrated [p. 382]
F. Doucet, R. Gupta, and S. Shukla

Reflection and automated introspection of a design in system level design frameworks are seen as necessities for the CAD tools to manipulate the designs within the tools. These features are also useful for debuggers, class and object browsers, design analyzers, composition validation, type checking, compatibility checking, etc. However, the central question is whether such features should be integrated into the language, or if we should build frameworks which feature these capabilities in a meta-layer, leaving the system-level language intact. In our recent interactions with designers, we have found differing opinions. Especially in the context of SystemC, the temptation to integrate reflective APIs into the language is great, because C++ is expressive, and already has type introspective packages available. In this paper, we analyze this issue and show that (i) it is a better EDA system architecture to implement reflection /introspection at a meta-layer in a design framework (ii) there are relatively unexplored territories of design automation, such as behavioral typing of component interfaces, corresponding type-theory, and their implication in automating component composition, interface synthesis, and validation, which can be better incorporated if the introspection is implemented at a meta-layer.

SystemC-AMS Requirements, Design Objectives and Rationale [p. 388]
A. Vachoux, C. Grimm, and K. Einwich

This paper presents and discusses the foundations on which the analog and mixed-signal extensions of SystemC, named SystemC-AMS, will be developed. First, requirements from targeted application domains are identified. These are then used to derive design objectives and related rationales. Finally, some preliminary seed work is presented and the outline of the analog and mixed-signal extensions development work is given.


5B: Hot Topic: Runtime Reconfigurable Systems on Chip -- An Industry Perspective

Organizer/Moderator: I. Bolsens, Xilinx, US

Parallel Processing Architectures for Reconfigurable Systems [p. 396]
K. Vissers

Novel reconfigurable computing architectures exploit the inherent parallelism available in many signal processing problems. These architectures often consist of networks of compute elements that have an ALU-like structure with corresponding instructions. This opens opportunities for rapid dynamic reconfiguration and instruction multiplexing. The field of computer architectures has significantly contributed to the systematic and quantified exploration of architectures. Novel reconfigurable architecture exploration should learn from this approach. Future System-on-a-Chip platforms will consist of a combination of processor architectures, on-chip memories, and reconfigurable architectures. The real challenge is to design those architectures that can be programmed efficiently. This requires that first a programming environment and benchmarks be created and then that the reconfigurable architectures be systematically explored.

Different Approaches to Add Reconfigurability in a SoC Architecture [p. 398]
B. Gupta and M. Borgatti

Dynamically reprogrammable hardware has been advocated in the academic research community as the next hot area in system design for some time now. The lack of integrated systems in the marketplace that incorporate dynamic reprogramming stands at contrast to the enthusiasm of the research community for the topic. We would like to offer as a middle ground several examples of dynamic reprogramming in working silicon that might help to illuminate the path towards the future of SoCs. In our research at STMicroelectronics, we have built two independent SoCs that utilize embedded FPGAs to provide the dynamic reprogramming capability. The benefit of the embedded FPGA has been demonstrated to range from application acceleration to augmenting functionality and providing silicon area reuse. The first system to be described is intended for image processing and biometric recognition. The second system is aimed at wireless LAN baseband processing.

A Lightweight Approach for Embedded Reconfiguration of FPGAs [p. 399]
B. Blodget, S. McMillan, and P. Lysaght

This paper presents a lightweight approach for embedded reconfiguration of Xilinx Virtex IItm series FPGAs. A hardware and software infrastructure is reported that enables an FPGA to dynamically reconfigure itself under the control of a soft microprocessor core that is instantiated on the same array. The system provides a highly integrated, lightweight approach to dynamic reconfiguration for embedded systems. It combines the benefits of intelligent control, fast reconfiguration and small overhead.


5C: Hot Topic: Creating Value Through Test

Organizer/Moderator: E.J. Marinissen, Philips, NL
Speakers:
B. Vermeulen, Philips, NL
R. Madge, LSI Logic, US
M. Kessler and M. Müller, IBM Boeblingen, DE

Panel Statement [p. 402]

Test is often seen as a necessary evil; it is a fact of life that ICs have manufacturing defects and those need to be filtered out by testing before the ICs are shipped to the customer. In this paper, we show that techniques and tools used in the testing field can also be (re-)used to create value to (1) designers, (2) manufacturers, and (3) customers alike. First, we show how the test infrastructure can be used to detect, diagnose, and correct design errors in prototype silicon. Secondly, we discuss how test results are used to improve the manufacturing process and hence production yield. Finally, we present test technologies that enable systems of high reliability for safety-critical applications.


5E: Software Optimisation for Embedded Systems (Embedded Software Forum)

Moderators: R. Leupers, RWTH Aachen, DE; J.C. Lopez, Castilla-La Mancha, ES

Control Flow Driven Splitting of Loop Nests at the Source Code Level [p. 410]
H. Falk and P. Marwedel

This paper presents a novel source code transformation for control flow optimization called loop nest splitting which minimizes the number of executed if-statements in loop nests of embedded multimedia applications. The goal of the optimization is to reduce runtimes and energy consumption. The analysis techniques are based on precise mathematical models combined with genetic algorithms. Due to the inherent portability of source code transformations, a very detailed benchmarking using 10 different processors can be performed. The application of our implemented algorithms to three real-life multimedia benchmarks leads to average speed-ups by 23.6% -- 62.1% and energy savings by 19.6% -- 57.7%. Furthermore, our optimization also leads to advantageous pipeline and cache performance.

Data Space Oriented Scheduling in Embedded Systems [p. 416]
M. Kandemir, G. Chen, W. Zhang, and I. Kolcu

With the widespread use of embedded devices such as PDAs, printers, game machines, cellular telephones, achieving high performance demands an optimized operating system (OS) that can take full advantage of the underlying hardware components. This paper presents a locality conscious process scheduling strategy for embedded environments. The objective of our scheduling strategy is to maximize reuse in the data cache. It achieves this by restructuring the process codes based on data sharing patterns between processes.

Compiler-Directed ILP Extraction for Clustered VLIW/EPIC Machines: Predication, Speculation and Modulo Scheduling [p. 422]
S. Pillai and M. Jacome

Compiler-directed ILP extraction techniques are critical to effectively exploiting the significant processing capacity of contemporaneous VLIW/EPIC machines. In this paper we propose a novel algorithm for ILP extraction targeting clustered EPIC machines that integrates three powerful techniques: predication, speculation and modulo scheduling. In addition, our framework schedules and binds operations, generating actual VLIW code. To the best of our knowledge, there is no other algorithm in the literature on predicated code optimizations that jointly considers speculation and modulo scheduling in the context of clustered EPIC machines. Our experimental results show that by jointly considering different extraction techniques in a resource aware context, the proposed algorithm can take maximum advantage of the resources available on the clustered machine, aggressively improving performance.

An Efficient Hash Table Based Approach to Avoid State Space Explosion in History Driven Quasi-Static Scheduling [p. 428]
A. Lomeña, M. López-Vallejo, Y. Watanabe, and A. Kondratyev

This paper presents an efficient hash table based method to optimally overcome a new variant of the state space explosion which appears during the quasi-static task scheduling of embedded, reactive systems. Our application domain is targeted to one-processor software synthesis, and the scheduling process is based on Petri net reachability analysis to ensure cyclic, bounded and undeadlocked programs. To achieve greater flexibility, we employ a dynamic, history based criterion to prune the search space. This makes our synthesis approach different from most existing code generation techniques. Our experimental results reveal a significant reduction in algorithmic complexity (both in memory storage and CPU time) obtained for medium and large size problems.


5F: Global Approaches to Layout Synthesis

Moderators: I. Markov, Michigan U, US; J. Lienig, TU Dresden, DE

Time Budgeting in a Wireplanning Context [p. 436]
J. Westra, R. Otten, D. Jongeneel, and C. Visweswariah

Wireplanning is an approach in which the timing of input-output paths is planned before modules are specified, synthesized or sized. If these global wires are optimally segmented and buffered, their delay is linear in the path length and independent of the position of the modules along these paths. From timing requirements, the total budget left to modules after allocating the appropriate delay to the wires can be determined. This paper describes how this budget can be optimally divided amongst the modules. A novel, static timing-like, mathematical programming formulation is introduced such that the total module area is minimized. Instead of only the worst delay, all pin-to-pin delays are implicitly taken into account. If area-delay tradeoffs are convex, a reasonable approximation in practice, the program can be solved efficiently. Further, efficiency of different formulations is discussed, and a low-cost method of making the budget relatively immune to downstream uncertainties and surprises is presented. The efficiency of the formulation is clear from benchmarks with over 2000 nodes and 5e19 paths.

Interconnect Planning with Local Area Constrained Retiming [p. 442]
R. Lu and C. Koh

We present a framework that considers global routing, repeater insertion, and flip-flop relocation for early interconnect planning. We formulate the interconnect retiming and flip-flop placement problem as a local area constrained retiming problem and solve it as a series of weighted minimum area retiming problems. Our method for early interconnect planning can reduce and even avoid design iterations between physical planning and high level designs. Experimental results show that our method can reduce the number of area violations by an average of 84% in a single interconnect planning step.

A Novel Metric for Interconnect Architecture Performance [p. 448]
P. Dasgupta, A. Kahng, and S. Muddu

We propose a new metric for evaluation of interconnect architectures. This metric is computed by optimal assignment of wires from a given wire length distribution (WLD) to a given interconnect architecture (IA). This new metric, the rank of an IA, is a single number that gives the number of connections in the WLD that meet a specific target delay when embedded in the IA. A dynamic programming algorithm is presented to exactly compute the rank of an IA with respect to a given WLD within practical runtimes. We use our new IA metric to quantitatively compare impacts of geometric parameters as well as process and material technology advances. For example, we observe that 42% reduction in Miller coupling factor achieves the same rank improvement as a 38% reduction in inter-layer dielectric permittivity for a 1M gate design in the 130nm technology.


6A: Platform Design and IP Reuse Methods

Moderators: R. Seepold, Carlos III de Madrid U, ES; T. Riesgo, UP Madrid, ES

Specification of Non-Functional Intellectual Property Components [p. 456]
J. Zhu and W. Mong

In its most general sense, intellectual property components (IPs) refer to any design artifacts that are reusable. While the specification of the functional IPs, such as behavioral and RTL specifications have been widely investigated, the specifications of others, such as timing, constraints, layouts and architectures are largely ad hoc. This leads to different standard or proprietary file/database formats with interoperatability problems, which eventually hinder the distribution and integration of IPs. In this paper, we address the difficult problem of integrating semantically diverse non-functional IPs by the use of a new, extensible language called Babel. Despite its simple 1-page grammar, Babel is front-end for a powerful IP-based design infrastructure. We demonstrate the effectiveness of our approach by two case studies, one for the creation of parameterized memory IPs and one for the creation of processor IPs.

Profile-Driven Selective Code Compression [p. 462]
Y. Xie, W. Wolf, and H. Lekatsas

In the embedded system design, memory is one of the most restricted resources. Code compression has been proposed as a solution to reduce the code size of applications for embedded systems. Data compression techniques are used to compress programs to reduce memory size. Most previous work compresses all instructions found in an executable, without taking into account the program execution profile. In this paper, a profile-driven code compression design methodology is proposed. Program profiling information can be used to help code compression to selectively compress non-critical instructions, such that the system performance degradation due to the decompression penalty is reduced.

Design and Analysis of a Programmable Single-Chip Architecture for DVB-T Base-Band Receiver [p. 468]
C. Pan, N. Bagherzadeh, A. Kamalizad, and A. Koohi

This work treats the design and analysis of a programmable (or reconfigurable) DSP-domain-specific architecture called MorphoSys, upon which world's first single-chip software solution for DVB-T base-band receiver can be implemented. Based on the first version of MorphoSys, many modifications have been made to improve greatly both computation power and data movement efficiency. Sequential codes and SIMD codes can be parallelized; temporal granularity adjustment boosts up performance up to 4 times; numerous different types of data movement can be accelerated 8 to 64 times faster than sequential movement. As a complicated (21GOPS) and typical communication system, DVB-T base-band receiver is designed with low performance loss and mapped onto MorphoSys architecture (>28GOPS). This solidly contributes to the software defined radio development.


6B: Panel: Reconfigurable Computing -- Different Perspectives

Organizer: W. Rosenstiel, Tuebingen U/FZI, DE
Moderator: R. Lauwereins, IMEC, BE
Panellists:
I. Bolsens, Xilinx, US
C. Rowen, Tensilica, US
Y. Tanurhan, Actel, US
K. Vissers, Chameleon Systems, US
S. Wang, Axis Systems, US

Panel Statement [p. 476]

With vanishing time to market windows and rapidly changing design specifications, flexibility becomes the number one constraint in the development of the SoC's of the future. Adding efficient power and resource management to the requirements list made the cry for reconfigurable computing even louder. In this panel we will listen to different industry and research representatives debating their approaches to reconfigurable computing.


6C: Analogue and Defect-Oriented Testing

Moderators: R. Galivanche, Intel, US; A. Richardson, Lancaster U, UK

RF-BIST: Loopback Spectral Signature Analysis [p. 478]
D. Lupea, U. Pursche, and H. Jentschel

Built-In Self-Test (BIST) becomes important also for more complex structures like complete front-ends. In order to bring down the costs for the test overhead, Spectral Signature Analysis at system level seems to be a promising concept. Investigations that have been carried out are targeted on the most challenging problems. Generation of the Test Signature, Evaluation of the Signature Response, Implementation of the concept and Verification by Simulation. From investigation it can be concluded that the concept is suitable especially in the case of transceiver-type DUT.

Optimizing Stresses for Testing DRAM Cell Defects Using Electrical Simulation [p. 484]
Z. Al-Ars, A. van de Goor, J. Braun, and D. Richter

Stresses are considered an integral part of any modern industrial DRAM test. This paper describes a novel method to optimize stresses for memory testing, using defect injection and electrical simulation. The new method shows how each stress should be applied to achieve a higher fault coverage of a given test, based on an understanding of the internal behavior of the memory. In addition, results of a fault analysis study, performed to verify the new optimization method, show its effectiveness.
Key words: stresses, memory testing, test optimization, defect simulation.

On Modeling Cross-Talk Faults [p. 490]
S. Zachariah, Y. Chang, S. Kundu, and C. Tirumurti

Circuit marginality failures in high performance VLSI circuits are projected to increase due to shrinking process geometries and high frequency design techniques. Capacitive cross coupling between interconnects is known to be a prime contributor to such failures. In this paper, we present novel techniques to model and prioritize capacitive cross-talk faults. Experimental results are provided to show effectiveness of the proposed modeling technique on industrial circuits.

Techniques for Automatic on Chip Closed Loop Transfer Function Monitoring for Embedded Charge Pump Phase Locked Loops [p. 496]
M. Burbidge, A. Richardson, and J. Tijou

Charge Pump Phase locked loops are used in a variety of applications, including on chip clock synthesis, symbol timing recovery for serial data streams, and generation of frequency agile high frequency carrier signals. In many applications PLL's are embedded into larger digital systems, in consequence, analogue test access is often limited. Test motivation is thus towards methods that can either aid digital only test of the PLL, or alternatively facilitate complete self testing of the PLL. One useful characterisation technique used by PLL designers is that of closed loop phase transfer function measurement. This test allows, an estimation of the PLL's natural frequency, damping, and 3dB bandwidth to be made from the magnitude and phase response plots. These parameters relate directly to the time domain response of the PLL and will indicate errors in the PLL circuitry. This paper provides suggestions towards test methods that use a novel maximum frequency detection technique to aid automatic measurement of the closed loop phase transfer function. In addition, techniques presented have potential for full BIST applications.
Keywords: PLL, CP-PLL, BIST, TEST, DfT.


6E: Energy Aware Software Techniques (Embedded Software Forum)

Moderators: J. Henkel, NEC, US; P. Marwedel, Dortmund U, DE

Pre-Characterization Free, Efficient Power/Performance Analysis of Embedded and General Purpose Software Applications [p. 504]
V. Rapaka and D. Marculescu

This paper presents a novel approach for an efficient, yet accurate estimation technique for power consumption and performance of embedded and general purpose applications. Our approach is adaptive in nature and is based on detecting sections of code characterized by high temporal locality (also called hotspots) in the execution profile of the benchmark being executed on a target processor. The technique itself is architecture and input independent and can be used for both embedded, as well as for general purpose processors. We have implemented a hybrid simulation engine which can significantly shorten the simulation time by using on-the-fly profiling for critical sections of the code and by reusing this information during power/performance estimation for the rest of the code. By using this strategy, we were able to achieve up to 20X better accuracy compared to a flat, non-adaptive sampling scheme and a simulation speed-up of up to 11.84X with a maximum error of 1.03% for performance and 1.92% for total energy on a wide variety of media and general purpose applications.

Runtime Code Parallelization for On-Chip Multiprocessors [p. 510]
M. Kandemir, W. Zhang, and M. Karakoy

Chip multiprocessing (or multiprocessor system-on-a-chip) is a technique that combines two or more processor cores on a single piece of silicon to enhance computing performance. An important problem to be addressed in executing applications on an on-chip multiprocessor environment is to select the most suitable number of processors to use for a given objective function (e.g., minimizing execution time or energy-delay product) under multiple constraints. Previous research proposed an ILP-based solution to this problem that is based on exhaustive evaluation of each nest under all possible processor sizes. In this paper, we take a different approach and propose a pure runtime strategy for determining the best number of processors to use at runtime. This approach is more general than static techniques and can be applicable in situations where the latter cannot be.

SDRAM-Energy-Aware Memory Allocation for Dynamic Multi-Media Applications on Multi-Processor Platforms [p. 516]
P. Marchal, F. Catthoor, D. Bruni, L. Benini, J. Gomez, L. Piñuel, and H. Corporaal

Heterogeneous multi-processors platforms are an interesting option to satisfy the computational performance of dynamic multi-media applications at a reasonable energy cost. Today, almost no support exists to energy-efficiently manage the data of a multi-threaded application on these platforms. In this paper we show that the assignment of data of dynamically created/ deleted tasks to the shared memory has a large impact on the energy consumption. We present two dynamic memory allocators which solve the bank assignment problem for shared multi-banked SDRAM memories. Both allocators assign the tasks' data to the available SDRAM banks such that the number of page-misses is reduced. We have measured large energy savings with these allocators compared to existing dynamic memory allocators for several task-sets based on MediaBench[5].


6F: Interconnect Modelling and Signal Integrity

Moderators: L.M. Silveira, INESC ID/IST, PT; J.R. Phillips, Cadence, US

Modeling and Evaluation of Substrate Noise Induced by Interconnects [p. 524]
F. Martorell, D. Mateo, and X. Aragonès

Interconnects have deserved attention as a source of crosstalk to other interconnects, but have been ignored as a source of substrate noise. In this paper, we evaluate the importance of interconnect-induced substrate noise. A known interconnect and substrate model is validated by comparing simulation results to experimental measurements. Based on the validated modeling approach, a complete study considering frequency, geometrical, load and shielding effects is presented. The importance of interconnect-induced substrate noise is demonstrated after observing that, for typically sized interconnects and state-of-the-art speeds, the amount of coupled noise is already comparable to that injected by hundreds of transistors.

Model-Order Reduction Based on PRONY's Method [p. 530]
M. Mansour and A. Mehrotra

A new model-order reduction technique for linear dynamic systems is presented. The idea behind this technique is to transform the dynamic system function from the s-domain into the z-domain via the bilinear transformation, then use Prony's [1] least-squares approximation method instead of the commonly employed Padé approximation method, and finally transform the reduced system back into the s-domain using the inverse bilinear transformation.tegy allows a very accurate and efficient full-wave solution of interconnection structures with possibly complex geometry including the nonlinear and dynamic effects of real-world digital devices, without the need of detailed transistor-level models. Examples of signal integrity and field coupling analysis are shown.

Enhancing Signal Integrity through a Low-Overhead Encoding Scheme on Address Buses [p. 542]
T. Lv, W. Wolf, J. Henkel, and H. Lekatsas

Signal integrity is and will continue to be a major concern in deep sub-micron VLSI designs where the proximity of signal carrying lines leads to crosstalk, unpredictable signal delays and other parasitic side effects. Our scheme uses bus encoding that guarantees that at any time any two signal carrying lines will be separated by at least one grounded line and thus providing a high degree of signal integrity. This comes at a small overhead of only one additional bus line (the closest related work needs 14 additional lines for a 32-bit bus) and a small average performance decrease of 0.36%. By means of a large set of real-world applications, we compare our scheme to other state-of-the-art approaches and present comparisons in terms of degree of integrity, overhead (e.g. additional lines required) and a possible performance decrease.


7A: System Level Simulation

Moderators: B.M. Al-Hashimi, Southampton U, UK; M. Rencz, TU Budapest, HU

Building Fast and Accurate SW Simulation Models Based on Hardware Abstraction Layer and Simulation Environment Abstraction Layer [p. 550]
S. Yoo, I. Bacivarov, A. Bouchhima, Y. Paviot, and A. Jerraya

As a fast and accurate SW simulation model, we present a model called fast timed SW model. The model enables fast simulation by native execution of application SW and OS. It gives simulation accuracy by timed SW and HW simulation. When building fast timed SW models, we need to solve two problems: (1) how to enable timing synchronization between the native execution and HW simulation and (2) how to obtain the portability of native execution (that needs multi-tasking from simulation environments to emulate its multi-tasking operation) on different simulation environments (that give different types of multi-tasking). In this paper, to enable the synchronization, we present a synchronization function. To enable the portability, we present an adaptation layer called simulation environment abstraction layer. We present our case studies in building fast timed SW models.

Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation [p. 556]
W. Qin and S. Malik

Given the growth in application-specific processors, there is a strong need for a retargetable modeling framework that is capable of accurately capturing complex processor behaviors and generating efficient simulators. We propose the operation state machine (OSM) computation model to serve as the foundation of such a modeling framework. The OSM model separates the processor into two interacting layers: the operation layer where operation semantics and timing are modeled, and the hardware layer where disciplined hardware units interact. This declarative model allows for direct synthesis of micro-architecture simulators as it encapsulates precise concurrency semantics of microprocessors. We illustrate the practical benefits of this model through two case studies - the StrongARM core and the PowerPC-750 superscalar processor. The experimental results demonstrate that the OSM model has excellent modeling productivity and model efficiency. Additional applications of this modeling framework include derivation of information required by compilers and formal analysis for processor validation.

Instruction Set Emulation for Rapid Prototyping of SoCs [p. 562]
J. Schnerr, G. Haug, and W. Rosenstiel

In this paper the application of Instruction Set Emulation for rapid prototyping of SoCs will be presented. The emulation works in a way that both the software and the hardware behaviour of the emulated processor core is reproduced cycle accurately. This requires the use of hardware and software components. The hardware component consists of a board containing a VLIW processor and FPGAs. The software component is an instruction set simulator of the core running on the VLIW processor. The FPGAs are used for emulating the SoC bus of this processor core. This way the simulation of an instruction set of a processor core has been extended to a real emulation of this core that can be used for rapid prototyping.


7B: Design Space Exploration for Reconfigurable Computing

Moderators: P. Lysaght, Xilinx, US; S. Vernalde, IMEC, BE

Hardware/Software Design Space Exploration for a Reconfigurable Processor [p. 570]
A. La Rosa, L. Lavagno, and C. Passerone

This paper describes an approach to hardware/ software design space exploration for reconfigurable processors. The existing compiler tool-chain, because of the user-definable instructions, needs to be extended in order to offer developers an easy way to explore design space. Such extension often is not easy to use for developer that have only a software background, thus ignoring reconfigurable architecture details or hardware logic synthesis on FPGA. Our approach differs from others because it is based on a simple extension on the standard programming model well known to software developers.

From C Programs to the Configure-Execute Model [p. 576]
J. Cardoso and M. Weinhardt

The emergence of run-time reconfigurable architectures makes feasible the configure-execute paradigm. Compilation of behavioral descriptions (in, e.g., C, Java, etc.), apart from mapping the computational structures onto the available resources on the device, must split the program in temporal sections if it needs more resources than physically available. In addition, since the execution of the computational structures in a configuration needs at least two stages (i.e., configuring and computing), it is important to split the program such that the reconfiguration overheads are minimized, taking advantage of the overlapping of the execution stages on different configurations. This paper presents mapping techniques to cope with those features. The techniques are being researched in the context of a C compiler for the eXtreme Processing Platform (XPP). Temporal partitioning is applied to furnish a set of configurations that reduces the reconfiguration overhead and thus may lead to performance gains. We also show that when applications include a sequence of loops, the use of several configurations may be more beneficial than the mapping of the entire application onto a single configuration. Preliminary results for a number of benchmarks strongly confirm the approach.

FPGA-Based Implementation of a Serial RSA Processor [p. 582]
A. Mazzeo, L. Romano, G. Saggese, and N. Mazzocca

In this paper we present an hardware implementation of the RSA algorithm for public-key cryptography. The RSA algorithm consists in the computation of modular exponentials on large integers, that can be reduced to repeated modular multiplications. We present a serial implementation of RSA, which is based upon an optimized version of the RSA algorithm originally proposed by P.L. Montgomery. The proposed architecture is innovative, and it widely exploits specific capabilities of Xilinx programmable devices. As compared to other solutions in the literature, the proposed implementation of the RSA processor has smaller area occupation and comparable performance. The final performance level is a function of the serialization factor. We provide a thorough discussion of design tradeoffs, in terms of area requirements vs performance, for different values of the key length and of the serialization factor.


7C: On-Line Testing and Self-Repair

Moderators: L. Bouzaida, STMicroelectronics, FR; A. Paschalis, Athens U, GR

Optimal Reconfiguration Functions for Column or Data-Bit Built-In Self-Repair [p. 590]
M. Nicolaidis, N. Achouri, and S. Boutobza

In modern SoCs, embedded memories occupy the largest part of the chip area and include an even larger amount of active devices. As memories are designed very tightly to the limits of the technology they are more prone to failures than logic. Thus, memories concentrate the large majority of defects and affect circuit yield dramatically. As a matter Built-In Self-Repair is gaining importance. This work presents optimal reconfigurations functions for memory built-in self-repair on the data-bit level. We also present a dynamic repair scheme that allows reducing the size of the repairable units. The combination of these schemes allows repairing multiple faults affecting both regular and spare units, by means of low hardware cost. The scheme uses a single test pass, resulting on low test and repair time.

Versatile High-Level Synthesis of Self-Checking Datapaths Using an On-Line Testability Metric [p. 596]
P. Oikonomakos, M. Zwolinski, and B. Al-Hashimi

There have been several recent attempts to include duplication-based on-line testability in behaviourally synthesized designs. In this paper, on-line testability is considered within the optimisation process of iterative, cost function-driven high-level synthesis, such that on-line testing resources are inserted automatically without any modification of the source HDL code. This involves the introduction of a metric for on-line testability. A variation of duplication testing (namely inversion testing) is also used, providing the system with an additional degree of freedom towards minimising hardware overheads associated with test resource insertion. Considering online testability within the synthesis process facilitates fast and efficient design space exploration, resulting in a versatile high-level synthesis process, capable of producing alternative realisations according to the designer's directions.

An Accurate Analysis of the Effects of Soft Errors in the Instruction and Data Caches of a Pipelined Microprocessor [p. 602]
M. Rebaudengo, M. Sonza Reorda, and M. Violante

Instruction and data caches are well known architectural solutions that allow significantly improving the performance of high-end processors. Due to their sensitivity to soft errors they are often disabled in safety critical applications, thus sacrificing performance for improved dependability. In this paper we report an accurate analysis of the effects of soft errors in the instruction and data caches of a soft core implementing the SPARC architecture. Thanks to an efficient simulation-based fault injection environment we developed, we are able to present in this paper an extensive analysis of the effects of soft errors on a processor running several applications under different memory configurations. The procedure we followed allows the precise computation of the processor failure rate when the cache is enabled even without resorting to expensive radiation experiments.

High Speed and Highly Testable Parallel Two -Rail Code Checker [p. 608]
M. Omaña, D. Rossi, and C. Metra

In this article we propose a high speed and highly testable parallel two-rail code checker, which features a compact structure and is Totally-Self-Checking or Strongly Code-Disjoint with respect to a wide set of realistic faults. The proposed checker is also particularly suitable to implement embedded two-rail code checkers, as it requires only two input codewords for fault detection. Our checker can be employed to check the correct operation of a connected functional block using the two-rail code, to implement the output two-rail code checker of "normal" checkers for unordered codes, or to join together the error messages produced by various checkers (possibly using different codes) present within the same self-checking system. The behavior of our checker has been verified by means of electrical level simulations (performed using HSPICE), considering both nominal values and statistical variations of electrical parameters.


7E: Hot Topic: Safe Automotive Software Development (Embedded Software Forum)

Organizer/Moderator: R. Ernst, TU Braunschweig, DE

Safe Automotive Software Development [p. 616]
K. Tindell, H. Kopetz, F. Wolf, and R. Ernst

Automotive systems engineering has made significant progress in using formal methods to design safe hardware-software systems. The architectures and design methods could become a model for safe and cost-efficient embedded software development as a whole. This paper gives several examples from the leading edge of industrial automotive applications.


7F: Mixed-Signal Design Techniques

Moderators: A. Kaiser, IEMN-ISEN, FR; T. Ifstroem, Robert Bosch GmbH, DE

Analysis and White-Box Modeling of Weakly Nonlinear Time-Varying Circuits [p. 624]
P. Dobrovolný, G. Vandersteen, P. Wambacq, and S. Donnay

The architectural study of wireless communication systems typically requires simulations with high-level models for different analog and RF blocks. Among these blocks, frequency-translating devices such as mixers pose problems in RF circuit simulation since their response typically covers a mix of long- and short-time scales. This paper proposes a technique to analyze and model nonlinear frequency-translating RF circuits such as up-and down conversion mixers. The proposed method is based on a generalized Volterra series approach for periodically time-varying systems. It enables a multi-tone distortion analysis starting from a circuit description and derives simplified high-level models based on the most important nonlinear contributions. These models give both insight in the nonlinear behavior and enable an efficient high-level simulation during architectural design of front-ends of RF transceivers.

Linear Model-Based Error Identification and Calibration for Data Converters [p. 630]
C. Wegener and M. Kennedy

For the example of a 12-bit Nyquist-rate ADC, a model for nonlinearity-causing mechanisms is developed based on circuit simulations. The model is used to estimate circuit element values from measured device characteristics. Post-manufacture reconfiguration of the digital control part of the device-type that is used as a test vehicle in this work can improve the linearity performance of a device. An algorithm is proposed that searches for a locally-optimal reconfiguration based on the determined circuit element values. Applying calibration to the circuit simulation model allows one to estimate the performance improvement obtainable with the proposed calibration scheme for a given manufacturing process prior to a physical implementation.

Improved Design Methodology for High-Speed High-Accuracy Current Steering D/A Converters [p. 636]
M. Albiol, J. González, and E. Alarcón

This paper describes a sizing and design methodology for high-speed high-accuracy current steering D/A converters taking into account mismatching in all the transistors of the current source cell. The presented method allows a more accurate selection of the optimal design point without introducing arbitrary safety margins, as was done in the previous literature. This methodology has been applied to the design of a CMOS 12-bit 400 MHz current-steering segmented D/A converter. Commercial CAD tools are used to automatically lay out regular structures of the DAC, specially the current source array, following an optimal two-dimensional switching scheme to compensate for systematic mismatch errors.

Behavioral Modeling and Simulation of a Mixed Analog/Digital Automatic Gain Control Loop in a 5 GHz WLAN Receiver [p. 642]
W. Eberle, H. De Man, G. Vandersteen, P. Wambacq, G. Gielen, and S. Donnay

Wireless LAN (WLAN) operating in the 5-6 GHz range, become commercially viable only, if they can be produced at low cost. Consequently, tight integration of the physical layer, consisting of the radio front-end and the digital signal processing part, is a must. Especially with respect to mixed-signal feedback loops, with automatic gain control as a recurring example, existing tools have major difficulties in offering efficient ways of modeling and simulation. We present a modeling approach where the complexity of the analog behavioral model has been reduced to the minimum required by the digital receiver, namely its steady-state responses and a `worst-case' time delay. Moreover, we show how this mixed-signal receiver model can be used in an end-to-end communication link simulation to provide the designer insight into statistical information such as transient delays and gain tolerances. For this model, we set up a co-simulation of two existing in-house tools, one for the analog part, the other for the digital system part.


8A: Design Space Exploration

Moderators: H. Hsieh, UC Riverside, US; W. Kruijtzer, Philips, NL

Analytical Design Space Exploration of Caches for Embedded Systems [p. 650]
A. Ghosh and T. Givargis

The increasing use of microprocessor cores in embedded systems, as well as mobile and portable devices, creates an opportunity for customizing the cache subsystem for improved performance. Traditionally, a design-simulate-analyze methodology is used to achieve desired cache performance. Here, to bootstrap the process, arbitrary cache parameters are selected, the cache sub-system is simulated using a cache simulator, based on performance results, cache parameters are tuned, and the process is repeated until an acceptable design is obtained. Since the cache design space is typically very large, the traditional approach often requires a very long time to converge. In the proposed approach, we outline an efficient algorithm that directly computes cache parameters satisfying the desired performance. We demonstrate the feasibility of our algorithm by applying it to a large number of embedded system benchmarks.
Keywords Cache Optimization, Core-Based Design, Design Space Exploration, System-on-a-Chip

Fast and Accurate Multiprocessor Architecture Exploration with Symbolic Programs [p. 656]
V. Zivkovic, E. Deprettere, E. de Kock, and P. van der Wolf

In system-level platform-based embedded systems design, the mapping model is a crucial link between the application model and the architecture model. All three models must match when design-space exploration has to be fast and accurate, and when exploration methods and design methods have to be closely related. For the media processing application domain we present an architecture model and corresponding mapping model that meet these requirements better than previously proposed models. A case study illustrates this improvement.

Design Space Exploration for a Wireless Protocol on a Reconfigurable Platform [p. 662]
L. Vanzago, J. Cambonie, B. Bhattacharya, and L. Lavagno

This paper describes a design space exploration experiment for a real application from the embedded networking domain -- the physical layer of a wireless protocol. The application models both control oriented as well as data processing functions, and hence requires composing tasks from different models of computation. We show how the cost and performance of communication and computation can be quickly evaluated, with a reasonable modeling cost. While the example uses a specific tool, the methodology and results can be used in a more general context.

A First Step Towards HW/SW Partitioning of UML Specifications [p. 668]
W. Fornaciari, F. Salice, P. Micheli, and L. Zampella

This paper proposes a novel methodology tailored to design embedded systems, taking into account the emerging market needs, such as hw/sw partitioning, object-oriented specifications, overall design costs and early analysis of design alternatives. The proposal tackles the problem by considering UML as the starting point for system-level description and uses a customization of Function Point analysis and COCOMO to provide cost metrics both for hardware and software. Finally, a genetic algorithm is used to select the best candidate architecture. The paper also reports some results, obtained from a case studies, showing the viability of the proposed approach.

Multi-Granularity Metrics for the Era of Strongly Personalized SOCs [p. 674]
Y. Le Moullec, J. Diguet, J. Philippe, N. Ben Amor, and M. Abid

This paper details the first step of the Design Trotter framework for design space exploration applied to dedicated SOCs. The aim of this step is to provide metrics in order to guide the designer and the synthesis tool towards an efficient application architecture matching. This work presents a computation of metrics at all levels of the application graph-based hierarchy. These metrics are computed through data and control dependency analysis. They quantify the memory, control and processing orientations as well as the average of parallelism for different granularities.


8B: Low Power Architectures

Moderators: W. Nebel, OFFIS/Oldenburg U, DE; J. Henkel, NEC, US

Energy Estimation for Extensible Processors [p. 682]
Y. Fei, N. Jha, S. Ravi, and A. Raghunathan

This paper presents an efficient methodology for estimating the energy consumption of application programs running on extensible processors. Extensible processors, which are increasingly popular in embedded system design, allow a designer to customize a base processor core through instruction set extensions. Existing processor energy macro-modeling techniques are not applicable to extensible processors, since they assume that the instruction set architecture as well as the underlying structural description of the microarchitecture remain fixed. Our solution to this problem is an energy macro-model suitably parameterized to estimate the energy consumption of a processor instance that incorporates any custom instruction extensions. Such a characterization is facilitated by careful selection of macro-model parameters/variables that can capture both the functional and structural aspects of the execution of a program on an extensible processor. Another feature of the proposed characterization flow is the use of regression analysis to build the macro-model. Regression analysis allows for in-situ characterization, thus allowing arbitrary test programs to be used during macro-model construction. We validate the proposed methodology by characterizing the energy consumption of a state-of-the-art extensible processor (Tensilica's Xtensa). We use the macro-model to analyze the energy consumption of several benchmark applications with custom instructions. The mean absolute error in the macro-model estimates is only 3.3%, when compared to the energy values obtained by a commercial tool operating on the synthesized RTL description of the custom processor. Our approach achieves an average speedup of three orders of magnitude over the commercial RTL energy estimator. Our experiments show that the proposed methodology also achieves good relative accuracy, which is essential in energy optimization studies.

Exploiting the Routing Flexibility for Energy/Performance Aware Mapping of Regular NoC Architectures [p. 688]
J. Hu and R. Marculescu

In this paper, we present an algorithm which automatically maps the IPs onto a generic regular Network on Chip (NoC) architecture and constructs a deadlock-free deterministic routing function such that the total communication energy is minimized. At the same time, the performance of the resulting communication system is guaranteed to satisfy the specified constraints through bandwidth reservation. As the main contribution, we first formulate the problem of energy/performance aware mapping, in a topological sense, and show how the routing flexibility can be exploited to expand the solution space and improve the solution quality. An efficient branch-and-bound algorithm is then described to solve this problem. Experimental results show that the proposed algorithm is very fast, and significant energy savings can be achieved. For instance, for a complex video/audio application, 51.7% energy savings have been observed, on average, compared to an ad-hoc implementation.

Chromatic Encoding: A Low Power Encoding Technique for Digital Visual Interface [p. 694]
W. Cheng and M. Pedram

This paper presents a low-power encoding technique, called chromatic encoding, for the Digital Visual Interface standard (DVI), a digital serial video interface. Chromatic encoding reduces power consumption by minimizing the transition counts on the DVI. This technique relies on the notion of tonal locality, i.e., the observation that the signal differences between adjacent pixels in images follow a Gaussian distribution. Based on this observation, an optimal code assignment is performed to minimize the transition counts. Furthermore, the three color channels of the DVI may be reciprocally encoded to achieve even more power saving. The idea is that given the signal values from the three color channels, one or two of these channels are encoded by reciprocal differences with a number of redundant bits used to indicate the selection. The proposed technique requires only three redundant bits for each 24-bit pixel. Experimental results show up to a 75% transition reduction.

MRPF: An Architectural Transformation for Synthesis of High-Performance and Low-Power Digital Filters [p. 700]
H. Choo, K. Roy, and K. Muhammad

We present a graph theoretical methodology that reduces the implementation complexity of a vector multiplied by a scalar. The proposed approach is called MRP (minimally redundant parallel) optimization and is presented in FIR filtering framework to obtain a low-complexity multiplier-less implementation. The key idea is to expand the design space using shift inclusive differential coefficients together with computation reordering using a graph theoretic approach to obtain maximal computation sharing. The transformed architecture of a filter is obtained by solving a set cover problem of the graph. A simple algorithm based on a greedy approach is presented. The proposed approach is merged with common sub-expression elimination. The simulation results show that 70% and 16% improvement in terms of computational complexity over simple implementation (transposed direct form) and common sub-expression, respectively, when using carry lookahead adder synthesized from synopsys designware library in .25 u technology.

Transport Protocol Optimization for Energy Efficient Wireless Embedded Systems [p. 706]
D. Bertozzi, L. Benini, A. Raghunathan, and S. Ravi

For wireless embedded systems, the power consumption in the network interface (radio) plays a dominant role in determining battery life. In this paper, we explore transport protocol optimizations for reducing the energy consumption of wireless LAN interfaces. Our work is based on the observation that, the transport protocol, which implements flow control to regulate the network traffic, plays a significant role in determining the workload of the network interface. Hence, by monitoring run-time parameters in the transport protocol, coarse-granularity idle periods, which present the best opportunities for network interface power reduction, can be accurately identified. We further show that, by tuning parameters in the protocol software implementation, we can shape the activity profile of the network interface, making it more energy efficient while remaining compliant to the TCP standard. We have performed extensive current measurements using an experimental testbed that consists of a Compaq iPAQ PDA with a Cisco Aironet wireless network adapter, to validate the proposed techniques. Our measurements indicate energy savings ranging from 28% to 69% compared to the use of state-of-the-art MAC layer power reduction techniques, with little or no impact on performance.


8C: System-on-Chip Testing

Moderators: P. Harrod, ARM, UK; S. Shoukourian, VirageLogic International, AR

Low-Cost Software-Based Self-Testing of RISC Processor Cores [p. 714]
N. Kranitis, A. Paschalis, G. Xenoulis, D. Gizopoulos, and Y. Zorian

Software self-testing of embedded processor cores which effectively partitions the testing effort between low-speed external equipment and internal processor resources, has been recently proposed as an alternative to classical hardware built-in self-test techniques over which it provides significant advantages. In this A P1500-Compatible Programmable BIST Approach for the Test of Embedded Flash Memories [p. 720]

P. Bernardi, M. Rebaudengo, M. Sonza Reorda, and M. Violante

In this paper we present a microprocessor-based approach suitable for embedded flash memory testing in a System-on-achip (SOC) environment. The main novelty of the approach is the high flexibility, which guarantees easy exploitation of the same architecture to different memory cores. The proposed approach is compatible with the P1500 standard. A case study has been developed and demonstrates the advantages of the proposed core test strategy in terms of area overhead and test application time.

Test Data Compression: The System Integrator's Perspective [p. 726]
P. Gonciari, B. Al-Hashimi, and N. Nicolici

Test data compression (TDC) is a promising low-cost methodology for System-on-a-Chip (SOC) test. This is due to the fact that it can reduce not only the volume of test data but also the bandwidth requirements. In this paper we provide a quantitative analysis of two distinctive TDC methods from the system integrator's standpoint considering a core based SOC environment. The proposed analysis addresses four parameters: compression ratio, test application time, area overhead and power dissipation. Based on our analysis, some future research directions are given which can lead to an easier integration of TDC in the SOC design flow and to further improve the four parameters.

Time Domain Multiplexed TAM: Implementation and Comparison [p. 732]
Z. Ebadi and A. Ivanov

One of the difficult problems which core-based system-on-chip (SoC) designs face is test access. For testing the cores in a SoC, a special mechanism is required, since they are not directly accessible via chip inputs and outputs. In this paper we introduce a novel Test Access Mechanism (TAM) based on time domain multiplexing (TDM-TAM). This TAM is P1500 compatible and uses a P1500 wrapper. The TAM characteristics are its flexibility, scalability, and reconfigurability. The proposed TAM is compared with two other approaches: a serial threading approach analogous to the IEEE1149.1 standard (Serial TAM)[7]and a packet switching test network (NIMA)[9]. A network-processing engine SoC is used as a platform to compare the different TAMs [6]. Results show that in most cases, TDM is the most effective TAM in both test time and overhead area. Keywords: SoC testing, Embedded core testing, Test Access Mechanism (TAM), Time domain multiplexed TAM, Optimal test time.

Layout-Driven SOC Test Architecture Design for Test Time and Wire Length Minimization [p. 738]
K. Goel and E. Marinissen

This paper extends existing SOC test architecture design approaches that minimize required tester vector memory depth and test application time, with the capability to minimize the wire length required by the test architecture. We present a simple, yet effective wire length cost model for test architectures together with a new test architecture design algorithm that minimizes both test time and wire length. The user specifies the relative weight of the costs of test time versus wire length. In an integrated fashion, the algorithm partitions the total available TAM width over individual TAMs, assigns the modules to these TAMs, and orders the modules within one TAM such that the total cost is minimized. Experimental results on five benchmark SOCs show that we can obtain savings of up to 86% in wiring costs at the expense of <4% in test time.

Delay Fault Testing of Core-Based Systems-On-A-Chip [p. 744]
Q. Xu and N. Nicolici

Existing approaches for modular manufacturing testing of core-based systems-on-a-chip (SOCs) do not provide any explicit mechanism for high quality two-pattern tests required for performance validation through delay fault testing. This paper proposes a new approach for broadside delay fault testing of core-based SOCs, by adapting the existing solutions for automatic test pattern generation and design for test support, test access mechanism division and test scheduling.


8D: Synthesis and Analysis of Digital Circuits

Moderators: W. Kunz, Kaiserslautern U, DE; T. Villa, Udine U, IT

Reducing Multi-Valued Algebraic Operations to Binary [p. 752]
J. Jiang, R. Brayton, and A. Mishchenko

Algebraic operations were developed for binary logic synthesis and extended later to apply to multi-valued (MV) logic. Operations in the MV domain were considered more complex and slower. This paper shows that MV algebraic operations are essentially as easy as binary ones, with only a slight overhead (linear in the size of the expression) in transformation into and out of the binary domain. By introducing co-singleton sets as a new basis, any MV sum-of-products expression can be rewritten and passed to a binary logic synthesizer for fast execution; the optimized results can be directly interpreted in the MV domain. This process, called EBD, reduces MV algebraic operations to binary. A pure MV operation differs mainly from its corresponding EBD one in that the former possesses "semi-algebraic" generality, which has not been implemented for the binary logic. Experiments show that the proposed methods are significantly faster, with little or no loss in quality when run in comparable scripts of sequences of logic synthesis operations.

Combination of Lower Bounds in Exact BDD Minimization [p. 758]
R. Ebendt, W. Günther, and R. Drechsler

Ordered Binary Decision Diagrams (BDDs) are a data structure for efficient representation and manipulation of Boolean functions. They are frequently used in logic synthesis and formal verification. The size of BDDs depends on a chosen variable ordering, i.e. the size may vary from linear to exponential, and the problem of improving the variable ordering is known to be NP-complete. In this paper we present a new exact branch & bound technique for determining an optimal variable order. In contrast to all previous approaches, that only considered one lower bound, our method makes use of a combination of three bounds and by this avoids unnecessary computations. The lower bounds are derived by generalization of a lower bound known from VLSI design. They allow to build the BDD either top down or bottom up. Experimental results are given to show the efficiency of our approach.

Implicit Resolution of the Chapman-Kolmogorov Equations for Sequential Circuits: An Application in Power Estimation [p. 764]
A. Freitas and A. Oliveira

In this work we describe an approach that implicitly formulates and solves the Chapman-Kolmogorov equations that describe the state probabilities associated with the stationary behavior of sequential circuits. Unlike previous approaches that assumed uncorrelated input signals, we model the more general case where the sequential circuit is driven by a sequence of inputs described by a discrete time Markov chain. This Markov chain is described implicitly using a formalism that allows for a compact description of chains with an exponentially high number of states. Using this approach, we present an application in power estimation of sequential circuits that takes into account all the temporal and spatial correlations between the primary inputs and the internal signals. We present results showing that, in some cases, it is possible to solve exactly the Chapman-Kolmogorov equations for systems with more than 107equations.

Performance-Directed Retiming for FPGAs Using Post-Placement Delay Information [p. 770]
U. Seidl, F. Johannes, and K. Eckl

In today's deep-submicron designs, the interconnect delays contribute an increasing part to the overall performance of an implementation. Particularly when targeting field programmable gate arrays (FPGAs), interconnect delays are crucial, since they can easily vary by orders of magnitude. Many existing performance-directed retiming methods use simple delay models which either neglect routing delays or use inaccurate delay estimations. In this paper, we propose a retiming approach which overcomes the problem of inaccurate delay models. Our retiming technique uses delay information extracted from a fully placed and routed design and takes account of register timing requirements. By applying physical constraints, we ensure that the delay information remains valid during retiming. In our experiments, we achieved up to 27% performance improvement.


8E: Embedded System Architectures (Embedded Software Forum)

Moderators: W. Fornaciari, Politecnico di Milano, IT; T. Thurner, DaimlerChrysler Research, DE

Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology [p. 778]
A. Agarwal, T. Vijaykumar, and K. Roy

In this paper we propose a design technique to pipeline cache memories for high bandwidth applications. With the scaling of technology cache access latencies are multiple clock cycles. The proposed pipelined cache architecture can be accessed every clock cycle and thereby, enhances bandwidth and overall processor performance. The proposed architecture utilizes the idea of banking to reduce bit-line and word-line delay, making word-line to sense amplifier delay to fit into a single clock cycle. Experimental results show that optimal banking allows the cache to be split into multiple stages whose delays are equal to clock cycle time. The proposed design is fully scalable and can be applied to future technology generations. Power, delay and area estimates show that on average, the proposed pipelined cache improves MOPS (millions of operations per unit time per unit area per unit energy) by 40-50% compared to current cache architectures.

Enhancing Speedup in Network Processing Applications by Exploiting Instruction Reuse with Flow Aggregation [p. 784]
G. Surendra, S. Banerjee, and S. Nandy

Instruction reuse is a microarchitectural technique that improves the execution time of a program by removing redundant computations at run-time. Although this is the job of an optimizing compiler, they do not succeed many a time due to limited knowledge of run-time data. In this paper we examine instruction reuse of integer ALU and load instructions in network processing applications. Specifically, this paper attempts to answer the following questions: (1) How much of instruction reuse is inherent in network processing applications?, (2) Can reuse be improved by reducing interference in the reuse buffer?, (3) What characteristics of network applications can be exploited to improve reuse?, and (4) What is the effect of reuse on resource contention and memory accesses? We propose an aggregation scheme that combines the high-level concept of network traffic i.e. "flows" with a low level microarchitectural feature of programs i.e. repetition of instructions and data along with an architecture that exploits temporal locality in incoming packet data to improve reuse. We find that for the benchmarks considered, 1% to 50% of instructions are reused while the speedup achieved varies between 1% and 24%. As a side effect, instruction reuse reduces memory traffic and can therefore be considered as a scheme for low power.

On-Chip Stochastic Communication [p. 790]
T. Dumitras and R. Marculescu

As CMOS technology scales down into the deep-submicron (DSM) domain, the costs of design and verification for Systems-On-Chip (SoCs) are rapidly increasing due to the inefficiency of traditional CAD tools. Relaxing the requirement of 100% correctness for devices and interconnects drastically reduces the costs of design but, at the same time, requires that SoCs be designed with some system-level fault-tolerance. In this paper, we introduce a new communication paradigm for SoCs, namely stochastic communication. The newly proposed scheme not only separates communication from computation, but also provides the required built-in fault-tolerance to DSM failures, is scalable and cheap to implement. For a generic tile-based architecture, we show how a ubiquitous multimedia application (an MP3 encoder) can be implemented using stochastic communication in an efficient and robust manner. More precisely, up to 70% data upsets, 80% packet drops because of buffer overflow, and severe levels of synchronization failures can be tolerated while maintaining a low latency.

An Integrated Approach for Improving Cache Behavior [p. 796]
G. Memik, A. Choudhary, M. Kandemir, and I. Kadayif

The widening gap between processor and memory speeds renders data locality optimization a very important issue in data-intensive embedded applications. Throughout the years hardware designers and compiler writers focused on optimizing data cache locality using intelligent cache management mechanisms and program-level transformations, respectively. Until now, there has not been significant research investigating the interaction between these optimizations. In this work, we investigate this interaction and propose a selective hardware/compiler strategy to optimize cache locality for integer, numerical (array-intensive), and mixed codes. In our framework, the role of the compiler is to identify program regions that can be optimized at compile time using loop and data transformations and to mark (at compile-time) the unoptimizable regions with special instructions that activate/deactivate a hardware optimization mechanism selectively at run-time. Our results show that our technique can improve program performance by as much as 60% with respect to the base configuration and 17% with respect to a non-selective hardware/compiler approach.

Rapid Configuration and Instruction Selection for an ASIP: A Case Study [p. 802]
N. Cheung, S. Parameswaran, and J. Henkel

We present a methodology that maximizes the performance of Tensilica based Application Specific Instruction-set Processor (ASIP) through instruction selection when an area constraint is given. Our approach rapidly selects from a set of pre-fabricated coprocessors/functional units from our library of pre-designed specific instructions (to evaluate our technology we use the Tensilica platform). As a result, we significantly increase application performance while area constraints are satisfied. Our methodology uses a combination of simulation, estimation and a pre-characterised library of instructions, to select the appropriate co-processors and instructions. We report that by selecting the appropriate coprocessors/functional units and specific TIE instructions, the total execution time of complex applications (we study a voice encoder/decoder), an application's performance can be reduced by up to 85% compared to the base implementation. Our estimator used in the system takes typically less than a second to estimate, with an average error rate of 4% (as compared to full simulation, which takes 45 minutes). The total selection process using our methodology takes 3-4 hours, while a full design space exploration using simulation would take several days.


8F: Specification and Verification in Action

Moderators: W. Ecker, Infineon, DE; T. Kropf, Robert Bosch GmbH, DE

Local Search for Boolean Relations on the Basis of Unit Propagation [p. 810]
Y. Novikov

We propose a method for local search of Boolean relations relating variables of a CNF formula. The method is to branch on small subsets of the set of CNF variables and to analyze results of unit propagation. By taking into account variable value assignments deduced during the unit propagation procedure the method is able to justify any relation represented by a Boolean expression. The proposed technique is based on bitwise logical operations over ternary vectors. We implement a restricted version of the method used for unit clause derivation and equivalent-literal identification in a preprocessor engine for a SAT-solver. The experiments show that the proposed technique is useful for solving real-world instances of the formal verification domain.

Set Manipulation with Boolean Functional Vectors for Symbolic Reachability Analysis [p. 816]
A. Goel and R. Bryant

Symbolic techniques usually use characteristic functions for representing sets of states. Boolean functional vectors provide an alternate set representation which is suitable for symbolic simulation. Their use in symbolic reachability analysis and model checking is limited, however, by the lack of algorithms for performing set operations. We present algorithms for set union, intersection and quantification that work with a canonical Boolean functional vector representation and show how this enables efficient symbolic simulation based reachability analysis. Our experimental results for reachability analysis indicate that the Boolean functional vector representation is often more compact than the corresponding characteristic function, thus giving significant performance improvements on some benchmarks.

Efficient Preimage Computation Using a Novel Success-Driven ATPG [p. 822]
S. Sheng and M. Hsiao

Preimage computation is a key step in formal verification. Pure OBDD-based symbolic method is vulnerable to the space-explosion problem. On the other hand, conventional ATPG/SAT-based method can handle large designs but can suffer from time explosion. Unlike methods that combine ATPG/SAT and OBDD, we present a novel success-driven learning algorithm which significantly accelerates a ATPG engine for enumerating all solutions (preimages). The algorithm effectively prunes redundant search space due to overlapped solutions and constructs a free BDD on the fly so that it becomes the representation of the preimage set at the end. Experimental results have demonstrated the effectiveness of the approach, in which we are able to compute preimages for large sequential circuits, where OBDD-based method fail.

Using Formal Techniques to Debug the AMBA System-On-Chip Bus Protocol [p. 828]
A. Roychoudhury, T. Mitra, and S. Karri

System-on-chip (SoC) designs use bus protocols for high performance data transfer among the Intellectual Property (IP) cores. These protocols incorporate advanced features such as pipelining, burst and split transfers. In this paper, we describe a case study in formally verifying a widely used SoC bus protocol: the Advanced Micro-controller Bus Architecture (AMBA) protocol from ARM. In particular, we develop a formal specification of the AMBA protocol. We then employ model checking, a state space exploration based formal verification technique, to verify crucial design invariants. The presence of pipelining and split transfer in the AMBA protocol gives rise to interesting corner cases, which are hard to detect via informal reasoning. Using the SMV model checker, we have detected a potential bus starvation scenario in the AMBA protocol. Such scenarios demonstrate the inherent intricacies in designing pipelined bus protocols.

Cross-Product Functional Coverage Measurement with Temporal Properties-Based Assertions [p. 834]
A. Ziv

Temporal specification languages provide an efficient way to express events comprised of complex temporal scenarios. Assertions based on these languages are used to detect violations of the specification and monitor coverage events. In this paper, we propose to extend temporal specification languages, and assertions based on these languages with auxiliary variables. We attach these variables to sub-expressions and assign them values when the subexpressions are evaluated. The use of auxiliary variables enables the implementation of large cross-product coverage models, using small number of assertions. This simplifies the definition and implementation of coverage models and helps reduce the simulation overhead caused by assertions, thus increasing the efficiency of simulation resources.


9A: Hot Topic: RF Design Technology for Highly Integrated Communication Systems

Organizer/Moderator: R. Wittmann, Nokia, DE
Speakers:
J. Hartung, Cadence, DE
H.-J. Wassener, Atmel, DE
G. Tränkle, Infineon, DE
M. Schröter, CEDIC, DE

Panel Statement [p. 842]

The characteristics of highly integrated High-Speed Data- Transmission-Systems for near future communication standards are high mobility, low power, reliability, high performance, low cost and short development time. Recent available RF development tools do not allow the efficient design of tailored, well partitioned, highly integrated, high data-rate communication systems targeting up to 6 GHz carrier frequency. Reasons for this are the missing flexibility and accuracy in the hierarchical design process from system down to physical level. In this session representatives of system houses, a leading EDA provider and a silicon foundry will introduce and discuss the challenges related to the specific areas. Considering the market relevance of high data rate wireless communication systems the missing overall design technology today is a real hot topic. Suggestions on how to improve the existing situation addressing the design technology as a whole will be discussed. Depending on the representatives perspective converging requirements will be worked out and non-converging requirements will be debated in a controversial manner.


9B: Zoning Chip Estate

Moderators: C. Sechen, Washington U, US; I. Markov, Michigan U, US

Power/Ground Mesh Area Optimization Using Multigrid-Based Technique [p. 850]
K. Wang and M. Marek-Sadowska

In this paper, we present a novel multigrid-based technique for power/ground mesh area optimization subject to reliability constraints. The multigrid-based technique is applied to reduce a large-scale mesh to a much coarser one. The reduced mesh can be efficiently optimized. The solution for the original mesh is then computed using a back-mapping process. Experimental results are very encouraging. Large-scale power/ground meshes with millions of nodes can be solved in a few minutes. The proposed technique not only speeds up the optimization process significantly without compromising the quality of solutions, but also brings up a possibility of incorporating the power/ground mesh optimization into other physical design stages such as signal routing.

A New and Efficient Congestion Evaluation Model in Floorplanning: Wire Density Control with Twin Binary Trees [p. 856]
S. Wai, E. Young, and C. Chu

As technology moves into the deep-submicron era, the complexity of VLSI circuit design grows rapidly, especially in the interconnections between modules. Therefore, interconnect optimization has become an important concern in floorplanning today. Most routability-driven floorplanners [2][6][8] use grid-based approach that divides a floorplan into grids as in global routing. Congestion is estimated as the expected number of nets passing through each grid. Although this approach is direct and accurate, it is not efficient when dealing with complex circuit containing thousands of nets. In this paper, an efficient and innovative routability-driven floorplanner using twin binary trees (TBT)[9][10] representation is proposed. The congestion model we used is the wire density on the half-perimeter boundary of different regions in a floorplan. These regions are defined naturally by the TBT representation. In order to increase the efficiency of our floorplanner, a fast algorithm for the least common ancestor (LCA) problem in [1] is used to compute the wire density. From the experimental results, the number of unroutable wires can be reduced in a short time.

Crosstalk Reduction in Area Routing [p. 862]
R. Smey, P. Madden, and B. Swartz

Interconnect delay dominates system delay in modern circuits, and with reduced feature sizes, coupling capacitance and signal crosstalk have become significant issues. By spacing interconnect wires effectively, and avoiding long parallel runs, coupling can be reduced; with current routing methods, however, this is difficult to achieve. In this paper, we present a new approach to area routing. Rather than inserting routes sequentially (using a performance driven maze router), multiple candidate routes for each connection are generated; excess routes are then eliminated iteratively. Experiments show that we obtain substantial reductions in coupling capacitance, without sacrificing routing completion rates.

Area Fill Generation with Inherent Data Volume Reduction [p. 868]
Y. Chen, A. Kahng, Y. Zheng, G. Robins, and A. Zelikovsky

Control of variability and performance in the back end of the VLSI manufacturing line has become extremely difficult with the introduction of new materials such as copper and low-k dielectrics. Uniformity of chemical-mechanical planarization (CMP) requires the insertion of area fill features into the layout, in order to smoothen the variation of feature densities across the die and thus improve manufacturability. Because the size of area fill features is very small compared with the large empty layout areas that must be filled, the filling process can increase the size of a GDSII file by an order of magnitude or more. Data compression is therefore a significant issue in successful fill synthesis. In this paper, we introduce compressed fill strategies which exploit the GDSII array reference record (AREF) construct. We apply greedy and linear programming based optimization techniques, and obtain practical compressed filling solutions.


9C: Panel: Transaction Based Design: Another Buzzword or the Solution to a Design Problem?

Organizer: H.-J. Schlebusch, Synopsys, DE
Moderator: G. Smith, Gartner Dataquest, US
Panellists:
D. Sciuto, Politecnico di Milano, IT
D. Gajski, UC Irvine, US
C. Mielenz, Infineon Technologies, DE
C.K. Lennard, ARM, UK
F. Ghenassia, STMicroelectronics, FR
S. Swan, Cadence, US
J. Kunkel, Synopsys, US

Panel Statement [p. 876]

Complex systems on chip (SoCs) present challenges in the design and verification process that cannot be adequately addressed by traditional methodologies based on register transfer descriptions. Some of the aspects are efficient design exploration based on component reuse, getting closure on the architecture, as well as early development, integration and verification of embedded software. In search for responses to these challenges, Transaction level modeling (TLM) has got quite some attention in the area of SoC design. This panel attempts to do a reality check on TLM from an engineering point of view. Questions to discuss are: Is the Transaction Level (TL) really useful for the design and/or for the verification of SoCs? How can TL speed up the design process and lowering the risk of design failures ? What are the implications on tools, languages, and Intellectual Property (IP) used in the design/verification process ? The panelists will share their thoughts on transaction based design and verification, and will discuss benefits and issues based on their experiences of applying transaction level methodologies.


9D: Trust in SAT-Based Verification?

Moderators: R. Drechsler, Bremen U, DE; C. Pixley, Synopsys, US

Validating SAT Solvers Using an Independent Resolution-Based Checker: Practical Implementations and other Applications [p. 880]
L. Zhang and S. Malik

As the use of SAT solvers as core engines in EDA applications grows, it becomes increasingly important to validate their correctness. In this paper, we describe the implementation of an independent resolution-based checking procedure that can check the validity of unsatisfiable claims produced by the SAT solver zchaff. We examine the practical implementation issues of such a checker and describe two implementations with different pros and cons. Experimental results show low overhead for the checking process. Our checker can work with many other modern SAT solvers with minor modifications, and it can provide information for debugging when checking fails. Finally we describe additional results that can be obtained by the validation process and briefly discuss their applications.

Verification of Proofs of Unsatisfiability for CNF Formulas [p. 886]
E. Goldberg and Y. Novikov

As SAT-algorithms become more and more complex, there is little chance of writing a SAT-solver that is free of bugs. So it is of great importance to be able to verify the information returned by a SAT-solver. If the CNF formula to be tested is satisfiable, solution verification is trivial and can be easily done by the user. However, in the case of unsatisfiability, the user has to rely on the reputation of the SAT-solver. We describe an efficient procedure for checking the correctness of evelop better heuristics to further improve its efficiency, by either speeding up the Boolean Constraint Propagation (BCP) procedure or finding a better decision ordering (or both). In this paper, we propose an entirely different SAT solver design concept that is circuit-based. Our solver is able to utilize circuit topological information and signal correlations to enforce a decision ordering that is more efficient for solving circuit-based SAT problem instances. In particular, for unsatisfiable circuit examples, our solver is able to achieve from 2x up to more than 75x speedup over a state-of-the-art SAT solver.

Improving SAT-Based Bounded Model Checking by Means of BDD-Based Approximate Traversals [p. 898]
G. Cabodi, S. Nocco, and S. Quer

Binary Decision Diagrams (BDDs) have been widely used for hardware verification since the beginning of the '90s, whereas Boolean Satisfiability (SAT) has been gaining ground more recently, with the introduction of Bounded Model Cheking (BMC). In this paper we dovetail BDD and SAT based methods to improve the efficiency of BMC. More specifically, we first exploit inexpensive symbolic approximate reachability analysis to gather information on the state space. We then use the above information to restrict and focus the overall search space of SAT based BMC. In the experimental results section we show how the information coming from a BDD tool can improve the efficiency of a SAT engine by drastically reducing the number of "variable assignments" and "variable conflicts". This results in a significant overall performance gain associated with a general, robust, and easy-to-apply methodology.


9E: Transformations for Real-Time Software (Embedded Software Forum)

Moderators: L. Thiele, ETH Zurich, CH; Z. Peng, Linköping U, SE

Generalized Data Transformations for Enhancing Cache Behavior [p. 906]
V. De La Luz, M. Kandemir, I. Kadayif, and U. Sezer

The performance gap between processors and off-chip memories has widened in the last years and is expected to widen even more. Today, it is widely accepted that caches improve significantly the performance of programs, since most of the programs exhibit temporal and/or spatial locality in their memory reference patterns. However, conflict misses can be a major obstacle preventing an application from utilizing the data cache effectively. While array padding can reduce conflict misses it can also increase the data space requirements significantly. In this paper, we present a compiler-based data transformation strategy, called the "generalized data transformations," for reducing inter-array conflict misses in embedded applications. We present the theory behind the generalized data transformations and discuss how they can be integrated with compiler-based loop transformations. Our experimental results demonstrate that the generalized data transformations are very effective in improving data cache behavior of embedded applications.

Software Streaming via Block Streaming [p. 912]
P. Kuacharoen, V. Mooney, and V. Madisetti

Software streaming allows the execution of stream-enabled software on a device even while the transmission/ streaming may still be in progress. Thus, the software can be executed while it is being streamed instead of causing the user to wait for the completion of download, decompression, installation and reconfiguration. Our streaming method can reduce application load time seen by the user since the application can start running as soon as the first executable unit is loaded into the memory. Furthermore, unneeded parts of the application might not be downloaded to the device. As a result, resource utilization such as memory and bandwidth usage may also be more efficient. Using our streaming method, an embedded device can support a wide range of real-time applications. The applications can be run on demand. In this paper, a streaming method we call block streaming is proposed. Block streaming is determined at the assembly code level. We implemented a tool to partition real-time software into parts which can be transmitted (streamed) to the embedded de-vice. Our streaming method was implemented and simulated on a hardware/ software co-simulation platform in which we used the PowerPC architecture. We show a robotics application that without our streaming method is unable to meet its real-time deadline. However, with our software streaming method, the application is able to meet its deadline. The application load time for this application also improves by a factor of more than 10X when compared to downloading the entire application before running it.

Energy-Aware Adaptive Checkpointing in Embedded Real-Time Systems [p. 918]
Y. Zhang and K. Chakrabarty

We present an integrated approach that provides fault tolerance and dynamic power management for a real-time task executing in an embedded system. Fault tolerance is achieved through an adaptive checkpointing scheme that dynamically adjusts the checkpointing interval during task execution. Adaptive checkpointing is then combined with a dynamic voltage scaling scheme to achieve power reduction. The resulting energy-aware adaptive checkpointing scheme uses a dynamic voltage scaling criterion that is based not only on the slack in task execution but also on the occurrences of faults during task execution. Simulation results show that compared to previous methods, the proposed approach significantly reduces power consumption and increases the likelihood of timely task completion in the presence of faults.


9F1: Synthesis Tools for Asynchronous Circuits

Moderators: M. Renaudin, TIMA Laboratory, FR; A. Koelmans, Newcastle U, UK

Visualization and Resolution of Coding Conflicts in Asynchronous Circuit Design [p. 926]
A. Madalinski, A. Bystrov, V. Khomenko, and A. Yakovlev

Synthesis of asynchronous circuits from Signal Transition Graphs (STGs) involves resolving state coding conflicts. The refinement process is generally done automatically using heuristics and often produces sub-optimal solutions, which have to be corrected manually. This paper presents a framework for an interactive refinement process aimed to help the designer. It is based on the visualization of conflict cores, i.e., sets of transitions causing coding conflicts, which are represented at the level of finite and complete prefixes of STG unfoldings.

STG Optimisation in the Direct Mapping of Asynchronous Circuits [p. 932]
D. Sokolov, A. Bystrov, and A. Yakovlev

Direct mapping from Petri nets (PN) and Signal Transition Graphs (STG) avoids algorithmic complexity inherent in logic synthesis methods based on optimal state encoding. However, it may lead to inefficient implementation, both in size and performance, due to excessive use of state-holding elements.This paper presents a set of tools that optimise logic produced by the direct mapping technique by means of: exposure of outputs, detection and elimination of redundant places. Output exposure is an approach to explicitly model output signals as STG places, which can be directly mapped into output flip-flops. The STG can be simplified after output exposure. The detection of redundant places is a computationally hard problem with multiple solutions. The tool solves this problem by using several heuristics aimed at speed and size. All operations preserve behavioural equivalence. The efficiency of the overall algorithm and individual heuristics is analysed using a number of benchmarks.


9F2: Collaborative Design and WWW-Based Tools

Moderators: T.J. Kazmierski, Southampton U, UK; W. Mueller, C-LAB Paderborn, DE

Ubiquitous Access to Reconfigurable Hardware: Application Scenarios and Implementation Issues [p. 940]
L. Indrusiak, F. Lubitz, R. Reis, and M. Glesner

This paper presents an approach for the integration of reconfigurable hardware and computer applications based on the concept of ubiquitous computing. The goal is to allow a network of reconfigurable hardware modules to be transparently accessible by client applications. The communication between them is done at the API level, and a Jini-based infrastructure is used to provide an interface for the client applications to find available reconfigurable hardware modules over the network. A DES-based cryptography system was implemented as a case study.

Dynamic Tool Integration in Heterogeneous Computer Networks [p. 946]
H. Eikerling, W. Mueller, T. Schattkowsky, and J. Wegner

Tool installation and automation of administrative tasks in heterogeneous computer networks becomes of increasing importance with the availability of complex heterogeneous computer networks. This article introduces a new approach for dynamic network tool management, i.e., TRMS. A variant of TRMS using SNMP -- a well established standard for network administration -- is outlined and illustrated by the application of the integration and management of design tools for Printed Circuit Boards (PCBs).


10A: Performance Optimisation in Hardware/Software Codesign

Moderators: J. Madsen, TU Denmark, DK; F. Rousseau, TIMA Laboratory, FR

Layered, Multi-Threaded, High-Level Performance Design [p. 954]
A. Cassidy, J. Paul, and D. Thomas

A primary goal of high-level modeling is to efficiently explore a broad design space, converging on an optimal or near-optimal system architecture before moving to a more detailed design. This paper evaluates a high-level, layered software-on-hardware performance modeling environment called MESH that captures coarse-grained, interacting system elements. The validity of the high-level model is established by comparing the outcome of the high-level model with a corresponding low-level, cycle-accurate instruction set simulator. We model a network processor and show that both high and low level models converge on the same architecture when design modifications are classified as good or bad performance impacts.

A Co-Design Methodology for Energy-Efficient Multi-Mode Embedded Systems with Consideration of Mode Execution Probabilities [p. 960]
M. Schmitz, B. Al-Hashimi, and P. Eles

Multi-mode systems are characterised by a set of interacting operational modes to support different functionalities and standards. In this paper, we present a co-design methodology for multi-mode embedded systems that produces energy-efficient implementations. Based on the key observation that operational modes are executed with different probabilities, i.e., the system spends uneven amounts of time in the different modes, we develop a novel co-design technique that exploits this property to significantly reduce energy dissipation. We conduct several experiments, including a smart phone real-life example, that demonstrate the effectiveness of our approach. Reductions in power consumption of up to 64% are reported.

Processor/Memory Co-Exploration on Multiple Abstraction Levels [p. 966]
G. Braun, A. Wieferink, O. Schliebusch, R. Leupers, and H. Meyr, and A. Nohl

Recently, the evolution of embedded systems has shown a strong trend towards application-specific, single-chip solutions. As a result, application-specific instruction set processors (ASIP) are more and more replacing off-the-shelf processors in such systems-on-chip (SoC). Along with the processor cores, heterogeneous memory architectures play an important role as part of the system. According to last year's ITRS [5], in 2004 about 70 percent of the chip area will be made up of memories. As such architectures are highly optimized for a particular application domain, processor core and memory subsystem design cannot be apart, but have to merge into an efficient design process. In this paper, we present a unified approach for processor/memory co-exploration using an architecture description language. We show an efficient way of considering instruction set and memory architecture during the entire exploration process. Finally, we illustrate the feasibility of our approach with a real-world case study.


10B: Dynamic Resource Management for Reconfigurable Systems

Moderators: Y. Tanurhan, Actel, US; C. Passerone, Politecnico di Torino, IT

Run-Time Management of Logic Resources on Reconfigurable Systems [p. 974]
M. Gericota, G. Alves, M. Silva, and J. Ferreira

Dynamically reconfigurable systems based on partial and dynamically reconfigurable FPGAs may have their functionality partially modified at run-time without stopping the operation of the whole system. The efficient management of the logic space available is one of the biggest problems faced by these systems. When the sequence of reconfigurations to be performed is not predictable, resource allocation decisions have to be made on-line. A rearrangement may be necessary to get enough contiguous space to implement incoming functions, avoiding the spreading of their components and the resulting degradation of system performance. A new software tool that helps to handle the problems posed by the consecutive reconfiguration of the same logic space is presented in this paper. This tool uses a novel on-line rearrangement procedure to solve fragmentation problems and to rearrange the logic space in a way completely transparent to the applications currently running.

Managing a Reconfigurable Processor in a General Purpose Workstation Environment [p. 980]
M. Dales

Reconfigurable processor hybrids are becoming an accepted solution in the embedded systems domain, but have yet to gain acceptance in the general purpose workstation domain. One problem with current solutions is their lack of support for the dynamic workloads and resource demands of a general purpose workstation. In this paper we describe and demonstrate a reconfigurable processor architecture that lets the operating system dynamically share the FPL resource between a set of applications without the management overheads negating the benefit of having the extra resource.

Infrastructure for Design and Management of Relocatable Tasks in a Heterogeneous Reconfigurable System-On-Chip [p. 986]
J. Mignolet, V. Nollet, P. Coene, S. Vernalde, D. Verkest, and R. Lauwereins

The ability to (re)schedule a task either in hardware or software will be an important asset in a reconfigurable systems-on-chip. To support this feature we have developed an infrastructure that, combined with a suitable design environment permits the implementation and management of hardware/software relocatable tasks. This paper presents the general scope of our research, and details the communication scheme, the design environment and the hardware/software context switching issues. The infrastructure proved its feasibility by allowing us to design a relocatable video decoder. When implemented on an embedded platform, the decoder performs at 23 frames/s (320x240 pixels, 16 bits per pixel) in reconfigurable hardware and 6 frames/s in software.


10C: Advances in Test Pattern Generation

Moderators: S. Kundu, Intel, US; B. Straube, FhG IIS/EAS Dresden, DE

RTL Test Pattern Generation for High Quality Loosely Deterministic BIST [p. 994]
M. Santos, J. Fernandes, I. Teixeira, and J. Teixeira

High quality Built-In Self Test (BIST) needs to efficiently tackle the coverage of random-pattern-resistant (r.p.r) defects. Several techniques have been proposed to cover r.p.r faults at logic level, namely, weighted pseudo-random and mixed-mode. In mixed-mode test pattern generation (TPG) techniques, deterministic tests are added to pseudo-random vectors to detect r.p.r faults. Recently, a RTL mixed-mode TPG technique has been proposed to cover r.p.r defects, the mask-based BIST technique. The purpose of this paper is to present mask-based BIST TPG improvements, namely in two areas: RTL estimation of the test length to be used for each mask, in order to reach high Defects Coverage (DC), and the identification of an optimum mask for each set of nested RTL conditions. Results are used to predict the number of customized vectors for each masks of one ITC'99 benchmark module.

A New Approach to Test Generation and Test Compaction for Scan Circuits [p. 1000]
I. Pomeranz and S. Reddy

We propose a new approach to test generation and test compaction for scan circuits that eliminates the distinction between scan operations and application of primary input vectors. Under this approach, the scan-in, scan-select and scan-out lines are treated as conventional primary inputs or primary outputs of the circuit. As a result, limited scan operations, where scan chains are shifted a number of times smaller than their lengths, are incorporated naturally into the test sequences generated by this approach. This leads to very aggressive compaction, resulting in test sequences with the lowest known test application times for benchmark circuits.

Fully Automatic Test Program Generation for Microprocessor Cores [p. 1006]
F. Corno, G. Cumani, M. Sonza Reorda, and G. Squillero

Microprocessor cores are a major challenge in the test arena: not only is their complexity always increasing, but also their specific characteristics intensify all difficulties. A microprocessor embedded inside a SOC is even harder to test since its input might be harder to control and its behavior may be harder to observe. Functional testing is an effective solution which consists in forcing the microprocessor to execute a suitable test program. This paper presents a new approach to automatic test program generation exploiting an evolutionary paradigm. It overcomes the main limitations of previous methodologies and provides significantly better results. Human intervention is limited to the enumeration of all assembly instructions. Also internal parameters of the optimizer are auto-adapted. Experimental results show the effectiveness of the approach.

On the Characterization of Hard-To-Detect Bridging Faults [p. 1012]
I. Pomeranz, S. Reddy, and S. Kundu

We investigate a characterization of hard-to-detect bridging faults. For circuits with large numbers of lines (or nodes), this characterization can be used to select target faults for test generation when it is impractical to target all the bridging faults (or all the realistic bridging faults). We demonstrate that the faults selected based on the proposed characterization are indeed hard-to-detect by showing that the fault coverage of a given test set with respect to this subset is lower and more sensitive to the test set than the fault coverage obtained with respect to a random subset of the same size, with respect to the complete set of faults, and when possible, with respect to a subset of realistic bridging faults of the same size. We also demonstrate that a test set for the selected subset of faults detects other faults more effectively than when a test set is derived for a randomly selected subset of faults of the same size.


10D: Analogue and Digital Simulation

Moderators: M. Zwolinski, Southampton U, UK; P. Schwarz, FhG IIS/EAS Dresden, DE

The Power Grid Transient Simulation in Linear Time Based on 3D Alternating-Direction-Implicit Method [p. 1020]
Y. Lee and C. Chen

The rising power consumption and clock frequency of VLSI technology demand robust and stable power delivery. Extensive transient simulations on large scale power delivery structures are required to analyze power delivery fluctuation caused by dynamic IR-, and Ldi/dt drop as well as package and on-chip resonance. This paper develops a novel and efficient transient simulation algorithm for the power distribution networks. The 3D TLM-ADI (Transmission-Line-Modeling Alternating-Direction-Implicit) method, first models the power delivery structure as three dimensional transmission line shunt node structure and transfers those equations to the Telegraph equation. Finally, we solve it by the alternating direction implicit method. The 3D TLM-ADI method, with linear runtime and memory requirement, is also unconditionally stable which ensures that the time-steps are not limited by any stability requirement. Numerical experimental results show that the 3D TLM-ADI method is not only over 300,000 times faster than SPICE but also extremely memory saving and accurate.

Transistor-Level Static Timing Analysis by Piecewise Quadratic Waveform Matching [p. 1026]
Z. Wang and J. Zhu

While fast timing analysis methods, such as asymptotic waveform evaluation (AWE), have been well established for linear circuits, the timing analysis for non-linear circuits, which are dominant in digital CMOS circuits, is usually performed by a SPICE like, time domain integration based approach, involving expensive Newton Raphson iterations at numerous time steps. In this paper, we propose a new technique that leads to the transient solution of charge/discharge paths with a complexity equivalent to only K DC operating point calculations, where K is the number of transistors along the path. This is accomplished by approximating each nodal voltage as a piecewise quadratic waveform, whose characteristics can be determined by matching the charge/discharge currents. Experiments on a wide range of circuits show that a 31.6 times speed-up over SPICE transient simulation with 10ps step size can be achieved, while maintaining an average accuracy of 99%.

A Fast Algorithm for the Layout Based Electro-Thermal Simulation [p. 1032]
M. Rencz, V. Székely, and A. Poppe

A new algorithm has been developed for the layout based direct electro-thermal simulation of integrated circuits. The advantage of the direct electro-thermal simulation over simulator coupling is, that very fast changes can also be considered, the drawback is that the thermal nodes are added to the number of nodes of the network to be simulated. The novelties of our method are the modeling and the solution of the thermal structure. This paper presents the algorithm of the time constant spectrum based FOSTER chain matrix thermal modeling, and the new algorithm of the coupled electro-thermal solution, where parts of the network, which represent the thermal behavior, are not computed in all steps of the iteration. This speeded up algorithm works both in the time-, and in the frequency domain. A simulation example demonstrates a typical application: the prediction of how the layout arrangement and the packaging of an analogue integrated circuit influence the electrical parameters.

Platform Based Testbench Generation [p. 1038]
R. Henftling, M. Bauer, M. Zambaldi, A. Zinn, and W. Ecker

This paper presents a new technology that accelerates system verification. In a real life example, we achieved a speed-up of a factor of about 5000. The key for this speed-up is a configurable, synthesizable testbench architecture, which can be completely mapped to emulators or FPGAs. Exploiting generic controllers and re-using protocol-specific stimuli generators combined with topology and microprogram generation is responsible for almost zero overhead compared to behavioral testbenches.


10E: Low Power Software (Embedded Software Forum)

Moderators: M. Miranda, IMEC, BE; F. Fallah, Fujitsu, US

Software Architectural Transformations: A New Approach to Low Energy Embedded Software [p. 1046]
T. Tan, A. Raghunathan, and N. Jha

Previous work on software optimization for low energy has focussed on instruction-level optimizations and compiler techniques. We argue, and demonstrate, that significant energy savings could be "left on the table" if energy is not considered during the design of the software architecture. As a first step towards addressing this gap, we propose a systematic framework for software architectural transformations to reduce energy consumption. We consider software architectural transformations in the context of the multi-process software style driven by an operating system (OS), which is very commonly employed in energy-sensitive embedded systems. Our methodology for applying software architectural transformations consists of (i) constructing a software architecture graph representation, (ii) deriving initial energy and performance statistics using a detailed energy simulation framework, (iii) constructing sequences of atomic software architectural transformations, guided by energy change estimates derived from high-level energy macro-models, that result in maximal energy reduction, and (iv) generation of program source code to reflect the optimized software architecture. We employ a wide suits of software architectural transformations whose effects span the application-OS boundary, including how the program functionality is structured into architectural components (e.g., application process, signal handlers, and device drivers), and connectors between them (inter-component synchronization and communication mechanisms). We present experimental results on several multi-process embedded software programs, in the context of an embedded system that features the Intel StrongARM processor and the embedded Linux OS. The presented results clearly underscore the potential of the proposed methodology (up to 66.1% reduction in energy is obtained). In a broader sense, our work demonstrates the impact of considering energy during the earlier stages of the software design process.

Dynamic Functional Unit Assignment for Low Power [p. 1052]
S. Haga, N. Reeves, R. Barua, and D. Marculescu

A hardware method for functional unit assignment is presented, based on the principle that a functional unit's power consumption is approximated by the switching activity of its inputs. Since computing the Hamming distance of the inputs in hardware is expensive, only a portion of the inputs are examined. Integers often have many identical top bits, due to sign extension, and floating points often have many zeros in the least significant digits, due to the casting of integer values into floating point. The accuracy of these approximations is studied and the results are used to develop a simple, but effective, hardware scheme.

Implementation and Evaluation of an On-Demand Parameter-Passing Strategy for Reducing Energy [p. 1058]
M. Kandemir, W. Zhang, and I. Kolcu

In this paper, we present an energy-aware parameter-passing strategy called on-demand parameter-passing. The objective of this strategy is to eliminate redundant actual parameter evaluations if the corresponding formal parameter in a subroutine is not used during execution. This on-demand parameter-passing is expected to be very successful in reducing energy consumption of large, multi-routine embedded applications at the expense of a slight implementation complexity.

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors [p. 1064]
D. Nicolaescu, A. Veidenbaum, and A. Nicolau

Modern embedded processors use data caches with higher and higher degrees of associativity in order to increase performance. A set-associative data cache consumes a significant fraction of the total power budget in such embedded processors. This paper describes a technique for reducing the D-cache power consumption and shows its impact on power and performance of an embedded processor. The technique utilizes cache line address locality to determine (rather than predict) the cache way prior to the cache access. It thus allows only the desired way to be accessed for both tags and data. The proposed mechanism is shown to reduce the average L1 data cache power consumption when running the MiBench embedded benchmark suite for 8, 16 and 32-way set-associate caches by, respectively, an average of 66%, 72% and 76%. The absolute power savings from this technique increase significantly with associativity. The design has no impact on performance and, given that it does not have mis-prediction penalties, it does not introduce any new non-deterministic behavior in program execution.


10F: Application Specific Memory Synthesis

Moderators: R. Hermida, Madrid Complutense U, ES; P. Eles, Linköping U, SE

Layer Assignment Techniques for Low Energy in Multi-Layered Memory Organisations [p. 1070]
E. Brockmeyer, M. Miranda, F. Catthoor, and H. Corporaal

Nearly all platforms use a multi-layer memory hierarchy to bridge the enormous latency gap between the large off-chip memories and local register files. However, most of previous works on HW or SW controlled techniques for layer assignment have been mainly focussed on performance. As a result, the intermediate layers have been assigned too large sizes leading to energy inefficiency. In this paper we present a technique that takes advantage of both the temporal locality and limited lifetime of the arrays of the application for minimum energy consumption under layer size constraints. A prototype tool has been developed and tested using two real-life applications of industrial relevance. Following this approach we have been able to half the energy consumed by the memory hierarchy for each of our drivers.

Mesh Partitioning Approach to Energy Efficient Data Layout [p. 1076]
S. Hettiaratchi and P. Cheung

Memory access consumes a significant amount of energy in data transfer intensive applications. The selection of a memory location from a CMOS memory cell array involves driving row and column select lines. A switching event on a row select line often consumes significantly more energy in comparison to a switching event on a column select line. In order to exploit this difference in energy consumption of row and column select lines, we propose a novel data layout method that aims to minimize row switching activity by assigning spatially and temporally local data items to the same row. The problem of minimum row switching data layout has been formulated as a multi-way mesh partitioning problem. The constraints imposed on the problem formulation ensure that the complexity of the address generator required to implement the optimized data layout is bounded and that the data layout optimization can be applied to all address generator synthesis methods. Our experiments demonstrate that our method can significantly reduce row transition counts over row major data layout.

On-Chip Stack Based Memory Organization for Low Power Embedded Architectures [p. 1082]
M. Mamidipaka and N. Dutt

This paper presents a on-chip stack based memory organization that effectively reduces the energy dissipation in programmable embedded system architectures. Most embedded systems use the notion of stack for implementation of function calls. However, such stack data is stored in processor address space, typically in the main memory and accessed through caches. Our analysis of several benchmarks show that the callee saved registers and return addresses for function calls call constitute a significant portion of the total memory accesses. We propose a separate stack-based memory organization to store these registers and return addresses. Our experimental results show that effective use of such stack-based memories yield significant reductions in system power/energy, while simultaneously improving the system performance. Application of our approach to the SPECint95 and MediaBench benchmark suites show a reduction of up to 32.5% reduction in energy in L1 data cache, with marginal improvements in system performance.


Poster Sessions

4P: CAD for Analogue Design, Design Methodologies and Physical Design

Figure of Merit Based Selection of A/D Converters [p. 1190]
M. Vogels and G. Gielen

A new method for selecting analog to digital (A/D) converters based on a generic figure of merit is described. First a figure of merit is introduced that includes both specifications and technology data and that has five generic parameters. The values of these generic parameters can be found by means of a fitting procedure using data from published designs. It is shown that the generic parameters have different values for different types of converters. Therefore the trade-off between speed, resolution, power dissipation and technology parameters depends on the type of converter. This trade-off can than be used to select a particular type of converter for a given application area.

XBM2PLA: A Flexible Synthesis Tool for Extended Burst Mode Machines [p. 1092]
O. Kraus and M. Padeffke

This paper describes the results of a new synthesis tool (XBM2PLA) for asynchronous state machines [2]. XBM2PLA generates the boolean functions for an asynchronous circuit. XBM2PLA is compatible to several other tools but often generates smaller results. Moreover, XBM2PLA allows a more flexible description of asynchronous circuits than other existing tools.

Multithreaded Synchronous Data Flow Simulation [p. 1094]
J. Kin and J. Pino

This paper introduces an efficient multithreaded synchronous dataflow (SDF) scheduler that can significantly reduce the running time of multi-rate SDF simulations on multiprocessor machines with only a slight increase of memory usage over standard cluster loop scheduler[1]. Experiments run on a dual processors machine achieves on average approximately 146% increase in performance with less than 2.4% increase in memory usage. There is an average of 2x speedup with a quad processors machine.

PLFire: A Visualization Tool for Asynchronous Phased Logic Designs [p. 1096]
K. Fazel, M. Thornton, and R. Reese

We present a visualization tool called PLFire, which allows a user to observe the behavior of a Phased Logic (PL) circuit. Phased logic is a technique for realizing self-timed circuitry that is delay-insensitive and requires no global clock. One advantage of self-timed circuits is that throughput is based on average propagation delays and not worst-case delay. By being able to visualize the operation of a PL circuit, including the token flow, a designer gets a better understanding of what features of a design have the greatest impact on performance.

Extraction of Piecewise-Linear Analog Circuit Models from Trained Neural Networks Using Hidden Neuron Clustering [p. 1098]
S. Doboli, G. Gothoskar, and A. Doboli

This paper presents a new technique for automatically creating analog circuit models. The method extracts -- from trained neural networks -- piecewise linear models expressing the linear dependencies between circuit performances and design parameters. The paper illustrates the technique for an OTA circuit for which models for gain and bandwidth were automatically generated. The extracted models have a simple form that accurately fits the sampled points and the behavior of the trained neural networks. These models are useful for fast simulation of systems with non-linear behavior and performances.

Simulation and Analysis of Embedded DSP Systems Using MASIC Methodology [p. 1100]
A. Deb, J. Öberg, and A. Jantsch

MASIC, short for Maths to ASIC, models embedded DSP systems using grammar based technique. This paper presents a simulation and analysis technique for the MASIC model of a system. We build a Petri Net (PN) representation of the MASIC model to simulate the communication architecture. It helps to evaluate the effects of architectural decisions early in the design cycle. The correctness of the protocol description is analyzed using PN based boundedness and conservation analysis.

A Custom-Cell Identification Method for High-Performance Mixed Standard/Custom-Cell Designs [p. 1102]
J. Lo, W. Kuo, T. Hwang, and A. Wu

Over the years, many design methodologies/tools and layout architectures have been developed for datapath-oriented designs. One commonly used approach for high-speed datapath designs is the full-custom design method targeted to bit-sliced or bit-alignment layout architectures. Using this approach, designers can fully exploit design properties, such as various circuit designs, structural regularities and layout structures, to develop high-speed, low-power and high-density datapath designs. However, the main drawbacks of this approach are threefold: (1) extremely high development cost, (2) very long development time, and (3) difficult to migrate the custom design to a new technology. The other approach is the HDL-based synthesis method targeted to the standard-cell layout architecture. This approach is highly productive and easy to implement. However, due to lack of the ability to exploit the regularity properties of datapath designs, it's mostly suitable to produce low/medium-speed datapath designs. Another design method targeted to mixed standard- and custom-cells is carried out as follows. First, it generates a standard-cell design to obtain a maximal achievable speed. Second, it performs a cell customization process to determine which standard cells need to be customized in order to satisfy the given timing constraint. This cell customization process is usually performed incrementally based on designers' intuitions and experiences. Once designers select a set of standard cells to be customized, they will define the specification of the custom cells and develop the cells accordingly. Finally, the designers will replace the standard cells with the custom ones and re-evaluate the speed of the design. The customization process will continue until the timing constraint is satisfied. In this paper, we present a custom-cell identification method for high-speed datapath-oriented designs. The proposed method uses an HDL-based standard-cell design flow by integrating a custom-cell identification algorithm to determine a minimal-cost cell set and the design budgets for cell customization. Experimental results on a set of benchmarking designs are reported to demonstrate the effectiveness of the proposed method. Note that our problem is different from the cell-sizing problem. While the cell-sizing problem is to select and to replace (re-size) cells (instances of cells in cell library) one by one, our problem is to select and re-design cells in the cell library and to replace all instances of the cell in the design at one time.

Hierarchical Global Floorplacement Using Simulated Annealing and Network Flow Area Migration [p. 1104]
W. Choi and K. Bazargan

Floorplanning large designs with many hard macros and IP blocks of various sizes is becoming an increasingly important and challenging problem. This paper presents a global floorplacement method that combines a hierarchical simulated annealing floorplanning method with a partitioning-based global placement technique. A novel area migration method formulated as a min-cost, max-flow network flow problem is used to improve area utilization, and provide a communication mechanism between the partitioning engine and the placement method for better design quality. The network flow area migration method can be used in managing incremental changes in the design as well. Our global placement wire length is 12% better than the detailed placement wire length of a previous work, while our global placement is almost 8 times faster than their global placement.
Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Placement and routing.
General Terms Algorithms, Management, Design.
Keywords Floorplanning, floorplacement, global placement, hierarchical, network flow, area migration, simulated annealing.

LIT -- An Automatic Layout Generation Tool for Trapezoidal Association of Transistors for Basic Analog Building Blocks [p. 1106]
A. Girardi and S. Bampi

This paper describes a methodology for analog layout synthesis based on the automatic generation of an equivalent composite transistor with the same DC current characteristics of the transistors in the electrical schematic. The tool serves a dual purpose: i) the layout synthesis of analog blocks over a digital sea-of-gates pre-diffused array, and ii) the generation of custom associations of transistors for matched common-source input pairs and current mirrors. The LIT tool generates the layout in a line-matrix, sea-of-gates with gate isolation style of several blocks: the trapezoidal-like composite transistors and of matched transistor pairs. In addition, the tools provides an environment for manual analog cells placement and automatic routing. These features drastically reduce the design time, reduce costs and include matching properties.

Symbolic Analysis of Nonlinear Analog Circuits [p. 1108]
A. Manthe, Z. Li, R. Shi, and K. Mayaram

A new method is presented to model symbolically strongly nonlinear circuits, characterized by Piece-Wise Linear (PWL) functions. The method follows the idea of Bokhoven and Leenaerts, and formulates the problem as a linear complementarity problem (LCP). A new graph representation, called complementarity decision diagram, is introduced to represent the explicit LCP solutions. Orders of magnitude improvements in computational efficiency and memory usage have been obtained.

Improved Time Domain Simulation of Optical Multimode Intrasystem Interconnects [p. 1110]
J. Gerling, O. Stübbe, J. Schrage, G. Mrozynski, and J. Teich

To increase the bandwidth of high-performance intrasystem interconnections optical multimode waveguides can be used. Since the design procedure of optical interconnections has to be widely compatible with conventional design processes, adequate simulation methods are required. This paper presents an improved time domain method for simulating the signal transmission along optical multimode interconnections. The improvements mainly result from the more efficient method for the piecewise approximation of the waveguides step responses by a few exponential functions. The adapted semi-analytical recursive convolution method decreases the computation times.

A New Crosstalk Noise Model for DOMINO Logic Circuits [p. 1112]
S. Choi and K. Roy

A new crosstalk noise model is proposed for DOMINO logic gates. Our noise model takes the effect of keeper into account and provides more accurate noise measure.

Modeling Noise Transfer Characteristic of Dynamic Logic Gates [p. 1114]
L. Ding and P. Mazumder

Dynamic noise analysis is recently gaining more attention as a definitive method to overcome glaring deficiencies of static noise analysis. Exact dynamic noise analysis requires modeling of both injected noise and propagated noise. In this paper, we have developed a strategy to study the noise propagation problem. An efficient analytical formula has been derived to accurately model the noise waveform transfer characteristic of dynamic CMOS logic gates. Experiments have shown that the maximum error in peak propagated noises of the proposed model is less than 10%.


5P: Reconfigurable Computing and Systems Design

Heterogeneous Programmable Logic Block Architectures [p. 1118]
A. Koorapaty, V. Chandra, K. Tong, C. Patel, L. Pileggi, and H. Schmit

In this poster, we propose four new heterogeneous programmable logic blocks (PLBs) consisting of a combination of various sizes of look up tables (LUTs), multiplexers (MUXes), and logic gates. We demonstrate that these PLBs offer significant performance and density benefits over more homogeneous PLBs.

An Industrial/Academic Configurable System-On-Chip Project (CSoC): Coarse-Grain XPP-/Leon-Based Architecture Integration [p. 1120]
J. Becker, A. Thomas, M. Vorbach, and V. Baumgarte

This paper describes the actual status and results of a dynamically Configurable System-on-Chip (CSoc) integration, consisting of a SPARC-compatible LEON processor-core, a commercial coarse-grain XPP-array of suitable size from PACT Informationstechnologie AG, and application-tailored global/local memory topology with efficient Amba-based communication interfaces. The given adaptive architecture is synthesized within an industrial/academic SoC project onto 0.18 and 0.13 um UMC CMOS technologies at Universitaet Karlsruhe (TH). Due to exponential increasing CMOS mask costs, essential aspects for the industry are now adaptivity of SoCs, which can be realized by integrating reconfigurable re-usable hardware parts on different granularities into Configurable Systems-on-Chip (CSoCs)[1].

Development of a Tool-Set for Remote and Partial Reconfiguration of FPGAs [p. 1122]
F. Moraes, D. Mesquita, J. Palma, L. Möller, and N. Calazans

This work describes the implementation of digital reconfigurable systems (DRS) using commercial FPGA devices. The main goal is to present a set of tools for remote and partial reconfiguration developed for the Virtex FPGA family. Even though the tools are targeted to a specific device, their building principles may easily be adapted to other FPGA families, if they have an internal architecture enabling partial reconfiguration. The main contribution of the paper is the tool-set proposed to manipulate cores using partial reconfiguration in existing FPGAs.

Mapping Applications to an FPFA Tile [p. 1124]
M. Rosien, Y. Guo, G. Smit, and T. Krol

This paper introduces a transformational design method which can be used to map code written in a high level source language, like C, to a coarse grain reconfigurable architecture. The source code is first translated into a Control Dataflow graph (CDFG), which is minimized using a set of behaviour preserving transformations such as dependency analysis, common subexpression elimination, etc. After applying graph clustering, scheduling and allocation transformations on this minimized graph, it can be mapped onto the target architecture.

Load Distribution with the Proximity Congestion Awareness in a Network On Chip [p. 1126]
E. Nilsson, M. Millberg, J. Öberg, and A. Jantsch

In Networks on Chip, NoC, very low cost and high performance switches will be of critical importance. For a regular two-dimensional NoC we propose a very simple, memoryless switch. In case of congestion, packets are emitted in a non-ideal direction, also called deflective routing. To increase the maximum tolerable load of the network, we propose a Proximity Congestion Awareness, PCA, technique, where switches use load information of neighbouring switches, called stress values, for their own switching decisions, thus avoiding congested areas. We present simulation results with random traffic which show that the PCA technique can increase the maximum traffic load by a factor of over 20.

Micro-Network for SoC: Implementation of a 32-Port SPIN Network [p. 1128]
A. Andriahantenaina and A. Greiner

We present a physical implementation of a 32-ports SPIN micro-network. For a 0.13 micron CMOS process, the total area is 4.6 mm2, for a cumulated bandwidth of about 100 Gbits/s. In a 6 metal process, all the routing wires can be routed on top of the switching components. The SPIN32 macro-cell will be fabricated by ST Microelectronics, but this macrocell uses symbolic layout, and can be manufactured with any CMOS process including 6 metal layers.

A Fully Self-Timed Bit-Serial Pipeline-Architecture for Embedded Systems [p. 1130]
A. Rettberg, M. Zanella, C. Bobda, and T. Lehmann

Area minimization, low power and high performance are objectives to be reached in chip design. Bit-serial architecture offers a great advantage in comparison with bit-parallel architectures as regards area minimization. One field of application of such an architecture is e.g. signal processing in terms of digital filters or digital controllers. These algorithms may be realized in hardware by means of the proposed architecture; this requires only small chip area and an equally small number of input and outputs pins, thus reducing the size and the complexity of the printed circuit. The speed of the bit-serial processing is high enough for the application domain in question. E.g., in an electrical motor-current control there are the delays of input/output converters (e.g., A/Ds and D/As) and those resulting from the inertia of the motor. In synchronous design, the performance of these architectures is affected by the long lines which are used to control the operators and the gated clocks. The architecture described in this contribution avoid a long control lines by a local distribution of the control circuitry on the operator level. To our knowledge, this is the second paper detailing the implementation of a fully interlocked synchronous architecture after that described in [1].

Library Functions Timing Characterization for Source-Level Analysis [p. 1132]
C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto

Execution time estimation of software at source-level is nowadays a crucial phase of the system design flow, especially for portable devices and real-time systems. From a source-level perspective, a call to an external library function is a black-box: only the binary code of such functions is, in fact, available. This paper proposes a methodology for library functions analysis within a source-level estimation framework.

G-MAC: An Application-Specific MAC/Co-Processor Synthesizer [p. 1134]
A. Chang, W. Kuo, T. Hwang, and A. Wu

A modern special-purpose processor (e.g., for image and graphical applications) usually contains a set of instructions supporting complex multiply-operations. These instructions perform a variety of multiply-operations with various data bitwidths and concurrent-execution requirements. For instance, such an instruction set may include instructions to perform signed/unsigned 32X32, signed/unsigned dual 16X16, signed/unsigned 8X8 MAC, and etc. Typically, a co-processor or a complex MAC (Multiplier-ACcumulator) unit is required to execute those instructions. Developing such a complex MAC/co-processor involves a series of design tasks including micro-architecture design, component allocation/binding, interconnect binding, pipeline insertion and control generation. This design process is non-trivial, time-consuming and error-prone, which is usually performed by experienced design engineers. In this paper, we present a synthesis method for application-specific MAC/coprocessor generation. The MAC/co-processor synthesis problem is defined as: Given a set of instructions and the number of execution cycles for each instruction, generate a MAC/co-processor design (including a data-path and a control unit) such that the total area-cost is minimized subject to the given execution cycle constraints.

Power Constrained High-Level Synthesis of Battery Powered Systems [p. 1136]
S. Nielsen and J. Madsen

We present a high-level synthesis algorithm solving the combined scheduling, allocation and binding problem minimizing area under both latency and maximum power per clock-cycle constraints. Our approach eliminates the large power spikes, resulting in an increased battery lifetime, a property of outmost importance for battery powered embedded systems. Our approach extends the partial-clique partitioning algorithm of [3] by introducing power awareness through a heuristic algorithm which bounds the design space to those of power feasible schedules. We have applied our algorithm on a set of dataflow graphs and investigated the impact on circuit area when applying different power constraints.

PARLAK: Parametrized Lock Cache Generator [p. 1138]
B. Akgul and V. Mooney

A system-on-chip lock cache (SoCLC) is an intellectual property (IP) core that provides effective lock synchronization in a heterogeneous multiprocessor shared-memory system-on-a-chip (SoC). We present PARLAK, a parameterized lock cache generator tool. PARLAK generates a synthesizable SoCLC architecture with a user specified number of lock variables and user specified number and type(s) of processor(s). PARLAK can generate a full range of customized SoCLCs, from a version for two processors with 32 lock variables occupying 1,790 gates of area to a version for 14 processors with 256 lock variables occupying 37,380 gates of area (in TSMC 0.25u technology). PARLAK is an important contribution to IP-generator tools for both custom and reconfigurable SoC designs.

A Secure Web-Based Framework for Electronic System Level Design [p. 1140]
T. Kazmierski and X. Yang

This contribution presents the concept of a secure implementation of a distributed, web-based electronic design framework. Our two-tier client-webserver-toolserver architecture has been extended to support permanent databases for collaborative, distributed design development. In the sample application of the framework, developed in Java, any of the servers can be based on Linux, MS Windows or Sun-SPARC Java-servlet enabled server. The feasibility of a secure, web-based design framework has been sufficiently proved. The technological approach used to do this involves the use of novel, yet tried and tested methods, taking where appropriate from the rapidly advancing field of e-commerce solutions.


6P: Low Power Design and Estimation, Verification and Testing

Background Data Organisation for the Low-Power Implementation in Real-Time of a Digital Audio Broadcast Receiver on a SIMD Processor [p. 1144]
P. Op de Beeck, C. Ghez, E. Brockmeyer, M. Miranda, F. Catthoor, and G. Deconinck

In this work we illustrates the strong interaction between the data organisation in background memory and the data format required for sub-word level acceleration. The impact of such interaction is demonstrated on the implementation of a Digital Audio Broadcast Channel Decoder on a TriMedia processor, where data format transformations applied on the background memory data enable a substantially better exploitation of the available Single Instruction Multiple Data instructions. As a result, a factor two reduction for both execution time and data memory energy is achieved.

Compiler Support for Reducing Leakage Energy Consumption [p. 1146]
W. Zhang, M. Kandemir, N. Vijaykrishnan, M. Irwin, and V. De

Current trends indicate that leakage energy consumption will be an important concern in upcoming process technologies. In this paper, we propose a compiler-based leakage energy optimization strategy. Our strategy is built upon a data-flow analysis that identifies basic blocks that do not use a given functional unit. Based on this information, the compiler then inserts activate/deactivate instructions in the code to set/reset a sleep signal which controls leakage current for functional units. Our experimental results show that the proposed compiler-based strategy is very effective in reducing leakage energy of functional units.

An Analytical Model for Predicting the Remaining Battery Capacity of Lithium-Ion Batteries [p. 1148]
P. Rong and M. Pedram

Predicting the residual energy of the battery source that powers a portable electronic device is quite important in designing and employing an effective dynamic power management policy in the device. This paper presents a closed-form analytical model for predicting the remaining capacity of a lithium-ion battery. The proposed high-level model relies on online current and voltage measurements and, at the same time, correctly accounts for the temperature and cycle aging effects. The accuracy of the high-level model is validated by comparing our analytical model with the Dualfoil simulation results (under some simplifying assumptions), demonstrating 5% error between simulated and predicted data.

Simultaneous Dynamic Voltage Scaling of Processors and Communication Links in Real-Time Distributed Embedded Systems [p. 1150]
J. Luo, N. Jha, and L. Peh

Dynamic voltage scaling has been widely acknowledged as a powerful technique for trading off power consumption and delay for processors. Recently, variable-frequency (and variable-voltage) parallel and serial links have also been proposed, which can save link power consumption by exploiting variations in bandwidth requirement. In this paper, we address joint dynamic voltage scaling for variable-voltage processors and communication links in such systems. We propose a scheduling algorithm for real-time applications, with both data flow and control flow information captured. It performs efficient routing of communication events through multi-hops, as well as efficient slack allocation among heterogeneous processors and communication links to maximize energy savings, while meeting all real-time constraints.

Decomposition of Extended Finite State Machine for Low Power Design [p. 1152]
M. Lee, T. Hwang, and S. Huang

Power reduction can be achieved by turning off portions of circuits that are idle. Unlike previous work which focused only on either controller or data-path, we propose a decomposition technique that takes both controller and data-path into consideration where, in the former, the state probability of an FSM provides the execution frequency of operations in that state. And, in the latter, operations performed in states provide the information of resource requirements which can be used to determine the resource sharing among states. The goal is to synthesize circuits such that the sub-machine with small area (data-path and controller) will be turned on most of the time (high state probability) and all other parts are turned off. Our experimental results show that on the average, 10% area reduction and 24% power reduction can be achieved as compared with designs without decomposition.

Equisolvability of Series vs. Controller's Topology in Synchronous Language Equations [p. 1154]
N. Yevtushenko, T. Villa, R. Brayton, A. Sangiovanni-Vincentelli, and A. Petrenko

Given a plant MA and a specification MC, the largest solution of the FSM equation MX. MA <=MC contains all possible discrete controllers MX. Often we are interested in computing the complete solutions whose composition with the plant is exactly equivalent to the specification. Not every solution contained in the largest one satisfies such property, that holds instead for the complete solutions of the series topology. We study the relation between the solvability of an equation for the series topology and of the corresponding equation for the controller's topology. We establish that, if MA is a deterministic FSM, then the FSM equation MX.MA <=MC is solvable for the series topology with an unknown head component iff it is solvable for the controller's topology. Our proof is constructive, i.e., for a given solution MB of the series topology it shows how to build a solution MD of the controller's topology and viceversa.

Using RTL Statespace Information and State Encoding for Induction Based Property Checking [p. 1156]
M. Wedler, D. Stoffel, and W. Kunz

This paper focuses on checking safety properties for sequential circuits specified on the RT-level. We study how different state encodings can be used to create a gate-level representation of the circuit that facilitates the computation of effective invariants for induction-based property checking. Our experiments show the strong impact of state encoding on the efficiency of the induction process.

Combining Simulation and Guided Traversal for the Verification of Concurrent Systems [p. 1158]
E. Pastor and M. Peña

We present a hybrid methodology that combines simulation and symbolic traversal in order to improve invariant checking. The methodology concentrates on concurrent systems, whose peculiarities are not fully exploited by other existing techniques for hybrid verification. Our approach exploits the information obtained from simulations to improve the knowledge of the state space, effectively guiding symbolic traversal.

Selectively Clocked CMOS Logic Style for Low-Power Noise-Immune Operations in Scaled Technologies [p. 1160]
N. Sirisantana and K. Roy

This paper proposes Selectively Clocked Logic (SCL) style based on skewed logic for noise-tolerant low-power high-performance applications. Variations of the logic style with multiple threshold voltage (MVth-SCL) and multiple oxide thickness (Mtox-SCL) techniques are also studied. Simulation results indicate that SCL, MVth-SCL, and Mtox-SCL circuits reduce the total power consumption (leakage plus switching power) of the ISCAS benchmark circuits by 51.5%, 53.1%, and 69.6%, respectively, with over 25% improvement in noise immunity compared to Domino circuits with comparable performance.

Self-Testing Embedded Checkers for Bose-Lin, Bose, and a Class of Borden Codes [p. 1162]
S. Tarnick

A new approach for designing t-UED and BUED code checkers is presented. In particular we consider Borden codes for t = 2k - 1, Bose and Bose-Lin codes. The design technique for all three checker types follows the same principle, which is mainly based on averaging weights and check symbol values of the code words. The checkers are very well suited for use as embedded checkers since they are self-testing with respect to single stuck-at faults under very weak assumptions. All three checker types can be tested by 2 or 3 code words.

Non-Intrusive Concurrent Error Detection in FSMs through State/Output Compaction and Monitoring via Parity Trees [p. 1164]
P. Drineas and Y. Makris

We discuss a non-intrusive methodology for concurrent error detection in FSMs. The proposed method is based on compaction and monitoring of the state/output bits of an FSM via parity trees. While errors may affect more than one state/output bit, not all combinations of state/output bits constitute potential erroneous cases for a given fault model. Therefore, it is possible to compact them without loss of error information. Thus, concurrent error detection is performed through hardware that predicts the values of the compacted state/output bits and compares them to the actual values of the FSM. In order to minimize the incurred hardware overhead, a randomized algorithm is proposed for selecting the minimum number of required parity functions.


7P: System Level Design and Specification and Testing Techniques

SAT-Based Techniques in System Synthesis [p. 1168]
C. Haubelt, J. Teich, R. Feldmann, and B. Monien

In this paper, we show how to integrate SAT-based techniques into the task of system synthesis by regarding the the problems: (i) feasibility check and (ii) evaluation of quality.

Refinement of Mixed-Signal Systems with SystemC [p. 1170]
C. Grimm, C. Meise, W. Heupke, and K. Waldschmidt

This paper gives an overview of a design methodology specific extension library for SystemC. The supported design methodology successively refines an executable specification to a concrete mixed-signal architecture. A first prototype, the ASC library, has been implemented and evaluated.

Polychrony for Refinement-Based Design [p. 1172]
J. Talpin, P. Le Guernic, S. Shukla, R. Gupta, and F. Doucet

Rising complexities and performances of integrated circuits and systems, shortening time-to-market demands for electronic equipments, growing installed bases of intellectual property, requirements for adapting existing IPs with new services, all stress high-level design as a prominents research topics and call for the development of appropriate methodological solutions. In this aim, system design based on the so-called "synchronous hypothesis" consists of abstracting the non-functional implementation details of a system away and let one benefit from a focused reasoning on the logics behind the instants at which the system functionalities should be secured. From this point of view, synchronous design models and languages provide intuitive models for integrated circuits. this affinity explains the ease of generating synchronous circuits and verify their functionalities using compilers and related tools that implement this approach. In the relational model of the SIGNAL/POLYCHRONY design language/plateform [3.5] this affinity goes beyond the domain of purely synchronous circuits to embrace the context of architectures consisting of synchronous circuits and desynchronization protocols: GALS architectures. The unique features of this model are to provide the notion of polychrony: the capability to describe multi-clocked (or partially clocked) circuits and systems; and to support formal design refinements, from the early stages of requirements specification, to the later stages of synthesis and deployment, and by using formal verification techniques.

Automatic Generation of Simulation Monitors from Quantitative Constraint Formula [p. 1174]
X. Chen, H. Hsieh, F. Balarin, and Y. Watanabe

System design methodology is poised to become the next big enabler for highly sophisticated electronic products. Design verification continues to be a major challenge and simulation will remain an important tool for making sure that implementations perform as they should. In this paper we present algorithms to automatically generate C++ checkers from any formula written in the formal quantitative constraint language, Logic Of Constraints (LOC). The executable can then be used to analyze the simulation traces for constraint violation and output debugging information. Different checkers can be generated for fast analysis under different memory limitations. LOC is particularly suitable for specification of system level quantitative constraints where relative coordination of instances of events, not lower level interaction, is of paramount concern. We illustrate the usefulness and efficiency of our automatic trace analysis methodology with case studies on large simulation traces from various system level designs.

Consequences of RAM Bitline Twisting for Test Coverage [p. 1176]
I. Schanstra and J. van de Goor

In order to reduce coupling effects between bitlines in static or dynamic RAMs, bitline twisting can be used in the design. For testing, however, this has consequences for the to-be-used data backgrounds. A generic twisting scheme is introduced and the involved fault models are identified.

An Approach to the Classification of Mixed-Signal Circuits in a Pseudorandom Testing Scheme [p. 1178]
F. Corsi, C. Marzocca, and G. Matarrese

BIST methods based on the application of a pseudorandom pulse sequence as input stimulus have recently been suggested for testing linear time-invariant (LTI) analog parts [1-3]. In such methods, the input-output cross-correlation function Rxy(t) gives a good estimation of the impulse response of the DUT, h(t), provided the autocorrelation of the pseudorandom input sequence approaches a single Dirac'fs pulse Delta(t) [4]. Much work has been devoted to the investigation of several aspects of the pseudorandom testing techniques, such as the impact of the finite length of the input pulse sequence, the duration of the single pulse, the ADC resolution, etc [1-3]. In particular, ADC's with low resolution can be employed to sample the DUT response as the averaging effect intrinsically performed by the input-output cross-correlation operation greatly reduces the effects of the quantization errors [2]. As a consequence, from a BIST point of view, the test of very fast DUT is possible, since ADC's with low accuracy and fast sampling rate are easily available on chip. In this paper we focus on the issues related to the choice of a suitable set of samples of the cross-correlation function Rxy(mi), i=1,..,n, as DUT signature, and, at the same time, to the definition of an effective classification procedure [5-6]. In particular, we make the choice of the DUT signature on the basis of the sensitivities of the cross-correlation samples to the circuit specifications sj, expressed in terms of the partial derivatives partij=partRxy (mi)/partsj. This sensitivity study is carried out by considering a large set RTS of good instances of the circuit generated in simulation by independently varying the performance parameters of an high level description of the DUT within their acceptance ranges (±5%). The same set of good circuits is used to extract the initial acceptance ranges [Rxymin(mi), Rxymax(mi)] of the samples Rxy(mi) selected as signature components, and thus it constitutes the starting point of the subsequent classification procedure.

Test Generation for Acyclic Sequential Circuits with Single Stuck-At Fault Combinational ATPG [p. 1180]
H. Ichihara and T. Inoue

A test generation method with time-expansion model can achieve high fault efficiency for acyclic sequential circuits[1, 2]. While the model is combinational circuit, a single stuck at fault in an original circuit is represented by a multiple one in this model. This paper proposes a test generation method for acyclic sequential circuits with a circuit model, called MS-model, which can express multiple stuck-at faults in time-expansion model as single stuck-at faults. Our procedure can generate test sequences for acyclic sequential circuits with just combinational test pattern generation algorithm for single stuck-at faults.

Comparison of Test Pattern Decompression Techniques [p. 1182]
O. Novák

Test pattern decompression techniques are bounded with the algorithm of test pattern ordering and test data flow controlling. Some of the methods could have more sophisticated sorting algorithm, some of the methods may be supported by fault simulation and some can work without any information about the quality of previously generated test patterns. The differences between the decompression techniques cause that the efficiency of the automata is not comparable and the published results do not give us any idea of the best choice of decompressing automaton. We have performed experiments, in which we have studied the degradation of random input stimuli by the decompression automata (DA) used by different authors. We have found that a medium percentage of the (n,r) exhaustive test set created on the DA outputs after stimulating the DA inputs with a given number of random pattern is correlated with the DA efficiency. This statement has been verified by experiments with ISCAS circuits.

Evolutionary Optimization of Markov Sources for Pseudo Random Scan BIST [p. 1184]
I. Polian, B. Becker, and S. Reddy

Recent work [1] showed that Markov sources lead to scan BIST designs of lower cost compared to earlier proposed methods in scan BIST. However the method presented in [1] utilizes tests generated using a deterministic test generator for target faults in synthesizing the Markov source to generate the tests. The requirement of a deterministic test generator may hinder the use of this procedure in industrial settings since the BIST tool must also include a deterministic ATPG tool that may add to the cost of the BIST tool. In this paper we investigate a procedure to synthesize BIST controllers with Markov sources for test generation using Evolutionary Algorithms (EAs). This allows us to avoid using the deterministic ATPG needed in [1]. Additionally we do not employ inversion logic used in [1] thereby potentially reducing the hardware in the BIST controller. Nevertheless, the proposed method achieves close to 100% fault efficiency using far fewer tests than required by pseudo random tests. In this work, similar to [1], we employ Markov sources based on finite state machines (FSMs) with transitions controlled by additional inputs. Hence, transition probabilities (TPs) of the FSM are determined by the 1-probability on a corresponding input.

Test Data Compression Based on Output Dependence [p. 1186]
I. Pomeranz and S. Reddy

In a large circuit it is common to find that an output of the circuit depends structurally on a proper subset of the circuit inputs. We use this observation to provide test data compression. The proposed approach can be used in addition to test data compression techniques based on encoding. Structural output dependence is utilized in this work to provide data compression as follows. Consider a test set T that needs to be applied to the circuit. Suppose that the circuit inputs can be divided into subsets A0,A1, . . . , Am-1 such that for every circuit output bj there exists an input subset Ak satisfying the condition that bj depends only on inputs in Ak . For simplicity, we also assume that all the subsets Ak contain the same number of inputs, denoted by NA (later we will ensure that this condition is satisfied). The value of NA is determined by the output that depends on the maximum number of inputs. Let the number of circuit inputs be NC. If NA << NC, it is possible to use the input subsets in order to reduce the amount of input test data by loading into the circuit patterns of length NA instead of loading patterns of length NC.

A Unified Approach for SOC Testing Using Test Data Compression and TAM Optimization [p. [p. 1188]
V. Iyengar, A. Chandra, S. Schweizer, and K. Chakrabarty

We integrate for the first time test access mechanism (TAM) optimization and test data compression into a single test methodology. We show how an integrated test architecture based on TAMs and test data decoders can be designed. The proposed approach offers considerable savings in test data volume and testing time. Two case studies using the integrated test architecture are presented.