SIGDA, DATE 2004, Abstracts

DATE 2004 ABTRACTS

Sessions: [Plenary] [1A] [1B] [1C] [1D] [1E] [1F] [2A] [2B] [2C] [2E] [2F] [3A] [3B] [3C] [3E] [3F] [4A] [4B] [4C] [4E] [4F] [4G] [5A] [5B] [5C] [5E] [5F] [5G] [IP1] [IP2] [IP3] [6A] [6B] [6C] [6E] [6F] [6G] [7A] [7B] [7C] [7E] [7F] [7G] [8A] [8B] [8C] [8E] [8G] [9A] [9B] [9C] [9E] [9G] [10A] [10B] [10C] [10E] [10F] [10G] [IP4] [IP5] [IP6]

Volume I

Plenary : Keynote Session

Moderator: J. Figueras, UP Catalunya, ES

Opportunities and Challenges in Building Silicon Products in 65nm and Beyond [p. 2]: G. Spirakis

1A: Architectural-Level Power Management

Moderators: J. Henkel, NEC, US; A. Macii, Politecnico di Torino, IT

Fine-Grained Dynamic Voltage and Frequency Scaling for Precise Energy and Performance Trade-Off Based on the Ratio of Off-Chip Access to On-Chip Computation Times [p. 4]

K. Choi, R. Soma, and M. Pedram

This paper presents an intra-process dynamic voltage and frequency scaling (DVFS) technique targeted toward non real-time applications running on an embedded system platform. The key idea is to make use of runtime information about the external memory access statistics in order to perform CPU voltage and frequency scaling with the goal of minimizing the energy consumption while translucently controlling the performance penalty. The proposed DVFS technique relies on dynamically-constructed regression models that allow the CPU to calculate the expected workload and slack time for the next time slot, and thus, adjust its voltage and frequency in order to save energy while meeting soft timing constraints. This is in turn achieved by estimating and exploiting the ratio of the total off-chip access time to the total on-chip computation time. The proposed technique has been implemented on an XScale-based embedded system platform and actual energy savings have been calculated by current measurements in hardware. For memory-bound programs, a CPU energy saving of more than 70% with a performance degradation of 12% was achieved. For CPU-bound programs, 15~60% CPU energy saving was achieved at the cost of 5-20% performance penalty.

Hybrid Architectural Dynamic Thermal Management [p. 10]

K. Skadron

When an application or external environmental conditions cause a chip's cooling capacity to be exceeded, dynamic thermal management (DTM) dynamically reduces the power density on the chip to maintain safe operating temperatures. The challenge is that even though this reduction in power density reduces heat dissipation and can be used to regulate temperature and reduce the need for expensive thermal packages, reducing power density may come at a cost in execution speed. This paper shows the importance of processor-architecture techniques for DTM, and proposes a new, "hybrid," low-overhead implementation based on combining fetch gating and dynamic voltage scaling (DVS). When thermal stress is low, fetch gating is superior because it exploits instruction-level parallelism (ILP). Once thermal stress becomes severe enough that fetch gating degrades ILP, DVS is engaged instead to take advantage of its greater ability to reduce power density. We show that under a variety of assumptions about DVS implementation, a hybrid policy reduces DTM performance overhead by 25% on average compared to DVS, and is easy to design.

Value-Conscious Cache: Simple Technique for Reducing Cache Access Power [p. 16]

Y. Chang, C. Yang, and F. Lai

Most microprocessors employ the on-chip caches to bridge the performance gap between the processor and main memory. However, the cache accesses usually contribute significantly to the total power consumption of the chip. Based on the observation that an overwhelming majority of the cache access bits are '0', in this paper we propose a value-conscious (VC) cache to reduce the average cache power consumption during an access. Unlike the conventional cache with differential-bitline implementation, the VC cache is a single-bitline design. Depending on the access bit value, the VC cache can dynamically prevent the bitline from being discharged such that the power dissipated in accessing '0' is much less than the power dissipated in accessing '1'. The implementation of the VC cache is a circuit-level technique, which is software independent and orthogonal to other low power techniques at architecture-level. The experimental results based on the SPEC2000 and MediaBench traces show that without compromise of both performance and stability, by exploiting the prevalence of '0' bits in access data the VC cache can reduce the average cache read and write power by about 18%~22% and 36%~40%, respectively.

State-Preserving vs. Non-State-Preserving Leakage Control in Caches [p. 22]

D. Parikh, K. Sankaranarayanan, Y. Li, K. Skadron, Y. Zhang, and M. Stan

This paper compares the effectiveness of state-preserving and non-state-preserving techniques for leakage control in caches by comparing drowsy cache and gated-V_ssfor data caches using 70nm technology parameters. To perform the comparison, we introduce 'HotLeakage', a new architectural model for subthreshold and gate leakage that explicitly models the effects of temperature, voltage, and parameter variations, and has the ability to recalculate leakage currents dynamically as temperature and voltage change at runtime due to operating conditions, DVS techniques, etc. By comparing drowsy-cache and gated-V_ssat different L2 latencies and different gate oxide thickness values, we are able to identify a range of operating parameters at which gated-V_ss is more energy efficient than drowsy-cache, even though gated-V_ssdoes not preserve data in cache lines that have been deactivated. We are also able to show potential further benefits of gated-V_ss if an effective dynamic adaptation technique can be found. These results debunk a fairly widespread belief that state-preserving techniques are inherently superior to non-state-preserving techniques.

1B: Formal Verification Using Functional and Structural Information

Moderators: A. Veneris, Toronto U, CA; K. Winkelmann, Infineon Technologies, DE

Arithmetic Reasoning in DPLL-Based SAT Solving [p. 30]

M. Wedler, D. Stoffel, and W. Kunz

We propose a new arithmetic reasoning calculus to speed up a SAT solver based on the Davis Putnam Longman Loveland (DPLL) procedure. It is based on an arithmetic bit level description of the arithmetic circuit parts and the property. This description can easily be provided by the front-end of an RTL property checker. The calculus yields significant speedup and more robustness on hard SAT instances derived from the formal verification of arithmetic circuits.

Enhanced Diameter Bounding via Structural Transformation [p. 36]

J. Baumgartner and A. Kuehlmann

Bounded model checking (BMC) has gained widespread industrial use due to its relative scalability. Its exhaustiveness over all valid input vectors allows it to expose arbitrarily complex design flaws. However, BMC is limited to analyzing only a specific time window, hence will only expose those flaws which manifest within that window and thus cannot readily prove correctness. The diameter of a design has thus become an important concept -- a bounded check of depth equal to the diameter constitutes a complete proof. While the diameter of a design may be exponential in the number of its state elements, in practice it often ranges from tens to a few hundred regardless of design size. Therefore, a powerful diameter over-approximation technique may enable automatic proofs that otherwise would be infeasible. Unfortunately, exact diameter calculation requires exponential resources, and over-approximation techniques may yield exponentially loose bounds. In this paper, we provide a general approach for enabling the use of structural transformations, such as redundancy removal, retiming, and target enlargement, to tighten the bounds obtained by arbitrary diameter approximation techniques. Numerous experiments demonstrate that this approach may significantly increase the set of designs for which practically useful diameter bounds may be obtained.

Improved Symbolic Simulation by Dynamic Functional Space Partitioning [p. 42]

T. Feng, L. Wang, K. Cheng, and A. Lin

In this paper, we provide a flexible and automatic method to partition the functional space for efficient symbolic simulation. We utilize a 2-tuple list representation as the basis for partitioning the functional space. The partitioning is carried out dynamically during the symbolic simulation based on the sizes of OBDDs. We develop heuristics for choosing the optimal partitioning points. These heuristics intend to balance the tradeoff between the time and space complexity. We demonstrate the effectiveness of our new symbolic simulation approach through experiments based on a floating point adder and a memory management unit.

1C: Power, Timing and Diagnosis Constrained Testing

Moderators: S. Kundu, Intel, US; B. Straube, FhG IIS/EAS Dresden, DE

Using BDDs and ZBDDs for Efficient Identification of Testable Path Delay Faults [p. 50]

S. Padmanaban and S. Tragoudas

We present a novel framework to identify all the robustly testable and untestable path delay faults in a circuit. The method uses a combination of decision diagrams for manipulating path delay faults and boolean functions. The approach benefits from processing partial paths or fanout free segments in the circuit rather than the entire path. The effectiveness of the proposed framework is demonstrated experimentally. It is observed that the methodology identifies 350% more testable faults in the ISCAS'85 benchmark C6288 than any existing technique by utilizing only a fraction of the time compared to earlier work.

Level of Similarity: A Metric for Fault Collapsing [p. 56]

I. Pomeranz and S. Reddy

We describe a new approach to fault collapsing that extends fault collapsing based on fault equivalence and fault dominance. The new approach is based on a metric called level of similarity between faults. Informally, a fault f_j is said to be similar to a fault f_i with a level of similarity SL_i,j ≤ 1 if a fraction SL_i,j of the tests for f_i also detect f_j. If SL_i,j is high enough, one may exclude f_jfrom the set of target faults and rely on the test for f_i (and tests for other faults) to detect f_j. We describe a procedure for fault collapsing based on the level of similarity, and study its effectiveness experimentally.

Design of Routing-Constrained Low Power Scan Chains [p. 62]

Y. Bonhomme, P. Girard, L. Guiller, C. Landrault, S. Pravossoudovitch, and A. Virazel

Scan-based architectures, though widely used in modern designs, are expensive in power consumption. Recently, we proposed a technique based on clustering and reordering of scan cells that allows to design low power scan chains [1]. The main feature of this technique is that power consumption during scan testing is minimized while constraints on scan routing are satisfied. In this paper, we propose a new version of this technique. The clustering process has been modified to allow a better distribution of scan cells in each cluster and hence lead to more important power reductions. Results are provided at the end of the paper to highlight this point and show that scan design constraints (length of scan connections, congestion problems) are still satisfied.

Z-Sets and Z-Detections: Circuit Characteristics that Simplify Fault Diagnosis [p. 68]

I. Pomeranz, S. Venkataraman, S. Reddy, and B. Seshadri

We define the concepts of z -sets and z -detections for combinational circuits (or the combinational logic of scan circuits). Based on these concepts we define structural characteristics and characteristics based on fault simulation. We show that these characteristics determine the numbers of fault pairs that are guaranteed to be distinguished by a given fault detection test set. These fault pairs do not need to be considered during diagnostic fault simulation or test generation. We demonstrate that benchmark circuits as well as industrial circuits have these characteristics to a larger extent than may be expected. As a result, only small percentages of fault pairs need to be considered during diagnostic fault simulation or test generation once a fault detection test set is available. In addition, these fault pairs can be identified efficiently.

1D: Mixed-Signal Circuits and Systems

Moderators: A. Rodriguez-Vazquez, IMSE-CNM, ES; P. Wambacq, IMEC, BE

A 2.7V 350µW 11-b Algorithmic Analog-to-Digital Converter with Single-Ended Multiplexed Inputs [p. 76]

A. Nagari and G. Nicollini

A low-power low-area CMOS algorithmic A/D converter that does not require trimming nor digital calibration is presented. The topology is based on a classical cyclic A/D conversion using a capacitor ratio-independent computation circuitry. All the non idealities have been carefully analyzed and reduced by proper choices of design and layout solutions. As a result the errors coming from opamp offset and finite open-loop dc gain, switch charge injection and clock feedthrough, parasitic capacitors, and intrinsic noise sources are reduced under the LSB level. To process a multiplexed (8 channels) single-ended analog input, an efficient single-ended to fully differential circuit has been presented. The converter achieves 11 bit accuracy in the Nyquist band at a sampling rate of 8kSps. The total power dissipation is only 350µW at 2.7V supply voltage. The active area is 0.3 mm2 in a 0.35µm 5 metal levels CMOS technology with double-poly linear capacitors.

Digital Background Gain Error Correction in Pipeline ADCs [p. 82]

A. Ginés, E. Peralías, and A. Rueda

This paper presents a new digital technique for background calibration of gain errors in Pipeline ADCs. The proposed algorithm estimates and corrects both the MDAC gain error of the stage under calibration and the global gain error associated to the uncalibrated stages without interruption of the conversion and without reduction of the dynamic rate. It is based on the use of a stage with two input-output characteristics, depending on the value of a digital noise signal.
Key Words: Analog-to-Digital Converter, Pipeline ADC, Background Calibration, On-line Calibration.

Digital Ground Bounce Reduction by Phase Modulation of the Clock [p. 88]

M. Badaroglu, G. Gielen, H. De Man, P. Wambacq, G. Van Der Plas, and S. Donnay

The digital switching noise that propagates through the chip substrate to the analog circuitry on the same chip is a major limitation for mixed-signal SoC integration. In synchronous digital systems, digital circuits switch simultaneously on the clock edge, hereby generating a large ground bounce. In order to reduce the spectral peaks in the ground bounce spectrum, we combine the two techniques: (1) phase modulation of the clock and (2) introducing intended clock skews to spread the switching activities. Experimental results show around 16 dB reduction in the spectral peaks of the noise spectrum when these two techniques are combined. These two techniques are believed to be good candidates for the development of methodologies for digital low-noise design techniques in future CMOS technologies.

Pseudo-Random Sequence Based Tuning System for Continuous-Time Filters [p. 94]

A. Baschirotto, S. D'Amico, F. Corsi, C. Marzocca, and G. Matarrese

Continuos-Time filters are widely used in signal processing but require a tuning system to align their frequency response. Several tuning techniques have been proposed in the literature, which can be grouped in two basic schemes: master-slave and self-calibration arrangements. Here we propose a novel tuning approach which can be applied to both tuning schemes. The tuning algorithm is based on the application of a pseudo-random input Test Pattern Signal and on the evaluation of a few samples of the input-output cross-correlation function. The key advantages of the proposed technique are basically the use of a pseudo-random pattern signal which can be generated by a very simple circuit in a small die area and the simple circuitry required to sample the filter output and to perform the cross-correlation operation. Some experimental results of the application of the proposed tuning technique to a benchmark filter are given in order to assess its effectiveness.

1E: Communication-Centric and Source-Level Optimisations for High-Level Synthesis

Moderators: J. Teich, Erlangen-Nuremberg U, DE; P. Cheung, Imperial College London, UK

A Crosstalk Aware Interconnect with Variable Cycle Transmission [p. 102]

L. Li, N. Vijaykrishnan, M. Kandemir, and M. Irwin

Crosstalk between wires, caused by increased capacitive coupling, is considered one of the major factors that affect the performance of interconnects such as buses. The data-dependent nature of crosstalk-induced delays necessitates bus cycle time to be designed for the worst case crosstalk. However, this pessimism incurs a significant performance penalty. Consequently, we propose a crosstalk aware interconnect that uses a faster clock and dynamically controls the number of cycles required for transmission based on the estimated delay of the data pattern to be transmitted. In order to accomplish this, we designed a crosstalk analyzer circuit that is incorporated into the sender side of the bus and support a variable cycle transmission mechanism. We evaluate the effectiveness of the proposed scheme focusing on the on-chip buses of a microprocessor and by using the SPEC2000 benchmarks. The experimental results show that the proposed approach improves performance by 31.5% as compared to the original pessimistic approach. Furthermore, we employ a coding optimization to enhance the effectiveness of the proposed approach. We also show that the proposed scheme is an area-efficient approach to improving performance as compared to other crosstalk reduction schemes.

Layout Conscious Bus Architecture Synthesis for Deep Submicron Systems on Chip [p. 108]

N. Thepayasuwan and A. Doboli

System-level design has a disadvantage in not knowing important aspects about the final layout. This is critical for SoC, where uncertainties in communication delay by very deep submicron effects cannot be neglected. This paper presents a layout-aware bus architecture (BA) synthesis algorithm for designing the communication sub-system of an SoC. BA synthesis includes finding bus topology and routing individual buses, so that constraints like area, bus speed and length, are tackled at the physical level. The paper presents the BA automatically synthesized for a network processor and a JPEG SoC.

Loop Shifting and Compaction for the High-Level Synthesis of Designs with Complex Control Flow [p. 114]

S. Gupta, N. Dutt, A. Nicolau, and R. Gupta

Emerging embedded system applications in multimedia and image processing are characterized by complex control flow consisting of deeply nested conditionals and loops. We present a technique called loop shifting that incrementally exploits loop level parallelism across iterations by shifting and compacting operations across loop iterations. Our experimental results show that loop shifting is particularly effective for the synthesis of designs with complex control especially when resource utilization is already high and/or under tight resource constraints. In situations when further loop unrolling (or initiating another iteration of the loop body) leads to a sharp increase in the longest combinational path in the circuit and the circuit area, loop shifting is able to achieve up to 20 % reduction in the input-to-output delay in the synthesized circuit. We implemented loop shifting within the SPARK parallelizing high-level synthesis framework and present results for experiments on designs derived from multimedia and image processing applications.

1F: Panel Session: SystemC and System Verilog: Where do They Fit? Where are They Going?

Organiser/Moderator: G. Martin, Cadence Berkeley Labs, US; D. Sciuto, Politecnico di Milano
Panellists:
S. Swan, Cadence, US
F. Ghenassia, STMicroelectronics, FR
P. Flake, Synopsys, US
J. Srouji, Intel, Israel
W. Rosenstiel, Tübingen U, DE

SystemC and System Verilog: Where do They Fit? Where are they going? [p. 122]: There is tremendous interest in design languages these days - and more particularly, SystemC and SystemVerilog. Sometimes the truth about design languages can be obscured by marketing and the press. This panel is meant to deepen the technical understanding of the DATE audience on the issue of design languages. It contains five technical experts -- an academic expert in design languages and SystemC and SystemVerilog in particular; a language expert for each of SystemC and SystemVerilog; and a user expert for these two languages. The language experts have been heavily involved in the specification and evolution of their respective languages. The user experts have been heavily involved in developing use methodologies for these languages within their own design communities, and in applying them to real design problems. The panelists will consider the questions:
- what are the key capabilities of these languages and what do they offer to users?
- which design problems are they best used for? what is their scope?
- how has application of these languages to real design problems improved the productivity of designers and the quality of the design results?
- where should the languages develop further capabilities?

2A: Low Power Systems and Architectures

Moderators: E. Schmidt, Chip Vision Design Systems, DE; C. Guardiani, PDF Solutions, IT

Re-Configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-Micron Instruction Bus [p. 130]

S. Wong and C. Tsui

In very deep sub-micron designs, cross coupling capacitances become the dominant factor of the total bus loading and have a significant impact on the power consumption. In this paper, we propose two reconfigurable bus encoding schemes, which are based on the correlation among the bit lines, to reduce the power consumption at the cross coupling capacitances of the instruction buses. The instruction is encoded by flipping and reordering the bit lines during compilation time to reduce the total switching capacitances. A crossbar is used to map back the data to the original instruction code before sending to the instruction decoder. The reordering can be re-configured during run-time by using different configurations in the crossbar. We propose two types of re-configuration, static and dynamic. Static coding uses a fix flipping and re-configuring pattern after the corresponding program is compiled. Dynamic coding allows different re-configuring patterns during program execution. Experimental results show that by using the proposed schemes, significant energy reduction, 17-23%, can be achieved. Comparisons with existing bit lines reordering encoding scheme have also been made and on average more than 15% reduction can be obtained using our method.

Hierarchical Adaptive Dynamic Power Management [p. 136]

Z. Ren, B. Krogh, and R. Marculescu

The main contribution of this paper is a novel hierarchical scheme for adaptive dynamic power management (DPM) under nonstationary service requests. We model the non-stationary arrival process of service requests as a Markov-modulated stochastic process in which the stochastic process for each modulation state models a particular stationary mode of the arrival process. The bottom layer of our hierarchical architecture is a set of stationary optimal DPM policies, pre-calculated off-line for selected modes from policy optimization in Markov decision processes. The supervisory power manager at the top layer adaptively and optimally switches among these stationary policies on-line to accommodate the actual mode-switching arrival dynamics. Simulation results show that our approach, under highly nonstationary requests, can lead to significant power savings compared to previously proposed heuristic approaches.
Keywords: low-power design, hierarchical adaptive dynamic power management, nonstationary service requests.

A Self-Tuning Cache Architecture for Embedded Systems [p. 142]

C. Zhang, F. Vahid, and R. Lysecky

Memory accesses can account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory-access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.
Keywords
Cache, configurable, architecture tuning, low power, low energy, embedded systems, on-chip CAD, dynamic optimization.

Scheduling Reusable Instructions for Power Reduction [p. 148]

J. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M. Irwin

In this paper, we propose a new issue queue design that is capable of scheduling reusable instructions. Once the issue queue is reusing instructions, no instruction cache access is needed since the instructions are supplied by the issue queue itself. Furthermore, dynamic branch prediction and instruction decoding can also be avoided permitting the gating of the front-end stages of the pipeline (the stages before register renaming). Results using array-intensive codes show that up to 82% of the total execution cycles, the pipeline front-end can be gated, providing a power reduction of 72% in the instruction cache, 33% in the branch predictor, and 21% in the issue queue, respectively, at a small performance cost. Our analysis of compiler optimizations indicates that the power savings can be further improved by using optimized code.

2B: Advanced Formal Verification Techniques

Moderators: R. Drechsler, Bremen U, DE; H. Eveking, TU Darmstadt, DE

Using Counter Example Guided Abstraction Refinement to Find Complex Bugs [p. 156]

P. Bjesse and J. Kukula

In this paper, we present a method for finding failure traces for safety properties that are out of reach for traditional approaches to counter example generation. We do this by guiding Bounded Model Checking (BMC) with information gathered from counter example guided abstraction refinement. Unlike previously described approaches based on reconstructing abstract counter examples on the concrete machines, we do not limit ourselves to search for failures of the same length as the current abstract counterexample. We also describe a combination of previously known methods for choosing registers to include in the abstraction that we have found works very well together with our technique for finding failures. Our experimental results show that the resulting method can find counter examples that are out of range for both standard BMC and two previously published approaches to abstraction-guided BMC.

Cost-Efficient Block Verification for a UMTS Up-Link Chip-Rate Coprocessor [p. 162]

G. Fey, D. Stoffel, H. Trylus, and K. Winkelmann

ASIC designs for future communication applications cannot be simulated exhaustively. Formal Property Checking is a powerful technology to overcome the limitations of current functional verification approaches. The paper reports on a large-scale experiment employing the CVE property checker for verifying the block-level functional correctness of a large ASIC. This new verification methodology achieves substantial quality and productivity gains. The two biggest advantages are:
• Coding and Verification can be done in parallel.
• The whole state space of a test case will be verified in a single run.
Formal Property Checking simplifies and shortens the functional verification of large-scale ASICs at least in the same order of magnitude as Static Timing Analysis did for timing verification.

Automatic Verification of Safety and Liveness for XScale-Like Processor Models Using WEB Refinements [p. 168]

P. Manolios and S. Srinivasan

We show how to automatically verify that complex XScale-like pipelined machine models satisfy the same safety and liveness properties as their corresponding instruction set architecture models, by using the notion of Well-founded Equivalence Bisimulation (WEB) refinement. Automation is achieved by reducing the WEB-refinement proof obligation to a formula in the logic of Counter arithmetic with Lambda expressions and Uninterpreted functions (CLU). We use the tool UCLID to transform the resulting CLU formula into a Boolean formula, which is then checked with a SAT solver. The models we verify include features such as out of order completion, precise exceptions, branch prediction, and interrupts. We use two types of refinement maps. In one, flushing is used to map pipelined machine states to instruction set architecture states; in the other, we use the commitment approach, which is the dual of flushing, since partially completed instructions are invalidated. We present experimental results for all the machines modeled, including verification times. For our application, we found that the time spent proving liveness accounts for about 5% of the overall verification time.

2C: New Algorithms for TPG

Moderators: H. Obermeir, Infineon Technologies, DE; M. Hsiao, Virginia Tech., US

A Probabilistic Method for the Computation of Testability of RTL Constructs [p. 176]

J. Fernandes, M. Santos, A. Oliveira, and J. Teixeira

Validation of RTL descriptions remains one of the principal bottlenecks in the circuit design process. Random simulation based methods for functional validation suffer from fundamental limitations and may be inappropriate or too expensive. In fact, for some circuits, a large number of vectors is required in order to make the circuit reach hard to test constructs and obtain accurate values for their testability. In this work, we present a static, non-simulation based, method for the determination of the controllability of RTL constructs that is efficient and gives accurate feedback to the designers in what regards the presence of hard to control constructs in their RTL code. The method takes as input a Verilog RTL description, solves the Chapman-Kolmogorov equations that describe the steady-state of the circuit and outputs the computed values for the controllability of the RTL constructs. To avoid the exponential blow-up that results from writing one equation for each circuit state and solving the resulting system of equations, an approximation method is used. We present results showing that the approximation is effective and describe how the method can be used to bias a random test generator in order to achieve higher coverage using a smaller number of vectors.

Graph-Based Functional Test Program Generation for Pipelined Processors [p. 182]

P. Mishra and N. Dutt

Functional verification is widely acknowledged as a major bottleneck in microprocessor design. While early work on specification driven functional test program generation has proposed several promising ideas, many challenges remain in applying them to realistic embedded processors. We present a graph coverage based functional test program generation approach for pipelined processors. The proposed methodology makes three important contributions. First, it automatically generates the graph model of the pipelined processor from the specification using functional abstraction. Second, it generates functional test programs based on the coverage of the pipeline behavior. Finally, the test generation time is drastically reduced due to the use of module level property checking. We applied this methodology on the DLX processor to demonstrate the usefulness of our approach.

Automatic Generation of Validation Stimuli for Application-Specific Processors [p. 188]

O. Goloubeva, M. Sonza Reorda, and M. Violante

Microprocessor soft cores offer today an effective solution to the problem of rapidly developing new system-on-a-chips. However, all the features they offer are rarely used in embedded applications, and thus designers are often involved in the challenging task of soft-core customization to obtain application-specific processors. This paper proposes a novel approach to help designers in the simulation-based validation of application-specific processors. Suitable input stimuli are automatically generated while reasoning only on the software application the processor is intended to execute, while all the details concerning the processor hardware are neglected. Experimental results on a 8051 soft core show the effectiveness of the proposed approach.

Efficient Static Compaction of Test Sequence Sets through the Application of Set Covering Techniques [p. 194]

M. Dimopoulos and P. Linardis

The test sequence compaction problem is modeled here, first, as a set covering problem. This formulation enables the straightforward application of set covering methods for compaction. Because of the complexity inherent in the first model, a second more efficient, formulation is proposed where the test sequences are modeled as matrix columns with variable costs (number of vectors). Further, matrix reduction rules appropriate to the new formulation, which do not affect the optimality of the solution, are introduced. Finally, the reduced problem is minimized with a Branch & Bound algorithm. Experiments on a large number of test sets show significant reductions to the original problem by simply using the presented reduction rules. Experimental results comparing our method with others from the literature and also with the absolute minima of the examples, computed separately with the MINCOV algorithm, support the potential of the proposed approach.

2E: Optimisation of Memory Hierarchies

Moderators: R. Bergamaschi, IBM TJ Watson Res. Center, US; R. Hermida, Madrid Complutense U, ES

Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies [p. 202]

I. Issenin, N. Dutt, E. Brockmeyer, and M. Miranda

In multimedia and other streaming applications a significant portion of energy is spent on data transfers. Exploiting data reuse opportunities in the application, we can reduce this energy by making copies of frequently used data in a small local memory and replacing speed and power inefficient transfers from main off-chip memory by more efficient local data transfers. In this paper we present an automated approach for analyzing these opportunities in a program that allows modification of the program to use custom scratch pad memory configurations comprising a hierarchical set of buffers for local storage of frequently reused data. Using our approach we are able to reduce energy consumption of the memory subsystem when using a scratch pad memory by a factor of two on average compared to a cache of the same size.

Automatic Tuning of Two-Level Caches to Embedded Applications [p. 208]

A. Gordon-Ross, F. Vahid, and N. Dutt

The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for optimizations. We present an automated method for tuning two-level caches to embedded applications for reduced energy consumption. The method is applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We introduce the two-level cache tuner, or TCaT -- a heuristic for searching the huge solution space of possible configurations. The heuristic interlaces the exploration of the two cache levels and searches the various cache parameters in a specific order based on their impact on energy. We show the integrity of our heuristic across multiple memory configurations and even in the presence of hardware/software partitioning -- a common optimization capable of achieving significant speedups and/or reduced energy consumption. We apply our exploration heuristic to a large set of embedded applications. Our experiments demonstrate the efficacy of our heuristic: on average the heuristic examines only 7% of the possible cache configurations, but results in cache sub-system energy savings of 53%, only 1% more than the optimal cache configuration. In addition, the configured cache achieves an average speedup of 30% over the base cache configuration due to tuning of cache line size to the application's needs.

Low Static-Power Frequent-Value Data Caches [p. 214]

C. Zhang, J. Yang, and F. Vahid

Static energy dissipation in cache memories will constitute an increasingly larger portion of total microprocessor energy dissipation due to nanoscale technology characteristics and the large size of on-chip caches. We propose to reduce the static energy dissipation of an on-chip data cache by taking advantage of the frequent values (FV) that widely exist in a data cache memory. The original FV-based low-power cache design aimed at only reducing dynamic power, at the cost of a 5% slowdown. We propose a better design that reduces both static and dynamic cache power, and that uses a circuit design that eliminates performance overhead. A designer can utilize our architecture by simulating an application and then synthesizing the FVs into an application-specific FV cache design when values will not change, or by simulating and then writing to an FV-cache with configuration registers when values could change. Furthermore, we describe hardware that can dynamically determine FVs and write to the configuration registers completely transparently. Experiments on 11 Spec 2000 benchmarks show that, in addition to the dynamic power savings, 33% static energy savings for data caches can be achieved.

Using a Victim Buffer in an Application-Specific Memory Hierarchy [p. 220]

C. Zhang and F. Vahid

Customizing a memory hierarchy to a particular application or applications is becoming increasingly common in embedded system design, with one benefit being reduced energy. Adding a victim buffer to the memory hierarchy is known to reduce energy and improve performance on average, yet victim buffers are not typically found in commercial embedded processors. One problem with such buffers is, while they work well on average, they tend to hurt performance for many applications. We show that a victim buffer can be very effective if it is considered as a parameter in designing a memory hierarchy, like the traditional cache parameters of total size, associativity, and line size. We describe experiments on PowerStone and MediaBench benchmarks, showing that having the option of adding a victim buffer to a direct-mapped cache can reduce memory-access energy by a factor of 3 in some cases. Furthermore, even when other cache parameters are configurable, we show that a victim buffer can still reduce energy by 43%. By treating the victim buffer as a parameter, meaning the buffer can be included or excluded, we can avoid performance overhead of up to 4% on some examples. We discuss the victim buffer in the context of both core-based and pre-fabricated platform based design approaches.

2F: Hot Topic -- High Security Smartcards

Organiser/Moderator: M. Renaudin, TIMA Laboratory, FR; F. Bouesse, TIMA Laboratory, FR
Speakers:
P. Proust, Gemplus Corporate R&D Security Technologies, FR
J. Tual, Axalto - Schlumberger, FR
L. Sourgen, STMicroelectronics, FR
F. Germain, DCSSI - French Government Service on the Security of Information Systems, FR

High Security Smartcards [p. 228]: New consumer appliances such as PDA, Set Top Box, GSM/UMTS terminals enable an easy access to the internet and strongly contribute to the development of ecommerce and m-commerce services. Tens of billion payments are made using cards today, and this is expected to grow in a near future. Smartcard platforms will enable operators and service providers to design and deploy new e- and m-commerce services. This development can only be achieved if a high level of security is guaranteed for the transactions and the customer's information. In this context, smartcard design is very challenging in order to provide the flexibility and the powerfulness required by the applications and services, while at the same time guaranteeing the security of the transactions and the customer's privacy. The goal of the session is to introduce this context and highlights the main challenges the smartcard designers/manufacturers have to face.

3A: New Directions in Low-Power Design

Moderators: M. Miranda, IMEC, BE; W. Nebel, OFFIS, DE

Energy-Aware Communication and Task Scheduling for Network-on-Chip Architectures Under Real-Time Constraints [p. 234]

J. Hu and R. Marculescu

In this paper, we present a novel Energy-Aware Scheduling (EAS) algorithm which statically schedules both communication transactions and computation tasks onto heterogeneous Network-on-Chip (NoC) architectures under realtime constraints. Our algorithm automatically assigns tasks onto different processing elements and then schedules their execution. At the same time, the algorithm also takes into consideration the exact communication delay by scheduling communication transactions in parallel. As the main contribution, we first formulate the problem of concurrent communication and task scheduling for heterogeneous NoC architectures and then propose an efficient heuristic to solve it. Experimental results show that significant energy savings can be achieved by using our energy-aware scheduler while meeting the specified performance constraints. For instance, for a complex multimedia application, 44% energy savings have been observed, on average, compared to the schedules generated by a standard earliest-deadline-first scheduler.

A Low Cost Individual-Well Adaptive Body Bias (IWABB) Scheme for Leakage Power Reduction and Performance Enhancement in the Presence of Intra-Die Variations [p. 240]

T. Chen and J. Gregg

This paper presents a new method of adapting body biasing on a chip during post-fabrication testing in order to mitigate the effects of process variations. Individual well biasing voltages can be changed to be connected either to a chip wide well bias or to a different bias voltage through a self-regulating mechanism, allowing biasing voltage adjustments on a per well basis. The scheme requires only one bias voltage distribution network, but allows for back biasing adjustments to more effectively mitigate die-to-die and within-die process variations. The biasing setting for each well is determined using a modified genetic algorithm. Our experimental results show that binning yields as low as 17% can be improved to greater than 90% after using the proposed IWABB method.

A Logic Level Design Methodology for a Secure DPA Resistant ASIC or FPGA Implementation [p. 246]

K. Tiri and I. Verbauwhede

This paper describes a novel design methodology to implement a secure DPA resistant crypto processor. The methodology is suitable for integration in a common automated standard cell ASIC or FPGA design flow. The technique combines standard building blocks to make "new" compound standard cells, which have a close to constant power consumption. Experimental results indicate a 50 times reduction in the power consumption fluctuations.

Power Minimization in a Backlit TFT-LCD Display by Concurrent Brightness and Contrast Scaling [p. 252]

W. Cheng, Y. Hou, and M. Pedram

This paper presents a Concurrent Brightness and Contrast Scaling (CBCS) technique for a cold cathode fluorescent lamp (CCFL) backlit TFT-LCD display. The proposed technique aims at conserving power by reducing the backlight illumination while retaining the image fidelity through preservation of the image contrast. First, we explain how CCFL works and show how to model the non-linearity between its backlight illumination and power consumption. Next, we propose the contrast distortion metric to quantify the image quality loss after backlight scaling. Finally, we formulate and optimally solve the CBCS optimization problem with the objective of minimizing the fidelity and power metrics. Experimental results show that an average of 3.7X power saving can be achieved with only 10% of contrast distortion.

3B: Advances in SAT

Moderators: P. Bjesse, Synopsys, US; G. Cabodi, Politecnico di Torino, IT

Managing Don't Cares in Boolean Satisfiability [p. 260]

S. Safarpour, A. Veneris, R. Drechsler, and J. Lee

Advances in Boolean satisfiability solvers have popularized their use in many of today's CAD VLSI challenges. Existing satisfiability solvers operate on a circuit representation that does not capture all of the structural circuit characteristics and properties. This work proposes algorithms that take into account the circuit don't care conditions thus enhancing the performance of these tools. Don't care sets are addressed in this work both statically and dynamically to reduce the search space and guide the decision making process. Experiments demonstrate performance gains.

Exploiting Signal Unobservability for Efficient Translation to CNF in Formal Verification of Microprocessors [p. 266]

M. Velev

The paper presents a method for translating Boolean circuits to CNF by identifying trees of ITE operators, where each ITE has fanout count of 1, and representing every such tree with a single set of equivalent CNF clauses without intermediate variables for ITE outputs, except for the tree output. This not only eliminates intermediate variables, but also reduces the number of clauses, compared to conventional translation to CNF, where each ITE is assigned an output variable and is represented with a separate set of clauses. Other gates with fanout count of 1 are similarly merged with their fanout gate to generate a single set of equivalent clauses. This translation to CNF was implemented in a decision procedure for the logic of Equality with Uninterpreted Functions and Memories (EUFM), and was applied to formulas from formal verification of microprocessors. To increase the number of ITE-trees in the Boolean formulas, the decision procedure was optimized to preserve the ITE-tree structure of arguments to equality comparisons. In conventional translation to CNF with the unoptimized decision procedure, the benchmark formulas require up to hundreds of thousands of CNF variables and millions of clauses. The best translation strategy reduced the CNF variables by up to 8 x; the clauses by up to 17 x; the SATsolver decisions by up to 79 x; the SAT-solver conflicts by up to 96 x; and accelerated the SAT solving by up to 420 x .

A Novel SAT All-Solutions Solver for Efficient Preimage Computation [p. 272]

B. Li, M. Hsiao, and S. Sheng

In this paper, we present a novel all-solutions preimage SAT solver, SOLALL, with the following features: (1) a new success-driven learning algorithm employing smaller cut sets; (2) a marked CNF database non-trivially combining success/conflict-driven learning; (3) quantified-jump-back dynamically quantifying primary input variables from the preimage; (4) improved free BDD built on the fly, saving memory and avoiding inclusion of PI variables; finally, (5) a practical method of storing all solutions into a canonical OBDD format. Experimental results demonstrated the efficiency of the proposed approach for very large sequential circuits.

3C: Analogue and High-Frequency Test

Moderators: A. Richardson, Lancaster U, UK; F Azais, LIRMM, FR

Efficient Test Strategy for TDMA Power Amplifiers Using Transient Current Measurements: Uses and Benefits [p. 280]

G. Srinivasan, S. Bhattacharya, A. Chatterjee, and S. Cherubal

A novel algorithm for fast and accurate testing of TDMA power amplifiers in a transmitter system is presented. First, the steep cost of high frequency testers can be largely complemented by the proposed method due to its ease of implementation on low-cost testers. Secondly, TDMA power amplifiers usually have a control voltage to operate the device in various modes of operation. At each of the control voltage values, all the specifications of the power amplifier are measured to ensure the performance of each tested device. A new method is proposed to test all the specifications of these devices using the transient current response of their bias circuits to a time-varying control voltage stimulus. This results in shorter test times compared to conventional test methods. The test specification values are measured to an accuracy of less than 5% for all the specifications measured. The proposed test approach can specifically benefit production test of quad-band amplifiers (GSM850, GSM900, PCS/DCS), as a single transient current measurement can be used to compute all the specifications of the device in different modes of operation, over different operating frequencies.

Random Jitter Extraction Technique in a Multi-Gigahertz Signal [p. 286]

C. Ong, D. Hong, K. Cheng, and L. Wang

In this paper, we propose a simple technique for estimating the standard deviation of a Gaussian random jitter component in a multi-gigahertz signal. This method may utilize existing on-chip single-shot period measurement techniques to measure the multi-gigahertz signal periods for the estimation. This method does not require an external sampling clock, nor any additional measurement beyond existing techniques. Experimental results show that this extraction method can accurately estimate the random jitter variance in a multi-gigahertz signal even with the presence of a few hundred-hertz sinusoidal jitter components.

Low Cost Analog Testing of RF Signal Paths [p. 292]

M. Negreiros, L. Carro, and A. Susin

A low cost method for testing analog RF signal paths suitable for BIST implementation in a SoC environment is described. The method is based on the use of a simple and low-cost one-bit digitizer that enables the reuse of processor and memory resources available in the SoC, while incurring little analog area overhead. The proposed method also allows a constant load to be observed by the circuit, since no switches or muxes are needed for digitizing specific test points. Mathematical background and experimental results are presented in order to validate the test approach.

A Method for Parameter Extraction of Analog Sine-Wave Signals for Mixed-Signal Built-In-Self-Test Applications [p. 298]

D. Vázquez, G. Huertas, G. Leger, A. Rueda, and J. Huertas

This paper presents a method for extracting, in the digital domain, the main characteristic parameters of an analog sine-wave signal. The required circuitry for on-chip implementation is very simple and robust, which makes the present approach very suitable for BIST applications. Solutions in this sense are addressed together with simulation results that validate the feasibility of the proposed approach.

3E: Energy Efficient Memory Usage

Moderators: E. de Kock, Philips Research, NL; G. Constantinides, Imperial College, UK

A Novel Implementation of Tile-Based Address Mapping [p. 306]

S. Hettiaratchi and P. Cheung

Tile-based data layout has been applied to achieve various objectives such as minimizing cache conflicts and memory row switching activity. In some applications of tile-based mapping, the size of the tile can be assumed to be a power of two. In this paper, this 'power of two' assumption has been used to drastically simplify the tile-based address mapping functions. Once optimized, the implementation of the non-linear tile-based mapping consumes 60% less power than the implementation of the linear row-major mapping. This result is very interesting because one would normally expect a power penalty in the address generation stage of the more sophisticated tile-based mapping. Moreover, on average tile-based mapping implementation takes 10% less area and incurs virtually no additional delay over row-major mapping implementation.

Power Aware Variable Partitioning and Instruction Scheduling for Multiple Memory Banks [p. 312]

Z. Wang and X. Hu

Many high-end DSP processors employ both multiple memory banks and heterogeneous register files to improve performance and power consumption. The complexity of such architectures presents a great challenge to compiler design. In this paper, we present an approach for variable partitioning and instruction scheduling to maximally exploit the benefits provided by such architectures. Our approach is built on a novel graph model which strives to capture both performance and power demands. We propose an algorithm to iteratively find the variable partition such that the maximum energy saving is achieved while satisfying the given performance constraint. Experimental results demonstrate the effectiveness of our approach.

Time-Energy Design Space Exploration for Multi-Layer Memory Architectures [p. 318]

R. Szymanek, K. Kuchcinski, and F. Catthoor

This paper presents an exploration algorithm which examines execution time and energy consumption of a given application, while considering a parameterized memory architecture. The input to our algorithm is an application given as an annotated task graph and a specification of a multi-layer memory architecture. The algorithm produces Pareto trade-off points representing different multi-objective execution options for the whole application. Different metrics are used to estimate parameters for application-level Pareto points obtained by merging all Pareto diagrams of the tasks composing the application. We estimate application execution time although the final scheduling is not yet known. The algorithm makes it possible to trade off the quality of the results and its runtime depending on the used metrics and the number of levels in the hierarchical composition of the tasks' Pareto points. We have evaluated our algorithm on a medical image processing application and randomly generated task graphs. We have shown that our algorithm can explore huge design space and obtain (near) optimal results in terms of Pareto diagram quality.

Breaking Instance-Independent Symmetries in Exact Graph Coloring [p. 423]

A. Ramani, F. Aloul, I. Markov, and K. Sakallah

Code optimization and high level synthesis can be posed as constraint satisfaction and optimization problems, such as graph coloring used in register allocation. Naturally-occurring instances of such problems are often small and can be solved optimally. A recent wave of improvements in algorithms for Boolean satisfiability (SAT) and 0-1 ILP suggests generic problem-reduction methods, rather than problem-specific heuristics, because (1) heuristics are easily upset by new constraints, (2) heuristics tend to ignore structure, and (3) many relevant problems are provably inapproximable. The NP-spec project offers a language to specify NP-problems and automatic reductions to SAT. Problem reductions often lead to highly symmetric SAT instances, and symmetries are known to slow down SAT solvers. In this work, we compare several avenues for symmetry-breaking, in particular when certain kinds of symmetry are present in all generated instances. Our surprising conclusion is that instance-independent symmetries should often be processed together with instance-specific symmetries rather than earlier, at the specification level.

3F: Hot Topic -- How Can System Level Design Solve the Interconnect Technology Scaling Problem?

Organiser: R. Lauwereins, IMEC, BE
Moderator: R. Wilson, CMP Media, US
Panellists:
K. Maex, IMEC, BE
P. Groeneveld, Magma Design Automation, US
G. Martin, Cadence, US
A. Cuomo, STMicroelectronics, IT
F. Catthoor, IMEC, BE
P. van de Steeg, Philips Semiconductors, NL

How Can System Level Design Solve the Interconnect Technology Scaling Problem? [p. 332]: The scaling of interconnect technology hits a red brick wall: interconnect delay and power do not follow Moore's law anymore. The use of new materials like Cu and low-k alleviated the problem temporarily, but physical limits are being hit. What does this mean for system level design? The session starts with an embedded tutorial, given by an interconnect semiconductor technology expert, explaining the physics behind the interconnect problem and the degrees of freedom semiconductor technology offers system designers. Panelists will then express their thoughts and discuss with you how the interconnect problem can be solved by taking these degrees of freedom into account at the system design level. Views from industrial designers, CAD vendors, IC manufacturers and researchers will be presented.

4A: System Level Design Methodology

Moderators: Y. Mathys, Motorola, FR; H. Hsieh, UC Riverside, US

System Design Using Kahn Process Networks: The Compaan/Laura Approach [p. 340]

E. Deprettere, B. Kienhuis, T. Stefanov, A. Turjan, and C. Zissulescu

New emerging embedded system platforms in the realm of high-throughput multimedia, imaging, and signal processing will consist of multiple microprocessors and reconfigurable components. One of the major problems is how to program these platforms in a systematic and automated way so as to satisfy the performance need of applications executed on these platforms. In this paper, we present our system design approach as an efficient solution to this programming problem. We show how for an application written in Matlab, a Kahn Process Network specification can automatically be derived and systematically mapped onto a target platform composed of a microprocessor and an FPGA. Furthermore, we illustrate how the mapping approach is applied on a real-life example, namely an M-JPEG encoder.

Microarchitecture Development via Metropolis Successive Platform Refinement [p. 346]

D. Densmore, A. Sangiovanni-Vincentelli, and S. Rekhi

Productivity data for IC designs indicates an exponential increase in design time and cost with the number of elements that are to be included in a device. Present applications require the development of complex systems to support novel functionality. To cope with these difficulties, we need to change radically the present design methodology to allow for extensive re-use, early verification in the design cycle, pervasive use of software, and architecture-level optimization. Platform-based design as defined in [1], has these characteristics. We present the application of this methodology to a complex industrial application provided by Cypress Semiconductor. In this case study, we focus on a particular aspect of this methodology that eases considerably the verification process: successive refinement. We compare this approach versus a parallel team of designers who developed the IC using standard design approaches.

Fast Exploration of Parameterized Bus Architecture for Communication-Centric SoC Design [p. 352]

C. Shin, Y. Kim, E. Chung, K. Choi, J. Kong, and S. Eo

For successful SoC design, efficient and scalable communication architecture is crucial. Some bus interconnects now provide configurable structures to meet this requirement of an SoC design. Furthermore, bus IP vendors provide software tools that automatically generate RTL codes of a bus once its designer configures it. Configurability, however, imposes more challenges upon designers because complexity involved in optimization increases exponentially as the number of parameters grows. In this paper, we present a novel approach with which effort requirement can be dramatically reduced. An automated optimization tool we developed is used and it exploits a genetic algorithm for fast design exploration. This paper shows that the time for the optimizing task can be reduced by more than 90% when the tool is used and, more significantly the task can be done without an expert's hand while ending up with a better solution.
Index Terms: Platform-based design, Bus Configuration, Optimization, SoC design, genetic algorithm.

SoftContract: An Assertion-Based Software Development Process that Enables Design-by-Contract [p. 358]

J. Brunel, P. Giusto, M. di Natale, A. Ferrari, and L. Lavagno

This paper discusses a model-based design flow for requirements in distributed embedded software development. Such requirements are specified using a language similar to Linear Temporal Logic which allows one to reason about time and sequencing. They consist of assertions which must hold for a design, given some assumptions on its environment. They can be checked both during simulation and, at least for a subset, even on the target. The key contribution of the paper is the extension to the embedded software domain of assertion-based verification, and the automated generation of property-checking code in multiple target languages, from simulation, to prototyping, to final production.

A System Level Exploration Platform and Methodology for Network Applications Based on Configurable Processors [p. 364]

D. Quinn, B. Lavigueur, G. Bois, and M. Aboulhamid

A recent practice in the development of programmable SoC is the integration of configurable processors, since they offer an interesting compromise between purely software and hardware solutions. This paper proposes an adjustment to the current codesign approach to integrate this opportunity at the partitioning level. Since configurable processors seem to be an interesting option for NPU designs, we integrated into a system level exploration platform the support of an Xtensa processor for more investigation. As case studies, this paper illustrates the methodology for two realistic network-processing applications, for which interesting performances are obtained.

4B: System Level Modelling and Analysis

Moderators: S. Singh, Xilinx, US; A. Jantsch, Royal Inst. of Tech., SE

Refinement of Mixed-Signal Systems with Affine Arithmetic [p. 372]

C. Grimm, W. Heupke, and K. Waldschmidt

This paper describes a framework for the refinement of control and signal processing functions. The design starts with an executable specification, and allowed deviations thereof. Refinement steps introduce models of analog or digital implementations, and augment the 'ideal' behavior with different sources of uncertainty. The framework verifies and analyzes the influence of these uncertainties on system properties using affine arithmetic.

System-Level Performance Analysis in SystemC [p. 378]

H. Posadas, F. Herrera, P. Sánchez, E. Villar, and F. Blasco

As both the ITRS and the Medea+ DA Roadmaps have highlighted, early performance estimation is an essential step in any SoC design methodology [1-2]. This paper presents a C++ library for timing estimation at system level. The library is based on a general and systematic methodology that takes as input the original SystemC source code without any modification and provides the estimation parameters by simply including the library within a usual simulation. As a consequence, the same models of computation used during system design are preserved and all simulation conditions are maintained. The method exploits the advantages of dynamic analysis, that is, easy management of unpredictable data-dependent conditions and computational efficiency compared with other alternatives (ISS or RT simulation, without the need for SW generation and compilation and HW synthesis). Results obtained on several examples show the accuracy of the method. In addition to the fundamental parameters needed for system-level design exploration, the proposed methodology allows the designer to include capture points at any place in the code. The user can process the corresponding captured events for unrestricted timing constraint verification.

Modeling and Validating Globally Asynchronous Design in Synchronous Frameworks [p. 384]

M. Mousavi, P. Le Guernic, J. Talpin, S. Shukla, and T. Basten

We lay a foundation for modeling and validation of asynchronous designs in a multi-clock synchronous programming model. This allows us to study properties of globally asynchronous systems using synchronous simulation and model-checking toolkits. Our approach can be summarized as automatic transformation of a design consisting of two asynchronously composed synchronous components into a fully synchronous multi-clock model preserving behavioral equivalence. The ultimate goal of this research is to provide the ability to model and build GALS systems in a fully synchronous design framework and deploy it on an asynchronous network preserving all properties of the system proven in the synchronous framework.

Synchronous Protocol Automata: A Framework for Modelling and Verification of SoC Communication Architectures [p. 390]

V. D'Silva, S. Ramesh, and A. Sowmya

Plug-n-Play style Intellectual Property(IP) reuse in System on Chip(SoC) design is facilitated by the use of an on-chip bus architecture. We present a synchronous, Finite State Machine based framework for modelling communication aspects of such architectures. This formalism has been developed via interaction with designers and the industry and is intuitive and lightweight. We have developed cycle accurate methods to formally specify protocol compatibility and component composition and show how our model can be used for compatibility verification, interface synthesis and model checking with automated specification. We demonstrate the utility of our framework by modelling the AMBA bus architecture including details such as pipelined operation, burst and split transfers, the AHB-APB bridge and arbitration features.

Aspects of Formal and Graphical Design of a Bus System [p. 396]

T. Seceleanu and T. Westerlund

This study shows the derivation of a local segmented bus arbiter from an original single segment bus arbiter. The operations are performed in the formal framework of action systems and illustrated in a graphical manner using the corresponding action systems -- UML profile notations. The derivation is useful both to demonstrate the capability of preserving correctness when considering an important hardware design decision and also to identify means through which this kind of decisions can be performed in a graphical environment.

4C: Advances in SoC Testing

Moderators: R. Dorsch, IBM Deutschland Entwicklung, DE; E. Larsson, Linköping U, SE

Scan Power Minimization through Stimulus and Response Transformations [p. 404]

O. Sinanoglu and A. Orailoglu

Scan-based cores impose considerable test power challenges due to excessive switching activity during shift cycles. The consequent test power constraints force SOC designers to sacrifice parallelism among core tests, as exceeding power thresholds may damage the chip being tested. Reduction of test power for SOC cores can thus increase the number of cores that can be tested in parallel, improving significantly SOC test application time. In this paper, we propose a scan chain modification technique that inserts logic gates on the scan path. The consequent beneficial test data transformations are utilized to reduce the scan chain transitions during shift cycles and hence test power. We introduce a matrix band algebra that models the impact of logic gate insertion between scan cells on the test stimulus and response transformations realized. As we have successfully modeled the response transformations as well, the methodology we propose is capable of truly minimizing the overall test power. The test vectors and responses are analyzed in an intertwined manner, identifying the best possible scan chain modification, which is realized at minimal area cost. Experimental results justify the efficacy of the proposed methodology as well.

Synchro-Tokens: Eliminating Nondeterminism to Enable Chip-Level Test of Globally Asynchronous Locally-Synchronous SoC's [p. 410]

M. Heath, W. Burleson, and I. Harris

Globally asynchronous locally synchronous (GALS) clocking applied to a system-on-a-chip (SoC) results in a design in which each core is a synchronous block (SB) of logic with a locally generated clock. Inter-core communication is asynchronous and controlled by wrapper logic around the cores. The nondeterministic synchronization used by most GALS architectures makes chip-level silicon debug and functional test difficult and costly. Deterministic GALS methodologies make dataflow assumptions which are only valid for a very limited set of applications. This paper describes a novel deterministic GALS methodology called 'synchro-tokens' whose parameterized wrappers are flexible enough to be useful for a wide range of applications while supporting synchronous debug and test methodologies such as 1149.1 and P1500. The validation of determinism, estimation of area overhead, and analysis of performance impact are detailed.

Wrapper Design for Testing IP Cores with Multiple Clock Domains [p. 416]

Q. Xu and N. Nicolici

This paper addresses the testability problems raised by embedded cores with multiple clock domains. The proposed solution, based on a novel core wrapper architecture, shows how multi-frequency at-speed test response capture can be achieved using low-speed testers synchronized with high-speed on-chip generated clocks. Using experimental data, the trade-offs between the number of tester channels, testing time, area overhead and power dissipation are discussed.

Efficient Modular Testing of SOCs Using Dual-Speed TAM Architectures [p. 422]

A. Sehgal and K. Chakrabarty

The increasing complexity of system-on-chip (SOC) integrated circuits has spurred the development of versatile automatic test equipment (ATE) that can simultaneously drive different channels at different data rates. Examples of such ATEs include the Agilent 93000 series tester based on port scalability and the test processor-per-pin architecture, and the Tiger system from Teradyne. The number of tester channels with high data rates may be constrained in practice however due to ATE resource limitations, the power rating of the SOC, and scan frequency limits for the embedded cores. Therefore, we formulate the following optimization problem: given two available data rates for the tester channels, an SOC-level test access mechanism (TAM) width W,V ( V < W) channels that can transport test data at the higher data rate, determine an SOC TAM architecture that minimizes the testing time. We present an efficient heuristic algorithm for TAM optimization that exploits port scalability of ATEs to reduce SOC testing time and test cost. We present experimental results on dual-speed TAM optimization for the ITC'2002 SOC test benchmarks.

An Arithmetic Structure for Test Data Horizontal Compression [p. 428]

M. Flottes, R. Poirier, and B. Rouzeyre

We propose a method for reducing test data volume of integrated circuits or cores in a System-on-Chip. This method is intended to reduce the required number of Automatic Test Equipment (ATE) output channels compared to the number of scan-in input pins in a classical multi-chain implementation (horizontal compression). Compression and decompression are based on arithmetic operations and structures which present a very low area overhead. The proposed compression scheme does not impact the fault coverage achieved by the original test sequence before compression.

4E: New Issues in Analogue System- and Circuit-Level Performance Modelling

Moderators: F. Fernandez, IMSE-CNM, ES; R. Schwencker, Infineon Technologies, DE

A Phase-Frequency Transfer Description of Analog and Mixed-Signal Front-End Architectures for System-Level Design [p. 436]

E. Martens and G. Gielen

A novel approach for the modeling of front-end architectures is presented. Architectures are described as a system transforming polyphase harmonic signals through building blocks modeled by polyphase harmonic transfer matrices and distortion tensors. The major goal of the method is to provide a model that is suited for systematic architectural exploration during front-end system design. An example of a downconversion architecture describes the system nonidealities as the result of parasitic transfers between phases and frequencies.

Hierarchical Automatic Behavioral Model Generation of Nonlinear Analog Circuits Based on Nonlinear Symbolic Techniques [p. 442]

L. Näthke, V. Burkhay, L. Hedrich, and E. Barke

We present an extended method of automatic behavioral model generation for nonlinear analog circuits. The focus is on a decrease of simulation time. A procedural model formulation approach is introduced, together with a new simplification method based on the recognition of physical transistor properties of the element models. The simplification process is performed with respect to simulation time, and a hierarchical modeling approach is proposed. The result of these extensions are models with an obvious speed-up in simulation time compared to the simulation of the original netlists.

Performance Modeling of Analog Integrated Circuits Using Least-Squares Support Vector Machines [p. 448]

T. Kiely and G. Gielen

This paper describes the application of Least-Squares Support Vector Machine (LS-SVM) training to analog circuit performance modeling as needed for accelerated or hierarchical analog circuit synthesis. The training is a type of regression, where a function of a special form is fit to experimental performance data derived from analog circuit simulations. The method is contrasted with a feasibility model approach based on the more traditional use of SVMs, namely classification. A Design of Experiments (DOE) strategy is reviewed which forms the basis of an efficient simulation sampling scheme. The results of our functional regression are then compared to two other DOE-based fitting schemes: a simple linear least-squares regression and a regression using posynomial models. The LSSVM fitting has advantages over these approaches in terms of accuracy of fit to measured data, prediction of intermediate data points and reduction of free model tuning parameters.

Extended Subspace Identification of Improper Linear Systems [p. 454]

G. Vandersteen, D. Linten, R. Pintelon, and S. Donnay

The modeling of linear transfer functions is often required prior to the simulation of electronic systems. An example is the modeling of on-chip inductors starting from 2-port measurements. The modeling is often done using state-space models that can only represent proper systems. This leads to modeling problems in the case of improper systems such as in the case of 2-port modeling of the admittance matrix of an on-chip inductor. This paper first describes an extended state-space model to represent improper systems. Afterwards, the paper introduces an extension to classical frequency-domain subspace identification methods. The usefulness of both the extended state-space model and the extended subspace modeling technique are illustrated by comparing them with commercially available solutions. This includes a comparison on measurements of an on-chip inductor and on simulations of a coplanar waveguide.

Identification and Modeling of Nonlinear Dynamic Behavior in Analog Circuits [p. 460]

X. Huang and H. Mantooth

This paper presents a new approach for identifying nonlinear dynamic behavior in analog circuits. The approach facilitates the creation of models that more accurately reflect the dynamic behavior of a circuit. It has been used in a fully automated, behavioral modeling tool, Ascend, that starts from the netlist description of the circuit and generates differential algebraic equation (DAE) based behavioral models. The underlying modeling approach is overviewed to provide a context for this research. Some demonstrative test results illustrate the effectiveness of the new method.

4F: Fabrics and Scheduling for Reconfigurable Computing

Moderators: G. Koch, Micronas GmbH, DE; C. Passerone, Politecnico di Torino, IT

Exploring Logic Block Granularity for Regular Fabrics [p. 468]

A. Koorapaty, V. Kheterpal, P. Gopalakrishnan, M. Fu, and L. Pileggi

Driven by the economics of design and manufacturing nanoscale integrated circuits, an emphasis is being placed on developing new, regular logic fabrics that leverage the regularity and programmability of FPGAs, yet deliver a level of performance and density close to ASICs. One example of such a fabric is a Via-Patterned Gate Array (VPGA) [9474], which employs ASIC style global routing on top of an array of patternable logic blocks (PLBs). Previous work [8480], [6486], [10494] showed that by employing even limited heterogeneity for the VPGA logic blocks, namely combining a 3-LUT with two 3-input Nand gates, one can achieve performance comparable to that provided by standard cells. Since the area cost for such heterogenity is far less for a VPGA than for SRAM programmed fabrics such as FPGAs, we can explore new configurations of via-configurable logic blocks that offer greater heterogenity and granularity to achieve even higher performance. In this paper, we present a new, more granular, via-patterned heterogeneous logic block architecture and compare it to a less granular LUT-based heterogeneous PLB. Our results show higher performance and more effective packing of the logic functions due to increased granularity.

Network Topology Exploration of Mesh-Based Coarse-Grain Reconfigurable Architectures [p. 474]

N. Bansal, S. Gupta, N. Dutt, A. Nicolau, and R. Gupta

Several coarse-grain reconfigurable architectures proposed recently consist of a large number of processing elements (PEs) connected in a mesh-like network topology. We study the effects of three aspects of network topology exploration on the performance of applications on these architectures: (a) changing the interconnection between PEs, (b) changing the way the network topology is traversed while mapping operations to the PEs, and (c) changing the communication delays on the interconnects between PEs. We propose network topology traversal strategies that first schedule PEs that are spatially close and that have more interconnections among them. We use an interconnect aware list scheduling heuristic as a vehicle to perform the network topology exploration experiments on a set of designs derived from DSP applications. Our experimental results show that a spiral traversal strategy, coupled with a two neighbor interconnect topology leads to good performance for the DSP benchmarks considered. Our prototype framework thus provides an exploration environment for system architects to explore and tune coarse-grain reconfigurable architectures for particular application domains.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning [p. 480]

R. Lysecky and F. Vahid

In previous work, we showed the benefits and feasibility of having a processor dynamically partition its executing software such that critical software kernels are transparently partitioned to execute as a hardware coprocessor on configurable logic -- an approach we call warp processing. The configurable logic place and route step is the most computationally intensive part of such hardware/software partitioning, normally running for many minutes or hours on powerful desktop processors. In contrast, dynamic partitioning requires place and route to execute in just seconds and on a lean embedded processor. We have therefore designed a configurable logic architecture specifically for dynamic hardware/software partitioning. Through experiments with popular benchmarks, we show that by specifically focusing on the goal of software kernel speedup when designing the FPGA architecture, rather than on the more general goal of ASIC prototyping, we can perform place and route for our architecture 50 times faster, using 10,000 times less data memory, and 1,000 times less code memory, than popular commercial tools mapping to commercial configurable logic. Yet, we show that we obtain speedups (2x on average, and as much as 4x) and energy savings (33% on average, and up to 74%) when partitioning even just one loop, which are comparable to commercial tools and fabrics. Thus, our configurable logic architecture represents a good candidate for platforms that will support dynamic hardware/software partitioning, and enables ultra-fast desktop tools for hardware/software partitioning, and even for fast configurable logic design in general.
Keywords: Hardware/software partitioning, FPGA fabric, configurable logic, synthesis, place and route, platforms, system-on-a-chip, dynamic optimization, codesign, self-improving chips, just-in-time compilation, warp processors, reconfigurable computing.

Configuration-Sensitive Process Scheduling for FPGA-Based Computing Platforms [p. 486]

G. Chen, M. Kandemir, and U. Sezer

Reconfigurable computing has become an important part of research in software systems and computer architecture. While prior research on reconfigurable computing have addressed architectural and compilation/programming aspects to some extent, there is still not much consensus on what kind of operating system (OS) support should be provided. In this paper, we focus on OS process scheduler, and demonstrate how it can be customized considering the needs of reconfigurable hardware. Our process scheduler is configuration sensitive, that is, it reuses the current FPGA configuration as much as possible. Our extensive experimental results show that the proposed scheduler is superior to classical scheduling algorithms such First-Come-First-Serve (FCFS) and Shortest Job First (SJF).

4G: Power Aware Design and Synthesis

Moderators: R. Zafalon, STMicroelectronics, IT; K. Roy, Purdue U, US

Simultaneous State, Vt and Tox Assignment for Total Standby Power Minimization [p. 494]

D. Lee, H. Deogun, D. Blaauw, and D. Sylvester

Standby leakage current minimization is a pressing concern for mobile applications that rely on standby modes to extend battery life. Also, gate oxide leakage current (Igate) has become comparable to subthreshold leakage (Isub) in 90nm technologies. In this paper, we propose a new method that uses a combined approach of sleepstate, threshold voltage (Vt) and gate oxide thickness (Tox) assignments in a dual-Vt and dual-Tox process to minimize both Isub and Igate. Using this method, total leakage current can be dramatically reduced since in a known state in standby mode, only certain transistors are responsible for leakage current and need to be considered for high-Vt or thick-Tox assignment. We formulate the optimization problem for simultaneous state, Vt and Tox assignments under delay constraints and propose two practical heuristics. We implemented and tested the proposed methods on a set of synthesized benchmark circuits. Results show an average leakage current reduction of 5-6X and 2-3X compared to previous approaches that only use state or state+Vt assignment, respectively, with small delay penalties.

A Scalable ODC-Based Algorithm for RTL Insertion of Gated Clocks [p. 500]

P. Babighian, E. Macii, and L. Benini

This paper describes a new automatic clock-gating extraction algorithm working at the RT-level. The key features of our approach are: (i) Seamless merging with existing industrial design flows and commercial tools; (ii) High scalability to deal with large circuits; (iii) Improved quality of results with respect to available commercial tools; (iv) Smaller and well-controlled overhead in speed and area. Experimental results, on a set of industrial RTL designs, demonstrate the viability and practical impact of our approach.

Impact of Data Transformations on Memory Bank Locality [p. 506]

M. Kandemir

High-energy consumption presents a problem for sustainable clock frequency and deliverable performance. In particular, memory energy consumption of array-intensive applications can be overwhelming due to poor cache locality. One option for reducing memory energy is to adopt a banked memory architecture, where memory space is divided into banks and each bank can be powered down if it is not in active use. An important issue here is the bank access pattern, which determines opportunities for saving energy. In this paper, we present a compiler-based data layout transformation strategy for increasing the effectiveness of a banked memory architecture. The idea is to transform the array layouts in memory in such a way that two loop iterations executed one after another access the data in the same bank as much as possible; the remaining banks can be placed into a low-power mode. Our simulation-based experiments with nine array-intensive applications show significant savings in memory energy consumption.

Why Transition Coding for Power Minimization of On-Chip Buses Does Not Work [p. 512]

C. Kretzschmar, D. Müller, and A. Nieuwland

Encoding techniques which minimize the self- or coupling activity of buses are often proposed to reduce power dissipation on system buses. In this paper, we investigate the efficiency of several coding schemes for on-chip buses with respect to overall power dissipation. The power of the codec systems was estimated by power simulations with the lay-outs and related to the savings on the bus. We derived an expression for the energy efficiency of the codecs as a function of bus length (capacitive load). Despite the fact that adaptive schemes could obtain up to 40% savings, the bus lengths required to reduce the overall power consumption are not realistic for on-chip buses.

Overhead-Conscious Voltage Selection for Dynamic and Leakage Energy Reduction of Time-Constrained Systems [p. 518]

A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. Al-Hashimi

Dynamic voltage scaling and adaptive body biasing have been shown to reduce dynamic and leakage power consumption effectively. In this paper, we optimally solve the combined supply voltage and body bias selection problem for multi-processor systems with imposed time constraints, explicitly taking into account the transition overheads implied by changing voltage levels. Both energy and time overheads are considered. We investigate the continuous voltage scaling as well as its discrete counterpart, and we prove NP-hardness in the discrete case. Furthermore, the continuous voltage scaling problem is formulated and solved using nonlinear programming with polynomial time complexity, while for the discrete problem we use mixed integer linear programming. Extensive experiments, conducted on several benchmarks and a real-life example, are used to validate the approaches.

5A: System Level Design: Case Studies, Exploration and Optimisation

Moderators: B. Kienhuis, LIACS, Leiden U, NL; F. Petrot, Pierre et Marie Curie U, FR

Dynamic Power Management Using Date Buffers [p. 526]

Y. Lu and L. Cai

This paper presents a method to reduce energy consumption by inserting data buffers. The method determines whether power can be reduced by inserting a buffer between two components and periodically turning off one of them. This method calculates the length of the period and the required buffer size to achieve the optimal energy savings. Our approach can be applied to any applications whose data arrival and departure rates are different and known in advance.

Dynamic Memory Management Design Methodology for Reduced Memory. Footprint in Multimedia and Wireless Network Applications [p. 532]

D. Atienza, J. Mendias, S. Mamagkakis, D. Soudris, and F. Catthoor

New portable consumer embedded devices must execute multimedia and wireless network applications that demand extensive memory footprint. Moreover, they must heavily rely on Dynamic Memory (DM) due to the unpredictability of the input data (e.g. 3D streams features) and system behaviour (e.g. number of applications running concurrently defined by the user). Within this context, consistent design methodologies that can tackle efficiently the complex DM behaviour of these multimedia and network applications are in great need. In this paper, we present a new methodology that allows to design custom DM management mechanisms with a reduced memory footprint for such kind of dynamic applications. The experimental results in real case studies show that our methodology improves memory footprint 60% on average over current state-of-the-art DM managers.

High-Level System Modeling and Architecture Exploration with SystemC on a Network SoC: S3C2510 Case Study [p. 538]

H. Jang, M. Kang, K. Shim, M. Lee, K. Chae, and K. Lee

This paper presents a high-level design methodology applied on a Network SoC using SystemC. The topic will emphasize on high-level design approach for intensive architecture exploration and verifying cycle accurate SystemC models comparative to real Verilog RTL models. Unlike many high-level designs, we started the project with working Verilog RTL models in hands, which we later compared our SystemC models to. Moreover, we were able to use the on-chip test board performance simulation data to verify our SystemC-based platform. This paper illustrates that in high-level design, we could have the same accuracy as RTL models but achieve over one hundred times faster simulation speed than that of RTL's. The main topic of the paper will be on architecture exploration in search of performance degradation in source.

A SystemC-Based Verification Methodology for Complex Wireless Software IP [p. 544]

G. Post, P. Venkataraghavan, T. Ray, and D. Seetharaman

The implementation of a complex hardware Intellectual Property (IP) together with complex lower-level software and the integration into a system platform poses tough challenges to the design and verification engineers. Traditionally, embedded software is developed and tested towards the end of the development cycle because of late availability of lab prototype equipment and hardware IP. In this paper, a 'software-centric' hardware/software implementation and verification methodology for a 3G WCDMA modem is presented, with emphasis on physical layer software design and early verification. The subsystem architecture of 3G hardware and software is presented along with design and verification steps carried out. A versatile SystemC-based test environment is described, which links test case modules producing the stimuli from protocol stack and hardware components to the L1 SW code, executed on a instruction set simulator.

5B: Recent Advances in Digital Systems Simulation

Moderators: M. Zwolinski, Southampton U, UK; M. Lajolo, NEC Laboratories, US

A New Optimized Implementation of the SystemC Engine Using Acyclic Scheduling [p. 552]

D. Pérez, O. Temam, and G. Mouchard

SystemC is rapidly gaining wide acceptance as a simulation framework for SoC and embedded processors. While its main assets are modularity and the very fact it is becoming a de facto standard, the evolution of the SystemC framework (from version 0.9 to version 2.0.1) suggests the environment is particularly geared toward increasing the framework functionalities rather than improving simulation speed. For cycle-level simulation, speed is a critical factor as simulation can be extremely slow, affecting the extent of design space exploration. In this article, we present a fast SystemC engine that, in our experience, can speed up simulations by a factor of 1.93 to 3.56 over SystemC 2.0.1. This SystemC engine is designed for cycle-level simulators and for the moment, it only supports the subset of the SystemC syntax (signals, methods) that is most often used for such simulators. We achieved greater speed (1) by completely rewriting the SystemC engine and improving the implementation software engineering, and (2) by proposing a new scheduling technique, intermediate between SystemC dynamic scheduling technique and existing static scheduling schemes. Unlike SystemC dynamic scheduling, our technique removes many if not all useless process wake-ups, while using a simpler scheduling algorithm than in existing static scheduling techniques.

Stimuli Generation with Late Binding of Values [p. 558]

A. Ziv

Generating test-cases that reach corner cases in the design is one of the main challenges in the functional verification of complex designs. In this paper, we describe a new technique that increases the ability of test generators by delaying assignment of values in the generated stimuli, until these values are used in the design. This late-binding allows the generator to have a more accurate view of the state of the design, and thus it can better choose the correct values. Experimental results show that late-binding can significantly improve coverage, with a reasonable penalty in simulation time.

Native ISS-SystemC Integration for the Co-Simulation of Multi-Processor SoC [p. 564]

F. Fummi, S. Martini, G. Perbellini, and M. Poncino

In a system-level design flow, the transition from a high-level description entry implies the refinement from an untimed, unpartitioned description to a real architecture where applications are executed on a programmable device and interact with ad-hoc hardware components. Simulation of such architectures requires the capability of efficient co-simulation of a model of hardware with a model of the processor. This paper presents two co-simulation methodologies, based on SystemC as hardware modeling language and on an Instruction Set Simulator (ISS) as a model of the processor. The first one works at the SystemC kernel level and exploits potentialities of the GNU suite, whereas the second uses features offered by the operating system running on the ISS. The two methodologies improve co-simulation performance with respect to state-of the art methods, and provide different trade-offs between the simplicity of the programming model, the modeling power, and co-simulation performance.

Extraction of Schematic Array Models for Memory Circuits [p. 570]

S. Bose and A. Nandi

The modeling and simulation of memory circuits remains an outstanding problem when accuracy with respect to the actual schematic implementation is desired. Functionally equivalent RTL models often cannot be used for designs with embedded memory blocks, because schematic models for the surrounding logic may be required for fault modeling accuracy. Existing methods derive a latch model that essentially represents each memory location as a latch primitive, and have a large number of gates. We present new algorithms that model such circuits as decoded arrays that access entire rows of cells for individual read and write operations. Decoded array models allow fault modeling accuracy for the surrounding logic, including the memory address decoder. Experimental data show improvements of an order of magnitude for both logic and fault simulations, when compared to the equivalent latch model.

5C: On-Line Testing and Reliability for Nanometric Technology

Moderators: T. Mak, Intel Corp., US; Y. Tsiatouhas, Ioannina U, GR

Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors [p. 578]

A. Paschalis and D. Gizopoulos

Software-based self-test (SBST) strategies are particularly useful for periodic testing of deeply embedded processors in low-cost embedded systems that do not require immediate detection of errors and cannot afford the well-known hardware, software, or time redundancy mechanisms. In this paper, first, we identify the stringent characteristics of an SBST test program to be suitable for on-line periodic testing. Then, we introduce a new SBST methodology with a new classification scheme for processor components. After that, we analyze the self-test routine code styles for the three more effective test pattern generation (TPG) strategies in order to select the most effective self-test routine for on-line periodic testing of a component under test. Finally, we demonstrate the effectiveness of the proposed SBST methodology for on-line periodic testing by presenting experimental results for a RISC pipeline processor.

Evaluating the Effects of SEUs Affecting the Configuration Memory of an SRAM-Based FPGA [p. 584]

M. Bellato, P. Bernardi, A. Candelori, M. Rebaudengo, M. Sonza Reorda, M. Violante, M. Ceschia, D. Bortolato, A. Paccagnella, and P. Zambolin

This paper analyses the effects of Single Event Upsets in an SRAM-based FPGA, with special emphasis for the transient faults affecting the configuration memory. Two approaches are combined: from one side, by exploiting the available information and tools dealing with the device configuration memory, we were able to make hypothesis on the meaning of every bit in the configuration memory. From the other side, radiation testing was exploited to validate the hypothesis and to gather experimental evidence about the correctness of the obtained results. As a major result, we can provide detailed information about the effects of SEUs affecting the configuration memory of a commercial FPGA device. As a second contribution, we describe a method for obtaining the same result with similar devices. Finally, the obtained results are crucial to allow the possible usage of SRAM-based FPGAs in safety-critical environments, e.g., by working on the place and route strategies of the supporting tools.

Early SEU Fault Injection in Digital, Analog and Mixed Signal Circuits: A Global Flow [p. 590]

R. Leveugle and A. Ammari

Fault injection techniques have been proposed for years to early analyze the dependability characteristics of digital circuits. Very few attempts have however been reported to perform the same task in analog parts. Furthermore, these attempts are all based on parametric variations. With the increasing number of mixed signal circuits, a unified approach becomes mandatory to globally validate the digital and analog parts, while taking into account real faults occurring in the field, e.g. SEUs. In this paper, a global analysis flow is proposed, based on a high-level model of the circuit. The possibility to inject transient faults in the different parts is discussed. The results obtained on a case study are reported to show the feasibility of the injection in analog blocks.

On Concurrent Error Detection with Bounded Latency in FSMs [p. 596]

S. Almukhaizim, Y. Makris, and P. Drineas

We discuss the problem of concurrent error detection (CED) with bounded latency in finite state machines (FSMs). The objective of this approach is to reduce the overhead of CED, albeit at the cost of introducing a small latency in the detection of errors. In order to ensure no loss of error detection capabilities as compared to CED without latency, an upper bound is imposed on the introduced latency. We examine the necessary conditions for performing CED with bounded latency, based on which we extend a parity-based method to permit bounded latency. We formulate the problem of minimizing the number of required parity bits as an Integer Program and we propose an algorithm based on Linear Program relaxation and Randomized Rounding to solve it. Experimental results indicate that allowing a small bounded latency reduces the hardware cost of the CED circuitry.

5E: Parasitic-Aware Analogue Design

Moderators: H. Graeb, TU Munich, DE; G. Vandersteen, IMEC, BE

Fast, Layout-Inclusive Analog Circuit Synthesis Using Pre-Compiled Parasitic-Aware Symbolic Performance Models [p. 604]

M. Ranjan, A. Agarwal, H. Sampath, R. Vemuri, W. Verhaegen, and G. Gielen

We present a new methodology for fast analog circuit synthesis, based on the use of parameterized layout generators and symbolic performance models (SPMs) in the synthesis loop. Fast layout generation is achieved by using efficient parameterized procedural layout generators. Fast performance estimation is achieved by using pre-compiled SPMs, stored as efficient DDD-like structures called Element Coefficient Diagrams. Techniques have been developed to include layout geometry effects in the SPMs. The accuracy and efficiency of the parasitic inclusion technique as well as the proposed methodology have been demonstrated by comparisons to traditional synthesis methods. The proposed methodology is used for the synthesis of opamps and filters and is demonstrated to achieve effective performance closure.

Sensitivity-Based Modeling and Methodology for Full-Chip Substrate Noise Analysis [p. 610]

R. Murgai, S. Reddy, T. Miyoshi, T Horie, and M. Tahoori

Substrate noise (SN) is an important problem in mixed-signal designs. With increasing design complexity, it is not possible to simulate for SN with a detailed SPICE model that uses an accurate model for each transistor. In this paper, we propose a sensitivity analysis- and static timing analysis-based methodology to derive a reduced model that computes the worst case substrate noise in the design. The reduced model contains only passive components, which are very few, and is very quick to simulate. The main feature of our methodology is that, unlike previous approaches, it is independent of input patterns and does not need to simulate for millions of clock cycles. This lets us apply it to a full-chip design in reasonable CPU time. We validate our reduced model on several benchmark circuits against a detailed and highly accurate reference model. On average, the reduced model is within 16.4% of the reference model and is up to 38 times faster. Finally, we apply our methodology to a mixed-signal switch chip design consisting of 8 million gates and show that it finishes in 17 minutes.

SubCALM: A Program for Hierarchical Substrate Coupling Simulation on Floorplan Level [p. 616]

T. Brandtner and R. Weigel

The hierarchical substrate coupling simulation tool Sub-CALM offers the opportunity to estimate substrate coupling on floorplan level. A novel approach for modeling well and SOI structures in a boundary element description is introduced. Several acceleration techniques like precalculated macromodels and sophisticated preconditioning algorithms are presented which are applied to an O(n)-conjugate-gradient Poisson solver in order to be able to process large full-chip layouts during floorplanning.

Optimization of Integrated Spiral Inductors Using Sequential Quadratic Programming [p. 622]

Y. Zhan and S. Sapatnekar

The optimization of integrated spiral inductors has great practical importance. Previous optimization methods used in this field are either too slow or depend on very simplified assumptions in the device modeling which result in the algorithm only applicable to low-frequency cases. In this paper, we propose using the sequential quadratic programming (SQP) approach to optimize the on-chip spiral inductors. A physical model based on first principles is used in the back-end device-parameter extraction engine which makes the algorithm suitable to the optimization at any frequency range. In addition, compared with enumeration, which is used in many inductance optimization packages, our experiments show that the SQP algorithm can achieve at least an order of magnitude speedup while maintaining the same quality of the optimized design.

5F: Hardware/Software System Design and Architecture Exploration

Moderators: B. Juurlink, TU Delft, NL; R. Leupers, RWTH Aachen, DE

System Design for DSP Applications Using the MASIC Methodology [p. 630]

A. K. Deb, A. Jantsch, and J. Öberg

Expensive top-down iterations are often required in the design cycle of complex DSP systems. In this paper, we introduce two levels of abstraction in the design flow by systematically categorizing the architectural decisions. As a result, the top-down iteration loop is broken. We also present a technique to capture and inject the architectural decisions such that the system models can be created and simulated efficiently. The concepts are illustrated by a realistic speech processing example, which is implemented using the AMBA on-chip architecture. Our methodology offers a smooth path from the functional modeling phase to the implementation level, facilitates the reuse of HW and SW components, and enjoys existing tool support at the backend.

Flexible Software Protection Using Hardware/Software Codesign Techniques [p. 636]

J. Zambreno, A. Choudhary, R. Simha, and B. Narahari

A strong level of trust in the software running on an embedded processor is a prerequisite for its widespread deployment in any high-risk system. The expanding field of software protection attempts to address the key steps used by hackers in attacking a software system. In this paper we present an efficient and tunable approach to some problems in embedded software protection that utilizes a hardware/software codesign methodology. By coupling our protective compiler techniques with reconfigurable hardware support, we allow for a greater flexibility of placement on the security-performance spectrum than previously proposed mainly-hardware or software approaches. Results show that for most of our benchmarks, the average performance penalty of our approach is less than 20%, and that this number can be greatly improved upon with the proper utilization of compiler and architectural optimizations.

Interactive Cosimulation with Partial Evaluation [p. 642]

P. Schaumont and I. Verbauwhede

We present a technique to improve the efficiency of hardware-software cosimulation, using design information known at simulator compile-time. The generic term for such optimization is partial evaluation. Our contribution is that we apply the optimization transparantly to the user, and at multiple abstraction levels in the simulation. We use the technique to create an interactive codesign environment, and evaluate it on several designs including an AES encryption coprocessor and a Viterbi decoder, and for several instruction-set simulators. Compared to SystemC-based cosimulation, we achieve comparable cosimulation performance at only a fraction of the model-build time.

Communication Analysis for System on Chip Design [p. 648]

A. Siebenborn, O. Bringmann, and W. Rosenstiel

In this paper we present an approach for analysis of systems of parallel, communicating processes for SoC design. We present a method to detect communications that synchronize the program flow of two or more processes. These synchronization points set the processes into relation and allow the determination of the global timing behavior of such a system. Using the results of our method for communication analysis, we present a new method to detect communications that might produce conflicts on shared communication resources. This information can be used for the assignment of communication resources.

5G: Hot Topic -- Extremely Low-Power Logic

Organizer/Moderator: C. Piguet, CSEM, CH
Presenters:
J. Gautier, CEA-LETI, FR
C. Heer, Infineon Technologies, DE
I. O'Connor, Ecole centrale de Lyon, FR
U. Schlichtmann, Technical University of Munich, DE

Extremely Low-Power Logic [p. 656]: For extremely Low-power Logic, three very new and promising techniques will be described. The first are methods on circuit and system level for reduced supply voltages. In large logic blocks, interconnect becomes a main issue, that could be solved by on-chip optical interconnect. Nano-devices will also be presented, as a possibility to compute with nearly zero power, and compared to future 10 nanometers transistors.

IP1: Interactive Presentations

Decomposition of Instruction Decoder for Low Power Design [p. 664]

W. Kuo, T. Hwang, and A. Wu

Microprocessors have been used in wide-ranged applications. During the execution of instructions, instruction decoding is a major task for identifying instructions and generating control signals for data-paths. By exploiting program behaviors, we propose a novel instruction-decoding approach for power minimization. Using the proposed instruction-decoding structure, we present a partitioning method that decomposes the instruction-decoding circuit into two sub-circuits according to the execution frequencies of instructions. Using our proposed decoding structure, only one sub-circuit will be activated when executing an instruction. Experimental results have demonstrated that our proposed approach achieves on an average of 26.71% and 15.69% power reductions for the instruction decoder and the control unit, respectively.

Functional Level Power Analysis: An Efficient Approach for Modeling the Power Consumption of Complex Processors [p. 666]

J. Laurent, N. Julien, E. Senn, and E. Martin

A high-level consumption estimation methodology and its associated tool, SoftExplorer, are presented. The estimation methodology uses a functional modeling of the processor combined with a parametric model to allow the designer to estimate the power consumption when the embedded software is executed on the target. SoftExplorer uses as input the assembly code generated by the compiler; its efficiency is compared to SimplePower's approach. Results for different processors (TI C62, C67, C55 and ARM7) and for several DSP applications provide an average error less than 5%.

Formal Verification Coverage: Are the RTL-Properties Covering the Design's Architectural Intent? [p. 668]

P. Basu, S. Das, P. Dasgupta, P. Chakrabarti, C. Mohan, and L. Fix

It is essential to formally ascertain whether the RTL validation effort effectively guarantees the correctness with respect to the design's architectural intent. The design's architectural intent can be expressed in formal properties. However, due to the capacity limitation of formal verification, these architectural-properties cannot be directly verified on the RTL. As a result, a set of lower level RTL-properties are developed and verified against the RTL. In this paper we present: (1) a method for checking whether the RTL-properties are covering the architectural-properties, that is, whether verifying the RTL-properties guarantee the correctness of the design's architectural intent, and (2) a method to identify the coverage holes in terms of the architectural-properties (or their sub-properties) that are not covered.

Functional Coverage Metric Generation from Temporal Event Relation Graph [p. 670]

Y. Kwon and C. Kyung

Functional coverage is a technique which can be used for checking the completeness of test vectors. In this paper, automatic generation of temporal events for functional coverage is proposed. The TERG(Temporal Event Relation Graph) is the graph where the nodes represent basic temporal property and the edges represent the time-shift value between two properties. Hierarchical temporal events are generated by traversing TERG such that invalid, or irrelevant properties are eliminated. Concurrent edge groups in TERG make it possible to generate more comprehensive temporal properties.

Automatic Scan Insertion and Pattern Generation for Asynchronous Circuits [p. 672]

A. Efthymiou, D. Edwards, and C. Sotiriou

This paper presents 3phisLSSD, a novel, easily automatable approach for scan insertion and ATPG of asynchronous circuits. 3phisLSSD inserts scan latches only into global circuit feedback paths, leaving the local feedback paths of asynchronous state-storing gates intact. By employing a three-phase LSSD clocking scheme and complemented by a novel ATPG method, our approach achieves industrial quality testability with significantly less area overhead testing the same number of faults compared to full-scan LSSD. The effectiveness of our approach is demonstrated on an asynchronous SOC interconnection fabric, where our phisLSSD ATPG tool achieved over 99% test coverage.

Automatic Synthesis and Simulation of Continuous-Time ΣΔ Modulators [p. 674]

H. Aboushady, L. de Lamarre, N. Beilleau, and M. Louërat

This paper presents a mixed equation-based and simulation-based design methodology for continuous-time Sigma-Delta modulators from high level specifications down to Layout. The calculation and scaling of the Sigma-Delta coefficients as well as circuit sizing and layout generation are implemented in the same analog design environment CAIRO+. The design of a complete third order current-mode continuous-time Sigma-Delta modulator is taken as an example to show the effectiveness of the proposed design methodology.

A Methodology for System-Level Analog Design Space Exploration [p. 676]

F. De Bernardinis and A. Sangiovanni-Vincentelli

This paper describes a novel approach to system level analog design. A new abstraction level -- the platform -- is introduced to separate circuit design from design space exploration. An Analog Platform encapsulates analog components concurrently modeling their behavior and their achievable performances. Performance models are obtained through statistical sampling of circuit configurations. The design configurations space is specified with Analog Constraint Graphs so that the sampling space is significantly reduced. System level exploration can be achieved through optimization on behavioral models constrained by performance models. Finally, an example is provided showing the effectiveness of the approach on a WCDMA amplifier.

Systematic Design for Optimization of High-Resolution Pipelined ADCs [p. 678]

R. Lotfi, M. Taherzadeh-Sani, and O. Shoaei

Pipelining is the promising approach to implement high-speed medium-to-high resolution analog-to-digital converters with minimum power consumption. In this paper, the most important specifications of a pipelined ADC including the signal-to-noise-and-distortion ratio and spurious-free dynamic range as well as the total current consumption of the converter are presented in closed-form equations and an optimization methodology for design of pipelined ADCs is suggested. Simulation results confirming the effectiveness of the methodology are presented.

A Direct Bootstrapped CMOS Large Capacitive-Load Driver Circuit [p. 680]

J. García, J. Montiel-Nelson, J. Sosa, and H. Navarro

A new 2.5V CMOS large capacitive-load driver circuit, using a direct bootstrap technique, for low-voltage CMOS VLSI digital design is presented. The proposed driver circuit exhibits a high speed and low power consumption to drive large capacitive loads. We compare our driver structure with the direct bootstrap circuit [1] in terms of the product of three metrics, active area, propagation time delay and power consumption. Results demonstrate the superior performance of the proposed driver circuit.

Co-Processor Synthesis: A New Methodology for Embedded Software Acceleration [p. 682]

B. Hounsell and R. Taylor

This paper introduces co-processor synthesis -- a methodology that provides design benefits by implementing hardware co-processors directly from embedded software. The paper examines the design benefits in this new approach vs behavioral synthesis and configurable processor methodologies.

Behavioural Bitwise Scheduling Based on Computational Effort Balancing [p. 684]

M. Molina, R. Ruiz-Sautua, J. Mendías, and R. Hermida

Conventional synthesis algorithms schedule multiple precision specifications by balancing the number of operations of every different type and width executed per cycle. However, totally balanced schedules are not always possible and therefore some hardware waste appears. In this paper a heuristic scheduling algorithm to minimize this hardware waste is presented. It successively transforms specification operations into sets of smaller ones until the most uniform distribution of the computational effort of operations among cycles is reached. In the schedules proposed some operations are executed during a set of non-consecutive cycles.

A Tool for Automatic Generation of RTL-Level VHDL Description of RNS FIR Filters [p. 686]

A. Nannarelli, A. Del Re, and M. Re

Although digital filters based on the Residue Number System (RNS) show high performance and low power dissipation, RNS filters are not widely used in DSP systems, because of the complexity of the algorithms involved. We present a tool to design RNS FIR filters which hides the RNS algorithms to the designer, and generates a synthesizable VHDL description of the filter taking into account several design constraints such as: delay, area and energy.

On Transfer Function and Power Consumption Transient Response [p. 688]

L. Cao

This paper proposes to use time series analysis techniques to model both average and cycle-by-cycle moving average power consumption behavior of electronic systems. The power model is in the form of first and/or second order transfer functions that represent the mapping from primary input/output activities to power consumption profile over time. Such an approach has power estimation applications in both software simulation and hardware implementation of power monitor circuit.

Polynomial Abstraction for Verification of Sequentially Implemented Combinational Circuits [p. 690]

T. Raudvere, A. Singh, I. Sander, and A. Jantsch

Todays integrated circuits with increasing complexity cause the well known state space explosion problem in verification tools. In order to handle this problem a much simpler abstract model of the design has to be created for verification. We introduce the polynomial abstraction technique, which efficiently simplifies the verification task of sequential design blocks whose functionality can be expressed as a polynomial. Through our technique, the domains of possible values of data input signals can be reduced. This is done in such a way that the abstract model is still valid for model checking of the design functionality in terms of the system's control and data properties. We incorporate polynomial abstraction into the ForSyDe methodology, for the verification of clock domain design refinements.

Regression Simulation: Applying Path-Based Learning in Delay Test and Post-Silicon Validation [p. 692]

L. Wang

This paper presents a novel path-based learning methodology to achieve timing Regression Simulation. The methodology can be applied for two purposes: (1) In pre-silicon phase, regression simulation can be used to produce a fast and approximate timing simulator to avoid the high cost associated with statistical timing simulation. (2) In post-silicon phase, regression simulation can be used as a vehicle to deduce critical paths from the pass/fail behavior observed on the test chips. Our path-based learning methodology consists of four major components: a delay test pattern set, a logic simulator, a set of selected paths as the basis for learning, and a machine learner. We summarize the key concepts in our regression simulation approach and present experimental results.

IP2: Interactive Presentations

A Game Theoretic Approach to Low Energy Wireless Video Streaming [p. 696]

A. Iranli, K. Choi, and M. Pedram

This paper presents a dynamic energy management policy for a wireless video streaming system, consisting of battery-powered client and server. The paper starts from the observation that the video quality in wireless streaming is a function of three factors: encoding aptitude of the server, decoding aptitude of the client, and the wireless channel. Based on this observation, the energy consumption of a wireless video streaming system is modeled and analyzed. Using the proposed model, the optimal energy assignment to each video frame is done such that the maximum system lifetime is achieved while satisfying a given minimum video quality requirement. Experimental results show that the proposed policy increases the system lifetime by 20%.

Block-Enabled Memory Macros: Design Space Exploration and Application-Specific Tuning [p. 698]

A. Ivaldi, A. Macii, E. Macii, and L. Benini

In this paper, we propose a combined solution that allows us to customize the architecture of internally partitioned SRAM macros according to the given application to be executed. Energy savings with respect to monolithic memory configurations are above 40%, without access time violation.

Synthesis of Partitioned Shared Memory Architectures for Energy-Efficient Multi-Processor SoC [p. 700]

E. Macii, K. Patel, and M. Poncino

Accesses to the shared memory in multi-processor systems-on-chip represent a significant performance bottleneck. Multi-port memories are a common solution to this problem, because they allow to parallelize accesses. However, they are not an energy-efficient solution. We propose an energy-efficient shared-memory architecture that can be used as a substitute for multi-port memories, which is based on an application-driven partitioning of the shared address space into a multi-bank architecture. Experiments on a set of parallel benchmarks show energy savings of about 56% with respect to a dual-port memory artchitecture, at a very limited performance penalty.

A Low Power Strategy for Future Mobile Terminals [p. 702]

M. Nikitovic and M. Brorsson

In this paper, we have investigated the efficiency of two power-saving strategies that reduces both static and dynamic power consumption when applied to a chip-multiprocessor (CMP). They are evaluated under two workload scenarios and compared against a conventional uni-processor architecture and a CMP without any power-aware scheduling. The results show that energy due to static and dynamic power consumption can be reduced by up to 78% and that further 8% energy can be saved at the expense of response-time of non-critical applications. Furthermore, a small study on the potential impact of system-level events showed that system calls can contribute significantly to the total energy consumed.

A 0.18 µm CMOS Implementation of On-Chip Analogue Test Signal Generation from Digital Test Patterns [p. 704]

L. Rolíndez, S. Mir, G. Prenat, and A. Bounceur

The test of Analogue and Mixed-Signal (AMS) cores requires the use of expensive AMS testers and accessibility to internal analogue nodes. The test cost can be considerably reduced by the use of Built-In-Self-Test (BIST) techniques. One of these techniques consists in generating analogue test signals from digital test patterns (obtained via SD modulation) and converting the responses of the analogue modules into digital signatures that are compared with the expected ones. This paper presents an implementation of the analogue test signal generation part that includes programmability of the circuit blocks, leading to an improvement of performance and a reduction of circuit size with respect to previous approaches. A 0.18µm CMOS circuit has been designed and fabricated, allowing the generation of test signals ranging from 10 Hz to 1 MHz.

A Digital Test for First-Order ΣΔModulators [p. 706]

G. Leger and A. Rueda

This paper presents a digital structural test for first order Sigma-Delta modulators. A periodic digital sequence is used as a stimulus to obtain a signature of the integrator leakage. This parameter is known to be related to the modulator precision and its estimation is of great importance to assess if the modulator works as expected. As the proposed technique is fully digital, it is specially suitable to test modulators embedded in complex Mixed-Signal circuits.

Trim Bit Setting of Analog Filters Using Wavelet-Based Supply Current Analysis [p. 708]

S. Bhunia, A. Raychowdhury, and K. Roy

Wavelet transform has the property of resolving signal in both time and frequency unlike Fourier transform. In this work, we show that time-domain information obtained from wavelet analysis of supply current can be used to efficiently trimanalog filters. The pole/zero locations in the frequency response of analog filters shift due to change in component values with process variations. Wavelet analysis of supply current can be a promising alternative to test frequency specification of analog filters, since it needs only one test stimulus and is virtually unaffected by transistor threshold variation. Simulation results on two test circuits demonstrate that we can estimate pole/zero shift with less than 3% error.
Index Terms: Wavelet Transform, Analog Filer, Trim Bit, Dynamic Supply Current (IDD).

SoC Test Scheduling with Power-Time Tradeoff and Hot Spot Avoidance [p. 710]

J. Chin and M. Nourani

We present a test scheduling methodology for core-based system-on-chips that can avoid hot spots and allows tradeoff between physical power dissipation and overall test time. A mixed integer linear programming formulation is presented to globally perform the power-time tradeoff, satisfy constraints, and produce the SoC test schedule.

STEPS: Experimenting a New Software-Based Strategy for Testing SoCs Containing P1500-Compliant IP Cores [p. 712]

M. Benabdenbi, F. Pêcheux, A. Greiner, M. Tuna and E. Viaud

This paper presents STEPS, an innovative software-based approach for testing P1500-compliant SoCs. STEPS is based on the concept that the ATE is not considered as an initiator applying vectors to the SoC test pins but rather as a target, a huge repository of 32-bits test data and control commands. The ATE is connected to the functional SoC external RAM controller interface. The only additional test component in the SoC is a P1500 test processor that converts test data into serial P1500 streams. This paper applies the STEPS methodology to SoCs containing a VCI-compliant interconnect, a microprocessor, P1500 compliant IP cores and an external RAM controller interface. Using the ITC02 SoC benchmarks a comparison is done between the STEPS architecture and a classical bus-based strategy.

Are Our Designs for Testability Features Fault Secure? [p. 714]

C. Metra, M. Omaña, and T. Mak

We analyze the risks associated with faults affecting some common Design For Testability (DFT) features employed within digital products. We will show that some DFT structures may become useless, with consequent dramatic impact on test effectiveness and product quality. We borrow the Fault Secure property and we will show that it guarantees that no escapes or false acceptance of faulty products may occur because of faults within the DFT structures.

Test Compression and Hardware Decompression for Scan-Based SoCs [p. 716]

F. Wolff, C. Papachristou, and D. McIntyre

We present a new decompression architecture suitable for embedded cores in SoCs which focuses on improving the download time by avoiding higher internal-to-ATE clock ratios and by exploiting hardware parallelism. The Bounded Huffman compression facilitates decompression hardware tradeoffs. Our technique is scalable in that the downloadable RAM-based decode table and accommodates for different SoC cores with different characteristics such as the number of scan chains and test set data distributions.

Concurrent Sizing, Vdd and V_th Assignment for Low-Power Design [p. 718]

A. Srivastava, D. Sylvester, and D. Blaauw

We present a sensitivity based algorithm for total power including dynamic and subthreshold leakage power minimization using simultaneous sizing, Vdd and Vth assignment. The proposed algorithm is implemented and tested on a set of combinational benchmark circuits. A comparison with traditional CVS based algorithms demonstrates the advantage of the algorithm including an average power reduction of 37% at primary input activities of 0.1. We also investigate the impact of various low Vdd values on total power savings.

Sizing and Characterization of Leakage-Control Cells for Layout-Aware Distributed Power-Gating [p. 720]

P. Babighian, E. Macii, and L. Benini

This paper proposes a methodology for sleep transistor sizing for usage in a novel, single-threshold leakage cut-off approach, where power gating cells are distributed row-by-row in a fully placed circuit. Sizing equations are obtained by performing SPICE simulations for a 130nm technology. Furthermore, the layout of a test case is considered and power and delay values are extracted in order to demonstrate the practical impact of our solution.

IP3: Interactive Presentations

An Asynchronous Synthesis Toolset Using Verilog [p. 724]

F. Burns, D. Shang, A. Koelmans, and A. Yakovlev

We present a new CAD tool set for generating asynchronous circuits from high-level Verilog level-sensitive specifications. Initially high-level Verilog descriptions are compiled and converted into a novel intermediate Petri-net format. The intermediate format is subsequently passed to optimization tools and mapping tools where it is directly mapped into asynchronous datapath and control circuits using David Cells (DCs). Finally logic optimization tools are applied to generate speed independent (SI) circuits. The speed independent circuits generated perform well compared to circuits generated by existing asynchronous tools.

Organizing Libraries of DFG Patterns [p. 726]

G. Dittmann

We propose to arrange a library of tree patterns into a hierarchy by means of identity operations. Compared with current unstructured approaches, our new method reduces the computational complexity of searching a pattern from O(n.p) to only O(d), d ≤ p. Furthermore, the organization reveals synergies between patterns for ASIP instruction-set synthesis, data-path sharing, and code generation.

Compositional Memory Systems for Data Intensive Applications [p. 728]

A. Molnos, M. Heijligers, J. Van Eijndhoven, and S. Cotofana

To alleviate the system performance unpredictability of multitasking applications running on multiprocessor platforms with shared memory hierarchies we propose a task level set based cache partitioning. We evaluate our approach on a CAKE platform with three Trimedias, one MIPS and a shared level 2 cache using a picture in picture benchmark. We compare the performance implications of two types of cache partitioning namely set based. Our experiments indicates that associativity based cache partitioning induces at least 30% performance degradation, whereas set-based partitioning provide 27% performance improvement when compared to non-partitioned cache scenario.

Scalar Metric for Temporal Locality and Estimation of Cache Performance [p. 730]

J. Alakarhu and J. Niittylahti

A scalar metric for temporal locality is proposed. The metric is based on LRU stack distance. This paper shows that the cache hit rate can be estimated based on the proposed metric (an error of a few percents can be expected). The metric alleviates high-level memory system outlining and enables using stack processing in run-time locality analysis.

.NET Framework -- A Solution for the Next Generation Tools for System-Level Modeling and Simulation [p. 732]

J. Lapalme, E. Aboulhamid, G. Nicolescu, L. Charest, J. David, F. Boyer, and G. Bois

Modeling and Simulating Memory Hierarchies in a Platform-Based Design Methodology [p. 734]

P. Viana, E. Barros, S. Rigo, R. Azevedo, and G. Araújo

This paper presents an environment based on SystemC for architecture specification of programmable systems. Making use of the new architecture description language ArchC, able to capture the processor description as well as the memory subsystem configuration, this environment offers support for system-level specification, intended for platform-based design. As a case study, it is presented the memory architecture exploration for a simple image processing application, yet a more robust environment evaluation is performed through the execution of some real-world benchmarks.

Integrating the Synchronous Dataflow Model with UML [p. 736]

P. Green and S. Essa

UML has attracted significant interest as a system description language. However, some aspects of embedded system behavior are difficult to model in UML. In particular, applications with significant dataflow components are not well represented. This paper considers how the synchronous dataflow model can be integrated with UML to provide behavioral descriptions, in an object oriented context, for system elements that perform stream processing. The integration of the SDF model with the UML state machine model is also discussed.

Design and Behavioral Modeling Tools for Optical Network-On-Chip [p. 738]

M. Brière, L. Carrel, T. Michalke, F. Mieyeville, I. O'Connor, and F. Gaffiot

In this paper, we present a tool to analyse photonic devices that can be used to realize basic building blocks of an optical network-on-chip (ONoC). Co-design between electrical tools and optical tools is possible. The VHDL-AMS language has been used to implement behavioral models of photonic devices. For low-level simulation, a gateway between an optical simulator, based on the finite elements method, and a typical EDA layout editor has been realized.

Hierarchical Modeling and Simulation of Large Analog Circuits [p. 740]

S. Tan, Z. Qi, and H. Li

This paper proposes a new hierarchical circuit modeling and simulation technique in s-domain for linear analog circuits. The new algorithm can perform circuit complexity reduction by deriving the exact or approximate admittances in rational form in the reduced circuit matrix and deriving the circuit characteristics for very large linear analog and interconnect circuits. We characterize some theoretical results regarding the conditions on the generations of canceling terms during the general hierarchical circuit analysis and propose an explicit de-cancellation scheme to remove canceling terms based on a new hierarchical symbolic analysis framework. The resulting algorithm can be used for modeling and simulation of linear analog and interconnect circuits in both frequency and time domain.

Efficient Mixed-Domain Behavioural Modeling of Ferromagnetic Hysteresis Implemented in VHDL-AMS [p. 742]

P. Wilson, J. Ross, A. Brown, T. Kazmierski, and J. Baranowski

In this paper a modified model of ferromagnetic hysteresis suitable for mixed-signal simulations in VHDLAMS is presented. The aim of this paper is to demonstrate how a numerically stable and accurate implementation of the Jiles-Atherton model can be achieved using a 4th order Runga-Kutta integration of the derivative of magnetization with respect to the field strength (H). While most SPICE-like implementations require inconvenient integration in time to obtain the magnetization derivative, our approach is more general as it does not rely on the underlying differential equation solver for this purpose. The model addresses the non-physical situation of negative BH slopes and proposes an alternative implementation of the anhysteretic function using a polynomial approximation of the Langevin function for low signal levels and a new function with no discontinuities. Model efficiency is improved by monitoring the change in H and only activating the integration function when H changes by a specified amount.

A Fast Algorithm for Finding Maximal Empty Rectangles for Dynamic FPGA Placement [p. 744]

M. Handa and R. Vemuri

In this paper, we present a fast algorithm for finding empty area on the FPGA surface with some rectangular tasks placed on it. We use a staircase datastructure to report the empty area in the form of a list of maximal empty rectangles. We model the FPGA surface using an innovative encoding scheme that improves runtime and reduces memory requirement of our algorithm. Worst-case time complexity of our algorithm is O(xy) where x is number of columns, y is number of rows and x.y is the total number of cells on the FPGA.

Enhancing Reliability of Operational Interconnections in FPGAs [p. 746]

A. Fit-Florea, M. Halas, and F. Kocan

SRAM-based Field-Programmable Gate Arrays (FPGAs) have fixed numbers of wires, switches and look-up tables. An application does not fully utilize all available components in a FPGA, e.g. wires. In this paper, we propose methods to improve reliability of less reliable operational interconnections by efficiently utilizing unused wires to mask errors dynamically. With these methods, we are able to improve the reliability of more than two-thirds of all interconnections in the studied MCNC benchmarks. As a result, the overall unreliability of operational interconnections decreases more than 20%.

Operating System Support for Interface Virtualization of Reconfigurable Coprocessors [p. 748]

M. Vuletic, L. Righetti, L. Pozzi, and P. Ienne

Reconfigurable Systems-on-Chip (SoC) consist of large Field-Programmable Gate-Arrays (FPGAs) and standard processors. The reconfigurable logic can be used for application-specific coprocessors to speedup execution of applications. The widespread use is limited by the complexity of interfacing software applications with coprocessors. We present a virtualisation layer that lowers the interfacing complexity and improves the portability. The layer shifts the burden of moving data between processor and coprocessor from the programmer to the Operating System (OS). A reconfigurable SoC running Linux is used to prove the concept.

Volume II

6A: Performances Analysis for MPSoC

Moderators: R. Ernst, TU Braunschweig, DE; A. Jantsch, Royal Inst. of Tech., SE

Analyzing On-Chip Communication in a MPSoC Environment [p. 752]

F. Angiolini, D. Bertozzi, L. Benini, M. Loghi, and R. Zafalon

This work focuses on communication architecture analysis for multi-processor Systems-on-Chips (MPSoCs), and it leverages a SystemC-based platform to simulate a complete multi-processor system at the cycle-accurate and signal-accurate level. These features allow to stimulate the communication sub-system with functional traffic generated by real applications running on top of a configurable number of ARM processors. This opens up the possibility for communication infrastructure exploration and for the investigation of its impact on system performance at the highest level of accuracy. Our simulation environment proved capable of a detailed comparative analysis between two industry-standard communication architectures, under realistic workloads and different system configurations, pointing out the impact of fine grained architectural mismatches on macroscopic performance differences.

A Mapping Strategy for Resource-Efficient Network Processing on Multiprocessor SoCs [p. 758]

M. Grünewald, J. Niemann, M. Porrmann, and U. Rückert

Hardware architectures based on a field of hardware-extended processors can provide flexible computing power for applications where parallelism can be exploited. For multiprocessors, the assignment of functionality to execution units can have a great impact on the performance. Additionally, finding the optimal mapping can be a time-consuming task. We present a multiprocessor architecture along with a suitable design method that includes an automated solution to the mapping problem. Our hardware architecture employs a network-on-chip (NoC) to achieve a high degree of scalability for the application and for the system in respect to future integration technologies.We also show how to reduce the packet buffer requirements with a proper scheduling strategy and present first estimates for the resource consumption of an application targeted for mobile networking.

Cost-Performance Trade-Offs in Networks on Chip: A Simulation-Based Approach [p. 764]

S. Pestana, E. Rijpkema, A. Radulescu, K. Goossens, and O. Gangwal

A challenge facing designers of systems on chip (SoC) containing networks on chip (NoC) is to find NoC instances that balance the cost (e.g. area) and performance (e.g. latency and throughput). In this paper we present a simulation-based approach to address this problem. We use XML to instantiate network components (routers, network interfaces) and their composition. NoCs are evaluated in terms of cost and performance by sweeping over different parameters (e.g. network topology, network interface queue depth). We then show, how we can obtain trade-off plots by using the results obtained with our simulation environment. Finally, by means of two examples we illustrate how trade-off plots can help the NoC designers in selecting the right network based on a set of different constraints.

A Case Study in Networks-on-Chip Design for Embedded Video [p. 770]

J. Xu, W. Wolf, T. Lv, J. Henkel, and S. Chakradhar

In this paper we study bus-based and switch-based onchip networks for an embedded video application, the Smart Camera SoC (system on chip). We analyze network performance and overall system performance in detail. We explore system performance using crossbars with different sizes, fixed size but different numbers of ports, and different numbers of shared memories. We find that network is a performance bottleneck in our design, and the system using an optimized NoC can outperform one using a bus by 132%. Our simulations are based upon recorded real communication traces, which give more accurate system performance. Our study finds that for the Smart Camera system, a 16-bit/port 3x3 crossbar with two shared memories shows 85.7% performance improvement over the bus-based model and also has less maximum network throughput than the bus-based model. This design example illustrates a methodology to quickly and accurately estimate the performance of NoC's at architecture level.

6B: Synthesis for Noise and Manufacturability

Moderators: T. Villa, Udine U, IT; T. Shiple, Synopsys, FR

Exploiting Crosstalk to Speed up On-Chip Buses [p. 778]

C. Duan and S. Khatri

In modern VLSI processes, the cross-coupling capacitance between adjacent neighboring wires on the same metal layer is a very large fraction of the total wire capacitance. This leads to problems of delay variation due to crosstalk and reduced noise immunity, arguably one of the biggest obstacles in the design of ICs in recent times. This problem is particularly severe in long on-chip buses, since bus signals are routed at minimum pitch for long distances. In this work, we propose to solve this problem by the use of crosstalk canceling CODECs. We only utilize memoryless CODECs, to reduce the logical complexity and enhance the robustness of our techniques. Bus data patterns can be classified (as 4.C, 3.C, 2.C, 1.C or 0.C patterns) based on the maximum amount of crosstalk that they can exhibit. Crosstalk avoidance CODECs which eliminate 4.C and 3.C patterns have been reported. In this paper, we describe crosstalk avoidance techniques which eliminate 2.C and 1.C patterns. We describe an analytical methodology to accurately characterize the bus area overhead 2.C pattern CODECs. Using these results, we characterize the area overhead versus crosstalk immunity achieved. A similar exercise is performed for 1.C patterns. Our experimental results show that by using 2.C crosstalk canceling techniques, buses can be sped up by up to a factor of 6 with an area overhead of about 200%, and that 1.C techniques are not very robust.

False-Noise Analysis for Domino Circuits [p. 784]

A. Glebov, S. Gavrilov, V. Zolotov, M. Becer, C. Oh, and R. Panda

High-performance digital circuits are facing increasingly severe noise problems due to cross-coupled noise injection. Traditionally, noise analysis tools use the conservative assumption that all neighbors of a net can switch simultaneously, producing the worst-case noise. However, due to logic correlations in the circuit, this worst-case noise may not be realizable, resulting in a so-called false noise failure. Some techniques for computing logic correlations have been designed targeting static CMOS circuits. However high performance microprocessors commonly use domino logic for their ALU. The domino circuits have lower noise margins than static CMOS circuits and are more sensitive to coupled noise. Any unnecessary pessimism of the noise analysis tool results in large number of false noise violations and either requires additional extensive SPICE simulations or circuit over-design. Unfortunately false noise analysis developed for static CMOS circuits [11] fails to compute many logic correlations in domino circuits. In this paper we propose a novel technique of computing logic correlations in domino circuits. It takes into account the fact that both pull up and pull down networks of a domino gate can be in non conducting state. The proposed technique generates additional logic correlations for such states of domino gates. In order to improve the capability of logic correlation derivation technique we combine the resolution method with recursive learning algorithm[ 12]. The proposed technique is implemented in an industrial noise analysis tool and tested on high performance ALU blocks.

Crosstalk Minimization in Logic Synthesis for PLA [p. 790]

Y. Liu, T. Hwang, and K. Wang

We propose a maximum crosstalk minimization algorithm taking logic synthesis into consideration for PLA structure. To minimize the crosstalk, technique of permuting wire is used which includes the following steps. First, product lines are partitioned into long set and short set, and then product lines in long set and short set are interleaved. By interleaving algorithm, an upper bound on the maximum coupling capacitance of the product lines can be derived. Then, we take advantage of crosstalk immunity of product lines in long set to further reduce the maximum crosstalk effect of the PLA. Finally, synthesis techniques such as local transformation and global transformation are taken into consideration to search for a better result. The experiments demonstrate that our algorithm can effectively minimize the maximum crosstalk effect of a circuit by 48% as compared with the original area-minimized PLA without crosstalk minimization.

Synthesis for Manufacturability: A Sanity Check [p. 796]

A. Nardi and A. Sangiovanni-Vincentelli

As we move towards nanometer technology, manufacturing problems become overwhelmingly difficult to solve. Presently, optimization for manufacturability is performed at a post-synthesis stage and has been shown capable of reducing manufacturing cost up to 10%. As in other cases, raising the abstraction layer where optimization is applied is expected to yield substantial gains. This paper focuses on a new approach to design for manufacturability: logic synthesis for manufacturability. This methodology consists of replacing the traditional area-driven technology mapping with a new manufacturability-driven one. We leverage existing logic synthesis tools to test our method. The results obtained by using STMicroelectronics 0.13µm library confirm that this approach is a promising solution for designing circuits with lower manufacturing cost, while retaining performance. Finally, we show that our synthesis for manufacturability can achieve even larger cost reduction when yield-optimized cells are added to the library, thus enabling a wider area-yield tradeoff exploration.

6C: Support for BIST

Moderators: G: Carlsson, Ericsson Telecom, SE; K. Chakrabarty, Duke U, US

Design of Sub-10-Picoseconds On-Chip Time Measurement Circuit [p. 804]

M. Abas, D. Kinniment, and G. Russell

The rapid pace of change in IC technology, specifically in speed of operation, demands sophisticated design solutions for IC testing methodologies. Moreover, the current technology of System-on-chip (SOC) makes great demands for testing internal speed accurately as the limitation on accessing internal nodes using I/O pins becomes more difficult. This paper presents two high-resolution time measurement schemes for digital BIST applications, namely: Two-Delay Interpolation Method (TDIM) and Time Amplifier. The two schemes are combined to produce a completely new design for BIST time measurement which offers two main advantages: a low range of timing measurement which has never been achieved before, and a small size of layout occupying 0.2 mm2 or equivalent to 3020 transistors. These two features are undoubtedly compatible with present high-speed SOC design architectures.

Impact of Test Point Insertion on Silicon Area and Timing during Layout [p. 810]

H. Vranken, H. Wunderlich, and F. Sapei

This paper presents an experimental investigation on the impact of test point insertion on circuit size and performance. Often test points are inserted into a circuit in order to improve the circuit's testability, which results in smaller test data volume, shorter test time, and higher fault coverage. Inserting test points however requires additional silicon area and influences the timing of a circuit. The paper shows how placement and routing is affected by test point insertion during layout generation. Experimental data for industrial circuits show that inserting 1% test points in general increases the silicon area after layout by less than 0.5% while the performance of the circuit may be reduced by 5% or more.

Designing Self Test Programs for Embedded DSP Cores [p. 816]

H. Rizk, C. Papachristou, and F. Wolff

This paper describes a self test program design technique for embedded DSP cores. The method requires minimal knowledge of the core's internals and minimal insertion of external LFSR hardware, without scan insertions. The test program consists of a small set of instructions which operate iteratively on pseudorandom data generated by the LFSRs to fully test the DSP core components. The method uses instruction-based test metrics and a program template as a blueprint to generate the test program. The self test scheme has been successfully applied on an industrial-strength DSP core and the results compare favorably to other methods using ATPG and pseudorandom BIST.

6E: Modelling, Simulation and Optimisation in Power/Ground/Substrate

Moderators: P. Feldmann, IBM T.J. Watson Res. Center, US; L. Silveira, INESC ID/IST - TU Lisbon, PT

Automated, Accurate Macromodelling of Digital Aggressors for Power/Ground/Substrate Noise Prediction [p. 824]

Z. Wang, J. Roychowdhury, and R. Murgai

Noise analysis and power distribution network reliability as is extremely important in deep sub-micron digital and mixed-signal circuit design. Both relate closely to the nonlinear loading impact of digital circuits. Consequently, accurate estimation of the latter is critical. In this paper, we present extraction techniques that automatically generate a family of small, time-varying macromodels for digital cell libraries, at the time of their library characterization. Our approach is based on importing and adapting the Time-Varying Padé (TVP) method, for linear time-varying (LTV) model reduction, from the mixed-signal macromodelling domain. Our approach features naturally higher accuracy than previous ones, and in addition, offers the user a tradeoff between accuracy and macromodel complexity. A key attraction of our approach is that it can be merged into cell library extraction methodologies to produce accurate-by-construction noise models for digital blocks. Simulations and comparisons confirming the efficacy of our approach are provided.

Thermal and Power Integrity Based Power/Ground Networks Optimization [p. 830]

T. Wang, J. Tsai, and C. Chen

With the increasing power density and heat-dissipation cost of modern VLSI designs, thermal and power integrity has become serious concern. Although the impacts of thermal effects on transistor and interconnect performance are well-studied, the interactions between power-delivery and thermal effects are not clear. As a result, power-delivery design without thermal consideration may cause soft-error, reliability degradation, and even premature chip failures. In this paper, we propose a thermal-aware power-delivery optimization algorithm. By simultaneously considering thermal and power integrity, we are able to achieve high power supply quality and thermal reliability. For a 58 x 72 mesh as shown in the experimental results, our algorithm shows that the lifetime of the optimized ground network is 9.5 years. Whereas the lifetime of the ground network generated by a traditional method is only 2 years without thermal concern.

Synthesized Compact Models (SCM) of Substrate Noise Coupling Analysis and Synthesis in Mixed-Signal ICs [p. 836]

H. Lan and R. Dutton

An approach for synthesized compact models (SCM) of substrate noise coupling is presented. The model is formulated using parameterized and scalable Z matrix. The improvement in modeling near field effects results in better substrate noise modeling for analog circuits. The geometrical scalability of the model provides a bi-directional link between noise analysis in the post-layout phase for verification and the noise-aware layout synthesis using convex optimization techniques. The model is validated by rigorous EM and device simulations. Several application examples are used to demonstrate the bi-directional usage of the model.

6F: Panel Session -- Chips of the Future: Soft, Crunchy or Hard?

Organizer/Moderator: P. Paulin, STMicroelectronics, FR
Panellists:
R. Bramley, STMicroelectronics
A. Silburt, Cisco, CAN
J. Balzano, Alcatel, FR
K. van Berkel, Philips Research, NL
N. Wehn, Kaiserslautern U, DE

Chips of the Future: Soft, Crunchy or Hard? [p. 844]: Today's electronic products are composed of an increasingly diverse set of IC's, ranging from dedicated ASIC's, domain-specific ASSP's, platform FPGA's, to general-purpose FPGA's. With increasing integration, a mix of different fabrics on a single SoC becomes possible, combining ASIC-style standard cells, embedded FPGA's, mask-programmable sea-of-gates, and programmable processors. The panelists will present their vision of the fabric which will dominate SoC's in 90nm technologies and beyond, based on industrial trends and case studies. They will also outline the key CAD tool challenges for the chosen fabric.

6G: Power-Aware Networks and Interfaces (Low Power Special Day)

Moderators: M. Pedram, Southern California U, US; A. Amara, ISEP, FR

Tuning In-Sensor Data Filtering to Reduce Energy Consumption in Wireless Sensor Networks [p. 852]

I. Kadayif and M. Kandemir

In recent years, research on wireless sensor networks has been undergoing a revolution, promising to have significant impact on a broad range of applications from military to health care to food safety. An important problem in many sensor network applications is to decide the amount of computation (or filtering) that needs to be done in the sensor nodes before the data are shifted to a central base station. Right amount of data filtering in the sensor nodes can lead to large savings in network-wide energy consumption. The main goal of this paper is to develop an automated strategy for data filtering in wireless sensor nodes. Assuming that one needs to reduce the overall energy consumption (as opposed to reducing just computation energy or communication energy), the proposed strategy attempts to strike a balance between computation energy consumption and communication energy consumption. Our experimental results clearly indicate that the proposed data filtering strategy generates substantial energy savings in practice.

Power-Aware Network Swapping for Wireless Palmtop PCs [p. 858]

A. Acquaviva, E. Lattanzi, and A. Bogliolo

Virtual memory is considered to be an unlimited resource in desktop or notebook computers with high storage memory capabilities. However, in wireless mobile devices like palmtops and personal digital assistants (PDA), storage memory is limited or absent due to weight, size and power constraints. As a consequence, swapping over remote memory devices can be considered as a viable alternative. Nevertheless, power hungry wireless network interface cards (WNIC) may limit the battery lifetime and application performance if not efficiently exploited. In this work we explore performance and energy of network swapping in comparison with swapping on local micro-drives and flash memories. Our study points out that remote swapping over power-manageable WNICs can be more efficient than local swapping and that both energy and performance can be optimized through power-aware reshaping of data requests. Experimental results show that our optimization technique can save up to 60% of communication energy while improving performance.

Power Aware Interface Synthesis for Bus-Based SoC Design [p. 864]

N. Liveris and P. Banerjee

In this paper we discuss the problem of interface synthesis for a system on a chip (SoC) such that the power consumption is minimized under some given latency constraints. Since the AMBA protocol has become one of the standard interfaces for SoC cores, we develop our interface synthesis methods around the AMBA protocol. We first provide an analysis of the parameters of the AMBA bus and the communication protocols and a bus power model that will be used by various transformations. Several latency improving and power minimizing transformations are presented at the bus level. Finally, a heuristic is presented which applies the above transformations in a certain order to provide minimum power under a given latency constraint. Experimental results are reported on two example benchmarks in that show that the heuristic is able to reduce power consumption on the wires by about 28% on the average from an initial design having a single layer bus architecture.

Asynchronous Design by Conversion: Converting Synchronous Circuits into Asynchronous Ones [p. 870]

A. Branover, R. Kol, and R. Ginosar

A novel methodology and algorithm for the design of large low-power asynchronous systems are described. The system is synthesized by a commercial tool as a synchronous circuit, and subsequently converted into an asynchronous one. The conversion algorithm consists of extracting input and output sets, replacing the storage elements, identifying fork and join sets, and constructing request and acknowledge networks. A DLAP (Doubly Latched Asynchronous Pipeline) architecture is employed. The resulting asynchronous circuit can adapt its effective operating frequency to the supply voltage, facilitating flexible and efficient power management. The algorithm has been validated on several circuits.

7A: Networks on Chip Design

Moderators: G. Nicolescu, Ecole Polytechnique de Montreal, CA; M. Coppola, STMicroelectronics, FR

An Efficient On-Chip Network Interface Offering Guaranteed Services, Shared-Memory Abstraction, and Flexible Network Configuration [p. 878]

A. Radulescu, J. Dielissen, K. Goossens, E. Rijpkema, and P. Wielage

In this paper we present a network interface for an on-chip network. Our network interface decouples computation from communication by offering a shared-memory abstraction, which is independent of the network implementation. We use a transaction-based protocol to achieve backward compatibility with existing bus protocols such as AXI, OCP and DTL. Our network interface has a modular architecture, which allows flexible instantiation. It provides both guaranteed and best-effort services via connections. These are configured via network interface ports using the network itself, instead of a separate control interconnect. An example instance of this network interface with 4 ports has an area of 0.143mm² in a 0.13µm technology, and runs at 500 MHz.

×pipesCompiler: A Tool for Instantiating Application Specific Networks-on-Chip [p. 884]

S. Murali, G. De Micheli, A. Jalabert, and L. Benini

Future Systems on Chips (SoCs) will integrate a large number of processor and storage cores onto a single chip and require Networks on Chip (NoC) to support the heavy communication demands of the system. The individual components of the SoCs will be heterogeneous in nature with widely varying functionality and communication requirements. The communication infrastructure should optimally match communication patterns among these components accounting for the individual component needs. In this paper we present ×pipesCompiler, a tool for automatically instantiating an application-specific NoC for heterogeneous Multi-Processor SoCs. The ×pipesCompiler instantiates a network of building blocks from a library of composable soft macros (switches, network interfaces and links) described in SystemC at the cycle-accurate level. The network components are optimized for that particular network and support reliable, latency-insensitive operation. Example systems with application-specific NoCs built using the ×pipesCompiler show large savings in area (factor of 6.5), power (factor of 2.4) and latency (factor of 1.42) when compared to a general-purpose mesh-based NoC architecture.
Keywords: Systems on Chips, Networks on Chips, latency-insensitive design, application-specific, SystemC.

Guaranteed Bandwidth Using Looped Containers in Temporally Disjoint Networks within the Nostrum Network on Chip [p. 890]

M. Millberg, E. Nilsson, R. Thid, and A. Jantsch

In today's emerging Network-on-Chips, there is a need for different traffic classes with different Quality-of-Service guarantees. Within our NoC architecture Nostrum, we have implemented a service of Guaranteed Bandwidth (GB), and latency, in addition to the already existing service of Best-Effort (BE) packet delivery. The guaranteed bandwidth is accessed via Virtual Circuits (VC). The VCs are implemented using a combination of two concepts that we call "Looped Containers" and "Temporally Disjoint Networks". The Looped Containers are used to guarantee access to the network -- independently of the current network load without dropping packets; and the TDNs are used in order to achieve several VCs, plus ordinary BE traffic, in the network. The TDNs are a consequence of the deflective routing policy used, and gives rise to an explicit time-division-multiplexing within the network. To prove our concept an HDL implementation has been synthesised and simulated. The cost in terms of additional hardware needed, as well as additional bandwidth is very low -- less than 2 percent in both cases! Simulations showed that ordinary BE traffic is practically unaffected by the VCs.

Bandwidth-Constrained Mapping of Cores onto NoC Architectures [p. 896]

S. Murali and G. De Micheli

We address the design of complex monolithic systems, where processing cores generate and consume a varying and large amount of data, thus bringing the communication links to the edge of congestion. Typical applications are in the area of multi-media processing. We consider a mesh-based Networks on Chip (NoC) architecture, and we explore the assignment of cores to mesh cross-points so that the traffic on links satisfies bandwidth constraints. A single-path deterministic routing between the cores places high bandwidth demands on the links. The bandwidth requirements can be significantly reduced by splitting the traffic between the cores across multiple paths. In this paper, we present NMAP, a fast algorithm that maps the cores onto a mesh NoC architecture under bandwidth constraints, minimizing the average communication delay. The NMAP algorithm is presented for both single minimum-path routing and split-traffic routing. The algorithm is applied to a benchmark DSP design and the resulting NoC is built and simulated at cycle accurate level in SystemC using macros from the ×pipes library. Also, experiments with six video processing applications show significant savings in bandwidth and communication cost for NMAP algorithm when compared to existing algorithms.
Keywords: Systems on Chips, Networks on Chips, cores, mapping, bandwidth, routing.

7B: Advances in Technology Mapping and Circuit Sizing

Moderators: T. Kutzschebauch, Magma Design Automation, US; L. Stok, IBM, US

Synthesis and Optimization of Threshold Logic Networks with Application to Nanotechnologies [p. 904]

R. Zhang, P. Gupta, L. Zhong, and N. Jha

We propose an algorithm for efficient threshold network synthesis of arbitrary multi-output Boolean functions. The main purpose of this work is to bridge the wide gap that currently exists between research on the development of nanoscale devices and research on the development of synthesis methodologies to generate optimized networks utilizing these devices. Many nanotechnologies, such as resonant tunneling diodes (RTD) and quantum cellular automata (QCA), are capable of implementing threshold logic. While functionally correct threshold gates have been successfully demonstrated, there exists no methodology or design automation tool for general multi-level threshold network synthesis. We have built the first such tool, ThrEshold Logic Synthesizer (TELS), on top of an existing Boolean logic synthesis tool. Experiments with about 60 multi-output benchmarks were performed, though the results of only 10 of them are reported in this paper because of space restrictions. They indicate that up to 77% reduction in gate count is possible when utilizing threshold logic, with an average reduction being 52%, compared to traditional logic synthesis. Furthermore, the synthesized networks are well-balanced, and hence delay-optimized.

Fast Comparisons of Circuit Implementations [p. 910]

S. Karandikar and S. Sapatnekar

Digital designs can be mapped to different implementations using diverse approaches, with varying cost criteria. Post-processing transforms, such as transistor sizing can drastically improve circuit performance, by optimizing critical paths to meet timing specifications. However, most transistor sizing tools have high execution times, and the attainable circuit delay can be determined only after running the tool. In this paper, we present an approach for fast transistor sizing that can enable a designer to choose one among several functionally identical implementations. Our algorithm computes the minimum achievable delay of a circuit with a maximum average error of 5.5% in less than a second for even the largest benchmarks.

Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs [p. 916]

A. Tiwari and K. Tomko

Modern FPGAs contain on-chip synchronous embedded memory blocks (SEMBs), these memory blocks can be used to implement control units, when not used as on-chip memory. In this paper, we explore the mapping of Finite State Machines (FSMs) into the SEMBs for power and area minimization. We have shown the SEMB based implementation of the FSMs and compared it with conventional Flip-Flop (FF) based implementation. The proposed implementation of the FSMs consumes less power and has lower area and routing overhead than the FF based approach and it can be clocked at the maximum clock frequency supported by the SEMBs. Experimental results show that the SEMB based FSM consumes 4% to 26% less power than the conventional implementation. In addition it is observed that the power consumption can be further reduced by stopping the clock to the SEMBs during the idle states.

MemMap: Technology Mapping Algorithm for Area Reduction in FPGAs with Embedded Memory Arrays Using Reconvergence Analysis [p. 922]

M. Kumar, J. Bobba, and V. Kamakoti

Modern day Field Programmable Gate Arrays (FPGA) include in addition to Look-up Tables, reasonably big configurable Embedded Memory Blocks (EMB) to cater to the on-chip memory requirements of systems/applications mapped on them. While mapping applications on to such FPGAs, some of the EMBs may be left unused. This paper presents a methodology to utilize such unused EMBs as large look-up tables to map multi-output combinational sub-circuits of the application, which, otherwise would be mapped on to a number of small Look-Up Tables (LUT) available on the FPGA. This in turn leads to a huge reduction in the area of the FPGA, utilized for mapping an application. Experimental results show that our proposed methodology, when employed on popular benchmark circuits, can lead to additional 50% reduction in area utilized when compared with other methodologies reported in the literature.

7C: Panel Session -- Nanometer Design: What are the Requirements for Manufacturing Test?

Organizer: K. Thapar, Mentor Graphics Europe, UK
Moderator: J. Rajski, Mentor Graphics, US
Panellists:
M. Vergniault, STMicroelectronics, FR
P. Muhmenthaler, Infineon Technologies, DE
E. Haioun, Motorola, FR
E. Marinissen, Philips Research, NL
R. Illman, Cadence Design Foundry, UK
B. Bennetts, Bennetts Associates, UK
S. Dowd, Jennic, UK

Nanometer Design: What are the Requirements for Manufacturing Test? [p. 930]

7E: Issues in Interconnect Simulation and Model Order Reduction

Moderators: I. Elfadel, IBM T.J. Watson Res. Center, US; U. Feldmann, Infineon Technologies, DE

Poor Man's TBR: A Simple Model Reduction Scheme [p. 938]

J. Phillips and L. Silveira

This paper presents a model reduction algorithm motivated by a connection between frequency domain projection methods and approximation of truncated balanced realizations. The method produces guaranteed passive models, has near-optimal error properties, is computationally simple to implement, contains error estimators, and can incorporate frequency weighting information in a straightforward manner. Examples are shown to prove that the method can outperform the standard order reduction techniques by providing similar accuracy with lower models or superior accuracy for the same size model.

Model Order Reduction Techniques for Linear Systems with Large Numbers of Terminals [p. 944]

P. Feldmann

This paper addresses the well known difficulty of applying model order reduction (MOR) to linear circuits with a large number of input-output terminals. Traditional MOR techniques substitute the original large but sparse matrices used in the mathematical modeling of linear circuits by models that approximate the behavior of the circuit at its terminals, and use significantly smaller matrices. Unfortunately these small MOR matrices become dense as the number of terminals increases, thus canceling the benefits of size reduction. The paper introduces a model reduction technique suitable for circuits with numerous terminals. The technique exploits the correlation that almost always exists between circuit responses at different terminals. The correlation is rendered explicit through an SVD-based algorithm and the result is a substantial sparsification of the MOR matrices. The proposed sparsification technique is applicable to a large class of problems encountered in the analysis and design of interconnect in VLSI circuits. Relevant examples are used to analyze and validate the method.

SCORE: Spice COmpatible Reluctance Extraction [p. 948]

R. Jiang and C. Chen

Presently, a necessary modification to mainstream analysis tools prevents the direct application of reluctance k. In this paper, we propose a reluctance realization algorithm (RRA) by directly converting reluctances to circuit elements compatible with general simulation engines, such as SPICE. Reluctance realization is applicable to arbitrary circuit topology and no accuracy penalty is involved in the realization process. Since the stability of the converted circuit largely depends on the stability of the reluctance matrix, we present an efficient Improved Recursive Bisection Cutting Algorithm (IRBCA) to obtain stability-guaranteed reluctance matrices, and integrate IRBCA and RRA into a SPICE compatible reluctance extraction tool, SCORE.

A Compact Propagation Delay Model for Deep-Submicron CMOS Technologies including Crosstalk [p. 954]

J. Rosselló and J. Segura

We present a compact, fully physical, analytical model for the propagation delay and the output transition time of deep-submicron CMOS gates. The model accounts for crosstalk effects, short-circuit currents, the input-output coupling capacitance and carrier velocity saturation effects. It is based on the nth-power law MOSFET model and computes the propagation delay from the charge delivered to the gate. Comparison with HSPICE simulations and other previously published models for different submicron technologies show significant improvements in terms of accuracy.

7F: Emerging Technologies: From Sensors to Qubits

Moderators: T. Basten, TU Eindhoven, NL; L. Claesen, National Chiao Tung U, Taiwan, ROC

A Framework for Battery-Aware Sensor Management [p. 962]

S. Dasika, S. Vrudhula, S. Chopra, and R. Srinivasan

A distributed sensor network (DSN) designed to cover a given region R, is said to be alive if there is at least one subset of sensors that can collectively cover (sense) the region R. When no such subset exists, the network is said to be dead. A key challenge in the design of a DSN is to maximize the operational life of the network. Since sensors are typically powered by batteries, this requires maximizing the battery lifetime. One way to achieve this is to determine the optimal schedule for transitioning sets of sensors between active and inactive states while satisfying user specified performance constraints. This requires identification of feasible subsets (covers) of sensors and a scheme for switching between such subsets. We present an algorithmic solution to compute all the sensor covers in an implicit manner by formulating the problem as unate covering problem (UCP). The representation of all possible sensor sets is extremely efficient and can accommodate very large number of sensor covers. The representation and formulation makes it possible to consider the residual battery charge when switching between covers. We develop algorithms for switching between sensor covers aimed at maximizing the lifetime of the network. The algorithms take into account the transmission/reception costs of sensors, a user specified quality constraint and also utilize a novel battery model that accounts for the rate-dependent capacity effect and charge recovery during idle periods. Our simulation results show that lifetime improvement can be achieved by exploiting the charge recovery process. The work 1 presented here constitutes a framework for battery aware sensor management in which various types of constraints can be incorporated and a range of other communication protocols can be examined.

Local Decisions and Triggering Mechanisms for Dynamic Fault Tolerance [p. 968]

P. Stanley-Marbell and D. Marculescu

Dynamic fault-tolerance management (DFTM) was previously introduced as a means of providing environment and workload-driven adaptation for failure-prone battery powered systems. This paper introduces and analyzes the role of local decision policies in a DFTM environment, and presents a precise formulation for when it is beneficial to activate a given DFTM algorithm with respect to metrics that combine performance, reliability, power consumption and battery life. In particular, local decision algorithms are described in the context of an imaging array application running on a network of resource-constrained processing elements. It is demonstrated that DFTM algorithms, in conjunction with appropriately chosen activation times, increase the mean computation before battery failure for a single battery, by a factor between 1.1 to 5.8, for the application investigated.

An Algorithm for Nano-Pipelining of Circuits and Architectures for a Nanotechnology [p. 974]

P. Gupta and N. Jha

In this paper, we describe an algorithm to post-process a register-transfer level (RTL) architecture to enable gate-level pipelining or nano-pipelining for the nanotechnology based on resonant tunneling diodes (RTDs). Nano-pipelining offers the opportunity to obtain massive throughput and, therefore, has applications in data-intensive algorithms such as digital signal processing (DSP). Since RTDs are a self-latching nanotechnology, nano-pipelining is an implicit property that should be exploited for this technology. The novelty of this work lies in exploring and demonstrating the benefits of nano-pipelining and presenting an algorithm for architectural nano-pipelining.

Smaller Two-Qubit Circuits for Quantum Communication and Computation [p. 980]

V. Shende, I. Markov, and S. Bullock

We show how to implement an arbitrary two-qubit unitary operation using any of several quantum gate libraries with small a priori upper bounds on gate counts. In analogy to library-less logic synthesis, we consider circuits and gates in terms of the underlying model of quantum computation, and do not assume any particular technology. As increasing the number of qubits can be prohibitively expensive, we assume throughout that no extra qubits are available for temporary storage. Using quantum circuit identities, we improve an earlier lower bound of 17 elementary gates by Bullock and Markov to 18, and their upper bound of 23 elementary gates to 18. We also improve upon the generic circuit with six CNOT gates by Zhang et al. (our circuit uses three), and that by Vidal and Dawson with 11 basic gates (we use 10). We study the performance of our synthesis procedures on two-qubit operators that are useful in quantum algorithms and communication protocols. With additional work, we find small circuits and improve upon previously known circuits in some cases.

7G: Embedded Tutorial -- Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia Processing

Organizer: E. Macii, Politecnico di Torino, IT
Moderator: N. Chang, Seoul National U, KR
Speakers:
I. Verbauwhede, UCLA, US
C. Piguet, CSEM, CH
P. Schaumont, UCLA, US
B. Kienhuis, Leiden U, NL

Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia Processing [p. 988]: Energy efficient embedded systems consist of a heterogeneous collection of very specific building blocks, connected together by a complex network of many dedicated busses and interconnect options. The trend to merge multiple functions into one device makes the design and integration of these 'systems-on-chip' (SOC's) even more problematic. Yet, specifications and applications are never fixed and require the embedded units to be programmable. The topic of this paper is to give the designer architectures and design techniques to find the right balance between energy efficiency and flexibility. The key is to include programmability (or reconfiguration) at the right level of abstraction and tuned to the application domain. The challenge is to provide an exploration and programming environment for this heterogeneous architecture platform.

8A: Platform-Based Design and VC Reuse Methods

Moderators: R. Seepold, Carlos III de Madrid U, ES; T. Riesgo, UP Madrid, ES

Measurement of IP Qualification Costs and Benefits [p. 996]

A. Vörg, W. Rosenstiel, and M. Radetzki

IP core reuse is necessary to overcome the design gap. Yet experience during IP integration has shown that risk is still considerably high when dealing with IPs. IP qualification provides IP providers and integrators with measurable quality characteristics that allow for high quality IP cores and to put buy decisions on a quantifiable basis. This paper presents unprecedented results that facilitate the comparison of the effectiveness of reusing qualified, digital soft IP to previous, immature reuse methods. An impressive reduction in IP integration effort, which is profitable for the IP customer, is demonstrated. Moreover, we show that the IP business can be profitable for the IP provider despite the additional qualification effort.

Architecture-Level Performance Estimation for IP-Based Embedded Systems [p. 1002]

K. Ueda, K. Sakanushi, Y. Takeuchi, and M. Imai

In this paper, we propose a architecture-level performance estimation method for IP-based embedded systems using system-level profiling. Our method enables the performance estimation by the following procedures; 1) System-level profiling. 2) Automatic construction of the execution dependency graph (EDG) from the profile information. 3) Estimation of the system performance based on the EDG analysis. Our method enables fast performance estimation because it can estimate the performance of various architectures from the same system-level profile information. Experimental results show that our estimation method is about 10,000 times faster than the architecture-level simulations.

Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures [p. 1008]

M. Singh and M. Theobald

Latency-insensitive systems were recently proposed by Carloni et al. as a correct-by-construction methodology for single-clock system-on-a-chip (SoC) design using predesigned IP blocks. Their approach overcomes the problem of long latencies of global interconnects in deep-submicron technologies, while still maintaining much of the inherent simplicity of synchronous design. In particular, wires whose latency is greater than a clock cycle are segmented using "relay stations," and IP blocks are made robust to arbitrary communication delays. This paper shows, however, that significant extensions are needed to make latency-insensitive systems useful for the practical design of large-scale SoC's. In particular, this paper proposes three extensions. The first extension allows each synchronous module to treat its input and output channels in a much more flexible manner, i.e., with greater decoupling. The second extension generalizes inter-module communication from point-to-point channels to more complex networks of arbitrary topologies. Finally, the third extension is to target multi-clock SoC's. The net impact of our extensions is the potential for improved throughput, reduced power consumption, and greater flexibility in design.

Platform Based on Open-Source Cores for Industrial Applications [p. 1014]

M. Bolado, J. Castillo, P. Huerta, H. Posadas, P. Sánchez, C. Sánchez, F. Blasco, and H. Fouren

The latest version of the International Technology Roadmap for Semiconductors predicts that design reuse will be essential in the near future to face the constantly increasing design complexity. The concept comes from software engineering in which reuse is a fundamental technology. In order to provide libraries and applications to reuse in software development, some open-source initiatives (e.g. Linux, gcc, X, mysql) have appeared during the last decades. The basic idea is to distribute the library or application source code (normally free-of-charge) and allow any developer to use, modify, debug and improve it. Several initiatives have tried to port this idea to hardware development. The main goal of this paper is to develop a synthesizable platform described in SystemC from an open architecture. The platform includes a CPU (OpenRISC) and some basic peripherals. A set of software development tools (compiler, assembler, debugger) and RTOS (eCos) has also been developed. This work enables the evaluation of the advantages and disadvantages of the open-source model in electronic system design.

MINCE: Matching INstructions with Combinational Equivalence for Extensible Processor [p. 1020]

N. Cheung, S. Parameswaran, J. Henkel, and J. Chan

Designing custom-extensible instructions for Extensible Processors1 is a computationally complex task because of the large design space. The task of automatically matching candidate instructions in an application (e.g. written in a high-level language) to a pre-designed library of extensible instructions is especially challenging. Previous approaches have focused on identifying extensible instructions (e.g. through profiling), synthesizing extensible instructions, estimating expected performance gains etc. In this paper we introduce our approach of automatically matching extensible instructions as this key step is missing in automating the entire design flow of an ASIP with extensible instruction capabilities. Since matching using simulation is practically infeasible (simulation time), and traditional pattern matching approaches would not yield reliable results (ambiguity related to a functionally equivalent code that can be represented in many different ways), we adopt combinational equivalence checking. Our MINCE tool as part of our ASIP design flow consists of a translator, a filtering algorithm and a combinational equivalence checking tool. We report matching times of extensible instructions that are 7.3x faster on average (using Mediabench applications) compared to the best known approaches to the problem (partial simulations). In all our experiments MINCE matched correctly and the outcome of the matching step yielded an average speedup of the application of 2.47x. As a summary, our work represents a key step towards automating the whole design flow of an ASIP with extensible instruction capabilities.

8B: Real-Time Issues in Embedded Systems

Moderators: S. Hu, Notre Dame U, US; F. Wolf, Volkswagen, DE

Design Optimization of Multi-Cluster Embedded Systems for Real-Time Applications [p. 1028]

P. Pop, P. Eles, Z. Peng, V. Izosimov, M. Hellring, and O. Bridal

We present an approach to design optimization of multi-cluster embedded systems consisting of time-triggered and event-triggered clusters, interconnected via gateways. In this paper, we address design problems which are characteristic to multi-clusters: partitioning of the system functionality into time-triggered and event-triggered domains, process mapping, and the optimization of parameters corresponding to the communication protocol. We present several heuristics for solving these problems. Our heuristics are able to find schedulable implementations under limited resources, achieving an efficient utilization of the system. The developed algorithms are evaluated using extensive experiments and a real-life example.

Timing Analysis for Preemptive Multi-Tasking Real-Time Systems with Caches [p. 1034]

Y. Tan and V. Mooney

In this paper, we propose an approach to estimate the Worst Case Response Time (WCRT) of tasks in a preemptive multi-tasking single-processor real-time system with a set associative cache. The approach focuses on analyzing the cache reload overhead caused by preemptions. We combine inter-task cache eviction behavior analysis and path analysis of the preempted task to reduce, in our analysis, the estimate of the number of cache lines that can possibly be evicted by the preempting task (thus requiring a reload by the preempted task). A mobile robot application which contains three tasks is used to test our approach. The experimental results show that our approach can tighten the WCRT estimate by up to 73% over prior state-of-the-art.

Workload Characterization Model for Tasks with Variable Execution Demand [p. 1040]

A. Maxiaguine, S. Künzli, and L. Thiele

The analysis of real-time properties of an embedded system usually relies on the worst-case execution times (WCET) of the tasks to be executed. In contrast to that, in real world applications the running time of tasks may vary from execution to execution, e. g. in multimedia applications. The traditional worst-case analysis of the system then returns overly pessimistic estimates of the system performance. In this paper we propose a new effective method to characterize tasks with variable execution requirements, which leads to tighter worst-case bounds on system performance and better use of available resources. We show the applicability of our approach by a detailed study of a multimedia application.

Context-Aware Performance Analysis for Efficient Embedded System Design [p. 1046]

M. Jersak, R. Henia, and R. Ernst

Performance analysis has many advantages in theory compared to simulation for the validation of complex embedded systems, but is rarely used in practice. To make analysis more attractive, it is critical to calculate tight analysis bounds. This paper shows that advanced performance analysis techniques taking correlations between successive computation or communication requests as well a correlated load distribution into account can yield much tighter analysis bounds. Cases where such correlations have a large impact on system timing are especially difficult to simulate and, hence, are an ideal target for formal performance analysis.

Compact Binaries with Code Compression in a Software Dynamic Translator [p. 1052]

S. Shogan and B. Childers

Embedded software is becoming more flexible and adaptable, which presents new challenges for management of highly constrained system resources. Software dynamic translation (SDT) has been used to enable software malleability at the instruction level for dynamic code optimizers, security checkers, and binary translators. This paper studies the feasibility of using SDT to manage program code storage in embedded systems. We explore to what extent code compression can be incorporated in a software infrastructure to reduce program storage requirements, while minimally impacting run-time performance and memory resources. We describe two approaches for code compression, called full and partial image compression, and evaluate their compression ratios and performance in a software dynamic translation system. We demonstrate that code decompression is indeed feasible in a SDT.

8C: Real-Life Defect Modelling and Detection

Moderators: R. Aitken, Artisan, US; H. Manhaeve, Q-star Test, BE

Pattern Selection for Testing of Deep Sub-Micron Timing Defects [p. 1060]

M. C. Chao, L. Wang, and K. Cheng

Due to process variations in deep sub-micron (DSM) technologies, the effects of timing defects are difficult to capture. This paper presents a novel coverage metric for estimating the test quality with respect to timing defects under process variations. Based on the proposed metric and a dynamic timing analyzer, we develop a pattern-selection algorithm for selecting the minimal number of patterns that can achieve the maximal test quality. To shorten the run time in dynamic timing analysis, we propose an algorithm to speed up the Monte-Carlo-based simulation. Our experimental results show that, selecting a small percentage of patterns from a multiple-detection transition fault pattern set is sufficient to maintain the test quality given by the entire pattern set. We present run-time and accuracy comparisons to demonstrate the efficiency and effectiveness of our pattern selection framework.

Balanced Excitation and its Effect on the Fortuitous Detection of Dynamic Defects [p. 1066]

J. Dworak, B. Cobb, J. Wingfield, and M. Mercer

Dynamic defects are less likely to be fortuitously detected than static defects because they have more stringent detection requirements. We show that (in addition to more site observations) balanced excitation is essential for detection of these defects, and we present a metric for estimating this degree of balance. We also show that excitation balance correlates with the parameter in the MPG-D defective part level model.

Intermittent Scan Chain Fault Diagnosis Based on Signal Probability Analysis [p. 1072]

Y. Huang, W. Cheng, C. Hsieh, H. Tseng, A. Huang, and Y. Hung

A new algorithm to diagnose intermittent scan chain fault in scan-based designs is proposed in this paper. An intermittent scan chain fault sometimes is triggered and sometimes is not triggered during scan chain shifting, which makes it very difficult to locate the fault sites. In this paper, we provide answers to three questions:
(1) Why intermittent scan chain faults happen?
(2) Why diagnosis of this type of faults is necessary?
(3) How to diagnose this type of faults?
The experimental results presented demonstrate that the proposed diagnosis algorithm is effective for large industrial designs with multiple intermittent scan chain faults.

A Modeling Approach for Addressing Power Supply Switching Noise Related Failures of Integrated Circuits [p. 1078]

C. Tirumurti, S. Kundu, S. Sur-Kolay, and Y. Chang

Power density of high-end microprocessors has been increasing by approximately 80% per technology generation, while the voltage is scaling by a factor of 0.8. This leads to 225% increase in current per unit area in successive generation of technologies. The cost of maintaining the same IR drop becomes too high. This leads to compromise in power delivery and power grid becomes a performance limiter. Traditional performance related test techniques with transition and path delay fault models focus on testing the logic but not the power delivery. In this paper we view power grid as performance limiter and develop a fault model to address the problem of vector generation for delay faults arising out of power delivery problems. A fault extraction methodology applied to a microprocessor design block is explained.

Soft Faults and the Importance of Stresses in Memory Testing [p. 1084]

Z. Al-Ars and A. van De Goor

Memory testing in general, and DRAM testing in particular, has become greatly dependent on the modification of stresses (timing, temperature and voltages) in a way that is difficult to justify using the current understanding of memory faults. This paper introduces a new class of fault models (soft faults) based on a special classification of memory faults, that shows why it is fundamentally necessary to apply stresses. The paper calculates the relative probability of soft faults for a specific failure mechanism and compares this probability in DRAMs with that in SRAMs. In addition, the concept of soft faults is validated using defect injection and electrical simulation of a Spice DRAM model.
Keywords: Fault modeling, soft faults, memory testing, stress application, defect simulation.

8E: Optimisation in Physical Design

Moderators: J. Lienig, TU Dresden, DE; R. Otten, TU Eindhoven, NL

Wire Retiming for System-On-Chip by Fixpoint Computation [p. 1092]

C. Lin and H. Zhou

In the current and future System-On-Chips, a non-negligible part of operation time is spent on multiple-clock period wires. Retiming -- that is moving flip-flops in a circuit without changing its functionality -- can be explored to pipeline long interconnect wires in SOC designs. The problem of retiming over a netlist of macro-blocks, where the internal structures may not be changed and flip-flops may not be inserted on some wire segments is called the wire retiming problem. In this paper, we formulate the constraints of the wire retiming problem as a fixpoint computation and use an iterative algorithm to solve it. Experimental results show that this approach is multiple orders more efficient than the previous one.

Boosting: Min-Cut Placement with Improved Signal Delay [p. 1098]

A. Kahng, S. Reda, and I. Markov

In this work we improve top-down min-cut placers in the context of timing closure. Using the concept of boosting factors, we adjust net weights according to net spans, so as to reduce the quadratic wirelength. Our method is generic and does not involve any timing analysis during or prior to placement. In essence, we skew the netlength distribution produced by a min-cut placer so as to decrease the number of long nets, with minimal impact on the overall wirelength. Empirically this approach does not significantly affect runtime, but reduces the worst negative slack and total negative slack of industrial benchmarks by up to 70% compared to Capo [5] and a leading industrial placer.

Optimal Algorithm for Minimizing the Number of Twists in an On-Chip Bus [p. 1104]

L. Deng and M. Wong

Complementary bus architecture is used to achieve higher speed and lower power in VLSI chips. However, in deep submicron circuit design, the effects of crosstalk become more and more serious, especially in the bus structure where wires are placed close to each other. Complementary bus architecture with twisted wires can reduce the coupling noise. But in current chip design flow, engineering change order (ECO) happens commonly to meet improvement requirement. Layout changes due to ECO introduce obstacles to the twists, which could reduce the number of twists and increase the coupling noise. In this paper, an ECO algorithm for generating twisted complementary architecture is proposed based on the shortest path algorithm. Our algorithm guarantees to give the minimum number of twists along the bus wires under noise constraints. Experimental results show that the twist patterns generated by our algorithm can effectively reduce the capacitive coupling noises.

A Fast Word-Level Statistical Estimator of Intra-Bus Crosstalk [p. 1110]

S. Gupta and S. Katkoori

Given word-level statistics, namely mean, standard deviation, and lag-one temporal correlation of input data, we estimate the bit-level crosstalk probability on a system bus using a non-enumerative statistical approach.We introduce a sampling technique for fast evaluation of integrals during the estimation process. We had proposed two techniques previously -- (a) a stream-based estimator that counts crosstalk events on a bus; and (b) a statistical enumeration technique that enumerates crosstalk-producing values on a bus and computes their occurrence probability. Both these techniques suffer from exponential time complexity with respect to the bus-width. In this work, we propose a statistical non-enumerative technique that has linear time complexity with respect to the bus-width. We achieve the linear complexity by resorting to: (1) manipulating the data stream to make the crosstalk-producing values contiguous and (2) sampling the distribution function and storing it as a lookup table. Experimental results for data streams from different data environments are presented, compared against the stream-based approach. Average errors of less than 12% are obtained for bus-widths ranging from 8b to 32b.

Full-Chip Multilevel Routing for Power and Signal Integrity [p. 1116]

J. Xiong and L. He

Conventional physical design flow separates the design of power network and signal network. Such a separated approach results in slow design convergence for wire-limited deep sub-micron designs. We present a novel design methodology that simultaneously considers global signal routing and power network design under integrity constraints. The key part to this approach is a simple yet accurate power net estimation formula that decides the minimum number of power nets needed to satisfy both power and signal integrity constraints prior to detailed layout. The proposed design methodology is a one-pass solution to the co-design of power and signal networks in the sense that no iteration between them is required in order to meet design closure. Experiment results using large industrial benchmarks show that compared to the state-of-the-art alternative design approach, the proposed method can reduce the power network area by 19.4% on average under the same signal and power integrity constraints with better routing quality, but use less runtime.

8G: Hot Topic -- Platforms and Tools for Energy-Efficient Design of Multimedia Systems

Organizer/Moderator: E. Pol, Philips Research, NL
Speakers:
H. Van Antwerpen, Philips Research, NL
R. von Vignau, Philips Research, NL
R. Gupta, UC San Diego, US
N. Dutt, UC Irvine, US
N. Venkatasubramanian, UC Irvine, US
S. Mohapatra, UC California Irvine, US
C. Pereira, UC California San Diego, US

Energy-Aware System Design for Wireless Multimedia [p. 1124]: In this paper, we present various challenges that arise in the delivery and exchange of multimedia information to mobile devices. Specifically, we focus on techniques for maintaining QoS to end-user multimedia applications (e.g. video streaming, multimedia conferencing) while maximizing device lifetimes. In order to cope with the resource intensive nature of multimedia applications (in terms of computation, bandwidth and consequently power) and dynamic congestion levels in wireless networks, an end-to-end approach to QoSaware power optimization is required. We discuss the trend towards such an integrated approach that couples the architectural, OS, middleware and application layers to achieve both user experience and device energy gains. We conclude with a discussion of tools for integrated system design and testing that will aid in rapid deployment of wireless multimedia.

9A: Communication Design for MPSoC

Moderators: K. Goossens, Philips Research, NL; L. Benini, Bologna U, IT

Unified Component Integration Flow for Multi-Processor SoC Design and Validation [p. 1132]

M. Dziri, W. Cesário, A. Jerraya, and F. Wagner

Most system-on-Chip (SoC) design methodologies promote the reuse of pre-designed (hardware, software, and functional) components. However, as these components are heterogeneous, their integration requires complex interface sub-systems. These sub-systems can also be constructed by assembling pre-designed basic interface components. Hence, SoC design and validation involves component composition techniques to create hardware, software, and functional interface sub-systems by assembling basic interface components. We propose a unified methodology for automatic component integration that allows designers to reuse pre-designed components effectively. We also present ROSES, a design flow that uses this methodology to generate hardware, software, and functional interface sub-systems automatically starting from a system-level architectural model.

An Interconnect Channel Design Methodology for High Performance Integrated Circuits [p. 1138]

V. Chandra, A. Xu, H. Schmit, and L. Pileggi

On-chip communication is becoming a bottleneck for high performance designs. Conventional interconnect design methodology does not account for architectures and/or communication schemes that require storage buffers (First-In-First-Out queues or FIFOs) in the interconnect channel. For example, FIFOs and flow-control are needed for Network-on-Chip, high performance ASICs and multiple clock domain designs. These IC implementation architectures require an efficient methodology to determine the size of the FIFOs in the channel since the FIFO sizes affect system performance. In this work we devised a methodology to size the FIFOs in an interconnect channel containing one or more FIFOs connected in series. We show that the sizing of the FIFOs in the channel is a function of system parameters such as data production rate and consumption rate, data burstiness, number of channel stages etc. and we also quantify their effect on performance. For a single clock design, we have developed an efficient algorithm which reduces the search space for the optimal sizing of the FIFOs in the channel.

Modeling Shared Resource Contention Using a Hybrid Simulation/Analytical Approach [p. 1144]

A. Bobrek, J. Pieper, J. Nelson, J. Paul, and D. Thomas

Future Systems-on-Chips will include multiple heterogeneous processing units, with complex data-dependent shared resource access patterns dictating the performance of a design. Currently, the most accurate methods of simulating the interactions between these components operate at the cycle-accurate level, which can be very slow to execute for large systems. Analytical models sacrifice accuracy for speed, and cannot cope with dynamic data-dependent behavior well. We propose a hybrid approach combining simulation with piecewise evaluation of analytical models that apply time penalties to simulated regions. Our experimental results show that for representative heterogeneous multiprocessor applications, simulation time can be decreased by 100 times over cycle-accurate models, while the error can be reduced by 60% to 80% over traditional analytical models to within 18% of an ISS simulation.

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems [p. 1150]

T. Suh, D. Blough, and H. Lee

In embedded system-on-a-chip (SoC) applications, the demand for integrating heterogeneous processors onto a single chip is increasing. An important issue in integrating multiple heterogeneous processors on the same chip is to maintain the coherence of their data caches. In this paper, we propose a hardware/software methodology to make caches coherent in heterogeneous multiprocessor platforms with shared memory. Our approach works with any combination of processors that support invalidation-based protocols. As shown in our experiments, up to 58% performance improvement can be achieved with low miss penalty at the expense of adding simple hardware, compared to a pure software solution. Speedup can be improved even further as the miss penalty increases. In addition, our approach provides embedded system programmers a transparent view of shared data, removing the burden of software synchronization.

9B: Combining Static and Dynamic Software Optimisation

Moderators: R. Ernst, TU Braunschweig; P. Kajfasz, Thales Communications, FR

Exploiting Processor Workload Heterogeneity for Reducing Energy Consumption in Chip Multiprocessors [p. 1158]

I. Kadayif, M. Kandemir, and I. Kolcu

Advances in semiconductor technology are enabling designs with several hundred million transistors. Since building sophisticated single processor based systems is a complex process from design, verification, and software development perspectives, the use of chip multiprocessing is inevitable in future microprocessors. In fact, the abundance of explicit loop-level parallelism in many embedded applications helps us identify chip multiprocessing as one of the most promising directions in designing systems for embedded applications. Another architectural trend that we observe in embedded systems, namely, multi-voltage processors, is driven by the need of reducing energy consumption during program execution. Practical implementations such as Transmeta's Crusoe and Intel's XScale tune processor voltage/frequency depending on current execution load. Considering these two trends, chip multiprocessing and voltage/frequency scaling, this paper presents an optimization strategy for an architecture that makes use of both chip parallelism and voltage scaling. In our proposal, the compiler takes advantage of heterogeneity in parallel execution between the loads of different processors and assigns different voltages/frequencies to different processors if doing so reduces energy consumption without increasing overall execution cycles significantly. Our experiments with a set of applications show that this optimization can bring large energy benefits without much performance loss.

Fault-Tolerant Deployment of Embedded Software for Cost-Sensitive Real-Time Feedback-Control Applications [p. 1164]

C. Pinello, L. Carloni, and A. Sangiovanni-Vincentelli

Designing cost-sensitive real-time control systems for safety-critical applications requires a careful analysis of the cost/coverage trade-offs of fault-tolerant solutions. This further complicates the difficult task of deploying the embedded software that implements the control algorithms on the execution platform that is often distributed around the plant (as it is typical, for instance, in automotive applications). We propose a synthesis-based design methodology that relieves the designers from the burden of specifying detailed mechanisms for addressing platform faults, while involving them in the definition of the overall fault-tolerance strategy. Thus, they can focus on addressing plant faults within their control algorithms, selecting the best components for the execution platform, and defining an accurate fault model. Our approach is centered on a new model of computation, Fault Tolerant Data Flows (FTDF), that enables the integration of formal validation techniques.

Task Feasibility Analysis and Dynamic Voltage Scaling in Fault-Tolerant Real-Time Embedded Systems [p. 1170]

Y. Zhang and K. Chakrabarty

We investigate dynamic voltage scaling (DVS) in realtime embedded systems that use checkpointing for fault tolerance. We present feasibility-of-scheduling tests for checkpointing schemes for a constant processor speed as well as for variable processor speeds. DVS is then carried out on the basis of the feasibility analysis. We incorporate practical issues such as faults during checkpointing and state restoration, rollback recovery time, memory access time and energy, and DVS overhead. Simulation results are presented for real-life checkpointing data and embedded processors.

Quasi-Static Scheduling for Real-Time Systems with Hard and Soft Tasks [p. 1176]

L. Cortés, P. Eles, and Z. Peng

This paper addresses the problem of scheduling for realtime systems that include both hard and soft tasks. The relative importance of soft tasks and how the quality of results is affected when missing a soft deadline are captured by utility functions associated to soft tasks. Thus the aim is to find the execution order of tasks that makes the total utility maximum and guarantees hard deadlines. We consider time intervals rather than fixed execution times for tasks. Since a purely off-line solution is too pessimistic and a purely online approach incurs an unacceptable overhead due to the high complexity of the problem, we propose a quasi-static approach where a number of schedules are prepared at design-time and the decision of which of them to follow is taken at run-time based on the actual execution times. We propose an exact algorithm as well as different heuristics for the problem addressed in this paper.

9C: Hot Topic -- The Status of the New IEEE Test Standards

Organizer/Moderator: B. Bennetts, Bennetts Associates, UK

Status of IEEE Testability Standards 1149.4, 1532 and 1149.6 [p. 1184]

S. Sunter, A. Osseiran, A. Cron, N. Jacobson, D. Bonnett, B. Eklow, C. Barnhart, and B. Bennetts

Single board, and now multi-board testability is highly conditioned by the availability of various forms of boundary scan technology. This paper surveys the three more recent IEEE Standards relating to boundary scan. The paper is based on three backgrounders prepared by members of the individual Working Groups for the IEEE Standards booth at ITC 2003.

9E: Modelling and Estimation in Circuit Layout

Moderators: I. Markov, Michigan U, US; J. Lienig, TU Dresden, DE

Eliminating False Positives in Crosstalk Noise Analysis [p. 1192]

Y. Ran, M. Marek-Sadowska, A. Kondratyev, and Y. Watanabe

Noise affects circuit operation by increasing gate delays and causing latches to capture incorrect values. Noise analysis techniques can detect some of such noise faults, but accurate analysis requires a careful examination of timing and functional properties of the circuit. This paper proposes a method to check the 'true' noise impact on path delay. It uses four-variable Boolean logic to characterize signal transitions in a time interval, and formulates Boolean satisfiability between aggressors and a victim under the min-max delay model for gates. The proposed technique is scalable as it keeps the size of Boolean formulation linear to the size of the modeled circuit. By applying it to a set of large circuits, it has eliminated up to 50% of noise delay faults reported by conventional noise analysis method.

A New Approach to Timing Analysis Using Event Propagation and Temporal Logic [p. 1198]

A. Mondal, P. Chakrabarti, and C. Mandal

Present day designers require deep reasoning methods to analyze circuit timing. This includes analysis of effects of dynamic behavior (like glitches) on critical paths, simultaneous switching and identification of specific patterns and their timings. This paper proposes a novel approach that uses a combination of symbolic event propagation and temporal reasoning to extract timing properties of gate-level circuits. The formulation captures complex situations like trigerring of traditional false paths and simultaneous switching in a unified symbolic representation in addition to identifying false paths, critical paths as well as conditions for such situations. This information is then represented as an event-time graph. A simple temporal logic on events is proposed that can be used to formulate a wide class of useful queries for various input scenarios. These include maximum/minimum delays, transition times, duration of patterns, etc. An algorithm is developed that retrieves answers to such queries from the event-time graph. A complete BDD based implementation of this system has been made. Results on the ISCAS85 benchmarks indicate very interesting properties of these circuits.

A New Effective Congestion Model in Floorplan Design [p. 1204]

Y. Hsieh and T. Hsieh

In this paper, we provide a new efficient and accurate congestion model embedded into a floorplanner to estimate the congestion of floorplans. It is based on probabilistic analysis and a new concept of Irregular-Grid which uses the routing information to determine the evaluating regions instead of fixed-size grids. Three complete experiments are performed and the experimental results show the correctness, accuracy and efficiency of our new congestion model.

ULSI Interconnect Length Distribution Model Considering Core Utilization [p. 1210]

H. Nakashima, J. Inoue, K. Okada, and K. Masu

Interconnect Length Distribution (ILD) represents a correlation between the number of interconnects and length. The ILD can predict power consumption, clock frequency, chip size, etc. It has been said that high core utilization and small circuit area improve chip performance. We propose a ILD model to predict a correlation between core utilization and chip performance. The proposed model predicts influences of interconnect length and interconnect density on circuit performances. As core utilization increases, small and simple circuits improve the performances. In large complex circuits, decrease of load capacitance is more important than that of total interconnect length for improvement of chip performance. The proposed ILD model expresses actual ILD more accurate than conventional models.

9G: Applications of Reconfigurability

Moderators: Y. Tanurhan, Actel, US; W. Rosenstiel, Tuebingen U and FZI Karlsruhe, DE

Implementation of a UMTS Turbo-decoder on a Dynamically Reconfigurable Platform [p. 1218]

A. La Rosa, C. Passerone, F. Gregoretti, and L. Lavagno

Modern embedded systems must execute a variety of high performance real-time tasks, such as audio and image compression and decompression, channel coding and encoding, etc. Reconfigurable platforms can effectively be used in these cases, because they allow to re-use the architecture for as many applications as possible. This paper describes the implementation of a UMTS turbo-decoder on one such platform, the XiRisc reconfigurable processor. Our goal is to test the development framework and design flow that we already developed on a real industrial example. Our results shows that, with some manual effort from the designer, very good performance improvements can be achieved, using a flow close to embedded software development.

Design Methodology for a Tightly Coupled VLIW/Reconfigurable Matrix Architecture: A Case Study [p. 1224]

B. Mei, R. Lauwereins, S. Vernalde, and D. Verkest

Coarse-grained reconfigurable architectures have seen growing importance recently. Design tools and methodology are essential to their success. Based on our previous work on modulo scheduling algorithms and a novel architecture with tightly coupled VLIW/reconfigurable matrix, we present a C-based design flow using an MPEG-2 decoder as a design example. The application is mapped to the architecture in less than one person-week starting from a software implementation. The kernel and overall speedup over the reference VLIWare 4.84 and 3.05 respectively. The case study shows that our methodology and architecture can deliver a competitive package in terms of design efforts and performance over other programmable architectures.

Efficient Implementations of Mobile Video Computations on Domain-Specific Reconfigurable Arrays [p. 1230]

I. Ahmed, S. Baloch, A. Pai, T. Arslan, N. Aydin, S. Khawam, and F. Westall

Mobile video processing as defined in standards like MPEG-4 and H.263 contains a number of timeconsuming computations that cannot be efficiently executed on current hardware architectures. The authors recently introduced a reconfigurable SoC platform that permits a low-power, high-throughput and flexible implementation of the motion estimation and DCT algorithms. The computations are done using domain-specific reconfigurable arrays that have demonstrated up to 75% reduction in power consumption when compared to generic FPGA architecture, which makes them suitable for portable devices. This paper presents and compares different configurations of the arrays to efficiently implementing DCT and motion estimation algorithms. A number of algorithms are mapped into the various reconfigurable fabrics demonstrating the flexibility of the new reconfigurable SoC architecture and its ability to support a number of implementations having different performance characteristics.

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience [p. 1236]

H. Krupnova

Today, having a fast hardware platform for SoC software development prior to silicon is an important challenge to gain the time-to-market. The FPGAs offer an excellent prototyping basis for building hardware platforms since more than ten years ([1]). However, as the circuit complexity increases and project timeframes shrink, building a multi-FPGA prototype represents a real challenge from the complexity viewpoint. The paper describes the state-of-the-art mapping methodology, prototyping tools and flows, shows the most difficult mapping problems and the ways to overcome them. The paper is issued from the experience of mapping on FPGA platform of four latest highly complex ST Microelectronics SoCs ranging from 1.5 to 4 million real ASIC gates mapped to up to 9 highest capacity FPGAs.

10A: Interconnect Modelling for MPSoC

Moderators: B. Candaele, Thales, FR; A. Jerraya, TIMA Laboratory, FR

Using a Communication Architecture Specification in an Application-Driven Retargetable [p. 1244] Prototyping Platform for Multiprocessing

X. Zhu and S. Malik

In multiprocessor based SoCs, optimizing the communication architecture is often as important, if not more important, than optimizing the computation architecture. While there are mature platforms and techniques for the modeling and evaluation of architectures of processing elements, the same is not true for the communication architectures. This paper presents an application-driven retargetable prototyping platform which fills this gap. This environment aims to facilitate the design exploration of the communication sub-system through application-level execution-driven simulations and quantitative analysis. First, we introduce an expressive communication architecture specification which gives the designers the freedom to choose and configure their custom interconnection schemes over a wide range of communication architectures, covering the spectrum from buses to packet switching networks. This, combined with a distributed application model, drives a modular modeling and simulation environment that permits detailed simulation of the communication (and computation) architectures at the application level. Through the case studies motivated by an embedded system application, we show that through simulations, critical system information such as timings and communication patterns can be obtained and evaluated. Consequently, system-level design choices regarding the communication architecture can be made with high confidence in the early stages of design. In addition to improving design quality, this methodology also results in significantly shortening design-time.

A Power and Performance Model for Network-on-Chip Architectures [p. 1250]

N. Banerjee, P. Vellanki, and K. Chatha

Networks-on-Chip (NoC) has been proposed as a solution for addressing the design challenges of future high-performance nanoscale architectures. Innovative system-level performance models are required for designing NoC based architectures. This paper presents a VHDL based cycle accurate register transfer level model for evaluating the latency, throughput, dynamic, and leakage power consumption of NoC based interconnection architectures. We implemented a parameterized register transfer level design of the NoC architecture elements. The design is parameterized on (i) size of packets, (ii) length and width of physical links, (iii) number, and depth of virtual channels, and (iv) switching technique. The paper discusses in detail the architecture and characterization of the various NoC components. The paper presents results obtained by application of the model towards design space exploration, and power versus performance trade-off analysis of 4x4 mesh based NoC architecture.

A System Level Processor/Communication Co-Exploration Methodology for Multi-Processor System-on-Chip Platforms [p. 1256]

A. Wieferink, T. Kogel, R. Leupers, G. Ascheid, H. Meyr, G. Braun, and A. Nohl

Current and future SoC designs will contain an increasing number of heterogeneous programmable units combined with a complex communication architecture to meet flexibility, performance and cost constraints. Designing such a heterogenous MP-SoC architecture bears enormous potential for optimization, but requires a system-level design environment and methodology to evaluate architectural alternatives. This paper proposes a methodology to jointly design and optimize the processor architecture together with the on-chip communication based on the LISA Processor Design Platform in combination with SystemC Transaction Level Models. The proposed methodology advocates a successive refinement flow of the system-level models of both the processor cores and the communication architecture. This allows design decisions based on the best modeling efficiency, accuracy and simulation performance possible on the respective abstraction level. The effectiveness of our approach is demonstrated by the exemplary design of a dual-processor JPEG decoding system.

10B: Embedded Software Generation and Optimisation

Moderators: F. Rousseau, TIMA Laboratory, FR; J. Madsen, TU Denmark, DK

Cache-Aware Scratchpad Allocation Algorithm [p. 1264]

M. Verma, L. Wehmeyer, and P. Marwedel

In the context of portable embedded systems, reducing energy is one of the prime objectives. Most high-end embedded microprocessors include onchip instruction and data caches, along with a small energy efficient scratchpad. Previous approaches for utilizing scratchpad did not consider caches and hence fail for the au courant architecture. In the presented work, we use the scratchpad for storing instructions and propose a generic Cache Aware Scratchpad Allocation (CASA) algorithm. We report an average reduction of 8-29% in instruction memory energy consumption compared to a previously published technique for benchmarks from the Mediabench suite. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of our approach against preloaded loop caches, we report average energy savings of 20-44%.

Phase Coupled Code Generation for DSPs Using a Genetic Algorithm [p. 1270]

M. Lorenz and P. Marwedel

The growing use of digital signal processors (DSPs) in embedded systems necessitates the use of optimizing compilers supporting special hardware features. Due to the irregular architectures present in today's DSPs there is a need of compilers which are capable of performing a phase coupling of the highly interdependent code generation subtasks and a graph based code selection. In this paper we present a code generator which performs a graph based code selection and a complete phase coupling of code selection, instruction scheduling (including compaction) and register allocation. In addition, our code generator takes into account effects of the subsequent address code generation phase. In order to solve the phase coupling problem and to handle the problem complexity, our code generator is based on a genetic algorithm. Experimental results for several benchmarks and an MP3 application for two DSPs show the effectiveness and the retargetability of our approach. Using the presented techniques, the number of execution cycles is reduced by 51% on average for the M3-DSP and by 38% on average for the ADSP2100 compared to standard techniques¹ .

A Methodology and Tool Suite for C Compiler Generation from ADL Processor Models [p. 1276]

M. Hohenauer, H. Scharwaechter, K. Karuri, O. Wahlen, T. Kogel, R. Leupers, G. Ascheid, H. Meyr, G. Braun, and H. van Someren

Retargetable C compilers are key tools for efficient architecture exploration for embedded processors. In this paper we describe a novel approach to retargetable compilation based on LISA, an industrial processor modeling language for efficient ASIP design. In order to circumvent the well-known trade-off between flexibility and code quality in retargetable compilation, we propose a user-guided, semiautomatic methodology that in turn builds on a powerful existing C compiler design platform. Our approach allows to include generated C compilers into the ASIP architecture exploration loop at an early stage, thereby allowing for a more efficient design process and avoiding application/ architecture mismatches. We present the corresponding methodology and tool suite and provide experimental data for two real-life embedded processors that prove the feasibility of the approach.

10C: Scan-Based Testing

Moderators: H. Vranken, Philips Research, NL; C. Papachristou, Case Western Reserve U, US

Nine-Coded Compression Technique with Application to Reduced Pin-Count Testing and Flexible On-Chip Decompression [p. 128412841284128412841284128412841284128412841284]

M. Tehranipour, M. Nourani, and K. Chakrabarty

This paper presents a new test data compression technique based on a compression code that uses exactly nine codewords. In spite of its simplicity, it provides significant reduction in test data volume and test application time. In addition, the decompression logic is very small and independent of the precomputed test data set. Our technique leaves many don't-care bits unchanged in the compressed test set, and these bits can be filled randomly to detect non-modeled faults. The proposed technique can be efficiently adopted for single- or multiple-scan chain designs to reduce test application time and pin requirement. Experimental results for ISCAS'89 benchmarks illustrate the flexibility and efficiency of the proposed technique.

CircularScan: A Scan Architecture for Test Cost Reduction [p. 1290]

B. Arslan and A. Orailoglu

Scan-based designs are widely used to decrease the complexity of the test generation process; nonetheless, they increase test time and volume. A new scan architecture is proposed to reduce test time and volume while retaining the original scan input count. The proposed architecture allows the use of the captured response as a template for the next pattern with only the necessary bits of the captured response being updated while observing the full captured response. The theoretical and experimental analysis promises a substantial reduction in test cost for large circuits.

Hybrid Delay Scan: A Low Hardware Overhead Scan-Based Delay Test Technique for High Fault Coverage and Compact Test Sets [p. 1296]

S. Wang, S. Chakradhar, and X. Liu

A novel scan-based delay test approach, referred as the hybrid delay scan, is proposed in this paper. The proposed scan-based delay testing method combines advantages of the skewed-load and broad-side approaches. Unlike the skewed-load approach whose design requirement is often too costly to meet due to the fast switching scan enable signal, the hybrid delay scan does not require a strong buffer or buffer tree to drive the fast switching scan enable signal. Hardware overhead added to standard scan designs to implement the hybrid approach is negligible. Since the fast scan enable signal is internally generated, no external pin is required. Transition delay fault coverage achieved by the hybrid approach is equal to or higher than that achieved by the broad-side load for all ISCAS 89 benchmark circuits. On an average, about 4.5% improvement in fault coverage is obtained by the hybrid approach over the broad-side approach.

Diagnosis of Scan-Chains by Use of a Configurable Signature Register and Error-Correcting Codes [p. 1302]

A. Leininger, P. Muhmenthaler, and M. Goessel

In this paper a new diagnosis method for scan designs with many scan-paths based on error correcting linear block codes with N information bits and K control bits is proposed, where N is the number of scan-paths. The new approach can be implemented on a modified STUMPS-architecture. In diagnosis mode the test has K times to be repeated. In the K repetitions of the test the outputs of the scan-paths are connected to a configurable signature register (with disconnected feedback logic) according to the coefficients of the K syndrome equations of the code. By monitoring the one-dimensional output sequence of the configurable signature register the failing scan-cells in the different scan-paths can be identified with the resolution of the selected error correcting code. Since for the relevant codes, e.g.(shortened) Hamming codes, T-error correcting BCH-code, the ratio K N decreases very fast with an increasing number N the method is useful for a large number of scan-paths.

10E: Novel Approaches to Analogue Simulation

Moderators: T. Kazmierski, Southampton U, UK; S. Yoo, TIMA Laboratory, FR

Hierarchical Multi-Dimensional Table Lookup for Model Compiler Based Circuit Simulation [p. 1310]

B. Wan and C. Shi

In this paper, a systematic method for automatically generating hierarchical multi-dimensional table lookup models for compact device and behavioral models with any number of terminals is presented. The method is based on an Abstract Syntax Tree representation of analytic equations. Expensive part of the computations represented by abstract syntax trees are identified and replaced by two-dimensional table lookup models. An error-control based optimization algorithm is developed to generate table lookup models with the minimal amount of table data for a given accuracy requirement. The proposed method has been implemented in the model compiler MCAST and the circuit simulator SPICE3. Experimental results show that, compared to non-optimized compilation based simulation, the simulation using the proposed table lookup optimization method is about 40 times faster and achieves sufficiently accurate results with error less than 1-2%.
Index Terms -- Model Compiler, Syntax-Tree, Hierarchical Multi-dimensional Table Lookup, Optimization, Circuit Simulation.

Direct Nonlinear Order Reduction with Variational Analysis [p. 1316]

L. Feng, X. Zeng, C. Chiang, D. Zhou, and Q. Fang

The variational analysis [11] has been employed in [7] for order reduction of weakly nonlinear systems. For a relatively strong nonlinear system, this method will mostly lose efficiency because of the exponentially increased number of inputs in higher order variational equations caused by the individual reduction process of the variational systems. Moreover, the inexact inputs into the higher order variational equations indispensably introduce extra errors in the order reduction process. Inspired by the variational analysis, we propose a direct model order reduction method. The order of the approximate polynomial system of the original nonlinear system is directly reduced by one project space. The proposed direct reduction technique can easily avoid the errors brought by inexact inputs and the exponentially increased inputs. We show theoretically and experimentally that the proposed method can achieve much more accurate reduced system with smaller order size than the conventional variational equation order reduction method.

Steady-State Analysis of Nonlinear Circuits Using Discrete Singular Convolution Method [p. 1322]

X. Zhou, D. Zhou, J. Liu, R. Li, X. Zeng, and C. Chiang

In this paper, we propose a novel time-domain based method, Discrete Singular Convolution algorithm, for computing steady-state response in nonlinear circuit. Properties and advantages of Discrete Singular Convolution method are discussed, compared with some other approaches. The accuracy and efficiency of this method are tested by the numerical experiments.

Hybrid Reduction Technique for Efficient Simulation of Linear/Nonlinear Mixed Circuits [p. 1327]

T. Mine, H. Kubota, A. Kamo, T. Watanabe, and H. Asai

In this paper, we propose a new method which makes transient simulation faster for the circuit including both nonlinear and linear elements. First, the method for generating the projection matrix with Krylov-subspace technique is described. The order of the circuit equation is reduced by congruence transformation with the projection matrix. Next, we suggest a method which can calculate the reduced Jacobian matrix directly in the each Newton-Raphson iteration. Since this technique does not need to calculate the original size of Jacobian matrix, the calculation cost is reduced drastically. Therefore, efficient circuit simulation can be achieved. Finally, our method is applied to some example circuits and the validity of the nonlinear circuit reduction technique is verified.

10F: Embedded Tutorial -- System Verilog for VHDL Users

Organizer/Moderator: H. Schlebusch, Synopsys, DE
Speaker: T. Fitzpatrick, Synopsys, US

System Verilog for VHDL Users [p. 1334]: SystemVerilog was developed to provide an evolutionary path from existing hardware description languages (HDLs) to next-generation design and verification methodologies necessary to support the development of the increasingly complex SoC designs of today and tomorrow. Although its roots are firmly planted in Verilog, many of the features of SystemVerilog were targeted to address capabilities that VHDL users have had for years. This tutorial will provide an overview of SystemVerilog, focusing on those language features that enable the adoption of SystemVerilog by VHDL designers, such as complex and user-defined data types, multi-dimensional arrays, and the concept of strong data type checking. In addition, we will show how VHDL and Verilog users can take advantage of distinct SystemVerilog features to improve their productivity with advanced coding capability and built-in verification.

10G: Hot Topic -- Quo Vadis Multimedia? From Desktop Multimedia to Distributed Multimedia Systems

Organizer/Moderator: P. Eles, Linkoping U, SE
Speakers:
R. Marculescu, Carnegie Mellon U, US
J. Henkel, NEC, US
M. Pedram, Southern California U, US

Distributed Multimedia System Design: A Holistic Perspective [p. 1342]: Multimedia systems play a central part in many human activities. Due to the significant advances in the VLSI technology, there is an increasing demand for portable multimedia appliances capable of handling advanced algorithms required in all forms of communication. Over the years, we have witnessed a steady move from standalone (or desktop) multimedia to deeply distributed multimedia systems. Whereas desktop-based systems are mainly optimized based on the performance constraints, power consumption is the key design constraint for multimedia devices that draw their energy from batteries. The overall goal of successful design is then to find the best mapping of the target multimedia application onto the architectural resources, while satisfying an imposed set of design constraints (e.g. minimum power dissipation, maximum performance) and specified QoS metrics (e.g. end-to-end latency, jitter, loss rate) which directly impact the media quality. This paper addresses a few fundamental issues that make the design process particularly challenging and offers a holistic perspective towards a coherent design methodology.

IP4: Interactive Presentations

Adaptive Prefetching for Multimedia Applications in Embedded Systems [p. 1350]

H. Sbeyti, S. Niar, and L. Eeckhout

This paper presents a new and simple prefetching mechanism to improve the memory performance of multimedia applications. This method adapts the memory access mechanism to the access patterns as observed in the application. By doing so, performance is increased, the available resources are better utilized and energy consumption is reduced. Using our prefetch method, we are able to get up to 5.5% IPC improvement, more than 50% cache miss reduction, and up to 4.5% energy reduction. Our mechanism results in better performance for a 2KB data cache than is achievable with an 8KB data cache (without prefetching) for StrongArm SA1110 and Xscale-like processor configurations. This mechanism requires limited hardware resources and generates little additional external bus transfers. This makes this adaptive prefetching well suited for embedded microprocessor systems.

Data Windows: A Data-Centric Approach for Query Execution in Memory-Resident Databases [p. 1352]

J. Pisharath, A. Choudhary, and M. Kandemir

Structured embedded databases are currently becoming an integrated part of embedded systems, thus, enabling higher standards in system automation. These embedded databases are typically memory resident. In this paper, we present a data-centric approach called data windowing that optimizes multiple queries issued to an embedded database. Traditional approaches improve the performance by optimizing the control flow of operations, whereas we target performance improvements based on the data that is brought into the system.

High-Performance QuIDD-Based Simulation of Quantum Circuits [p. 1354]

G. Viamontes, I. Markov, and J. Hayes

Simulating quantum computation on a classical computer is a difficult problem. The matrices representing quantum gates, and vectors modeling qubit states grow exponentially with the number of qubits. It has been shown experimentally that the QuIDD (Quantum Information Decision Diagram) datastructure greatly facilitates simulations using memory and runtime that are polynomial in the number of qubits. In this paper, we present a complexity analysis which formally describes this class of matrices and vectors. We also present an improved implementation of QuIDDs which can simulate Grover's algorithm for quantum search with the asymptotic runtime complexity of an ideal quantum computer up to negligible overhead.

An Application of Parallel Discrete Event Simulation Algorithms to Mixed Domain System Simulation [p. 1356]

D. Reed, S. Levitan, J. Boles, J. Martinez, and D. Chiarulli

We present our system-level co-simulation environment for mixed domain microsystems. The environment provides synchronization and cosimulation between the Chatoyant MOEMS (Micro-Electro Mechanical Systems) simulator and ModelTech ModelSim. By using shared memory IPC (Inter-Process Communication) and PDES (Parallel Discrete Event Simulation) techniques, we achieve two orders of magnitude speedup over standard pipe/socket communication.

Fault Tolerance of Programmable Switch Blocks [p. 1358]

J. Huang, M. Tahoori, and F. Lombardi

This paper presents a new approach for the evaluation of FPGA routing resources in the presence of faulty switches. This is considered under the worst case scenario of open faults. Signal routing in the presence of faulty switches is analyzed at switch block level probabilitic routing (routability) is used as figure of merit for evaluating the interconnect resources of FPGAs. The presented approach utilizes a path-based technique to find the probability of establishing a path between pairs of input and output endpoints in a switch block. The results are reported for various commercial and academic FPGAs.

A New Self-Checking Sum-Bit Duplicated Carry-Select Adder [p. 1360]

E. Sogomonyan, D. Marienfeld, V. Ocheretnij, and M. Gössel

In this paper the first code-disjoint totally self-checking carry-select adder is proposed. The adder blocks are fast ripple adders with a single NAND-gate delay for carry-propagation per cell. In every adder block both the sum-bits and the corresponding inverted sum-bits are simultaneously implemented. The parity of the input operands is checked against the XOR-sum of the propagate signals. For 64 bits area and maximal delay are determined by the SYNOPSYS CAD tool of the EUROCHIP project. Compared to a 64 bit carry-select adder without error detection the delay of the most significant sum-bit does not increase. The area is 170% of a 64 bit carry-select adder (without error detection and not code-disjoint).

A Macromodelling Methodology for Efficient High-Level Simulation of Substrate Noise Generation [p. 1362]

L. Elvira, F. Martorell, X. Aragonés, and J. González

Efficient prediction of the substrate noise generated by digital sections is currently a major challenge in System-on-a-Chip design. In this paper a macromodel to accurately and efficiently predict the substrate noise generated by digital standard cells is presented. The macromodel accuracy is demonstrated for some simple circuits.

Accurate Estimation of Parasitic Capacitances in Analog Circuits [p. 1364]

A. Agarwal, H. Sampath, V. Yelamanchili, and R. Vemuri

This paper presents efficient and accurate techniques for modeling parasitic capacitances in analog CMOS circuits. A layout aware synthesis flow using these parasitic models has been proposed. The fast parasitic estimation process replaces the time consuming steps of layout generation and extraction during synthesis. Results indicate that these models are extremely fast and accurate.

GRAAL -- A Development Framework for Embedded Graphics Accelerators [p. 1366]

D. Crisu, S. Cotofana, S. Vassiliadis, and P. Liuha

This paper presents a versatile hardware/software cosimulation and co-design environment for embedded 3D graphics accelerators. The GRAphics AcceLerator design exploration framework (GRAAL) is an open system which offers a coherent development methodology based on an extensive library of SystemC RTL models of graphics pipeline components. GRAAL incorporates tools to assist in the visual debugging of the graphics algorithms implemented in hardware, and to estimate the performance in terms of throughput, power consumption, and area.

From Synchronous to Asynchronous: An Automatic Approach [p. 1368]

J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, and C. Sotiriou

This paper presents a methodology to derive asynchronous circuits from optimized synchronous circuits by replacing the clock distribution tree by a handshaking network. A case study shows the applicability of the method and the potential benefits of de-synchronizing synchronous circuits.

Enhancing Testability of System on Chips Using Network Management Protocols [p. 1370]

O. Laouamri and C. Aktouf

This paper shows how to adapt the P1500 Design-For-Test standard through network management protocols to make the testing problem of System-On-Chips (SoCs) easier and cost-effective. For this purpose, a SoC is analyzed as a distributed system in which its own basic components or IP Cores (Intellectual Proprieties) are considered as network agents according to SNMP (Simple Network Management Protocol) protocol. An experimental study was carried out to show the effectiveness of such an approach.

Minimization of Crosstalk Noise, Delay and Power Using a Modified Bus Invert Technique [p. 1372]

M. Lampropoulos, B. Al-Hashimi, and P. Rosinger

Previously reported bus encoding approaches reduce crosstalk delay but they ignore the effects of inductive coupling between the bus lines, i.e. crosstalk noise. Aiming to solve this issue, this paper presents a modified bus-invert technique which minimizes crosstalk noise, as well as delay and power, at the expense of a small area overhead.

Energy-Efficient Design for Highly Associative Instruction Caches in Next-Generation Embedded Processors [p. 1374]

J. Aragon, D. Nicolaescu, A. Veidenbaum, and A. Badulescu

This paper proposes a low-energy solution for CAM-based highly associative I-caches using a segmented wordline and a predictor-based instruction fetch mechanism. Not all instructions in a given I-cache fetch are used due to branches. The proposed predictor determines which instructions in a cache access will be used and does not fetch any other instructions. Results show an average I-cache energy savings of 44% over the baseline case and 6% over the segmented case with no negative impact on performance.

Dynamic Voltage and Cache Reconfiguration for Low Power [p. 1376]

A. Nacul and A. Givargis

In this work, we propose a combined Dynamic Voltage Scaling (DVS) and Dynamic Cache Reconfiguration (DCR) online algorithm that dynamically adapts the processor speed (i.e., voltage) and the cache subsystem to the workload requirements for the purposes of saving energy. The workload is considered to be a set of tasks with real-time deadlines. Our online algorithm is invoked as part of the OS scheduler, which performs standard earliest deadline first (EDF) task scheduling first. Then, our online algorithm, determines an ideal voltage/cache configuration for the current executing task.

IP5: Interactive Presentations

Overhead-free Polymorphism in Network-on-Chip Implementation of Object-Oriented Models [p. 1380]

M. Goudarzi, S. Hessabi, and A. Mycroft

We unify virtual-method despatch (polymorphism implementation) and network packet-routing operations; virtual-method calls correspond to network packets, and network addresses are allocated such that routing the packet corresponds to dispatching the call. As the run-time routing structure is inherent in Network-on-Chip platforms, this unification implements polymorphism for free.

Multi-Processor SoC Design Methodology Using a Concept of Two-Layer Hardware-Dependent Software [p. 1382]

S. Yoo, M. Youssef, A. Bouchhima, A. Jerraya, and M. Diaz-Nava

In conventional multiprocessor SoC (MPSoC) design methods, we find two problems: lack of SW code portability and lack of early SW validation. The problems cause a long design cycle. To resolve them, we present a concept of two-layer hardware-dependent software (HdS). The presented HdS consists of hardware abstraction layer to abstract the subsystem architecture and SoC abstraction layer to abstract the global MPSoC architecture. During the exploration of global and sub-system architectures, the application programming interfaces of presented two-layer HdS allow to keep the SW independent from architectural change. The simulation models of two-layer HdS enable to validate the entire system including the SW and HW design early in the design steps. We show the effectiveness of the presented methodology in the MPSoC architecture exploration of an OpenDiVX encoder system design.

Synthesis of Reversible Logic [p. 1384]

A. Agrawal and N. Jha

A function is reversible if each input vector produces a unique output vector. Reversible functions find applications in low power design, quantum computing, and nanotechnology. Logic synthesis for reversible circuits differs substantially from traditional logic synthesis. In this paper, we present the first practical synthesis algorithm and tool for reversible functions with a large number of inputs. It uses positive-polarity Reed-Muller decomposition at each stage to synthesize the function as a network of Toffoli gates. The heuristic uses a priority queue based search tree and explores candidate factors at each stage in order of attractiveness. The algorithm produces near-optimal results for the examples discussed in the literature. The key contribution of the work is that the heuristic finds very good solutions for reversible functions with a large number of inputs.

A Unified Design Space for Regular Parallel Prefix Adders [p. 1386]

M. Ziegler and M. Stan

We consider sparsity, fanout, and radix as three dimensions in the design space of regular parallel prefix adders and present a unified formalism to describe such structures.
Keywords: parallel prefix adder, Kogge-Stone adder, Han-Carlson adder, Brent-Kung adder.

MODD: A New Decision Diagram and Representation for Multiple Output Binary Functions [p. 1388]

A. Jabir and D. Pradhan

This paper presents a new decision diagram (DD), called MODD, for multiple output binary and multiple-valued functions. This DD is canonic and can be made minimal with respect to a given variable order. Unlike other reported DDs, our approach can represent arbitrary combination of bits at the word-level. The preliminary results show that our representation can result in considerable memory saving [11390].

Issues in Implementing Latency Insensitive Protocols [p. 1390]

M. Casu and L. Macchiarulo

Model-Based Specification and Execution of Embedded Real-Time Systems [p. 1392]

T. Schattkowsky and W. Mueller

A Demonstration of Co-Design and Co-Verification in a Synchronous Language [p. 1394]

S. Singh

Profile Guided Management of Code Partitions for Embedded Systems [p. 1396]

S. Zhou, B. Childers, and N. Kumar

Researchers have proposed to divide embedded applications into code partitions and to download partitions on demand from a wireless code server to enable a diverse set of applications for very tightly constrained embedded systems. This paper describes a new approach for managing the request and storage of code partitions and we explore the benefits of our scheme.

IP6: Interactive Presentations

Realizable Reduction for Electromagnetically Coupled RLMC Interconnects [p. 1400]

R. Jiang and C. Chen

This paper presents a realizable RLMC1 reduction algorithm for extracted interconnect circuits based on two effective approaches: RL branch reduction and RC/LC node reduction. Our algorithm takes advantage of some structures existing extensively in interconnect circuits and hence has extremely fast execution time. It takes about 8 seconds to reduce a circuit of over 300,000 elements while maintaining 3% error and 75% element reduction ratio.

Statistically Aware Buffer Planning [p. 1402]

G. Garcea, N. van der Meijs, and R. Otten

In this paper, we will develop an analytic approach to estimate the statistical properties (mean and variance) of the performance of a uniformly buffered global IC interconnect, based on the mean and (co)variance of the appropriate design and technology parameters. Compared to other approaches, such as Monte Carlo based approaches, our analytic approach would allow a much tighter design optimization loop and provide a better insight in the factors involved. The model that we use is generic, but in this paper we assume a set of synthetic (not based on actual process data) but realistically large values for the variability of the input parameters. Under these assumptions, it follows that solutions for the area/power/performance tradeoff that are optimal in a deterministic setting, might suffer from excessive variability, potentially leading to a yield problem.

A Tunneling Model for Gate Oxide Failure in Deep Sub-Micron Technology [p. 1404]

S. Bernadini, J. Portal, and P. Masson

Parametric failures in CMOS IC nanoelectronics, leads to strong detection problem. In order to develop new defect oriented test methods, it is of prime importance to study the behavior of the transistor affected by those kind of failures. In this paper, we present a new electrical transistor model, which allows to study the impact of gate oxide thickness drop. It is shown that electrical behavior of the proposed model matches in a satisfactory way the defective transistor behavior.

Power Supply Noise Monitor for Signal Integrity Faults [p. 1406]

J. Vázquez and J. de Gyvez

We propose a monitor able to detect on-line excessive Power Supply Noise (PSN) at the power/ground lines. It has high resolution (100 ps), enough to collect the important features of PSN and its output is isolated from the local PSN. It is useful for any scheme that takes corrective actions to prevent signal integrity faults after detection of excessive PSN.

Testing of Quantum Dot Cellular Automata Based Designs [p. 1408]

M. Tahoori and F. Lombardi

There has been considerable research on quantum dots cellular automata as a new computing scheme in the nano-scale regimes. The basic logic element of this technology is a majority voter. In this paper, testing of these devices is investigated and compared with conventional CMOS-based designs. A testing technique is presented; it requires only a constant number of test vectors to achieve 100% fault coverage with respect to the fault list of the original design. A design-for-test scheme is also presented which results in the generation of a reduced test set.

Net and Pin Distribution for 3D Package Global Routing [p. 1410]

J. Minz, M. Pathak, and S. Lim

In this paper, we study the net and pin distribution problem for global routing targeting three dimensional packaging layout via System-on-Package (SOP). The routing environment for the new emerging mixed-signal SOP technology is more advanced than that of the conventional PCB or MCM technology -- pins are located at all layers of SOP packaging substrate rather than the top-most layer only. This is the first work to formulate and solve the multi-layer net and pin distribution for layer, wirelength, and crosstalk minimization.

Placement Using a Localization Probability Model (LPM) [p. 1412]

M. Olbrich and E. Barke

We propose a new placement model for global placement. This model uses probabilities to localize the cells. It enables arbitrary levels of placement abstraction. Wirelength estimations at any level can be derived from the model. We present a new placer, that uses a special variant of the proposed model. Examples show that the model properties improve placement quality.

CMOS Structures Suitable for Secured Hardware [p. 1414]

S. Guilley, P. Hoogvorst, Y. Mathieu, R. Pacalet, and J. Provost

Unsecured electronic circuits leak physical syndromes correlated to the data they handle. Side-channels attacks, like SPA or DPA, exploit this information leakage. We provide balanced and memoryless CMOS structures for a 2-input secured NAND gate.

Timing Correction and Optimization with Adaptive Delay Sequential Elements [p. 1416]

K. Rahimi, S. Bridges, and C. Diorio

This paper introduces Adaptive Delay Sequential Elements (ADSEs). ADSEs are registers that use nonvolatile, floating-gate transistors to tune their internal clock delays. We propose ADSEs for correcting timing violations and optimizing circuit performance. We present an ADSE circuit example, system architecture, and tuning methodology. We present experimental results that demonstrate the correct operation of our example circuit and discuss the die-area impact of using ADSEs. Our experiments also show that voltage and temperature sensitivity of ADSEs are comparable to non-adaptive flip-flops.