SIGDA, DATE 2007, Abstracts

DATE 2007 ABSTRACTS

Sessions: [Keynote Addresses] [1.2] [IP1] [1.3] [1.4] [1.5] [IP2] [1.6] [IP3] [1.7] [IP4] [2.2] [IP5] [2.3] [IP6] [2.4] [2.5] [IP7] [2.6] [IP8] [2.7] [IP9] [3.2] [IP10] [3.3] [3.4] [IP11] [3.5] [3.6] [3.7] [IP12] [4.1] [4.2] [IP13] [4.3] [IP14] [4.4] [IP15] [4.5] [IP16] [4.6] [IP17] [4.7] [IP18] [5.1.1] [5.1.2] [5.2] [5.3] [IP19] [5.4] [5.5] [IP20] [5.6] [IP21] [5.7] [IP22] [6.1] [6.2] [6.3] [IP23] [6.4] [IP24] [6.5] [IP25] [6.6] [6.7] [7.1] [7.2] [IP26] [7.4] [7.5] [IP27] [7.6] [IP28] [7.7] [IP29] [8.1] [8.2] [8.3] [IP30] [8.4] [IP31] [8.5] [IP32] [8.6] [8.7] [9.1.1] [9.1.2] [9.2] [9.3] [IP33] [9.4] [IP34] [9.5] [IP35] [9.6] [9.7] [IP36] [10.1] [10.2] [IP37] [10.3] [IP38] [10.4] [10.5] [10.6] [IP39] [10.7] [IP40] [11.1] [11.2] [11.3] [11.4] [11.5] [11.6] [11.7]

Cover Page
DATE Executive Committee
DATE Sponsor Committee
Technical Program Chairs
Technical Program Committee
Reviewers
Foreword
Best Paper Awards
Tutorials
PH.D. Forum
Call for Papers: DATE 2008

Keynote Addresses

Challenges of Digital Consumer and Mobile SOC's: More Moore Possible? [p. 1]

T. Furuyama

Digital consumer and mobile products have continuously accommodated more features and functions. For example, the recent high-end cellular phones can operate as terrestrial digital TV viewers, MP3 music players, digital cameras, substitutes of credit cards and many more in addition to multi-modal wireless communication terminals that handle various formats; GSM, 3G, BT, WiFi and so on. These products require to best combine highly integrated SoC's and sophisticated software stacks in a timely manner. It is essential to establish a hardware/software co-development/verification environment with an ESL design methodologies and an IP reuse platform where various functions are realised on an SoC by legacy sub-systems with a low-power multi-processor architecture. This challenge gets more complicated in deep sub-100 nm technology nodes. Approaches to these complex problems from different aspects will be presented.

Was Darwin Wrong? Has Design Evolution Stopped at the RTL Level...or Will Software and Custom Processors (or System-Level Design) Extend Moore's Law? [p. 2]

A. Naumann

The challenges of electronic design are escalating as software and embedded processors are fast becoming a more dominant component of electronic products. Software is now acknowledged as the most effective way for electronics companies to differentiate their products. But what if the processors running the software aren't up to the task? Electronics companies are increasingly adopting a new system-level design methodology to stay competitive, one that enables design that is centred on custom processors and software. The ripple effects of systemlevel design are even affecting the way that semiconductor companies take products to market and how their customers choose and use silicon.

1.2: Design Records

Moderators: G. De Micheli, EPF Lausanne, CH, P. van der Wolf, NXP Semiconductors Research, NL

ATLAS: A Chip-Multiprocessor with Transactional Memory Support [p. 3]

N. Njoroge, J. Casper, S. Wee, Y. Teslyar, D. Ge, C. Kozyrakis and K. Olukotun

Chip-multiprocessors are quickly becoming popular in embedded systems. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development for such systems. Transactional Memory (TM) promises to simplify concurrency management in multithreaded applications by allowing programmers to specify coarse-grain parallel tasks, while achieving performance comparable to fine-grain lock-based applications. This paper presents ATLAS, the first prototype of a CMP with hardware support for transactional memory. ATLAS includes 8 embedded PowerPC cores that access coherent shared memory in a transactional manner. The data cache for each core is modified to support the speculative buffering and conflict detection necessary for transactional execution. We have mapped ATLAS to the BEE2 multi-FPGA board to create a full-system prototype that operates at 100MHz, boots Linux, and provides significant performance and ease-of-use benefits for a range of parallel applications. Overall, the ATLAS prototype provides an excellent framework for further research on the software and hardware techniques necessary to deliver on the potential of transactional memory.

A Dynamically Adaptive DSP for Heterogeneous Reconfigurable Platforms[p. 9]

F. Campi, A. Deledda, M. Pizzotti, L. Ciccarelli, P. Rolandi, C. Mucci, A. Lodi, A. Vitkovski and L. Vanzolini

This paper describes a digital signal processor based on a multi-context, dynamically reconfigurable datapath, suitable for inclusion as an IP-block in complex SoC design projects. The IP was realized in CMOS 090 nm technology. The most relevant features offered by the proposed architecture with respect to state of the art are zero overhead for switching between successive configurations, relevant area and energy computational density on computational kernels (average of 2 GOPS/mm², 0.2GOPS/mW) and relatively small area occupation (18 mm²), making it suitable for acceleration or upgrade of multi-core heterogeneous embedded platforms. The processor is delivered with a software tool chain providing the application developer algorithmic analysis and design space exploration based on ANSI C, with no utilization of hardware-related constructs or description languages.

An 0.9 X 1.2", Low Power, Energy-Harvesting System with Custom Multi-Channel Communication Interface [p. 15]

P. Stanley-Marbell and D. Marculescu

Presented is a self-powered computing system, Sunflower, that uses a novel combination of a PIN photodiode array, switching regulators, and a supercapacitor, to provide a small footprint renewable energy source. The design provides software-controlled power-adaptation facilities, for both the main processor and its peripherals. The system's power consumption is characterized, and its energyscavenging efficiency is quantified with field measurements under a variety of weather conditions.

Interactive Presentation

An FPGA Based All-Digital Transmitter with Radio Frequency Output for Software Defined Radio [p. 21]

Z. Ye, J. Grospietsch, G. Memik

In this paper, we present the architecture and implementation of an all-digital transmitter with radio frequency output targeting an FPGA device. FPGA devices have been widely adopted in the applications of digital signal processing (DSP) and digital communication. They are typically well suited for the evolving technology of software defined radios (SDR) due to their reconfigurability and programmability. However, FPGA devices are mostly used to implement digital baseband and intermediate frequency (IF) functionalities. Therefore, significant analog and RF components are still needed to fulfill the radio communication requirements. The all-digital transmitter presented in this paper directly synthesizes RF signal in the digital domain, therefore eliminates the need for most of the analog and RF components. The all-digital transmitter consists of one QAM modulator and one RF pulse width modulator (RFPWM). The binary output waveform from RFPWM is centered at 800MHz with 64QAM signaling format. The entire transmitter is implemented using Xilinx Virtex2pro device with on chip multi-gigabit transceiver (MGT). The adjacent channel leakage ratio (ACLR) measured in the 20 MHz passband is 45dB, and the measured error vector magnitude (EVM) is less than 1%. Our work extends the digital implementation of communication applications on an FPGA platform to radio frequency, therefore making a significant evolution towards an ideal SDR.

1.3: Design for Testability for SoCs

Moderators: S. Kundu, Massachusetts U, US, H.-J. Wunderlich, Stuttgart U, DE

A Non-Intrusive Isolation Approach for Soft Cores [p. 27]

O. Sinanoglu and T. Petrov

Cost effective SOC test strongly hinges on parallel, independent test of SOC cores, which can only be ensured through proper core isolation techniques. While a core isolation mechanism can provide controllability and observability at the core I/O interface, its implementation may have various implications on area, functional timing, test time and data volume, and at-speed coverage on the core interface. In this paper, we propose a non-intrusive core isolation technique that is based on the utilization of existing core registers for isolating the core. We provide a core register partitioning algorithm that is capable of identifying the core interface registers, and of robustly isolating a core, resulting in a computationally efficient core isolation implementation that is area and performance efficient at the same time. The proposed isolation technique also ensures minimal test time increase and no at-speed coverage loss on the core interface, offering an elegant solution for soft cores, and thus enabling significant SOC test cost reductions.

Unknown Blocking Scheme for Low Control Data Volume and High Observability [p. 33]

S. Wang, W. Wei, S.T. Chakradhar

This paper presents a new blocking logic to block unknowns for temporal compactors. The proposed blocking logic can reduce data volume required to control the blocking logic and also increase the number of scan cells that are observed by the temporal compactors. Control patterns, which describe values required at the control signals of the blocking logic, are compressed by LFSR reseeding. In this paper, the blocking logic gates for some groups of scan chains that do not capture unknowns are bypassed. Since all the scan cells in these scan chain groups are observed without specifying the corresponding bits in control patterns, fewer specified bits are required and more scan cells are observed. The seed size is further reduced by reducing numbers of specified bits in the densely specified control patterns. The proposed method can always achieve the same fault coverage that can be achieved by direct observation of scan chains. Experiments with large industrial designs clearly demonstrate that the proposed method is scalable to large circuits. Hardware overhead for the proposed blocking logic is very low.

Test Cost Reduction for SoC Using a Combined Approach to Test Data Compression and Test Scheduling [p. 39]

Q. Zhou and K.J. Balakrishnan

A combined approach for implementing system level test compression and core test scheduling to reduce SoC test costs is proposed in this paper. A broadcast scan based test compression algorithm for parallel testing of cores with multiple scan chains is used to reduce the test data of the SoC. Unlike other test compression schemes, the proposed algorithm doesn't require specialized test generation or fault simulation and is applicable with intellectual property (IP) cores. The core testing schedule with compression enabled is decided using a generalized strip packing algorithm. The hardware architecture to implement the proposed scheme is very simple. By using the combined approach, the total test data volume and test application time of the SoC is reduced to a level comparable with the test data volume and test application time of the largest core in the SoC.

High-Level Test Synthesis for Delay Fault Testability [p. 45]

S.-J. Wang and T.-H. Yeh

A high-level test synthesis (HLTS) method targeted for delay fault testability is presented. The proposed method, when combined with hierarchical test pattern generation for embedded modules, guarantees 100% delay test coverage for detectable faults in modules. A study on the delay testability problem in behavior level shows that low delay fault coverage is usually attributed to the fact that two-pattern test for delay testing cannot be delivered to modules under test in consecutive cycles. To solve the problem, we propose an HLTS method that ensures valid test pairs can be sent to each module through synthesized circuit hierarchy. Experimental results show that this method achieves 100% fault coverage for transition faults in functional units, while the fault coverage in circuits synthesized by LEA-based allocation algorithm is rather poor. The area overhead due to this method ranges from 2% to 10% for 16-bit datapaths.

1.4: Communication Synthesis under Timing Constraints

Moderators: J. Teich, Erlangen-Nuremberg U, DE, M. Heijligers, NXP IC-Lab, NL

Bus Access Optimisation for FlexRay-based Distributed Embedded Systems [p. 51]

T. Pop, P. Pop, P. Eles and Z. Peng

FlexRay will very likely become the de-facto standard for in-vehicle communications. Its main advantage is the combination of high speed static and dynamic transmission of messages. In our previous work we have shown that not only the static but also the dynamic segment can be used for hard-real time communication in a deterministic manner. In this paper, we propose techniques for optimising the FlexRay bus access mechanism of a distributed system, so that the hard real-time deadlines are met for all the tasks and messages in the system. We have evaluated the proposed techniques using extensive experiments.

A Decomposition-based Constraint Optimization Approach for Statically Scheduling Task Graphs with Communication Delays to Multiprocessors [p. 57]

N. Satish, K. Ravindran and K. Keutzer

We present a decomposition strategy to speed up constraint optimization for a representative multiprocessor scheduling problem. In the manner of Benders decomposition, our technique solves relaxed versions of the problem and iteratively learns constraints to prune the solution space. Typical formulations suffer prohibitive run times even on medium-sized problems with less than 30 tasks. Our decomposition strategy enhances constraint optimization to robustly handle instances with over 100 tasks. Moreover, the extensibility of constraint formulations permits realistic application and resource constraints, which is a limitation of common heuristic methods for scheduling. The inherent extensibility, coupled with improved run times from a decomposition strategy, posit constraint optimization as a powerful tool for resource constrained scheduling and multiprocessor design space exploration.

Design Closure Driven Delay Relaxation Based on Convex Cost Network Flow [p. 63]

C. Lin, A. Xie and H. Zhou

Design closure becomes hard to achieve at physical layout stage due to the emergence of long global interconnects. Consequently, interconnect planning needs to be integrated in high level synthesis. Delay relaxation that assigns extra clock latencies to functional resources at RTL (Register Transfer Level) can be leveraged. In this paper we propose a general formulation for design closure driven delay relaxation problem. We show that the general formulation can be transformed into a convex cost integer dual network flow problem and solved in polynomial time using the convex cost-scaling algorithm in [1]. Experimental results validate the efficiency of the approach.

1.5: Performance Modelling and Synthesis of Analogue/Mixed-Signal Circuits

Moderators: F. V. Fernandez, IMSE, CSIC and Seville U, ES, L. Hedrich, Frankfurt/M U, DE

Simulation-based Reusable Posynomial Models for MOS Transistor Parameters [p. 69]

V. Aggarwal and U.-M. O'Reilly

We present an algorithm to automatically design posynomial models for parameters of the MOS transistors using simulation data. These models improve the accuracy of the Geometric Programming flow for automatic circuit sizing. The models are reusable for multiple circuits on a given Silicon technology and hence don't adversely affect the scalability of the Geometric Programming approach. The proposed method is a combination of genetic algorithms and Quadratic Programming. It is the only approach for posynomial modeling with real-valued exponents which is easily extensible to different error metrics. We compare the proposed technique with state-of-art posynomial/monomial modeling techniques and show its superiority.

Trade-Off Design of Analog Circuits Using Goal Attainment and "Wave Front" Sequential Quadratic Programming [p. 75]

D. Mueller, H. Graeb and U. Schlichtmann

One of the main tasks in analog design is the sizing of the circuit parameters, such as transistor lengths and widths, in order to obtain optimal circuit performances, such as high gain or low power consumption. In most cases one performance can only be optimized at cost of others, therefore a sizing must aim at an optimal trade-off between the important circuit performances. In this paper we present a new deterministic method to calculate the complete range of performance trade-offs, the so-called Pareto-optimal front, of a given circuit topology. Known deterministic methods solve a set of constrained multi-objective optimization problems independently of each other. The presented method minimizes a set of Goal Attainment (GA) optimization problems simultaneously. In a parallel algorithm, the individual GA optimization processes compare and exchange their iterative solutions. This leads to a significant improvement in the efficiency and quality of analog trade-off design.

An Efficient Methodology for Hierarchical Synthesis of Mixed-Signal Systems with Fully Integrated Building Block Topology Selection [p. 81]

T. Eeckelaert, R. Schoofs, G. Gielen, M. Steyaert and W. Sansen

An hierarchical synthesis methodology for analog and mixed-signal systems is presented that fully in a novel way integrates topology selection at all levels. A hierarchical system optimizer takes multiple topologies for all the building blocks at each hierarchical abstraction level, and generates optimal topology combinations using multi-objective evolutionary optimization techniques. With the presented methodology, system-level performance trade-offs can be generated where each design point contains valuable information on how the systems performances are influenced by different combinations of lower-level building block topologies. The generated system designs can contain all kinds of topology combinations as long as critical inter-block constraints are met. Different topologies can be assigned to building blocks with the same functional behavior, leading to more optimal hybrid designs than typically obtained in manual designs. In the experimental results, three different integrator topologies are used to generate an optimal system-level exploration trade-off for a complex high-speed ΔΣ A/D modulator.

Interactive Presentation

A Coefficient Optimization and Architecture Selection Tool for ∑Δ Modulators in MATLAB [p. 87]

O. Yetik, O. Saglamdemir, S. Talay and G. Dündar

A tool created in MATLAB environment for automatic transfer function generation and topology synthesis for a Sigma Delta Modulator for a desired frequency response will be proposed in this work. The tool carries out two basic tasks: (1) transfer function generation, which works in a SPICE like fashion, taking the netlist of an arbitrary SD modulator architecture in block level as the input, determining the input-output relation for each block in z-domain and generating the signal and noise transfer functions (STF and NTF) of the system automatically, (2) a topology synthesis algorithm which uses the STF and NTF as inputs and finds all the possible SD modulator topologies (according to some criteria such as minimization of the number of signal paths) which can be obtained from the architecture and which realizes a desired frequency response. The application of the tool will be illustrated on examples.

1.6: System Level Mapping and Simulation

Moderators: P. van der Wolf, NXP Semiconductors Research, NL, L. Thiele, ETH Zurich, CH

(694) Synthesis of Task and Message Activation Models in Real-Time Distributed Automotive Systems [p. 93]

W. Zheng, M. Di Natale, C. Pinello, P. Giusto and A. Sangiovanni Vincentelli

Modern automotive architectures support the execution of distributed safety- and time-critical functions on a complex networked system with several buses and tens of ECUs. Schedulability theory allows the analysis of the worst case end-to-end latencies and the evaluation of the possible architecture configurations options with respect to timing constraints. We present an optimization framework, based on an ILP formulation of the problem, to select the communication and synchronization model that leverages the trade-offs between the purely periodic and the precedence constrained data-driven activation models to meet the latency and jitter requirements of the application. We demonstrate its effectiveness by optimizing a complex automotive architecture.

(287) An ILP Formulation for System-Level Application Mapping on Network Processor Architectures [p. 99]

C. Ostler and K.S. Chatha

Current day network processors incorporate several architectural features including symmetric multi-processing (SMP), block multi-threading, and multiple memory elements to support the high performance requirements of networking applications. We present an automated system-level design technique for application development on such architectures. The technique incorporates process transformations and block multi-threading aware data mapping to maximize the worst case throughput of the application. We propose integer linear programming formulations for process allocation and data mapping on SMP and block multi-threading based network processors. The paper presents experimental results that evaluate the technique by implementing representative network processing applications on the Intel IXP 2400 architecture. The results demonstrate that our technique is able to generate high-quality mappings of realistic applications on the target architecture within a short time.

(231) A Smooth Refinement Flow for Co-Designing HW and SW Threads [p. 105]

P. Destro, F. Fummi and G. Pravadelli

Separation of HW and SW design flows represents a critical aspect in the development of embedded systems. Co-verification becomes necessary, thus implying the development of complex cosimulation strategies. This paper presents a refinement flow that delays as much as possible the separation between HW and SW concurrent entities (threads), allowing their differentiation, but preserving an homogeneous simulation environment. The approach relies on SystemC as the unique reference language. However, SystemC threads, corresponding to the SW application, are simulated outside the control of the SystemC simulation kernel to exploit the typical features of multi-threading real-time operating systems running on embedded systems. On the contrary HW threads maintain the original simulation semantics of SystemC. This allows designers to effectively tune the SW application before HW/SW partitioning, leaving to an automatic procedure the SW generation, thus avoiding error-prone and time-consuming manual conversions.

(521) Speeding Up SystemC Simulation through Process Splitting [p. 111]

Y. N. Naguib and R. S. Guindi

This paper presents a new approach that can be used to speed up SystemC simulations by automatically optimizing the model for simulation. The work addresses the inefficiency of the standard SystemC scheduler that may lead in some situations to unnecessary wake-up calls, as well as unnecessary code execution. The method presented analyzes the SystemC code to automatically extract signal dependencies based on a set of rules. This information is then used to split large processes into smaller ones. Process splitting is performed by a tool - SplitPro- which generates an optimized code that can be run on any standard SystemC engine. SplitPro was used to analyze the description of an Alpha super scalar processor and optimize some of its modules. A speed gain of up to 23% in simulation time was achieved over a number of split processes.

Interactive Presentation

(394) An FPGA Design Flow for Reconfigurable Network-Based Multi-Processor Systems on Chip [p. 117]

A. Kumar, A. Hansson, J. Huisken and H. Corporaal

Multi-Processor System on Chip (MPSoC) platforms are becoming increasingly more heterogeneous and are shifting towards a more communication-centric methodology. Networks on Chip (NoC) have emerged as the design paradigm for scalable on-chip communication architectures. As the system complexity grows, the problem emerges as how to design and instantiate such a NoC-based MPSoC platform in a systematic and automated way. In this paper we present an integrated flow to automatically generate a highly configurable NoC-based MPSoC for FPGA instantiation. The system specification is done on a high level of abstraction, relieving the designer of errorprone and time consuming work. The flow uses the state-of-the-art Æthereal NoC, and Silicon Hive processing cores, both configurable at design- and run-time. We use this flow to generate a range of sample designs whose functionality has been verified on a Celoxica RC300E development board. The board, equipped with a Xilinx Virtex II 6000, also offers a huge number of peripherals, and we show how their insertion is automated in the design for easy debugging and prototyping.

1.7: Algorithms and Applications of Run-Time Reconfiguration

Moderators: W. Najjar, UC Riverside, US, F. Kurdahi, UC Irvine, US

Hard Real-Time Reconfiguration Port Scheduling [p. 123]

F. Dittmann and S. Frank

When modern partially and dynamically reconfigurable FPGAs are to be used as resources in hard real-time systems, the two dimensions area and time have to be considered in the focus of availability and deadlines. In particular, area requirements must be guaranteed for the tasks' duration. While execution environments that abstract the space demand of tasks exist and methods for occupancy of resources over time are discussed in the literature, few works focus on another fundamental bottleneck, the reconfiguration port. As all resource requests are served by this mutually exclusive device, profound concepts for scheduling the port access are vital requirements for FPGA realtime scheduling. Nevertheless, as the port must be accessed sequentially, we can inherit and apply monoprocessor scheduling concepts that are well researched. In this paper, we introduce monoprocessor scheduling algorithms for the reconfiguration port of FPGAs.

An Efficient Algorithm for Online Management of 2D Area of Partially Reconfigurable FPGAs [p. 129]

J. Cui, Q. Deng, X. He and Z. Gu

Partially Runtime-Reconfigurable (PRTR) FPGAs allow hardware tasks to be placed and removed dynamically at runtime. We present an efficient algorithm for finding the complete set of maximal empty rectangles on a 2D PRTR FPGA, which is useful for online placement and scheduling of HW tasks. The algorithm is incremental and only updates the local region affected by each task addition or removal event. We use simulation experiments to evaluate its performance and compare to related work.

Improving Utilization of Reconfigurable Resources Using Two-Dimensional Compaction [p. 135]

A.A. El Farag, H.M. El-Boghdadi and S.I. Shaheen

Partial reconfiguration allows parts of the reconfigurable chip area to be configured without affecting the rest of the chip. This allows placement of tasks at run time on the reconfigurable chip. Area management is a very important issue which highly affect the utilization of the chip and hence the performance. This paper focuses on a major aspect of moving running tasks to free space for new incoming tasks (compaction). We study the effect of compacting running tasks to free more contiguous space on the system performance. First, we introduce a straightforward compaction strategy called Blind compaction. We use its performance as a reference to measure the performance of other compaction algorithms. Then we propose a two-dimensional compaction algorithm called one-corner compaction. This algorithm runs with respect to one chip corner. We further extend this algorithm to the four corners of the chip and introduce the 4-corner compaction algorithm. Finally, we compare the performance of these algorithms with some existing compaction strategies [3]. The simulation results show improvement in average task allocation time when using the 4-corner compaction algorithm by 15% and in chip utilization by 16% over the Blind compaction. These results outperform the existing strategies.

Low-Power Warp Processor for Power Efficient High-Performance Embedded Systems [p. 141]

R. Lysecky

Researchers previously proposed warp processors, a novel architecture capable of transparently optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. However, the original warp processor design was primarily performance-driven and did not focus on power consumption, which is becoming an increasingly important design constraint. Focusing on power consumption, we present an alternative low-power warp processor design and methodology that can dynamically and transparently reduce power consumption of an executing application with no degradation in system performance, achieving an average reduction in power consumption of 74%. We further demonstrate the flexibility of this approach to provide dynamic control between high-performance and low-power consumption.
Keywords Warp processing, low-power, hardware/software partitioning, dynamically adaptable systems, embedded systems.

Interactive Presentations

Using Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable Devices [p. 147]

Y. Qu, J.-P. Soininen and J. Nurmi

In this paper, an approach that uses dynamic voltage scaling (DVS) to reduce the configuration energy of runtime reconfigurable devices is proposed. The basic idea is to use configuration prefetching and parallelism to create excessive system idle time and apply DVS on the configuration process when such idle time can be utilized. A genetic algorithm is developed to solve the task scheduling and voltage assignment problem. With real applications, the results show that up to 19.3% of configuration energy can be reduced. When considering the reduction of the configuration energy, the results show that using more computation resources is more favorable when the configuration latency is relatively small, and using more configuration controllers is more favorable for relatively large latency.

A Shift Register Based Clause Evaluator for Reconfigurable SAT Solver [p. 153]

M. Safar, M. Shalan, M. W. El-Kharashi and A. Salem

Several approaches have been proposed to accelerate the NP-complete Boolean Satisfiability problem (SAT) using reconfigurable computing. We present an FPGA based clause evaluator, where each clause is modeled as a shift register that is either right shifted, left shifted, or standstill according to whether the current assigned variable value satisfy, unsatisfy, or does not effect the clause, respectively. For a given problem instance, the effect of the value of each of its variables on its SAT formula is loaded in the FPGA on-chip memory. This results in less configuration effort and fewer hardware resources than other available SAT solvers. Also, we present a new approach for implementing conflict analysis based on a conflicting variables accumulator and priority encoder to determine backtrack level. Using these two new ideas, we implement an FPGA based SAT solver performing depth-first search with nonchronological conflict directed backtracking. We compare our SAT solver with other solvers through instances from DIMACS benchmarks suite.

2.2: IP Designs for Media Processing and Other Computational Intensive Kernels

Moderators: J. Dielissen, NXP Research, NL, N. Dutt, UC Irvine, US

Efficient High-Performance ASIC Implementation of JPEG-LS Encoder [p. 159]

M. Papadonikolakis, V. Pantazis and A. P. Kakarountas

This paper introduces an innovative design which implements a high-performance JPEG-LS encoder. The encoding process follows the principles of the JPEG-LS lossless mode. The proposed implementation consists of an efficient pipelined JPEG-LS encoder, which operates at a significantly higher encoding rate than any other JPEG-LS hardware or software implementation while keeping area small.
Index Terms - Image processing, lossless compression, JPEGLS, LOCO-I, VLSI implementation.

Improve CAM Power Efficiency Using Decoupled Match Line Scheme [p. 165]

Y.-J. Chang, Y.-H. Liao and S.-J. Ruan

Content addressable memory (CAM) is widely used in many applications that require fast table lookup. Due to the parallel comparison feature and high frequency of lookup, however, the power consumption of CAM is usually significant. In this paper we propose a decoupled match line scheme which combines the performance advantage of the traditional NOR-type CAM and the power efficiency of the traditional NAND-type CAM. In our design, a CAM word is divided into two segments, and then all the CAM cells are decoupled from the match line. By minimizing both the match line capacitances and switching activities, our design can largely reduce the CAM power dissipated in search operations. The results measured from the fabricated chip show that without any performance penalty our design can reduce the search energy consumption of the CAM by 89% compared to the traditional NOR-type CAM design.

Cyclostationary Feature Detection on a Tiled-SoC [p. 171]

A. B. J. Kokkeler, G. J. M. Smit, T. Krol and J. Kuper

In this paper, a two-step methodology is introduced to analyse the mapping of Cyclostationary Feature Detection (CFD) onto a multi-core processing platform. In the first step, the tasks to be executed by each core are determined in a structured way using techniques known from the design of array processors. In the second step, the implementa- tion of tasks on a processing core is analysed. Using this methodology, it is shown that calculating a 127 x 127 Discrete Spectral Correlation Function requires approximately 140 μs on a tiled System on Chip (SoC) with 4 Montium cores.

Mapping Control-Intensive Video Kernels onto a Coarse-Grain Reconfigurable Architecture: The H.264/AVC Deblocking Filter [p. 177]

C. Arbelo, A. Kanstein, S. López, J. F. L&ocute;pez, M. Berekovic, R. Sarmiento and J.-Y. Mignolet

Deblocking filtering represents one of the most compute intensive tasks in an H.264/AVC standard video decoder due to its demanding memory accesses and irregular data flow. For these reasons, an efficient implementation poses big challenges, especially for programmable platforms. In this sense, the mapping of this decoder's functionality onto a C-programmable coarse-grained reconfigurable architecture named ADRES (Architecture for Dynamically Reconfigurable Embedded Systems) is presented in this paper, including results from the evaluation of different topologies. The results obtained show a considerable reduction in the number of cycles and memory accesses needed to perform the filtering as well as an increase in the degree of instruction parallelism (ILP) when compared with an implementation on a Very Long Instruction Word (VLIW) dedicated processor. This demonstrates that high ILP is achievable on the ADRES even for irregular, data-dependent kernels.

Interactive Presentations

An Efficient Hardware Architecture for H.264 Intra Prediction Algorithm [p. 183]

E. Sahin and I. Hamzaoglu

In this paper, we present an efficient hardware architecture for real-time implementation of intra prediction algorithm used in H.264 / MPEG4 Part 10 video coding standard. The hardware design is based on a novel organization of the intra prediction equations. This hardware is designed to be used as part of a complete H.264 video coding system for portable applications. The proposed architecture is implemented in Verilog HDL. The Verilog RTL code is verified to work at 90 MHz in a Xilinx Virtex II FPGA. The FPGA implementation can process 27 VGA frames(640x480) per second.

An FPGA Implementation of Decision Tree Classification [p. 189]

R. Narayanan, D. Honbo, G. Memik, A. Choudhary and J. Zambreno

Data mining techniques are a rapidly emerging class of applications that have widespread use in several fields. One important problem in data mining is Classification, which is the task of assigning objects to one of several predefined categories. Among the several solutions developed, Decision Tree Classification (DTC) is a popular method that yields high accuracy while handling large datasets. However, DTC is a computationally intensive algorithm, and as data sizes increase, its running time can stretch to several hours. In this paper, we propose a hardware implementation of Decision Tree Classification. We identify the computeintensive kernel (Gini Score computation) in the algorithm, and develop a highly efficient architecture, which is further optimized by reordering the computations and by using a bitmapped data structure. Our implementation on a Xilinx Virtex-II Pro FPGA platform (with 16 Gini units) provides up to 5.58x performance improvement over an equivalent software implementation.

Radix 4 SRT Division with Quotient Prediction and Operand Scaling [p. 195]

N.R. Srivastava

SRT division is an efficient method for implementing high radix division circuits. However, as the radix increases the size of a quotient digit selection table increases exponentially. To overcome the limitations of quotient prediction, a method in which a quotient digit is speculated has been proposed. The speculated quotient digit is utilized to update the possible partial remainders while the speculated quotient is corrected. In this paper, instead of using a huge quotient selection table an estimation and correction scheme is used for prodiction of quotient digit. The prediction is done in parallel with the calculation of the partial remainder for the quotient predicted earlier thus improving the latency. In addition, since this method tends to consume less area as the radix increases compared to previous methods, it has the ability to improve higher radix implementations for SRT division.

2.3: Test Infrastructure of SoCs and its Verification

Moderators: F. Novak, Jozef Stefan Institute, SL, R. Dorsch, IBM, Boeblingen, DE

SoC Testing Using LFSR Reseeding, and Scan-Slice-Based TAM Optimization and Test Scheduling [p. 201]

Z. Wang, K. Chakrabarty and S. Wang

We present an SoC testing approach that integrates test data compression, TAM/test wrapper design, and test scheduling. An improved LFSR reseeding technique is used as the compression engine. All cores on the SoC share a single on-chip LFSR. At any clock cycle, one or more cores can simultaneously receive data from the LFSR. Seeds for the LFSR are computed from the care bits from the test cubes for multiple cores. We also propose a scan-slice-based scheduling algorithm that tries to maximize the number of care bits the LFSR can produce at each clock cycle, such that the overall test application time is minimized. Experimental results for both ISCAS circuits and industrial circuits show that optimal test application time, which is determined by the largest core, can be achieved. The proposed approach has small hardware overhead and is easy to deploy. Only one LFSR, one phase shifter, and a few counters should be added to the SoC. The scheduling algorithm is also scalable for large industrial circuits. The CPU time for a large industrial design ranges from 1 to 30 minutes.

Optimized Integration of Test Compression and Sharing for SoC Testing [p. 207]

A. Larsson, E. Larsson, P. Eles and Z. Peng

The increasing test data volume needed to test core-based System-on-Chip contributes to long test application times (TAT) and huge automatic test equipment (ATE) memory requirements. TAT and ATE memory requirement can be reduced by test architecture design, test scheduling, sharing the same tests among several cores, and test data compression. We propose, in contrast to previous work that addresses one or few of the problems, an integrated framework with heuristics for sharing and compression and a Constraint Logic Programming technique for architecture design and test scheduling that minimizes the TAT without violating a given ATE memory constraint. The significance of our approach is demonstrated by experiments with ITC'02 benchmark designs.

A Sophisticated Memory Test Engine for LCD Display Drivers [p. 213]

O. Spang, H.-M. Von Staudt and M.G. Wahl

Economic testing of small devices like LCD drivers is a real challenge. In this paper we describe an approach where a production tester is extended by a memory test engine (MTE). This MTE, which consists of hardware and software components allows testing the LCD driver memory at speed, allowing at the same time the concurrent execution of other tests. It is fully integrated into the tester. The MTE leads to a significant increase of memory test quality and at the same time to a significant reduction of the test time. The test time reduction that was achieved by executing the memory test in parallel to other analog tests lead to the test cost reduction, which was the impetus for developing the MTE.

Formal Verification of a Pervasive Interconnect Bus System in a High-Performance Microprocessor [p. 219]

T. Le, T. Glökler and J. Baumgartner

In our high-performance PowerPC processor, the correctness of the so-called pervasive interconnect bus system, which provides, among others, Test and Debug access via external interfaces like JTAG, is of utmost importance. In this paper, we describe our approach in formally verifying the correctness of this bus system to combat the coverage problem of simulation-based techniques. The bus system and the associated arbitration logic support several functionalities such as deadlock detection and resolution. In order to efficiently complete all of the required formal analysis for verification, we needed to leverage a variety of proof and semi-formal algorithms, as well as reduction and abstraction algorithms. Experimental results are provided to show the efficiency of this approach.

Interactive Presentations

Low Cost Debug Architecture Using Lossy Compression for Silicon Debug [p. 225]

E. Anis and N. Nicolici

The size of on-chip trace buffers used for at-speed silicon debug limits the observation window in any debug session. Whenever the debug experiment can be repeated, we propose a novel architecture for at-speed silicon debug that enables a methodology where the designer can iteratively zoom only in the intervals containing erroneous samples. When compared to increasing the size of the trace buffer, the proposed architecture has a small impact on silicon area, while significantly reducing the number of debug sessions.

An SoC Test Scheduling Algorithm Using Reconfigurable Union Wrappers [p. 231]

T. Yoneda, M. Imanishi and H. Fujiwara

This paper presents a reconfigurable union wrapper that can wrap multiple cores into a single wrapper design. Moreover, we present a test scheduling algorithm to minimize a test application time using the proposed reconfigurable union wrapper. The proposed heuristic algorithm can achieve short test application time with low computational cost compared to the conventional approaches where every core has its own wrapper. Experimental results for the ITC'02 SOC Benchmarks show the effectiveness of our approach.
keywords: system-on-a-chip, test scheduling, reconfigurable union wrapper, test access mechanism

2.4: HOT TOPIC - Microprocessors in the Era of Terascale Integration

Moderator: A. González, Intel and UPC, ES

Microprocessors in the Era of Terascale Integration [p. 237]

S. Borkar, N.P. Jouppi and P. Stenstrom

Moore's Law will soon deliver tera-scale level transistor integration capacity. Power, variability, reliability, aging, and testing will pose as barriers and challenges to harness this integration capacity. Advances in microarchitecture and programming systems discussed in this paper are potential solutions.

2.5: Statistical / Nonlinear Analysis and Verification for Analogue Circuits

Moderators: G. Vandersteen, IMEC, BE, J. Roychowdhury, Minnesota U, US

CMCal: An Accurate Analytical Approach for the Analysis of Process Variations with Non-Gaussian Parameters and Nonlinear Functions [p. 243]

M. Zhang, M. Olbrich, D. Seider, M. Frerichs, H. Kinzelbach and E. Barke

As technology rapidly scales, performance variations (delay, power etc.) arising from process variation are becoming a significant problem. The use of linear models has been proven to be very critical in many today's applications. Even for well-behaved performance functions, linearising approaches as well as quadratic model provide serious errors in calculating expected value, variance and higher central moments. In this paper, we present a novel approach to analyse the impacts of process variations with low efforts and minimum assumption. We formulate circuit performance as a function of the random parameters and approximate it by Taylor Expansion up to 4th order. Taking advantage of the knowledge about higher moments, we convert the Taylor series to characteristics of performance distribution. Our experiments show that this approach provides extremely exact results even in strongly non-linear problems with large process variations. Its simpleness, efficiency and accuracy make this approach a promising alternative to the Monte Carlo Method in most practical applications.

A Symbolic Methodology for the Verification of Analog and Mixed Signal Designs [p. 249]

G. Al-Sammane, M. H. Zaki and S. Tahar

We propose a new symbolic verification methodology for proving the properties of analog and mixed signal (AMS) designs. Starting with an AMS description and a set of properties and using symbolic computation, we extract a normal mathematical representation for the system in terms of recurrence equations. These normalized equations are used along with an induction verification strategy defined inside the computer algebra system Mathematica to prove the correctness of the properties. We apply our methodology on a third order DS modulator.

Efficient Nonlinear Distortion Analysis of RF Circuits [p. 255]

D. Tannir and R. Khazaka

Nonlinear distortion, typically defined using the third order intercept point (IP3), is one of the key figures of merit that are critical in the design of RF communication circuits. The calculation of IP3 is typically based on analytical approaches such as Volterra Series which are very complex and difficult to apply to circuits of arbitrary complexity, or on simulation based methods which require multi-tone inputs and thus result in a very high CPU cost. In this paper a new method based on the computation of the circuit moments is proposed. The new approach uses the circuit moments in order to numerically compute the Volterra kernels. This automates the process of numerically obtaining such kernels for any circuit and results in an efficient approach for the computation of IP3.

Nonlinearity Analysis of Analog/RF Circuits Using Combined Multisine and Volterra Analysis [p. 261]

J. Borremans, L. De Locht, P. Wambacq and Y. Rolain

Modern integrated radio systems require highly linear analog/RF circuits. Two-tone simulations are commonly used to study a circuit's nonlinear behavior. Very often, however, this approach suffers limited insight. To gain insight into nonlinear behavior, we use a multisine analysis methodology to locate the main nonlinear components (e.g. transistors) both for weakly and strongly nonlinear behavior. Under weakly nonlinear conditions, selective Volterra analysis is used to further determine the most important nonlinearities of the main nonlinear components. As shown with an example of a 90 nm CMOS wideband low-noise amplifier, the insights obtained with this approach can be used to reduce nonlinear circuit behavior, in this case with 10 dB. The approach is valid for wideband and thus practical excitation signals, and is easily applicable both to simple and complex circuits.

Interactive Presentation

Optimizing Analog Filter Designs for Minimum Nonlinear Distortions Using Multisine Excitations [p. 267]

J. Lataire, G. Vandersteen and R. Pintelon

Nonlinear distortions in submicron analog circuits are gaining importance, especially when power constraints are imposed and when operating in moderate inversion. This paper proposes a method to optimize the design of analog filters for minimum noise and nonlinear distortions. For this purpose a technique is presented for quantifying these nonlinearities, such that their influence can be compared with that of the system noise. Having quantified the nonidealities, an optimization can be carried out which involves the tuning of design parameters.

2.6: System Modeling and Specification

Moderators: T. Schattkowsky, Paderborn U, DE, W. Klingauf, TU Braunschweig, DE

Performance Analysis of Complex Systems by Integration of Dataflow Graphs and Compositional Performance Analysis [p. 273]

S. Schliecker, S. Stein and R. Ernst

In this paper we integrate two established approaches to formal multiprocessor performance analysis, namely Synchronous Dataflow Graphs and Compositional Performance Analysis. Both make different trade-offs between precision and applicability. We show how the strengths of both can be combined to achieve a very precise and adaptive model. We couple these models of completely different paradigms by relying on load descriptions of event streams. The results show a superior performance analysis quality.

Tackling an Abstraction Gap: Co-Simulating with SystemC DE and Bluespec ESL [p. 279]

H.D. Patel and S.K Shukla

The growing SystemC community for system level design exploration is a result of SystemC's capability of modeling at RTL and above RTL abstraction levels. However, managing shared state concurrency using multi-threading in large SystemC models is error prone. A recent extension of SystemC called Bluespec-SystemC (BS-ESL) counters this difficulty with its model of computation employing atomic rule-based specifications. However, for simulating a model that is partly designed in SystemC and partly using BS-ESL, an interoperability semantics and implementation of such a semantics is required. This paper views the interoperability problem as an abstraction gap closure problem. To illustrate the problem, we formalize the simulation semantics of BS-ESL and discrete-event simulation of RTL SystemC and provide a solution based on this formalization.

A Calculator for Pareto Points [p. 285]

M. Geilen and T. Basten

This paper presents the Pareto Calculator, a tool for compositional computation of Pareto points, based on the algebra of Pareto points. The tool is a useful instrument for multidimensional optimisation problems, design-space exploration and development of quality management and control strategies. Implementations and their complexity of the operations of the algebra are discussed. In particular, we discuss a generalisation of the well-known divide-and-conquer algorithm to compute the Pareto points (optimal solutions) from a set of possible configurations, also known as the maximal vector or skyline problem. The generalisation lies in the fact that we allow for partially ordered domains instead of only totally ordered ones. The calculator is available through the following url: http://www.es.ele.tue.nl/pareto.

Modeling and Simulation to the Design of ZΔ Fractional-N Frequency Synthesizer [p. 291]

S. Huang, H. Ma and Z. Wang

A set of behavioral voltage-domain verilogA/verilog models allowing a systematic design of the ΔΣ fractional-N frequency synthesizer is discussed in the paper. The approach allows the designer to accurately predict the dynamic or stable characteristic of the closed loop by including nonlinear effects of building blocks in the models. The proposed models are implemented in a three-order ΔΣ fractional-N PLL based frequency synthesizer with a 60MHz frequency tuning range. Cadence SpectreVerilog simulation results show that behavioral modeling can provide a great speed-up over circuit-level simulation. Synchronously, the phase noise, spurs and settling time can also be accurately predicted, so it is helpful to a grasp of the fundamentals at the early stage of the design and optimization design at the system level. The key simulation results have been compared against measured results obtained from an actual prototype validating the effectiveness of the proposed models.

Interactive Presentations

System Level Power Optimization of Sigma-Delta Modulator [p. 297]

F. Gong and X. Wu

A new approach to power optimization of the sigma delta modulators was presented based on the modeling of noise performance while deciding its system functions and the sub-circuit specifications. And a system model of a 2nd order modulator with a Matlab algorithm to optimize its power specifications was developed. The system simulation results showed that all specifications were consistent with the expectations well. By using the proposed architecture, a resolution of 16-bit was achieved.

Executable System-Level Specification Models Containing UML-Based Behavioral Patterns [p. 301]

L.S. Indrusiak, A. Thuy and M. Glesner

Behavioral patterns are useful abstractions to simplify the design of the communication-centric systems. Such patterns are traditionally described using UML diagrams, but the lack of execution semantics in UML prevents the co-validation of the patterns together with simulation models and executable specifications which are the mainstream in today's system level design flows. This paper proposes a method to validate UML-based behavioral patterns within executable system models. The method is based on actor orientation and was implemented as an extension of the Ptolemy II framework. A case study is presented and potential applications and extensions of the proposed method are discussed.

2.7: Design Space Exploration and Nano-Technologies for Reconfigurable Computing

Moderators: W. Luk, Imperial College, London, UK, R. Lysecky, Arizona U, US

Assessing Carbon Nanotube Bundle Interconnect for Future FPGA Architectures [p. 307]

S. Eachempati, A. Nieuwoudt, A. Gayasen, N. Vijaykrishnan and Y. Massoud

Field Programmable Gate Arrays (FPGAs) are important hardware platforms in various applications due to increasing design complexity and mask costs. However, as CMOS process technology continues to scale, standard copper interconnect will become a major bottleneck for FPGA performance. In this paper, we propose utilizing bundles of single-walled carbon nanotubes (SWCNT) as wires in the FPGA interconnect fabric and compare their performance to standard copper interconnect in future process technologies. To leverage the performance advantages of nanotubebased interconnect, we explore several important aspects of the FPGA routing architecture including the segmentation distribution and the internal population of the wires. The results demonstrate that FPGAs utilizing SWCNT bundle interconnect can achieve a 19% improvement in average area delay product over the best performing architecture for standard copper interconnect in 22 nm process technology.

Two-Level Microprocessor-Accelerator Partitioning [p. 313]

S. Sirowy, Y. Wu, S. Lonardi and F. Vahid

The integration of microprocessors and field-programmable gate array (FPGA) fabric on a single chip increases both the utility and necessity of tools that automatically move software functions from the microprocessor to accelerators on the FPGA to improve performance or energy. Such hardware/software partitioning for modern FPGAs involves the problem of partitioning functions among two levels of accelerator groups - tightly-coupled accelerators that have fast single-clock-cycle memory access to the microprocessor's memory, and loosely-coupled accelerators that access memory through a bridge to avoid slowing the main clock period with their longer critical paths. We introduce this new two-level accelerator-partitioning problem, and we describe a novel optimal dynamic programming algorithm to solve the problem. By making use of the size constraint imposed by FPGAs, the algorithm has what is effectively quadratic runtime complexity, running in just a few seconds for examples with up to 25 accelerators, obtaining an average performance improvement of 35% compared to a traditional single-level bus architecture.

Design Space Exploration of Partially Re-Configurable Embedded Processors [p. 319]

A. Chattopadhyay, W. Ahmed, K. Karuri, D. Kammler, R. Leupers, G. Ascheid and H. Meyr

In today's embedded processors, performance and flexibility have become the two key attributes. These attributes are often conflicting. The best performance is obtained from custom designed integrated circuits. In contrast, the maximum flexibility is delivered by a general purpose processor. Among the architecture types emerged over the past years to strike an optimum balance between these two attributes, two are prominent. The first ones are Field Programmable Gate Array (FPGA)-based architectures and the second ones are Applicationspecific Instruction-set Processors (ASIPs). Depending on the type of application (i.e. stream-like or control-dominated) either one of the abovementioned architecture types is able to deliver high performance or flexibility or both. Consequently, a new design approach with partial re-configurability on the application-specific processor is attracting strong research interest. We call this architecture reconfigurable ASIP (rASIP). Currently, the lack of a high-level abstraction of the rASIP limits the designer from trying out various design alternatives because of long and tedious exploration cycles. To address this issue, in this paper, a high-level specification for reconfigurable processors is proposed. Furthermore, a seamless design space exploration methodology using this specification is proposed.

Interactive Presentation

Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor [p. 325]

H. Noori, F. Mehdipour, K. Murakami, K. Inoue and M. Goudarzi

To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problems of this approach are immense cost and long time of designing. To address these issues, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chip-fabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units with the capability of conditional execution. Unlike previous proposed CIs, ours can include multiple exits. Experimental results show that multi-exit CIs enhance the performance by 46% in average compared to CIs limited to one basic block. A maximum speedup of 2.89 compared to a 4-issue in-order RISC processor, and a speedup of 1.66 in average, was achieved on MiBench benchmark suite.

3.2: Implementation of LDPC Codecs for Various Communication Standards

Moderators: M. Heijligers, NXP IC Lab, NL, N. Wehn, Kaiserslautern U, DE

Low Complexity LDPC Code Decoders for Next Generation Standards [p. 331]

T. Brack, M. Alles, T. Lehnigk-Emden, F. Kienle, N. When, N.E. L'Insalata, F. Rossi, M. Rovini and L. Fanucci

This paper presents the design of low complexity LDPC codes decoders for the upcoming WiFi (IEEE 802.11n), WiMax (IEEE802.16e) and DVB-S2 standards. A complete exploration of the design space spanning from the decoding schedules, the node processing approximations up to the top-level decoder architecture is detailed. According to this search state-of-the-art techniques for a low complexity design have been adopted in order to meet feasible high throughput decoder implementations. An analysis of the standardized codes from the decoder-aware point of view is also given, presenting, for each one, the implementation challenges (multi rates-length codes) and bottlenecks related to the complete coverage of the standards. Synthesis results on a present 65nm CMOS technology are provided on a generic decoder architecture.

Non-Fractional Parallelism in LDPC Decoder Implementations [p. 337]

J. Dielissen and A. Hekstra

Because of its excellent bit-error-rate performance, the Low-Density Parity-Check (LDPC) decoding algorithm is gaining increased attention in communication standards and literature. Also the new Chinese Digital Video Broadcast standard (CDVB-T) uses LDPC codes. This standard uses a large prime number as the parallelism factor, leading to high area cost. In this paper we present a new method to allow fractional dividers to be used. The method depends on the property that consecutive sub-circulants have one memory row in common. Several techniques are shown for assuring this property, or solving memory conflicts, making the method more generally applicable. In fact, the proposed technique is a first step towards a general purpose LDPC processor. For the CDVB-T decoder implementation the method leads to a factor 3 improvement in area.

Minimum-Energy LDPC Decoder for Real-Time Mobile Application [p. 343]

W. Wang and G. Choi

This paper presents a low-power real-time decoder that provides constant-time processing of each frame using dynamic voltage and frequency scaling. The design uses known capacity-approaching low-density parity-check(LDPC) code to contain data over fading channels. Real-time applications require guaranteed data rates. While conventional fixed-number of decoding-iteration schemes are not energy efficient for mobile devices, the proposed heuristic scheme pre-analyzes each received data frame to estimate the maximum number of necessary iterations for frame convergence. The results are then used to dynamically adjust decoder frequency. Energy use is then reduced appropriately by adjusting power supply voltage to minimum necessary for the given frequency. The resulting design provides a judicious trade-off between power consumption and error level.

Pipelined Implementation of a Real Time Programmable Encoder for Low Density Parity Check Code on a Reconfigurable Instruction Cell Architecture [p. 349]

Z. Khan and T. Arslan

This paper presents pipelined implementation of a real time programmable irregular Low Density Parity Check (LDPC) Encoder as specified in the IEEE P802.16E/D7 standard. The encoder is programmable for frame sizes from 576 to 2304 and for five different code rates. H matrix is efficiently generated and stored for a particular frame size and code rate. The encoder is implemented on Reconfigurable Instruction Cell Architecture which has recently emerged as an ultra low power, high performance, ANSI-C programmable embedded core. Different general and architecture specific optimization techniques are applied to enhance the throughput. With the architecture, a throughput from 10 to 19 Mbps has been achieved. The maximum throughput achieved with pipelining/ multi-core is 78 Mbps.

Interactive Presentation

Implementation of AES/Rijndael on a Dynamically Reconfigurable Architecture [p. 355]

C. Mucci, L. Vanzolini, A. Lodi, A. Deledda, R. Guerrieri, F. Campi and M. Toma

Reconfigurable architectures provide the user the capability to couple performance typical of hardware design with the flexibility of the software. In this paper, we present the design of AES/Rijndael on a dynamically reconfigurable architecture. We will show a performance improvement of three order of magnitude compared to the reference code and up to 24x speed-up figure wrt fast C implementations over a RISC processor. A maximum throughput of 546 Mbit/sec is achieved. Compared to prior art, we show better energy efficiency with respect to the other programmable solutions, obtaining up to 3 Mbit/sec/mW.

3.3: Testing NoCs

Moderators: Z. Peng, Linkoping U, SE; J. Raik, TU Tallinn, ES

Using the Inter- and Intra-Switch Regularity in NoC Switch Testing [p. 361]

M. Hosseinabady, A. Dalirsani and Z. Navabi

This paper proposes an efficient test methodology to test switches in a Network-on-Chip (NoC) architecture. A switch in an NoC consists of a number of ports and a router. Using the intra-switch regularity among ports of a switch and inter-switch regularity among routers of switches, the proposed method decreases the test application time and test data volume of NoC testing. Using a test source to generate test vectors and scan-based testing, this methodology broadcasts test vectors through the minimum spanning tree of the NoC and concurrently tests its switches. In addition, a possible fault is detected by comparing test results using the inter- or intra- switch comparisons. The logic and memory parts of a switch are tested by appropriate memory and logic testing methods. Experimental results show less test application time and test power consumption, as compared with other methods in the literature.

Toward a Scalable Test Methodology for 2D-mesh Network-on-Chips [p. 367]

K. Petersén and J. Öberg

This paper presents a BIST strategy for testing the NoC interconnect network, and investigates if the strategy is a suitable approach for the task. All switches and links in the NoC are tested with BIST, running at full clock-speed, and in a functional-like mode. The BIST is carried out as a go/no-go BIST operation at start up, or on command. It is shown that the proposed methodology can be applied for different implementations of deflecting switches, and that the test time is limited to a few thousand-clock cycles with fault coverage close to 100%.

Remote Testing and Diagnosis of System-on-Chips Using Network Management Frameworks [p. 373]

O. Laouamri and C. Aktouf

This paper presents a new approach that allows remote testing and diagnosis of complex (Systems-on-Chip) and embedded IP cores. The approach extends both on-chip design-for-test (DFT) architectures and network management protocols to take full benefits from existing networking infrastructures. By running intensive experimentation on ITC'99 and ITC'02 design benchmarks, the efficiency of the proposed testing and diagnosis methodology is analyzed.

3.4: Synthesis at System and Architectural Levels

Moderators: P. Pop, DTU, DK; S. Chakraborty, National U of Singapore, SG

Fast Memory Footprint Estimation Based on Maximal Dependency Vector Calculation [p. 379]

Q. Hu, A. Vandecappelle, P.G. Kjeldsberg, F. Catthoor and M. Palkovic

In data dominated applications, loop transformations have a huge impact on the lifetime of array data and therefore on memory footprint. Since a locally optimal loop transformation may have a detrimental effect somewhere else, many alternative loop transformations need to be explored. Therefore, estimation of the memory footprint is essential, and this estimation has to be fast. This paper presents a fast array based memory footprint estimation technique based on counting of iteration nodes in an iteration domain constrained by a maximal lifetime. The maximal lifetime is defined by the Maximal Dependency Vector (MDV) of the array for a given execution ordering. We further present for the first time two approaches for calculation of the MDV: a general approach based on an ILP formulation and a novel vertexes approach when iteration domains are approximated by bounding boxes. Experiments on practical test vehicles demonstrate that the estimation based on our vertexes approach is extremely fast, on average two orders of magnitude faster than the compared approaches, while still keeping the accuracy high. This enables system-level data memory footprint exploration of many different alternative transformed program codes, within interactive time limits, and on realistic complex applications.

Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations [p. 385]

H. Zhu, I.I. Lucian and F. Balasa

The storage requirements of the array-dominated and looporganized algorithmic specifications running on embedded systems can be significant. Employing a data memory space much larger than needed has negative consequences on the energy consumption, latency, and chip area. Finding an optimized storage of the usually large arrays from these algorithmic specifications is an important step during memory allocation. This paper proposes an efficient algorithm for mapping multi-dimensional arrays to the data memory. Similarly to [13], it computes bounding windows for live elements in the index space of arrays, but this algorithm is several times faster. Moreover, since this algorithm works not only for entire arrays, but also parts of arrays - like, for instance, array references or, more general, sets of array elements represented by lattices [11], this signal-to-memory mapping technique can be also applied in multi-layer memory hierarchies.

The Impact of Loop Unrolling on Controller Delay in High Level Synthesis [p. 391]

S. Kurra, N.K. Singh and P.R. Panda

Loop unrolling is a well-known compiler optimization that can lead to significant performance improvements. When used in High Level Synthesis (HLS) unrolling can affect the controller complexity and delay. We study the effect of the loop unrolling factor on the delay of controllers generated during HLS. We propose a technique to predict controller delay as a function of the loop unrolling factor, and use this prediction with other search space pruning methods to automatically determine the optimal loop unrolling factor that results in a controller whose delay fits into a specified time budget, without an exhaustive exploration. Experimental results indicate delay predictions that are close to measured delays, yet significantly faster than exhaustive synthesis.

Clock-Frequency Assignment for Multiple Clock Domain Systems-on-a-Chip [p. 397]

S. Sirowy, Y. Wu, S. Lonardi and F. Vahid

Modern systems-on-a-chip platforms support multiple clock domains, in which different sub-circuits are driven by different clock signals. Although the frequency of each domain can be customized, the number of unique clock frequencies on a platform is typically limited. We define the clock-frequency assignment problem to be the assignment of frequencies to processing modules, each with an ideal maximum frequency, such that the sum of module processing times is minimized, subject to a limit on the number of unique frequencies. We develop a novel polynomial-time optimal algorithm to solve the problem, based on dynamic programming. We apply the algorithm to the particular context of post-improvement of accelerator-based hardware/software partitioning, and demonstrate 1.5x-4x additional speedups using just three clock domains.

Interactive Presentations

System-Level Process Variation Driven Throughput Analysis for Single and Multiple Voltage-Frequency Island Designs [p. 403]

S. Garg and D. Marculescu

Manufacturing process variations are the primary cause of timing yield loss in aggressively scaled technologies. In this paper, we analyze the impact of process variations on the throughput (rate) characteristics of embedded systems comprised of multiple voltage-frequency islands (VFIs) represented as component graphs. We provide an efficient, yet accurate method to compute the throughput of an application in a probabilistic scenario and show that systems implemented with multiple VFIs are more likely to meet throughput constraints than their fully synchronous counterparts. The proposed framework allows designers to investigate the impact of architectural decisions such as the granularity of VFI partitioning on their designs, while determining the likelihood of a system meeting specified throughput constraints. An implementation of the proposed framework is accurate within 1.2% of Monte Carlo simulation while yielding speedups ranging from 78X-260X, for a set of synthetic benchmarks. Results on a real benchmark (MPEG-2 encoder) show that a nine clock domain implementation gives 100% yield for a throughput constraint for which a fully synchronous design only yields 25%. For the same throughput constraint, a three clock domain architecture yields 78%.

Reliability-Aware System Synthesis [p. 409]

M. Glass, M. Lukasiewycz, T. Streichert, C. Haubelt and J. Teich

Increasing reliability is one of the most important design goals for current and future embedded systems. In this paper, we will put focus on the design phase in which reliability constitutes one of several competing design objectives. Existing approaches considered the simultaneous optimization of reliability with other objectives to be too extensive. Hence, they firstly design a system, secondly analyze the system for reliability and finally exchange critical parts or introduce redundancy in order to satisfy given reliability constraints or optimize reliability. Unfortunately, this may lead to suboptimal designs concerning other design objectives. Here, we will present a) a novel approach that considers reliability with all other design objectives simultaneously, b) an evaluation technique that is able to perform a quantitative analysis in reasonable time even for real-world applications, and c) experimental results showing the effectiveness of our approach.

3.5: Analogue and Mixed-Signal Design and Characterization

Moderators: A. Rodriguez-Vazquez, AnaFocus, ES; M. Glesner, TU Darmstadt, DE

Flexibility-oriented Design Methodology for Reconfigurable Delta Sigma Modulators [p. 415]

P. Sun, Y. Wei and A. Dobili

This paper presents a systematic methodology for producing reconfigurable ΣΔ modulator topologies with optimized flexibility in meeting variable performance specifications. To increase their flexibility, topologies are optimized for performance attributes pertaining to ranges of values rather than being single values. Topologies are implemented on switched-capacitor reconfigurable mixed-signal architectures. As the number of configurable blocks is very small, it is extremely important that the topologies use as few blocks as possible. A case study illustrates the methodology for specifications from telecommunications area.

Experimental Validation of a Tuning Algorithm for High-Speed Filters [p. 421]

G. Matarrese, C. Marzocca, F. Corsi, S. D'Amico and A. Baschirotto

We report here the results of some laboratory experiments performed to validate the effectiveness of a technique for the self tuning of integrated continuoustime, high-speed active filters. The tuning algorithm is based on the application of a pseudo-random input sequence of rectangular pulses to the device to be tuned and on the evaluation of a few samples of the input-output cross-correlation function which constitute the filter signature. The key advantages of this technique are the ease of the input test pattern generation and the simplicity of the output circuitry which consists of a digital crosscorrelator. The technique allows to achieve a tuning error mainly dominated by the value of the elementary capacitors employed in the tuning circuitry. The time required to perform the tuning is kept within a few microseconds. This appears particularly interesting for applications to telecommunication multi-standard terminals. The experiments regarding the application of the proposed tuning algorithm to a baseband multi-standard filter confirm most of the simulation results and show the robustness of the technique against practical operating conditions and noise.

Design of High-Resolution MOSFET-Only Pipelined ADCs with Digital Calibration [p. 427]

H. Aminzadeh, M. Danaie and R. Lotfi

Design of low-voltage high-resolution MOSFET-only pipeline analog to digital converters (ADCs) has been investigated in this work. The nonlinearity caused by replacing linear MIM capacitors with compensated depletion-mode MOS transistors in all 1.5-bit residue stages of the ADC has been properly modeled to be calibrated in digital domain. The proposed calibration technique makes it possible to digitally compensate the nonlinearity of a 1.8V 12-bit 65MS/s MOSFET-only ADC in 0.18μm standard digital CMOS technology. It improves the values of signal-to-noise-plusdistortion- ratio (SNDR) and spurious-free dynamic range (SFDR) by approximately 27dB and 35dB respectively.

A New Technique for Characterization of Digital-to-Analog Converters in High-Speed Systems [p. 433]

J. Savoj, A.-A. Abbasfar, A. Amirkhany, B. W. Garlepp and M. A. Horowitz

In this paper, a new technique for characterization of digital-toanalog converters (DAC) used in wideband applications is described. Unlike the standard narrowband approach, this technique employs Least Square Estimation to characterize the DAC from dc to any target frequency. Characterization is performed using a random sequence with certain temporal and probabilistic characteristics suitable for intended operating conditions. The technique provides a linear estimation of the system and decomposes nonlinearity into higher-order harmonics and deterministic periodic noise. The technique can also be used to derive the impulse response of the converter, predict its operating bandwidth, and provide far more insight into its sources of distortion.

3.6: PANEL SESSION - Should You Trust the Surgeon or the Family Doctor?

Organizer: M. Casale-Rossi, Synopsys, Italy
Moderator: A. Strojwas, Carnegie Mellon U, US

DFM/DFY: Should You Trust the Surgeon or the Family Doctor? [p. 439]

Panelists: R. Aitken, A. Domic, C. Guardiani, P. Magarshack, D. Pattullo, J. Sawicki

Everybody agrees that curing DFM/DFY issues is of paramount importance at 65 nanometers and beyond. Unfortunately, there is disagreement about how and when to cure them. "Surgeons" suggest a GDSII-centered approach, potentially invasive, while "family doctors" recommend a more pervasive approach, starting from RTL. As in real life, "surgery" and "medicine" represent two different schools of thought in the DFM/DFY arena. Both involve risks. This panel will examine these two approaches from high-level design all the way to manufacturing. We have assembled a set of panelists that represent a broad crosssection of semiconductor industry. Although there is general agreement among the panelists that both approaches are necessary and that prevention is the best way to proceed, they also acknowledge that the surgery may be unavoidable in such "hazardous" conditions as state-of-the-art technologies. However, as always, "the devil is in the details," and the diverse approaches to DFM presented below should make this panel quite interesting. We are also counting on the feedback from the IC design community to assess if these approaches are sufficient and practical enough to deal with the "health hazards". We are looking forward to an exciting discussion that will challenge our esteemed panelists.

3.7: Automatic Synthesis of Computation Intensive Application Specific Circuits

Moderators: F. Ferrandi, Politecnico di Milano, IT; T. Henriksson, NXP Semiconductors Research, NL

Automatic Synthesis of Compressor Trees: Reevaluating Large Counters [p. 443]

A.K. Verma and P. Ienee

Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing some local optimisations but hardly improving the overall structure of arithmetic components. Architectural optimisations have been often studied manually, and only in the case of very common building blocks such as fast adders and multi-input adders, ad-hoc techniques have been developed. A notable case is multi-input addition, which is the core of many circuits such as multipliers, etc. The most common technique to implement multi-input addition is using compressor trees, which are often composed of carry-save adders (based on (3 : 2) counters, i.e., full adders). A large body of literature exists to implement compressor trees using large counters. However, all the large counters were built by using full and half adders recursively. In this paper we give some definite answers to issues related to the use of large counters. We present a general technique to implement large counters whose performance is much better than the ones composed of full and half adders. Also we show that it is not always useful to use larger optimised counters and sometimes a combination of various size counters gives the best performance. Our results show 15% improvement in the critical path delay. In some cases even hardware area is reduced by using our counters.

Area Optimization of Multi-Cycle Operators in High-Level Synthesis [p. 449]

M.C. Molina, R. Ruiz-Sautua, J.M. Mendias and R. Hermida

Conventional high-level synthesis algorithms usually employ multi-cycle operators to reduce the cycle length in order to improve the circuit performance. These operators need several cycles to execute one operation, but the entire functional unit is not used in any cycle. Additionally, the execution of operations over wider multi-cycle operators is unfeasible if their results must be available in a smaller number of cycles than the functional unit delay. This obliges to add new functional resources to the datapath even if multi-cycle operators are idle when the execution of the operation begins. In this paper a new design technique to overcome the restricted reusability of multi-cycle operators is presented. It reduces the area of these functional units allowing their internal reuse when executing one operation. It also expands the possibilities of common hardware sharing as it allows the partial use of multicycle operators to calculate narrower operations faster than the functional unit delay. This technique is applied as an optimization phase at the end of the high-level synthesis process, and can optimize the circuits synthesized by any high-level synthesis tool.

Data-Flow Transformations Using Taylor Expansion Diagrams [p. 455]

M. Ciesielski, S. Askar, D. Gomez-Prado, J. Guillot and E. Boutillon

An original technique to transform functional representation of the design into a structural representation in form of a data flow graph (DFG) is described. A canonical, word-level data structure, Taylor Expansion Diagram (TED), is used as a vehicle to effect this transformation. The problem is formulated as that of applying a sequence of decomposition cuts to a TED that transforms it into a DFG optimized for a particular objective. A systematic approach to arrive at such a decomposition is described. Experimental results show that such constructed DFG provides a better starting point for architectural synthesis than those extracted directly from HDL specifications.

Automatic Application Specific Floating-point Unit Generation [p. 461]

Y.J. Chong and S. Parameswaran

This paper describes the creation of custom floating point units (FPUs) for Application Specific Instruction Set Processors (ASIPs). ASIPs allow the customization of processors for use in embedded systems by extending the instruction set, which enhances the performance of an application or a class of applications. These extended instructions are manifested as separate hardware blocks, making the creation of any necessary floating point instructions quite unwieldy. On the other hand, using a predefined FPU includes a large monolithic hardware block with considerable number of unused instructions. A customized FPU will overcome these drawbacks, yet the manual creation of one is a time consuming, error prone process. This paper presents a methodology for automatically generating floating-point units (FPUs) that are customized for specific applications at the instruction level. Generated FPUs comply with the IEEE754 standard, which is an advantage over FP format customization. Custom FPUs were generated for several Mediabench applications. Area savings over a fully-featured FPU without resource sharing of 26%-80% without resource sharing and 33%-87% with resource sharing, were obtained. Clock period increased in some cases by up to 9.5% due to resource sharing.

Interactive Presentation

Time-Constrained Clustering for DSE of Clustered VLIW-ASP [p. 467]

M. Schölzel

In this paper we describe a new time-constrained clustering algorithm. It is coupled with a time-constrained scheduling algorithm and used for Design-Space-Exploration (DSE) of clustered VLIW processors with heterogeneous clusters and heterogeneous functional units. The algorithm enables us to reduce the complexity of the DSE, because the parameters of the VLIW are derived from the clustered schedule of the considered application which is produced during a single compilation step. Several compilations of the same application with different VLIWparameter settings are not necessary. Our proposed algorithm is integrated into a DSE-Tool in order to explore the best parameters of a clustered VLIW processor for several basic blocks of signal processing applications. The obtained results are compared to the results of Lapinskii's work and show, that, for most benchmarks, we are able to save ports in the register file of each cluster.

4.1: EMBEDDED TUTORIAL - Design, Verification and Test (Ubiquitous Communication and Computation Special Day)

Organizer/Moderator: P Liuha, Nokia, FI

Applications for Ubiquitous Computing and Communications [p. 473]: The session views some potential use cases and applications that use ubiquitous computing and communications. The new aspects of these applications and the basic design challenges and solutions will be addressed.

4.2: Automotive

Moderators: L. Fanucci, Pisa U, IT; J. Gerlach, Robert Bosch GmbH, DE

Timing Simulation of Interconnected AUTOSAR Software-Components [p. 474]

M. Krause, O. Bringmann, A. Hergenhan, G. Tabanoglu and W. Rosenstiel

AUTOSAR is a recent specification initiative which focuses on a model-driven architecture like methodology for automotive applications. However, needed engineering steps, or how-to-come from a logical to a technical architecture respectively implementation, are not well supported by tools, yet. In contrast, SystemC offers a comprehensive way to simulate, analyze, and verify software. Furthermore, it is even able to take the timing behavior of underlying hardware and communication paths into account. Already at a first glance, there are many similarities with respect to the modeling structure between the both concepts. Therefore, this paper discusses approaches on how to use SystemC during the design process of AUTOSAR-conform systems.

FPGA-based Networking Systems for High Data-rate and Reliable In-vehicle Communications [p. 480]

S. Saponara, E. Petri, M. Tonarelli, I. Del Corona and L. Fanucci

The amount of electronic systems introduced in vehicles is continuously increasing: X-by-wire, complex electronic control systems and above all future applications such as automotive vision and safety warnings require in-car reliable communication backbones with the capability to handle large amount of data at high speeds. To cope with this issue and driven by the experience of aerospace systems, the SpaceWire standard, recently proposed by the European Space Agency (ESA), can be introduced in the automotive field. The SpaceWire is a serial data link standard which provides safety and redundancy and guarantees to handle data-rates up to hundreds of Mbps. This paper presents the design of configurable SpaceWire router and interface hardware macrocells, the first in state of the art compliant with the newest standard extensions, Protocol Identification (PID) and Remote Memory Access Protocol (RMAP). The macrocells have been integrated and tested on antifuse technology in the framework of an ESA project. The achieved performances of a router with 8 links, 130 Mbps data-rate, 1.5 W power cost, meet the requirements of future automotive electronic systems. The proposed networking solution simplifies the connectivity, reducing also the relevant volume and mass budgets, provides network safety and redundancy and guarantees to handle very high bandwidth data flows not covered by current standards as CAN or FlexRay.

Low-g Accelerometer Fast Prototyping for Automotive Applications [p. 486]

F. D'Ascoli, F. Iozzi, C. Marino, M. Melani, M. Tonarelli, L. Fanucci, A. Giambastiani, A. Rocchi and M. De Marinis

This paper presents an application of the ISIF chip (Intelligent Sensor InterFace), for conditioning a dualaxis low-g accelerometer in MEMS technology. MEMS are nowadays the standard in automotive applications (and not only), as they feature a drastic reduction in cost, area and power, while they require a more complex electronic interface with respect to traditional discrete devices. ISIF is a Platform On Chip implementation, aiming to fast prototype a wide range of automotive sensors thanks to its high configuration resources, achieved both by full analog / digital IPs trimming options and by flexible routing structures. This accelerometer implementation exploits a relevant part of ISIF hardware resources, but also requires signal processing add-ins (software emulation of digital DSP blocks) for the closed loop conditioning architecture and for performance improvement (for example temperature drift compensation). In spite the short prototyping time, the resulting system achieves good performances with respect to commercial devices, featuring a 0.9 mg/√Hz noise density with 1024 LSB/g sensitivity on the digital output over a +/- 2g FS, and an offset drift over 100°C range within 30 mg, with 2% of FS sensitivity drift. Miniboards have been developed as product prototypes, consisting of a small PCB with ISIF and accelerometer dies bonded together, firmware embedded in EEPROM and communication transceivers.

Using an Innovative SOC-level FMEA Methodology to Design in Compliance with IEC61508 [p. 492]

R. Mariani, G. Boschi and F. Colucci

This paper proposes an innovative methodology to perform and validate a Failure Mode and Effects Analysis (FMEA) at System-on-Chip (SoC) level. This is done in compliance with the IEC 61508, an international norm for the functional safety of electronic safety-related systems, of which an overview is given in the paper. The methodology is based on a theory to decompose a digital circuit in "sensible zones" and a tool that automatically extracts these sensible zones from the RTL description. It includes as well a spreadsheet to compute the metrics required by the IEC norm such Diagnostic Coverage and Safe Failure Fraction. The FMEA results are validated by using another tool suite including a fault injection environment. The paper explains how to take benefits of the information provided by such approach and as example it is described how the methodology has been applied to design memory sub-systems to be used in fault robust microcontrollers for automotive applications. This methodology has been approved by TÜV-SÜD as the flow to assess and validate the Safe Failure Fraction of a given SoC in adherence to IEC 61508.

Using Partial-Run-Time Reconfigurable Hardware to Accelerate Video Processing in Driver Assistance Systems [p. 498]

C. Claus, J. Zeppenfeld, F. Müller and W. Stechele

In this paper we show a reconfigurable hardware architecture for the acceleration of video-based driver assistance applications in future automotive systems. The concept is based on a separation of pixel-level operations and high level application code. Pixel-level operations are accelerated by coprocessors, whereas high level application code is implemented fully programmable on standard PowerPC CPU cores to allow flexibility for new algorithms. In addition, the application code is able to dynamically reconfigure the coprocessors available on the system, allowing for a much larger set of hardware accelerated functionality than would normally fit onto a device. This process makes use of the partial dynamic reconfiguration capabilities of Xilinx Virtex FPGAs.

Interactive Presentation

Towards a Methodology for the Quantitative Evaluation of Automotive Architectures [p. 504]

P. Popp, M. Di Natale, P. Giusto, S. Kanajan and C. Pinello

Architecture design is a critical stage of the Electronics/Controls/Software (ECS) -based vehicle design flow. Traditional approaches relying on component-level design and analysis are no longer effective as they do not always allow for the quantitative evaluation of properties arising from the composition of subsystems. This paper presents a system level architecture design methodology that is supported by tools and methods for the quantitative evaluation of key metrics of interest related to timing, dependability and cost. An example of its application to a bywire system case study is presented, and the challenges faced in its application in the context of the actual development process are discussed.

4.3: Test Generation for Diagnosis, Scan Testing and Advanced Memory Fault Models

Moderators: H. Obermeir, Infineon Technologies AG, DE; B. Straube, FhG IIS/EAS Dresden, DE

Dynamic Learning Based Scan Chain Diagnosis [p. 510]

Y. Huang

Scan chain defect diagnosis is important to silicon debug and yield enhancement. Traditional simulationbased chain diagnosis algorithms may take long run time if a large number of simulations are required. In this paper, a novel dynamic learning based scan chain diagnosis is proposed to speedup the diagnosis run time. Experimental results illustrate that by using the proposed dynamic learning techniques, the diagnosis run time can be reduced about 10X on average.

Diagnosis, Modeling and Tolerance of Scan Chain Hold-Time Violations [p. 516]

O. Sinanoglu and P. Schremmer

Errors in timing closure process during the physical design stage may result in systematic silicon failures, such as scan chain hold time violations, which prohibit the test of manufactured chips. In this paper, we propose a set of techniques that enable the accurate pinpointing of hold time violating scan cells, their modeling and tolerance, paving the way for the generation of valid test data that can be used to test chips with such systematic failures. The process yield is thus restored, as chips that are functional in mission mode can still be identified and shipped out, despite the existence of scan chain hold time failures. The techniques that we propose are non-intrusive, as they utilize only basic scan capabilities, and thus impose no design changes. Scan cells with hold time violations can be identified with maximal possible resolution, enabling the incorporation of the associated impact during the ATPG process and thus the generation of valid test data for the chips with such systematic failures.

On Test Generation by Input Cube Avoidance [p. 522]

I. Pomeranz and S.M. Reddy

Test generation procedures attempt to assign values to the inputs of a circuit so as to detect target faults. We study a complementary view whereby the goal is to identify values that should not be assigned to inputs in order not to prevent faults from being detected. We describe a procedure for computing input cubes (or incompletely specified input vectors) that should be avoided during test generation for target faults. We demonstrate that avoiding such input cubes leads to the detection of target faults after the application of limited numbers of random input vectors. This indicates that explicit test generation is not necessary once certain input values are precluded. Potential uses of the computed input cubes are in a test generation procedure to reduce the search space, and during built-in test generation to preclude input vectors that will not lead to the detection of target faults.

Slow Write Driver Faults in 65nm SRAM Technology: Analysis and March Test Solution [p. 528]

A. Ney, P. Girard, C. Landrault, S. Pravossoudovitch, A. Virazel and M. Bastian

This paper presents an analysis of the electrical origins of Slow Write Driver Faults (SWDFs) [1] that may affect SRAM write drivers in 65nm technology. This type of fault is the consequence of resistive-open defects in the control part of the write driver. It involves an erroneous write operation when the same write driver performs two successive write operations with opposite data values. In the first part of the paper, we present the SWDF electrical phenomena and their consequences on the SRAM functioning. Next, we show how SWDFs can be sensitized and observed and how a standard March test is able to detect this type of fault.

Interactive Presentations

On Power-profiling and Pattern Generation for Power-safe Scan Tests [p. 534]

V.R. Devanathan, C.P. Ravikumar and V. Kamakoti

With increasing use of low cost wire-bond packages for mobile devices, excessive dynamic IR-drop may cause tests to fail on the tester. Identifying and debugging such scan test failures is a very complex and effort-intensive process. A better solution is to generate correct-by-construction "power-safe" patterns. Moreover, with glitch power contributing to a significant component of dynamic power, pattern generation needs to be timing-aware to minimize glitching. In this paper, we propose a timing-based, power and layout-aware pattern generation technique that minimizes both global and localized switching activity. Techniques are also proposed for power-profiling and optimizing an initial pattern set to obtain a power-safe pattern set, with the addition of minimal patterns. The proposed technique also comprehends irregular power grid topologies for constraints on localized switching activity. Experiments on ISCAS benchmark circuits reveal the effectiveness of the proposed scheme.

Automatic Test Pattern Generation for Maximal Circuit Noise in Multiple Aggressor Crosstalk Faults [p. 540]

K.P. Ganeshpure and S. Kundu

Decreasing process geometries and increasing operating frequencies have made VLSI circuits more susceptible to signal integrity related failures. Capacitive crosstalk is one of the causes of such kind of failures. Crosstalk fault results from switching of neighboring lines that are capacitively coupled. Long nets are more susceptible to crosstalk faults because they tend to have a higher coupling capacitance to overall capacitance ratio. A typical long net has multiple aggressors. In generating patterns to create maximal crosstalk noise, it may not be possible to activate all aggressors at the same time. Therefore, pattern generation must focus on activating a maximal subset of aggressors weighted by actual coupling capacitance values. This is a variant of max-satisfiability problem. Unlike a traditional max-satisfiability problem, here we must deal with signal propagation to an observable output. In this paper, we present a novel solution that combines 0-1 Integer Linear Program (ILP) with traditional stuck-at fault ATPG. The maximal aggressor activation is formulated as a linear programming problem while the fault effect propagation is treated as an ATPG problem. The problems are separated by min-cut circuit partitioning technique based on Kernighan-Lin-Fiduccia-Mattheyses (KLFM) method. This proposed technique was applied to ISCAS 85 benchmark circuits. Results indicated that 75-100% of the aggressors could be switched for generating crosstalk noise while satisfying requirement of sensitizing a path to the output.

4.4: Future Design Challenges

Moderators: V. Narayanan, Penn State U, US; C. Guiducci, Bologna U, IT

Temperature-aware NBTI Modeling and the Impact of Input Vector Control on Performance Degradation [p. 546]

Y. Wang, H. Luo, K. He, R. Luo, H. Yang and Y. Xie

As technology scales, Negative Bias Temperature Instability (NBTI), which causes temporal performance degradation in digital circuits by affecting PMOS threshold voltage, is emerging as one of the major circuit reliability concerns. In this paper, we first investigate the impact of NBTI on PMOS devices and propose a novel temporal performance degradation model for digital circuits considering the temperature difference between active and standby mode.For the first time, the impact of input vector control (to minimize standby leakage) on the NBTI is investigated. Minimum leakage vectors, which lead to minimum circuit performance degradation and remains maximum leakage reduction rate, are selected and used during the standby mode. Furthermore, the potential to save the circuit performance degradation by internal node control techniques during circuit standby mode is discussed. Our simulation results show that: 1) the active and standby time ratio and the standby mode temperature have considerable impact on the circuit performance degradation; 2) the NBTI-aware IVC technique leads to an average 3% savings of the total circuit degradation; while the potential of internal node control may lead to 10% savings of the total circuit degradation.

A Cross-Referencing-Based Droplet Manipulation Method for High-Throughput and Pin-Constrained Digital Microfluidic Arrays [p. 552]

T. Xu and K. Chakrabarty

Digital microfluidic biochips are revolutionizing high-throughput DNA sequencing, immunoassays, and clinical diagnostics. As high-throughput bioassays are mapped to digital microfluidic platforms, the need for design automation techniques for pin-constrained biochips is being increasingly felt. However, most prior work on biochips CAD has assumed independent control of the underlying electrodes using a large number of (electrical) input pins. We propose a droplet manipulation method based on a "cross-referencing" addressing method that uses "row" and "columns" to access electrodes. By mapping the droplet movement problem to the clique partitioning problem from graph theory, the proposed method allows simultaneous movement of a large number of droplets on a microfluidic array. This in turn facilitates high-throughput applications on a pin-constrained biochip. We use random synthetic benchmarks and a set of multiplexed bioassays to evaluate the proposed method.

Reversible Circuit Technology Mapping from Non-reversible Specifications [p. 558]

Z. Zilic, K. Radecka and A. Khazamiphur

This paper considers the synthesis of reversible circuits directly from an irreversible specification, with no need for producing a reversible embedding first. We present a feasible methodology for realizing the networks of reversible gates, in a manner that builds on the classical technology mapping. We do not restrict ourselves to the restricted notion of realizing permutation functions, and construct reversible implementations where extraneous signals are efficiently reused for overcoming the inherent fanout limitation.

Distributed Power-Management Techniques for Wireless Network Video Systems [p. 564]

N. H. Zamora, J.-C. Kao and R. Marculescu

Wireless sensor networks operating on limited energy resources need to be power efficient to extend the system lifetime. This is especially challenging for video sensor networks due to the large volumes of data they need to process in short periods of time. Towards this end, this paper proposes two coordinated power management policies for video sensor networks. These policies are scalable as the system grows and flexible to video parameters and network characteristics. In addition to simulation results, our prototype demonstrates the feasibility of implementing these policies. Finally, the analytical framework we provide gives an upper bound for the achievable sleep fraction and insight into how adjusting select parameters will affect the performance of the power management policies.

Interactive Presentations

Improving the Fault Tolerance of Nanometric PLA Designs [p. 570]

F. Angiolini, M.H. Ben Jamaa, D. Atienza, L. Benini, and G. De Micheli

Several alternative building blocks have been proposed to replace planar transistors, among which a prominent spot belongs to nanometric filaments such as Silicon NanoWires (SiNWs) and Carbon NanoTubes (CNTs). However, chips leveraging these nanoscale structures are expected to be affected by a large amount of manufacturing faults, way beyond what chip architects have learned to counter. In this paper, we show a design flow, based on software mapping algorithms, to improve the yield of nanometric Programmable Logic Arrays (PLAs). While further improvements to the manufacturing technology will be needed to make these devices fully usable, our flow can significantly shrink the gap between current and desired yield levels. Also, our approach does not need post-fabrication functional analysis and mapping, therefore dramatically cutting on verification costs. We check PLA yields by means of an accurate analyzer after Monte Carlo fault injection. We show that, compared to a baseline policy of wire replication, we achieve equal or better yields (8% over a set of designs) depending on the underlying defect assumptions.

Techniques for Designing Noise-Tolerant Multi-Level Combinational Circuits [p. 576]

K. Nepal, R.I. Bahar, J. Mundy, W.R. Patterson and A. Zaslavsky

As CMOS technology downscales, higher noise levels, wider threshold variation, and low supply voltage will force designers to contend with high rates of soft logical errors and many defective devices. A probabilistic design framework based on Markov random fields (MRF) has been previously proposed to address dynamic fault and noise vulnerability of ultimate digital CMOS circuitry. The idea is to use additional transistors and feedback loops to achieve significant noise immunity and ensure correct logic operations at low VDD. However, the extra reliability achieved in previously published work came at a cost of high transistor counts. In this paper, we present techniques to reduce the transistor count of larger multi-level combinational circuits built within the MRF framework by using variable sharing, implied dependence and supergates. Using these techniques we show an average reduction of approximately 28% in transistor counts over a range of combinational benchmark circuits built within the MRF framework compared to the best previously published results.

4.5: Application-Specific Architectures

Moderators: T. Austin, U of Michigan, US; B. Calder, Microsoft, US

An Efficient Code Compression Technique Using Application-Aware Bitmask and Dictionary Selection Methods [p. 582]

S.-W. Seong and P. Mishra

Memory plays a crucial role in designing embedded systems. A larger memory can accommodate more and large applications but increases cost, area, as well as energy requirements. Code compression techniques address this problem by reducing the size of the applications. While early work on bitmask-based compression has proposed several promising ideas, many challenges remain in applying them to embedded system design. This paper makes two important contributions to address these challenges by developing application-specific bitmask selection and bitmask-aware dictionary selection techniques. We applied these techniques for code compression of TI and MediaBench applications to demonstrate the usefulness of our approach.

Optimizing Instruction-set Extensible Processors under Data Bandwidth Constraints [p. 588]

K. Atasu, R.G. Dimond, O. Mencer, W. Luk, C. Özturan and G. Dündar

We present a methodology for generating optimized architectures for data bandwidth constrained extensible processors. We describe a scalable Integer Linear Programming (ILP) formulation, that extracts the most profitable set of instruction-set extensions given the available data bandwidth and transfer latency. Unlike previous approaches, we differentiate between number of inputs and outputs for instruction-set extensions and the number of register file ports. This differentiation makes our approach applicable to architectures that include architecturally visible state registers and dedicated data transfer channels. We support a comprehensive design space exploration to characterize the area/performance trade-offs for various applications. We evaluate our approach using actual ASIC implementations to demonstrate that our automatically customized processors meet timing within the target silicon area. For an embedded processor with only two register read ports and one register write port, we obtain up to 4.3x speed-up with extensions incurring only a 35% area overhead.

Resource Prediction for Media Stream Decoding [p. 594]

J. Hamers and L. Eeckhout

Resource prediction refers to predicting required compute power and energy resources for consuming a service on a device. Resource prediction is extremely useful in a client-server setup where the client requests a media service from the server or content provider. The content provider (in cooperation with the client) can then determine what service quality to deliver given the client's available resources. This paper proposes a practical approach to predicting resources for decoding media streams. The idea is to group frames with similar decode complexity from various media streams in the content provider's database into so called scenarios. Client profiling using scenario representatives characterizes the client's computational power. This enables the content provider for predicting decode time, decode energy and quality of service for a media stream of interest once deployed on the client.

Register Pointer Architecture for Efficient Embedded Processors [p. 600]

J.S. Park, S.-B. Park, J.D. Balfour, D. Black-Schaffer, C. Kozyrakis and W.J. Dally

Conventional register file architectures cannot optimally exploit temporal locality in data references due to their limited capacity and static encoding of register addresses in instructions. In conventional embedded architectures, the register file capacity cannot be increased without resorting to longer instruction words. Similarly, loop unrolling is often required to exploit locality in the register file accesses across iterations because naming registers statically is inflexible. Both optimizations lead to significant code size increases, which is undesirable in embedded systems. In this paper, we introduce the Register Pointer Architecture (RPA), which allows registers to be accessed indirectly through register pointers. Indirection allows a larger register file to be used without increasing the length of instruction words. Additional register file capacity allows many loads and stores, such as those introduced by spill code, to be eliminated, which improves performance and reduces energy consumption. Moreover, indirection affords additional flexibility in naming registers, which reduces the need to apply loop unrolling in order to maximize reuse of register allocated variables.

Interactive Presentations

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search [p. 606]

S. Van Haastregt and P.M.W. Knijnenburg

When designing embedded systems, one needs to make decisions concerning the different components that will be included in a microprocessor. An important issue is the chip area vs. performance trade-off. In this paper we investigate the relationship between chip area and performance for superscalar microprocessors. We investigate the feasibility to obtain a suitable configuration by searching. We show that our approach gives a good configuration after 100 to 150 iterations using a simple random search algorithm. This shows the feasibility of our approach, in particular when more sophisticated search algorithms are employed as we plan in future work.

A Decoupled Architecture of Processors with Scratch-Pad Memory Hierarchy [p. 612]

A. Milidonis, N. Alachiotis, V. Porpodas, H. Michail, A.P. Kakarountas and C.E. Goutis

We present a decoupled architecture of processors with a memory hierarchy of only scratch-pad memories, and a main memory. The decoupled architecture also exploits the parallelism between address computation and processing the application data. The application code is split in two programs the first for computing the addresses of the data in the memory hierarchy and the second for processing the application data. The first program is executed by one of the decoupled processors called Access which uses compiler methods for placing data in the memory hierarchy. In parallel, the second program is executed by the other processor called Execute. The synchronization of the memory hierarchy and the Execute processor is achieved through simple handshake protocol. The Access processor requires strong communication with the memory hierarchy which strongly differentiates it from traditional uniprocessors. The architecture is compared in performance with the MIPS IV architecture of SimpleScalar and with the existing decoupled architectures showing its higher normalized performance. Experimental results show that the performance is increased up to 3.7 times. Compared with MIPS IV the proposed architecture achieves the above performance with insignificant overheads in terms of area.

4.6: Technology and Process Aware Low Power Circuit Design

Moderators: A.J. Acosta, Seville U/IMSE, ES; B.C. Paul, Toshiba, US

An Algorithm to Minimize Leakage through Simultaneous Input Vector Control and Circuit Modification [p. 618]

N. Jayakumar and S.P. Khatri

Leakage power currently comprises a large fraction of the total power consumption of an IC. Techniques to minimize leakage have been researched widely. In this paper, we present an approach which minimizes leakage by simultaneously modifying the circuit while deriving the input vector that minimizes leakage. In our approach, we selectively modify a gate so that its output (in sleep mode) is in a state which helpsminimize the leakage of other gates in its transitive fanout. Gate replacement is performed in a slack-aware manner, to minimize the resulting delay penalty.

Understanding Voltage Variations in Chip Multiprocessors Using a Distributed Power-Delivery Network [p. 624]

M.S. Gupta, J.L. Oatley, R. Joseph, G.-Y. Wei and D.M. Brooks

Recent efforts to address microprocessor power dissipation through aggressive supply voltage scaling and power management require that designers be increasingly cognizant of power supply variations. These variations, primarily due to fast changes in supply current, can be attributed to architectural gating events that reduce power dissipation. In order to study this problem, we propose a fine-grain, parameterizable model for power-delivery networks that allows system designers to study localized, on-chip supply fluctuations in high-performance microprocessors. Using this model, we analyze voltage variations in the context of next-generation chip-multiprocessor (CMP) architectures using both real applications and synthetic current traces. We find that the activity of distinct cores in CMPs present several new design challenges when considering power supply noise, and we describe potentially problematic activity sequences that are unique to CMP architectures.

Process Variation Tolerant Low Power DCT Architecture [p. 630]

N. Banerjee, G. Karakonstantis and K. Roy

2-D Discrete Cosine Transform (DCT) is widely used as the core of digital image and video compression. In this paper, we present a novel DCT architecture that allows aggressive voltage scaling by exploiting the fact that not all intermediate computations are equally important in a DCT system to obtain "good" image quality with Peak Signal to Noise Ratio(PSNR) > 30 dB. This observation has led us to propose a DCT architecture where the signal paths that are less contributive to PSNR improvement are designed to be longer than the paths that are more contributive to PSNR improvement. It should also be noted that robustness with respect to parameter variations and low power operation typically impose contradictory requirements in terms of architecture design. However, the proposed architecture lends itself to aggressive voltage scaling for low-power dissipation even under process parameter variations. Under a scaled supply voltage and/or variations in process parameters, any possible delay errors would only appear from the long paths that are less contributive towards PSNR improvement, providing large improvement in power dissipation with small PSNR degradation. Results show that even under large process variation and supply voltage scaling (0.8V), there is a gradual degradation of image quality with considerable power savings (62.8%) for the proposed architecture when compared to existing implementations in 70 nm process technology.

Interactive Presentation

Statistical Dual-Vdd Assignment for FPGA Interconnect Power Reduction [p. 636]

Y. Lin and L. He

Field programmable dual-Vdd interconnects are effective to reduce FPGA power. However, the deterministic Vdd assignment leverages timing slack exhaustively and significantly increases the number of near-critical paths, which results in a degraded timing yield with process variation. In this paper, we present two statistical Vdd assignment algorithms. The first greedy algorithm is based on sensitivity while the second one is based on timing slack budgeting. Both minimize chip-level interconnect power without degrading timing yield. Evaluated with MCNC circuits, the statistical algorithms reduce interconnect power by 40% compared to the single-Vdd FPGA with power gating. In contrast, the deterministic algorithm reduces interconnect power by 51% but degrades timing yield from 97.7% to 87.5%.

4.7: Hardware Implementation of MPSoCs and NoCs Architectures

Moderators: K. Goossens, NXP Semiconductors Research, NL; B. Candaele, Thales Communications, FR

Hardware Scheduling Support in SMP Architectures [p. 642]

A.C. Nácul, F. Regazzoni and M. Lajolo

In this paper we propose a hardware real time operating system (HW-RTOS) that implements the OS layer in a dual-processor SMP architecture. Intertask communication is specified by means of dedicated APIs and the HW-RTOS takes care of the communication requirements of the application and also implements the task scheduling algorithm. The HW-RTOS allows to have smaller footprints, since it avoids the need to link to the final executables traditional software RTOS libraries. Moreover, the HW-RTOS is able to exploit the easy task migration feature provided by an SMP architecture much more efficiently than a traditional software RTOS, due to its faster execution and we show how this significantly overcomes the performance achievable with optimal static task partitioning among two processors. Preliminary results show that the hardware overhead in a dual processor architecture is less than 20K gates.

A Scalable, Timing-Safe, Network-on-Chip Architecture with an Integrated Clock Distribution Method [p. 648]

T. Bjerregaard, M.B. Stensgaard and J. Sparsø

Growing system sizes together with increasing performance variability are making globally synchronous operation hard to realize. Mesochronous clocking constitutes a possible solution to the problems faced. The most fundamental of problems faced when communicating between mesochronously clocked regions concerns the possibility of data corruption caused by metastability. This paper presents an integrated communication and mesochronous clocking strategy, which avoids timing related errors while maintaining a globally synchronous system perspective. The architecture is scalable as timing integrity is based purely on local observations. It is demonstrated with a 90 nm CMOS standard cell network-on-chip design which implements completely timing-safe, global communication in a modular system.

Butterfly and Benes-Based On-Chip Communication Networks for Multiprocessor Turbo Decoding [p. 654]

H. Moussa, O. Muller, A. Baghdadi and M. Jezequel

Several research activities have recently emerged aiming to propose multiprocessor implementations in order to achieve flexible and high throughput parallel iterative decoding. Besides application algorithm optimizations and application-specific instruction-set processor design, the on-chip communication network constitutes a major issue in this application domain. In this paper, we propose to use multistage interconnection networks as on-chip communication networks for parallel turbo decoding. Adapted Benes and Butterfly networks are proposed with detailed hardware implementation of network interfaces, routers, and topologies. In addition, appropriate packet format and routing for interleaved/deinterleaved extrinsic information exchanges are proposed. The flexibility of these on-chip communication networks enables their use for all turbo code standards and constitutes a promising feature for their reuse for any similar interleaved/deinterleaved iterative communication profile.

Interactive Presentation

Capturing the Interaction of the Communication, Memory and I/O Subsystems in Memory-Centric Industrial MPSoC Platforms [p. 660]

S. Medardoni, M. Ruggiero, D. Bertozzi, L. Benini, G. Strano and C. Pistritto

Industrial MPSoC platforms exhibit increasing communication needs while not yet reverting to revolutionary solutions such as networks-on-chip. On one hand, the limited scalability of shared busses is being overcome by means of multi-layer communication architectures, which are stressing the role of bridges as key contributors to system performance. On the other hand, technology limitations, data footprint and cost constraints lead to platform instantiations with only few on-chip memory devices and with a global performance bottleneck: the memory controller for access to the off-chip SDRAM memory. The complex interaction among system components and the dependency of macroscopic performance metrics on fine-grain architectural features stress the importance of highly accurate modelling and analysis tools. This paper takes its steps from an extensive modelling effort of a complete industrial MPSoC platform for consumer electronics, including the off-chip memory sub-system. Based on this, relevant design issues concerning the communication, memory and I/O architecture and their interaction are addressed, resulting in guidelines for designers of industry-relevant MPSoCs.

5.1.1: Security and Trust in Ubiquitous Communication (Ubiquitous Communication and Computation Special Day)

Organizer/Moderator: P. Liuha, Nokia, FI

Cost-Aware Capacity Optimization in Dynamic Multi-Hop WSNs [p. 666]

J. Suhonen, M. Kohvakka, M. Kuorilehto, M. Hännikäinen, and T.D. Hämäläinen

Low energy consumption and load balancing are required for enhancing lifetime at Wireless Sensor Networks (WSN). In addition, network dynamics and different delay, throughput, and reliability requirements demand costaware traffic adaptation. This paper presents a novel capacity optimization algorithm targeted at locally synchronized, low-duty cycle WSN MACs. The algorithm balances the traffic load between contention and contention free channel access. The energy-inefficient contention access is avoided, whereas the more reliable contention free access is preferred. The algorithm allows making cost-aware trade-off between delay, energy-efficiency, and throughput guided by routing layer. Analysis results show that the algorithm has 10% to 100% better energy-efficiency than IEEE 802.15.4 LR-WPAN in a typical sensing application, while providing comparable goodput and delay.

Design Methods for Security and Trust [p. 672]

I. Verbauwhede and P. Schaumont

The design of ubiquitous and embedded computers focuses on cost factors such as area, power-consumption, and performance. Security and trust properties, on the other hand, are often an afterthought. Yet the purpose of ubiquitous electronics is to act and negotiate on their owner's behalf, and this makes trust a first-order concern. We outline a methodology for the design of secure and trusted electronic embedded systems, which builds on identifying the secure-sensitive part of a system (the root-of-trust) and iteratively partitioning and protecting that root-of-trust over all levels of design abstraction. This includes protocols, software, hardware, and circuits. We review active research in the area of secure design methodologies.

5.1.2: Lunch-Time Keynote and Awards

Emerging Solutions Technology and Business Views for the Ubiquitous Communication [p. 678]

H. Huomo

The presentation will cover a short historical overview of the ubiquitous communication research Dr Huomo was leading while at Nokia. This research program led to the development of the short range radio technology which is now known as Wibree and touch based service discovery technology now known as NFC. The current key use cases of the NFC and its future development directions will be covered.

5.2: Best Industrial System Designs in Aerospace, Avionics and Automotive

Moderators: L. Fanucci, Pisa U, IT; A. Reutter, Robert Bosch GmbH, DE

Development of on Board, Highly Flexible, Galileo Signal Generator ASIC [p. 679]

L. Baguena, E. Liégeon, A. Bepoix, J. M. Dusserre, C. Oustric, P. Bellocq and V. Heiries

Alcatel Alenia Space is deeply involved in the Galileo program at many stages. In particular, Alcatel Alenia Space has successfully designed and delivered the very first navigation signal generator, based on a 0.35μm Atmel ASIC technology, which has been launched in the satellite demonstrator GIOVE-A in December 2005. The Galileo project is now in a second phase including the development of four of the thirty satellites of the final constellation. The new navigation signal generator requires both high performance and high flexibility (various waveforms to cope with the different Galileo services: open, commercial, governmental ...) for a very long life time system. Besides, the challenge is increased due to the specific space constraints such as mass, volume and power consumption. These requirements will be achieved through the implementation of a 3 million gates ASIC in a 0.18μm European Radiation Tolerant Atmel technology. This paper will, after a brief description of Galileo system, present the constraints of space environment and technologies challenges. It will then present the ASIC and the development flow of this project, emphasizing the up to date tools that have been used (architectural synthesis, physical synthesis). A conclusion will then be drawn on the requirements on technology and tools for space domain.

New Safety Critical Radio Altimeter for Airbus and Related Design Flow [p. 684]

D. Hairion, S. Emeriau, E. Combot and M. Sarlotte

The latest generation of the ERT560 Digital Radio Altimeter (DRA) developed for the Airbus A380 is the result of Thales' 40 years experience. Over 40,000 radio-altimeters have been produced over that period based on dual technology, meeting the stringent requirements of the civil aircraft. This new version takes advantages of the FPGA technology to implement the main treatment of the equipment. The present article introduces the main capabilities of the ERT560 product and focus on the FPGA which is the key element of the safety critical analysis of the radio-altimeter. Then the paper presents the application of the new "design Assurance guidance for Airborne Electronic Hardware (DO254) which has been raised in 2000 (this guide is the equivalent for the HW of the DO178B for the SW). DO254 related activities are mainly developed such as a dedicated workflow, validation (give evidence of the completeness and correctness of all design life cycle outputs) and verification (evaluation of an implementation of requirements to determine that they have been met) and also verification tool qualification.

Introducing New Verification Methods into a Company's Design Flow: An Industrial User's Point of View [p. 689]

R. Lissel and J. Gerlach

Today the task of design verification has become one of the key bottlenecks in hardware and system design. To address this topic, several verification languages, methods and tools, which address several issues of the verification process, were developed by multiple EDA vendors over the last years. This paper takes an industrial user's point of view and explores the difficulties introducing new verification methods into a company's "naturally grown" and well established design flow - taking into account application domain specific requirements, constraints given by the existing design environment and economical aspects. The presented approach extends the capabilities of an existing verification strategy by powerful new features while keeping in mind integration, reuse and applicability aspects. Based on an industrial design example the effectiveness and potential of the developed approach is shown.

5.3: Mixed-Signal and RF Test

Moderators: A. Chatterjee, Georgia Institute of Technology, US; B. Kaminska, Simon Fraser U, CA

Testable Design for Advanced Serial-Link Transceivers [p. 695]

M. Lin and K.-T. Cheng

This paper describes a DfT solution for modern seriallink transceivers. We first summarize the architectures of the Crosstalk Canceller and the Equalizer used in advanced transceivers to which the proposed solution can be applied. The solution addresses the testability and observability issues of the transceiver for both characterization and production testing. Without using sophisticated testing instrument setting, the proposed solution could test the clock and data recovery circuit and characterize the decision-feedback equalizer in the receiver. Our experiments demonstrate that the proposed method has significant higher fault coverage and lower hardware requirement than the conventional approach of probing the eyeopening of the signals inside the transceiver.

Method for Reducing Jitter in Multi-Gigahertz ATE [p. 701]

D.C. Keezer, D. Minier and P. Ducharme

Controlling jitter on a picosecond (or smaller) time scale has become one of the most difficult challenges for testing multi-gigahertz systems. In this paper we present a novel method for reducing jitter in timing-critical ATE signals. This method uses a real-time averaging approach to combine multiple ATE signals and produces timing references with significantly lower random jitter. For example, we demonstrate a 3x reduction in jitter by combining eight ATE signals (each with σ =4ps) to produce a low-jitter signal (σ =1.3ps). The measured jitter reduction is shown to closely match that predicted by theory. This counter-intuitive (but welcome) result is of general interest for the design of any low-jitter system, and is particularly helpful for multi-GHz ATE where precise timing is so critical.

Re-Configuration of Sub-blocks for Effective Application of Time Domain Tests [p. 707]

J. Anders, S. Krishnan and G. Gronthoud

AC sensitivities guide most Analogue Automatic Test Pattern Generator (AATPG) while determining the optimal frequencies of a sinusoidal test stimulus. The optimal frequencies thus determined normally lie in the close vicinity of the operating frequency of the circuit. Although these frequencies are justifiable by the principles of the circuit, these test frequencies do not bring any added value to the ultimate goal of cheap alternatives (low frequency test signal and cheaper measurement equipment) for the analogue and RF tests. In this paper, we propose to re-configure the circuit blocks, in such a way that the operating frequencies of the respective sub-block are shifted to lower testable frequencies. We have validated our proposal on a sub-block of a satellite receiver circuit that resulted in lowering the test frequencies of the corresponding sub-blocks from 12 GHz to 4MHz, while attaining the same level of defect coverage.

An ADC-BiST Scheme Using Sequential Code Analysis [p. 713]

E.S. Erdogan and S. Ozev

This paper presents a built-in self-test (BiST) scheme for analog to digital converters (ADC) based on a linear ramp generator and efficient output analysis. The proposed analysis method is an alternative to histogram based analysis techniques to provide test time improvements, especially when the resources are scarce. In addition to the measurement of DNL and INL, non-monotonic behavior can also be detected with the proposed technique. We present two implementation options based on how much on-chip resources are available. The ramp generator has a high linearity over a full-scale range of 1V and the generated ramp signal is capable of testing 13 - bit ADCs. The circuit implementation of the ramp generator utilizes a feedback configuration to improve the linearity having an area of 0.017mm² in 0.5μm process.

Interactive Presentation

Boosting SER Test for RF Transceivers by Simple DSP Technique [p. 719]

J. Dabrowski and R. Ramzan

The paper presents a new technique of symbol error rate test (SER) for RF transceivers. A simple DSP algorithm implemented at the receiver baseband is introduced in terms of constellation correction, which is usually used to compensate for IQ imbalance. The test is oriented at detection of impairments in gain and noise figure in a transceiver frontend. The proposed approach is shown to enhance the sensitivity of a traditional SER test to the limits of its counterpart, the error vector magnitude (EVM) test. Its advantage over EVM is in simple implementation, lower DSP overhead and the ability of achieving a larger dynamic range of the test response. Also the test time is saved compared to a traditional SER test. The technique is validated by a simulation model of a Wi-Fi transceiver implemented in Matlab^TM.

Novel Test Infrastructure and Methodology Used for Accelerated Bring-Up and In-System Characterization of the Multi-Gigahertz Interfaces on the Cell Processor [p. 725]

P. Yeung, A. Torres and P. Batra

Design-for-test (DFT) techniques are continuously used in designs to help identify defects during silicon manufacturing. However, prior to production, a significant amount of time and effort is needed to bring-up and validate various aspects of the silicon design in the system. In particular, the use of multi-Gigabit I/O signaling for a high I/O count, high-volume product introduces unique test challenges during these two phases of the product life cycle. In this paper, we shall discuss the test infrastructure and methodologies used to accelerate bring-up and in-system silicon characterization for high-speed mixed-signal I/O. These ideas will lead to a shortened time to market (TTM) at a lower cost. As a case study, we shall illustrate these techniques used in the development of the Rambus FlexIO^TM processor bus and XIO^TM memory interface used on the first generation Cell processor (aka Cell Broadband Engine^TM or Cell BE). Cell was co-developed by Sony Corporation, Sony Computer Entertainment Inc, Toshiba Corporation, and IBM and is used in the Sony PlayStation®3 (PS3) game console and other intense computational applications. The Cell processor uses 5Gbps links for the processor's FlexIO system interface and 3.2Gbps links for the processor's XDR^TM memory interface. This per pin bandwidth translates into a system interface with a bandwidth of 60GB/s and a memory interface with a bandwidth of 25.6GB/s, respectively.

Evaluation of Test Measures for LNA Production Testing Using a Multinormal Statistical Model [p. 731]

J. Tongbong, S. Mir and J.L. Carbonero

For Design-For-Test (DFT) purposes, analogue and mixed-signal testing has to cope with the difficulty of test evaluation before production. This paper aims at evaluating test measures for RF components in order to optimize production test sets and thus reduce test cost. For this, we have first developed a statistical model of the performances and possible test measures of the Circuit Under Test (a Low Noise Amplifier). The statistical multi-normal model is derived from data obtained using Monte-Carlo circuit simulation (five hundred iterations). This statistical model is then used to generate a larger circuit population (one million instances) from which test metrics can be estimated with ppm precision at the design stage, considering just process deviations. With the use of this model, a trade-off between defect level and yield loss resulting from process deviations is used to set test limits. After fixing test limits, we have carried out a fault simulation campaign to verify the suitability of the different test measurements, targeting both catastrophic and single parametric faults. Catastrophic faults are modelled by shorts and opens. A parametric fault is defined as the minimum value of a physical parameter that causes a specification to be violated. Test metrics are then evaluated for the LNA case-study. As a result, test metrics for functional measurements such as S-parameters and Noise Figure are compared with low cost test measurements such as RMS and peak-to-peak current consumption and output voltage, input/output impedance, and the correlation between current consumption and output voltage.

5.4: EMBEDDED TUTORIAL AND PANEL - Heterogeneous Systems on Chip and Systems in Package

Organizers/Moderators: B. Courtois, TIMA Laboratory, FR; I. O'Connor, Ecole Centrale de Lyon, FR

Heterogeneous Systems on Chip and Systems in Package [p. 737]

I. O'Connor, B. Courtois, K. Chakrabarty, N. Delorme, M. Hampton, J. Hartung

This paper discusses several forms of heterogeneity in systems on chip and systems in package. A means to distinguish the various forms of heterogeneity is given, with an estimation of the maturity of design and modeling techniques with respect to various physical domains. Industry-level MEMS integration, and more prospective microfluidic biochip systems are considered at both technological and EDA levels. Finally, specific flows for signal abstraction heterogeneity in RF SiP and for functional co-verification are discussed.

5.5: Novel Directions in Architectural Simulation and Validation

Moderators: E.M. Aboulhamid, Montreal U, CA; T. Austin, U of Michigan, US

Engineering Trust with Semantic Guardians [p. 743]

I. Wagner and V. Bertacco

The ability to guarantee the functional correctness of digital integrated circuits and, in particular, complex microprocessors, is a key task in the production of secure and trusted systems. Unfortunately, this goal remains today an unfulfilled challenge, as even the most straightforward practical designs are released with latent bugs. Patching techniques can repair some of these escaped bugs, however, they often incur a performance overhead, and most importantly, they can only be deployed after an escaped bug has been exposed at the customer site. In this paper we present a novel approach to guaranteeing correct system operation by deploying a semantic guardian component. The semantic guardian is an additional control logic block which is included in the design, and can switch the microprocessor's mode of operation from its normal, high-performance but error-prone mode, to a a secure, formally verified safe mode, guaranteing that the execution will be functionally correct. We explore several frameworks where a selective use of the safe mode can enhance the overall functional correctness of a processor. Additionally, we observe through experimentation that semantic guardians facilitate the trade-off between the design validation effort and the performance and area cost of the final secure product. The experimental results show that the area cost and performance overheads of a semantic guardian can be as small as 3.5% and 5%, respectively.

CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators [p. 749]

D. Kim, S. Ha and R. Gupta

This paper focuses on enhancing performance of cycle accurate simulation with multiple processor simulators. Simulation performance is determined by how often simulators exchange events with one another and how accurately simulators model their behavior. Previous techniques have limited their applicability or sacrificed accuracy for performance. In this paper, we notice that inaccuracy comes from events which arrive between event exchange boundaries. To solve the problem, we propose cycle accurate transaction-driven simulation which maintains event exchange boundaries at bus transactions but compensates for accuracy. The proposed technique is implemented in a publicly available CATS framework and our experiment with 64 processors achieves 1.2M processor cycles/s (200K instructions/s) which is faster than other cycle accurate frameworks by an order of magnitude.

A One-Shot Configurable-Cache Tuner for Improved Energy and Performance [p. 755]

A. Gordon-Ross, P. Viana, F. Vahid, W. Najjar and E. Barros

We introduce a new non-intrusive on-chip cache-tuning hardware module capable of accurately predicting the best configuration of a configurable cache for an executing application. Previous dynamic cache tuning approaches change the cache configuration several times as part of the tuning search process, executing the application using inferior configurations and temporarily causing energy and performance overhead. The introduced tuner uses a different approach, which non-intrusively collects data on addresses issued by the microprocessor, analyzes that data to predict the best cache configuration, and then updates the cache to the new best configuration in "one-shot," without ever having to examine inferior configurations. The result is less energy and less performance overhead, meaning that cache tuning can be applied more frequently. We show through experiments that the one-shot cache tuner can reduce memory-access related energy for instructions by 35% and comes within 4% of a previous intrusive approach, and results in 4.6 times less energy overhead and a 7.7 times speedup in tuning time compared to a previous intrusive approach, at the main expense of 12% larger size.

Design Fault Directed Test Generation for Microprocessor Validation [p. 761]

D.A. Mathaikutty, S.K. Shukla, S.V. Kodakara, D. Lilja and A. Dingankar

Functional validation of modern microprocessors is an important and complex problem. One of the problems in functional validation is the generation of test cases that has higher potential to find faults in the design. We propose a model based test generation framework that generates tests for design fault classes inspired from software validation. There are two main contributions in this paper. Firstly, we propose a microprocessor modeling and test generation framework that generates test suites to satisfy Modified Condition Decision Coverage (MCDC), a structural coverage metric that detects most of the classified design faults as well as the remaining faults not covered by MCDC. Secondly, we show that there exists good correlation between types of design faults proposed by software validation and the errors/bugs reported in case studies on microprocessor validation. We demonstrate the framework by modeling and generating tests for the microarchitecture of VESPA, a 32-bit microprocessor. In the results section, we show that the tests generated using our framework's coverage directed approach detects the fault classes with 100% coverage, when compared to model-random test generation.

Interactive Presentation

Impact of Description Language, Abstraction Layer, and Value Representation on Simulation Performance [p. 767]

W. Ecker, V. Esen, L. Schönberg, T. Steininger M. Velten and M. Hull

In recent years other verification features than simulation performance such as robustness and debugging gained increasing impact on simulation language and tool selection. However, fastest model execution speed is still priority number one for many design and verification engineers. This can be seen in the continuously growing interest in virtual prototypes and transaction level modeling (TLM). As part of the ongoing re-work modeling language strategies and the world wide introduction of TLM, a detailed analysis of the impact of description languages, abstraction layers and data types on simulation performance is of high importance. For the presented analysis, we considered five designs that have been modeled in VHDL, Verilog, SystemVerilog, and SystemC, using different value representations and coding styles, covering the abstraction levels from functional to behavioral to RTL. This paper presents our evaluation environment and several interesting findings of our analysis. The most important results are as follows: We found that HDL tool/language/abstraction selection of RTL models impacts on the execution speed with a factor of 4.4. We found that Verilog is on average 2x faster than VHDL for RTL models. We found that SystemC results in 10x slower RTL models than HDLs and surprisingly results in 2.6x slower TLM1 PV models than SystemVerilog. And we found finally that on average over all analyzed aspects SystemVerilog models are executed fastest.

5.6: Power Management

Moderators: D. Soudris, Thrace Democritus U, GR; M. Poncino, Politecnico di Torino, IT

Adaptive Power Management in Energy Harvesting Systems [p. 773]

C. Moser, L. Thiele, D. Brunelli and L. Benini

Recently, there has been a substantial interest in the design of systems that receive their energy from regenerative sources such as solar cells. In contrast to approaches that attempt to minimize the power consumption we are concerned with adapting parameters of the application such that a maximal utility is obtained while respecting the limited and time-varying amount of available energy. Instead of solving the optimization problem on-line which may be prohibitively complex in terms of running time and energy consumption, we propose a parameterized specification and the computation of a corresponding optimal on-line controller. The efficiency of the new approach is demonstrated by experimental results and measurements on a sensor node.

Stochastic Modeling and Optimization for Robust Power Management in a Partially Observable System [p. 779]

Q. Qiu, Y. Tan and Q. Wu

As the hardware and software complexity grows, it is unlikely for the power management hardware/software to have a full observation of the entire system status. In this paper, we propose a new modeling and optimization technique based on partially observable Markov decision process (POMDP) for robust power management, which can achieve near-optimal power savings, even when only partial system information is available. Three scenarios of partial observations that may occur in an embedded system are discussed and their modeling techniques are presented. The experimental results show that, compared with power management policy derived from traditional Markov decision process model that assumes the system is fully observable, the new power management technique gives significantly better performance and energy tradeoff.

Efficient and Scalable Compiler-Directed Energy Optimization for Realtime Applications [p. 785]

P.-K. Huang and S. Ghiasi

We present a compilation technique that targets realtime applications running on embedded processors with combined dynamic voltage scaling (DVS) and adaptive body biasing (ABB) capabilities. Considering the delay and energy penalty of switching between operating modes of the processor, our compiler judiciously inserts mode switch instructions in selected locations of the code and generates executable binary that is guaranteed to meet the deadline constraint. More importantly, our algorithm runs very fast and comes reasonably close to the theoretical limit of energy optimization using DVS+ABB. At 65 nm technology, we improve the energy dissipation of the generated code by an average of 11.4% under deadline constraints. While our technique's improvement in energy dissipation over conventional DVS is marginal (3%) at 130nm, the average improvement continues to grow to 4.7%, 8.8% and 15.4% for 90nm, 65nm and 45nm technology nodes, respectively. Compared to a recent ILP-based competitor, we improve the runtime by more than three orders of magnitude, while producing improved results.

Interactive Presentations

Peripheral-Conscious Scheduling on Energy Minimization for Weakly Hard Real-time Systems [p. 791]

L. Niu and G. Quan

In this paper, we present a dynamic scheduling algorithm to minimize the energy consumption by both the DVS processor and peripheral devices in a weakly hard real-time system. In our approach, we first use a new static approach to partition real-time jobs into mandatory and optional part to meet the weakly hard real-time constraints. We then adopt an on-line approach that can effectively exploit the run-time variations and reduce the preemption impacts to leverage the energy saving performance. Extensive simulation studies demonstrate that our approach can effectively reduce the system-wide energy consumption while guaranteeing the weakly hard constraints.

Task Scheduling under Performance Constraints for Reducing the Energy Consumption of GALS Multi-Processor SoC [p. 797]

R. Watanabe, M. Kondo, M. Imai, H. Nakamura and T. Nanya

The present paper focuses on applications that are periodic and have both latency and throughput constraints. For these applications, pipeline scheduling is effective for reducing energy consumption. Thus, the present paper proposes a pipelined task scheduling method for minimizing the energy consumption of GALS MP-SoC under latency and throughput constraints. First, we model target GALS MP-SoC architecture and application tasks. We then show that the energy optimization problem under this model belongs to the class of Mixed-Integer Linear Programming. Next, we propose a new scheduling method based on simulated annealing for the purpose of solving this problem quickly. Finally, experimental results demonstrate that the proposed method achieves a significant energy reduction on a real application under a practical architecture.

5.7: Advanced Techniques for Embedded Processors Design

Moderators: W. Kruijtzer, NXP, NL; G. Martin, Tensilica, US

Instruction Trace Compression for Rapid Instruction Cache Simulation [p. 803]

A. Janapsatya, A. Ignjatovic, S. Parameswaran and J. Henkel

Modern Application Specific Instruction Set Processors (ASIPs) have customizable caches, where the size, associativity and line size can all be customized to suit a particular application. To find the best cache size suited for a particular embedded system, the application( s) is/are executed, traces obtained, and caches simulated. Typically, program trace files can range from a few megabytes to several gigabytes. Simulation of cache performance using large program trace files is a time consuming process. In this paper, a novel instruction cache simulation methodology that can operate directly on a compressed program trace file without the need for decompression is presented. This feature allowed our simulation methodology to have an average speed up of 9.67 times compared to the existing state of the art tool (Dinero IV cache simulator), for a range of applications from the Mediabench suite.

Efficient Code Density through Look-up Table Compression [p. 809]

T. Bonny and J. Henkel

Code density is a major requirement in embedded system design since it not only reduces the need for the scarce resource memory but also implicitly improves further important design parameters like power consumption and performance. Within this paper we introduce a novel and efficient hardware-supported approach that belongs to the group of statistical compression schemes as it is based on Canonical Huffman Coding. In particular, our scheme is the first to also compress the necessary Look-up Tables that can become significant in size if the application is large and/or high compression is desired. Our scheme optimizes the number of generated Look-up Tables to improve the compression ratio. In average, we achieve compression ratios as low as 49%(already including the overhead of the Lookup Tables). Thereby, our scheme is entirely orthogonal to approaches that take particularities of a certain instruction set architecture into account. We have conducted evaluations using a representative set of applications and have applied it to three major embedded processor architectures, namely ARM, MIPS and PowerPC.

Microarchitectural Support for Program Code Integrity Monitoring in Application-specific Instruction Set Processors [p. 815]

Y. Fei and Z.J. Shi

Program code in a computer system can be altered either by malicious security attacks or by various faults in microprocessors. At the instruction level, all code modifications are manifested as bit flips. In this work, we present a generalized methodology for monitoring code integrity at run-time in application-specific instruction set processors (ASIPs), where both the instruction set architecture (ISA) and the underlying microarchitecture can be customized for a particular application domain. We embed monitoring microoperations in machine instructions, thus the processor is augmented with a hardware monitor automatically. The monitor observes the processor's execution trace of basic blocks at run-time, checks whether the execution trace aligns with the expected program behavior, and signals any mismatches. Since microoperations are at a lower software architecture level than processor instructions, the microarchitectural support for program code integrity monitoring is transparent to upper software levels and no recompilation or modification is needed for the program. Experimental results show that our microarchitectural support can detect program code integrity compromises with small area overhead and little performance degradation.

Interactive Presentation

Soft-core Processor Customization Using the Design of Experiments Paradigm [p. 821]

D. Sheldon, F. Vahid and S. Lonardi

Parameterized components are becoming more commonplace in system design. The process of customizing parameter values for a particular application, called tuning, can be a challenging task for a designer. Here we focus on the problem of tuning a parameterized soft-core microprocessor to achieve the best performance on a particular application, subject to size constraints. We map the tuning problem to a well-established statistical paradigm called Design of Experiments (DoE), which involves the design of a carefully selected set of experiments and a sophisticated analysis that has the objective to extract the maximum amount of information about the effects of the input parameters on the experiment. We apply the DoE method to analyze the relation between input parameters and the performance of a soft-core microprocessor for a particular application, using only a small number of synthesis/execution runs. The information gained by the analysis in turn drives a soft-core tuning heuristic. We show that using DoE to sort the parameters in order of impact results in application speedups of 6x-17x versus an un-tuned base soft-core. When compared to a previous single-factor tuning method, the DoE-based method achieves 3x-6x application speedups, while requiring about the same tuning runtime. We also show that tuning runtime can be reduced by 40-45% by using predictive tuning methods already built into a DoE tool.

6.1: Advances in Potential Power Supply (Ubiquitous Communication and Computation Special Day)

Power Supply and Power Management in Ubicom[p. 827]: This session views the challenges for power supply and power management with devices and system of ad hoc communication nature. The session highlights some of the design aspects relevant in ubicom and their impact to the whole system design and communication solutions.

6.2: Best Industrial Systems Designs in Communication and Multimedia

Moderators: O. Deprez, Texas Instruments, FR; M. Heijligers, NXP IC-Lab, NL

From Algorithm to First 3.5G Call in Record Time . A Novel System Design Approach Based on Virtual Prototyping and Its Consequences for Interdisciplinary System Design Teams [p. 828]

M. Brandenburg, A. Schöllhom, S. Heinen, J. Eckmüller and T. Eckart

Increasing system complexity not only in wireless communications forces design teams to avoid errors during the process of system refinement thereby keeping ambiguities during system implementation at a minimum. On the other hand the chosen system design approach has to ensure that a system design project rapidly advances through all stages of refinement from an algorithmic model to a real "System on Chip" (SoC) while maintaining backwards equivalence of the produced HW and FW/SW code with the original algorithmic model. This system design challenge also demands a new interdisciplinary team approach encompassing all design skills ranging from concept to HW and FW/SW engineering as well as system verification to increase the overlap in the system concept, implementation and verification phase. But how do these interdisciplinary teams cooperate efficiently, as they are used to metaphorically "speak different design languages"? Resulting in an industry record development time for a 3.5G UMTS modem the employment of a novel system design approach is shown which serves as common system design language, avoiding the babylonian language disaster of isolated engineering worlds. The motivation for an increasing overlap of system concept, implementation and verification phases is obvious: it can save time (to market) in the magnitude of several months or even more and thus drastically shorten design cycles by parallel development of HW and FW/SW. The proposed approach also helps to avoid costly redesign cycles due to conceptual errors and optimizes the quality of the developed system HW and FW/SW thereby also substantially reducing system development R&D costs.

Portable Multimedia SoC Design: A Global Challenge [p. 831]

M. Paganini, G. Kimmich, S. Ducrey, G. Caubit and V. Coeffe

The intrinsic capability brought by each new technology node opens the way to a broad range of system integration options and continuously enables new applications to be integrated in a single device to the point that almost everything seems possible. In reality the difference between a successful design and a failure resides today more then ever in the ability of the design team to properly master all the critical design factors at once. In essence, today's System on Chip design represent a multidiscipline challenge that spans from Architecture through Design to Test and finally mass production. SoC design for portable applications has to cope with very unique constraints that normally greatly challenge the ability of an organization and most of the times of an entire Company to fully master its industrialization capabilities and pushes concurrent design to new limits. In the end, only a well thought out Architecture followed by best practices design techniques with a high level of understanding of the manufacturing constraints and excellent logistics can result in a device that can be produced in the volume required by the cell phone industry today. This paper will try to capture how these challenges have been addressed to design the family of Application Processing Engines named Nomadik^TM. The paper will specifically focus on the third generation device labeled STn8815S22 where the integration capabilities of silicon technology have been pared with those of System in Package design to provide and extremely compact and effective System on Chip for portable multimedia applications. An overview of the main success factors and challenges will be presented driving the reader from the Architecture conception through the chip industrialization. Both Silicon design and packaging design will be illustrated, highlighting those techniques that made this incredible product a reality.

What If You Could Design Tomorrow's System Today? [p. 835]

N. Wingen

This paper highlights a series of proven concepts aimed at facilitating the design of next generation systems. Practical system design examples are examined and provide insight on how to cope with today's complex design challenges.

6.3: Nano and FIFO

Moderators: E. Larsson, Linkoping U, SE; D. Gizopoulos, Piraeus U, GR

Circuit-Level Modeling and Detection of Metallic Carbon Nanotube Defects in Carbon Nanotube FETs [p. 841]

H. Hashempour and F. Lombardi

Carbon Nanotube Field Effect Transistors (CNTFET) are promising nano-scaled devices for implementing high performance, very dense and low power circuits. The core of a CNTFET is a carbon nanotube. Its conductance property is determined by the so-called chirality of the tube; chirality is difficult to control during manufacturing. This results in conducting (metallic) nanotubes and defective CNTFETs similar to stuck-on (SON or source-drain short) faults, as encountered in classical MOS devices. This paper studies this phenomenon by using layout information and presents modeling and detection methodologies for nanoscaled defects arising from the presence of metallic carbon nanotubes. For CNTFET-based circuits (e.g. intramolecular), these defects are analyzed using a traditional stuck-at fault model. This analysis is applicable to primitive and complex gates. Simulation results are presented for detecting modeled metallic nanotube faults in CNTFETs using a single stuck-at fault test set. A high coverage is achieved (˜98%).
Keywords: Carbon Nanotube, CNT, CNTFET, Defect Modeling, Fault Detection, Nanotechnology

Error Rate Reduction in DNA Self-Assembly by Non-Constant Monomer Concentrations and Profiling [p. 847]

B. Jang, Y.-B. Kim and F. Lombardi

This paper proposes a novel technique based on profiling the monomers for reducing the error rate in DNA selfassembly. This technique utilizes the average concentration of the monomers (tiles) for a specific pattern as found by profiling its growth. The validity of profiling and the large difference in the concentrations of the monomers are shown to be applicable to different tile sets. To evaluate the error rate new Markov based models are proposed to account for the different types of bonding (i.e. single, double and triple) in the monomers as modification to the commonly assumed kinetic trap model. A significant error rates reduction is accomplished compared to a scheme with constant concentration as commonly utilized under the kinetic trap model. Simulation results are provided.

Design and DFT of a High-Speed Area-Efficient Embedded Asynchronous FIFO [p. 853]

P. Wielage, E.J. Marinissen, M. Altheimer and C. Wouters

Embedded First-In First-Out (FIFO) memories are increasingly used in many IC designs. We have created a new full-custom embedded ripple-through FIFO module with asynchronous read and write clocks. The implementation is based on a micropipeline architecture and is at least a factor two smaller than SRAM-based and standard-cell-based counterparts. This paper gives an overview of the most important design features of the new FIFO module and describes its test and design-for-test approach.

Test Quality Analysis and Improvement for an Embedded Asynchronous FIFO [p. 859]

T. Dubois, M. Azimane, E. Larsson, E.J. Marinissen, P. Wielage and C. Wouters

Embedded First-In First-Out (FIFO) memories are increasingly used in many IC designs. We have created a new full-custom embedded FIFO module with asynchronous read and write clocks, which is at least a factor two smaller and also faster than SRAM-based and standard-cell-based counterparts. The detection qualities of the FIFO test for both hard and weak resistive shorts and opens have been analyzed by an IFA-like method based on analog simulation. The defect coverage of the initial FIFO test for shorts in the bit-cell matrix has been improved by inclusion of an additional data background and low-voltage testing; for low-resistant shorts, 100% defect coverage is obtained. The defect coverage for opens has been improved by a new test procedure which includes waiting periods.

Interactive Presentation

Logic Level Fault Tolerance Approaches Targeting Nanoelectronics PLAs [p. 865]

W. Rao, A. Orailoglu and R. Karri

A regular structure and capability to implement arbitrary logic functions in a two-level logic form have placed crossbar-based Programmable Logic Arrays (PLAs) as promising implementation architectures in the emerging nanoelectronics environment. Yet reliability constitutes an important concern in the nanoelectronics environment, necessitating a thorough investigation and its effective augmentation for crossbar-based PLAs. We investigate in this paper fault masking for crossbar-based nanoelectronics PLAs. Missing nanoelectronics devices at the crosspoints have been observed as a major source of faults in nanoelectronics crossbars. Based on this observation, we present a class of fault masking approaches exploiting logic tautology in two-level PLAs. The proposed approaches enhance the reliability of nanoelectronics PLAs significantly at low hardware cost.

6.4: System Level Validation

Moderators: F. Fummi, Verona U, IT; M. Lajolo, NEC Laboratories, US

A Multi-Core Debug Platform for NoC-Based Systems [p. 870]

S. Tang and Q. Xu

Network-on-Chip (NoC) is generally regarded as the most promising solution for the future on-chip communication scheme in gigascale integrated circuits. As traditional debug architecture for busbased systems is not readily applicable to identify bugs in NoC-based systems, in this paper, we present a novel debug platform that supports concurrent debug access to the cores under debug (CUDs) and the NoC in a unified architecture. By introducing core-level debug probes in between the CUDs and their network interfaces and a system-level debug agent controlled by an off-chip multi-core debug controller, the proposed debug platform provides in-depth analysis features for NoC-based systems, such as NoC transaction analysis, multi-core cross-triggering and global synchronized timestamping. Therefore, the proposed solution is expected to facilitate the designers to identify bugs in NoC-based systems more effectively and efficiently. Experimental results show that the design-for-debug cost for the proposed technique in terms of area and traffic requirements is moderate1.

Seamless Hardware/Software Performance Co-Monitoring in a Codesign Simulation Environment with RTOS Support [p. 876]

L. Moss, M. De Nanclas, L. Filion, S. Fontaine, G. Bois and M. Aboulhamid

Simulation monitoring tools are needed in hardware/software codesign for performance debugging, model validation and hardware/software partitioning purposes. Existing tools are either hardware- or software-centric and lack integrated and seamless co-monitoring. This paper presents a system-level co-monitoring tool that can monitor the computation and communication activities of SystemC user modules, as well as bus, memory and processor usage, on a variety of hardware/software embedded configurations that may include an RTOS. We also describe how performance metrics are generated during or after simulation and made accessible to users or external applications. Finally, experimental results show that such comonitoring does not disturb the simulation's internal timing and only moderately increases the simulation's wall clock run time (by 11-22% for hardware/software partitioned architectures).

Incremental ABV for Functional Validation of TL-to-RTL Design Refinement [p. 882]

N. Bombieri, F. Fummi and G. Pravadelli

Transaction-level modeling (TLM) has been proposed as the leading strategy to address the always increasing complexity of digital systems. However, its introduction arouses a new challenge for designers and verification engineers, since there are no mature tools to automatically synthesize an RTL implementation from a transaction-level (TL) design, thus manual refinements are mandatory. In this context, the paper presents an incremental assertionbased verification (ABV) methodology to check the correctness of the TL-to-RTL refinement. The methodology relies on reusing assertions and already checked code, and it is guided by an assertion coverage metrics.

Efficient Testbench Code Synthesis for a Hardware Emulator System [p. 888]

I. Mavroidis and I. Papaefstathiou

The rising complexity of modern embedded systems is causing a significant increase in the verification effort required by hardware designers and software developers, leading to the "design verification crisis", as it is known among engineers. Today's verification challenges require powerful testbenches and high-performance simulation solutions such as Hardware Simulation Accelerators and Hardware Emulators that have been in use in hardware and electronic system design centers for approximately the last decade. In particular, in order to accelerate functional simulation, hardware emulation is used so as to offload calculation-intensive tasks from the software simulator. However, the communication overhead between the software simulator and hardware emulator is becoming a new critical bottleneck. We tackle this problem by partitioning the code running on the software simulator into two sections: the testbench HDL (Hardware Description Language) code that communicates directly with the Design Under Test (DUT) and the rest C-like testbench code. The former section is transformed into synthesizable code while the latter runs in a general purpose CPU. Our experiments demonstrate that the proposed method reduces the communication overhead by a factor of about 5 compared to a conventional hardware emulated simulation.

Interactive Presentations

Implementation of a Transaction Level Assertion Framework in SystemC [p. 894]

W. Ecker, V. Esen, T. Steininger, M. Velten and M. Hull

Current hardware design and verification methodologies reflect a trend towards abstraction levels higher than RTL, referred to as transaction level (TL). Since transaction level models (TLMs) are used for early prototyping and as reference models for the verification of their RTL representation, the quality assurance of TLMs is vital. Assertion based verification (ABV) of RTL models has improved quality assurance of IP blocks and SoC systems to a great extent. Since mapping of an RTL ABV methodology to TL poses severe problems due to different design paradigms, current ABV approaches need extensions towards TL. In this paper we present a prototype implementation of a TL assertion framework using SystemC which is currently the de facto standard for system modeling.

Automatic Generation of Functional Coverage Models from Behavioral Verilog Descriptions [p. 900]

S. Verma, I.G. Harris and K. Ramineni

As an industrial practice, the functional coverage models are developed based on a high-level specification of the Design Under Verification (DUV). However, in the course of implementation a designer makes specific choices which may not be reflectedwell in a functional coverage model developed entirely from a high-level specification. We present a method to automatically generate implementation-aware coverage models based on the static analysis of a HDL description of the DUV. Experimental results show that the functional coverage models generated using our technique correlate well with the detection of randomly injected errors into a design.

6.5: Model-Based Design for Embedded Systems

Moderators: P.J. Mosterman, The MathWorks, Inc, US; H. Giese, Paderborn U, DE

Compositional Specification of Behavioral Semantics [p. 906]

K. Chen, J. Sztipanovits and S. Neema

An emerging common trend in model-based design of embedded software and systems is the adoption of Domain-Specific Modeling Languages (DSMLs). While abstract syntax metamodeling enables the rapid and inexpensive development of DSMLs, the specification of DSML semantics is still a hard problem. In previous work, we have developed methods and tools for the semantic anchoring of DSMLs. Semantic anchoring introduces a set of reusable "semantic units" that provide reference semantics for basic behavioral categories using the Abstract State Machine (ASM) framework. In this paper, we extend the semantic anchoring framework to heterogeneous behaviors by developing a method for the composition of semantic units. Semantic unit composition reduces the required effort from DSML designers and improves the quality of the specification. The proposed method is demonstrated through a case study.

Performance Analysis of Multimedia Applications Using Correlated Streams [p. 912]

K. Huang, L. Thiele, T. Stefanov and E. Deprettere

In modern embedded systems, data streams are often partitioned into separate sub-streams which are processed on parallel hardware components. To analyze the performance of these systems with high accuracy, correlations between event streams must be taken into account. No methods are known so far that are able to model such a scenario with the desired accuracy. In this paper, we present a new approach to analyze correlations and we embed this analysis method into a well-established modular performance analysis framework. The presented approach enables system-level performance analysis of complete systems by taking into account stream correlations and blocking-read semantics. Experimental results on a hardware-software prototyping system are provided that show the accuracy of the analysis in a practical application.

Simulation Platform for UHF RFID [p. 918]

V. Derbek, C. Steger, R. Weiβ, D. Wischounig, J. Preishuber-Pfluegl and M. Pistauer

Developing modern integrated and embedded systems require well-designed processes to ensure flexibility and independency. These features are related to exchangeability of hardware targets and to the ability of choosing the target at a very late stage in the implementation process. Especially in the field of ultra high frequency radio frequency identification (UHF RFID) the model-based design approach leads to expected results. Beside a clear design process, which is applied in this work to build the required system architecture, the scope for UHFRFID simulations is defined and an extendable platform based on The MathWorks Matlab Simulink® is developed. This simulation platform, based on a multi-processor hardware target, using a Texas Instruments TMS320C6416 digital signal processor is able to run UHFRFID tag simulations of very high complexity. The highest effort is made to ensure flexibility to handle future simulation models on the same hardware target, realized by the continuous design and implementation flow of this platform based on modelbased design.

Tool-Support for the Analysis of Hybrid Systems and Models [p. 924]

A. Bauer, M. Pister and M. Tautschnig

This paper introduces a method and tool-support for the automatic analysis and verification of hybrid and embedded control systems, whose continuous dynamics are often modelled using MATLAB/Simulink. The method is based upon converting system models into the uniform input language of our efficient multi-domain constraint solving library, ABSOLVER, which is then used for subsequent analysis. Basically, ABSOLVER is an extensible SMT-solver which addresses mixed Boolean and (nonlinear) arithmetic constraint problems as they appear in the design of hybrid control systems. It allows the integration and semantic connection of various domain specific solvers via a logical circuit, such that almost arbitrary multi-domain constraint problems can be formulated and solved. Its design has been tailored for extensibility, and thus facilitates the reuse of expert knowledge, in that the most appropriate solver for a given task can be integrated and used. As such the only constraint over the problem domain is the capability of the employed solvers. Our approach to systems verification has been validated in an industrial case study using the model of a car's steering control system. However, additional benchmarks show that other hard instances of problems could also be solved by ABSOLVER in respectable time, and that for some instances, ABSOLVER's approach was the only means of solving a problem at all.

Interactive Presentation

Automatic Model Generation for Black Box Real-Time Systems [p. 930]

T.H. Feng, L. Wang, W. Zheng, S. Kanajan and S.A. Seshia

Embedded systems are often assembled from black box components. System-level analyses, including verification and timing analysis, typically assume the system description, such as RTL or source code, as an input. There is therefore a need to automatically generate formal models of black box components to facilitate analysis. We propose a new method to generate models of realtime embedded systems based on machine learning from execution traces, under a given hypothesis about the system's model of computation. Our technique is based on a novel formulation of the model generation problem as learning a dependency graph that indicates partial ordering between tasks. Tests based on an industry case study demonstrate that the learning algorithm can scale up and that the deduced system model accurately reflects dependencies between tasks in the original design. These dependencies help us formally prove properties of the system and also extract data dependencies that are not explicitly stated in the specifications of black box components.

6.6: PANEL SESSION - Life Begins at 65 - Unless You Are Mixed Signal

Organizers: N. Nandra, Synopsys, US; R. Wittmann, Nokia, DE
Moderator: G. Gielen, KU Leuven, BE

Life Begins at 65 - Unless You Are Mixed Signal? [p. 936]

R. Wittmann, N. Nandra, J. Kunkel, M. Vanzi, J. Franca, H.-J. Wassener, C. Münker

The old school of analog designers, exemplified by pioneer Bob Pease, is becoming an extinct species. But the demand for analog/mixed-signal IP blocks has never been greater, especially at 65 nm and below. Can this demand be met by using externally designed 3rd party analog/mixed-signal IP? Or is the implementation of revolutionary changes to traditional work flows and analog design processes a suitable option? Which solutions that help in increasing design efficiency are currently on the table? In the future, which side of the table will analog designers of Bob Pease's generation sit: the IP provider or the chip company? Or are their skills redundant for the 65 nm analog design challenges?

6.7: Resource Optimisation for Best Effort and Quality of Service

Moderators: M. Coppolla, STMicroelectronics, IT; P. Ienne, EPFL Lausanne, CH

Routing Table Minimization for Irregular Mesh NoCs [p. 942]

E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny

The majority of current Network on Chip (NoC) architectures employ mesh topology and use simple static routing, to reduce power and area. However, regular mesh topology is unrealistic due to variations in module sizes and shapes, and is not suitable for application-specific NoCs. Consequently, simplistic routing techniques such as XY routing are inadequate, raising the need for low cost alternatives which can work in irregular mesh networks. In this paper we present a novel technique for reducing the total hardware cost of routing tables for both source and distributed routing approaches. The proposed technique is based on applying a fixed routing function combined with minimal deviation tables that are used only when the routing decisions for a given destination deviate from the predefined routing function. We apply this methodology to compare three hardware efficient routing methods for irregular mesh topology NoCs. For each method, we develop path selection algorithms that minimize the overall cost of routing tables. Finally, we demonstrate by simulations on random and specific real application network instances a significant cost saving compared to standard solutions, and examine the scaling of cost savings with growing NoC size.

Congestion-Controlled Best-Effort Communication for Networks-on-Chip [p. 948]

J.W. van den Brand, C. Ciordas, K. Goossens and T. Basten

Congestion has negative effects on network performance. In this paper, a novel congestion control strategy is presented for Networks-on-Chip (NoC). For this purpose we introduce a new communication service, congestioncontrolled best-effort (CCBE). The load offered to a CCBE connection is controlled based on congestion measurements in the NoC. Link utilization is monitored as a congestion measure, and transported to a Model Predictive Controller (MPC). Guaranteed bandwidth and latency connections in the NoC are used for this, to assure progress of link utilization data in a congested NoC. We also present a simple but effective model for link utilization for the model-based predictions. Experimental results show that the presented strategy is effective and has reaction speeds of several microseconds which is considered acceptable for realtime embedded systems.

Undisrupted Quality-of-Service during Reconfiguration of Multiple Applications in Networks on Chip [p. 954]

A. Hansson, M. Coenen and K. Goossens

Networks on Chip (NoC) have emerged as the design paradigm for scalable System on Chip (SoC) communication infrastructure. Due to convergence, a growing number of applications are integrated on the same chip. When combined , these applications result in use-cases with different communication requirements. The NoC is configured per use-case and traditionally all running applications are disrupted during use-case transitions, even those continuing operation. In this paper we present a model that enables partial reconfiguration of NoCs and a mapping algorithm that uses the model to map multiple applications onto a NoC with undisrupted Quality-of-Service during reconfiguration. The performance of the methodology is verified by comparison with existing solutions for several SoC designs. We apply the algorithm to a mobile phone SoC with telecom, multimedia and gaming applications, reducing NoC area by more than 17% and power consumption by 50% compared to a state-of-the-art approach.

7.1: HOT TOPIC - Testing 35 Billions of Transistors in 2020, Is It Possible?

Organizers: L. Anghel, TIMA Laboratory, FR; M.-L. Flottes, LIRMM, Montpellier, FR Moderator Y. Zorian, Virage Logic, US

Testing in the Year 2020 [p. 960]

R. Galivanche, R. Kapur and A. Rubio

Testing today of a several hundred million transistor System-on-Chip with analog, RF blocks, many processor cores and tens of memories is a huge task. What will test technology be like in year 2020 with hundreds of billions of transistors on a single chip? Can we get there with tweaks to today's technology? While the exact nature of the circuit styles, architectural innovations and product innovations in year 2020 are highly speculative at this point, we examine the impact of likely design and process technology trends on testing methods.

7.2: Designs in Avionics, Military and Space

Moderators: P. Manet, U Catholique de Louvain, BE ; I. Söderquist, SAAB AB, Saab Avitronics, SE

Transaction Level Modeling of SCA Compliant Software Defined Radio Waveforms and Platforms PIM/PSM [p. 966]

G. Gailliard, E. Nicollet, M. Sarlotte and F. Verdier

In the scope of the US Department of Defense (DoD) Joint Tactical Radio System (JTRS) program, the portability and reconfigurability needs of Software Defined Radios (SDR) required by the Software Communications Architecture (SCA) [1] can be resolved thanks to Model Driven Architecture (MDA) and component/container paradigm to address a heterogeneous hardware and software architecture. In this paper, we propose SystemC Transaction Level Modelling (TLM) to simulate Platform Independent Model (PIM) and Platform Specific Model (PSM) of SDRs, while keeping the component/container approach for applications portability. We show that SystemC 2.1 enables natively to simulate the waveform PIM specified in UML to obtain an executable specification, which can be reused to validate the SystemC TLM model of PSM. This latter allows radio platform virtualisation and true reuse of IPs models to validate earlier SDR waveforms and platforms.

Event Driven Data Processing Architecture [p. 972]

I. Söderquist

This paper describes a data processing architecture where events and time are in focus. This differs from traditional von Neumann and data flow architectures. New instruction codes are defined and special circuitry is introduced to express and execute event and time operations. This results in reconfigurable software controlled functionality together with real-time performance comparable to dedicated VLSI solutions. The architecture is demonstrated in a real-time radar jammer application. The architecture is promising also for applications as routers and network processors. A prototype system on silicon (SoC), complete with signal memory, instruction memory, four processing units in parallel and interfaces for digitized signals and host computer, is fabricated in 0.35 μm standard CMOS. Time events of signal data on two simultaneous 8-bit links can be programmed with a time resolution of one clock period. Measurements verified correct function and performance above 400 MHz clock frequency at 3.3 Volt supply. Power consumption is 3.6-Watt @320 MHz.

Reconfigurable System-on-Chip Data Processing Units for Space Imaging Instruments [p. 977]

B. Fiethe, H. Michalik, C. Dierker, B. Osterloh and G. Zhou

Individual Data Processing Units (DPUs) are commonly used for operational control and specific data processing of scientific space instruments. To overcome the limitations of traditional rad-hard or fully commercial design approaches, a System-on-Chip (SoC) solution based on state-of-the-art FPGA is introduced. This design has been successfully demonstrated in space on Venus Express. From this, a reconfigurable DPU design for future advanced imaging sensors is derived using embedded processing cores. In addition, a SoC design variant is presented based on recently available FPGA technology with integrated hardwired processor, which is capable to support also high end payload applications.

Enabling Certification for Dynamic Partial Reconfiguration Using a Minimal Flow [p. 983]

B. Rousseau, P. Manet, D. Galerin, D. Merkenbraeck, J.-D. Legat, F. Dedeken and Y. Gabriel

As the trend in reconfigurable electronics goes towards strong integration, FPGA devices are becoming more and more interesting. They are already used for safety-critical applications such as avionics [9]. Latest FPGA's also enable new techniques such as dynamic partial reconfiguration (DPR), allowing new possibilities in terms of performance and flexibility. Their use in safety-critical systems is considered as impossible nowadays since they must be strictly validated, and DPR brings many new issues. Indeed, the tools used for DPR must be certified, which is barely impossible for the current DPR tools provided by the vendors. We have developed a simple flow upon the usual static one for Xilinx FPGA's that does not require any support of the vendor tools for DPR. This lessens the complexity of tools certification, and make a step towards enabling the certification of DPR for safety-critical applications. Moreover, under strong hypotheses, and by using safe design principles, we show how the complexity of certifying DPR can be reduced.

Identification of Process/Design Issues during 0.18 μm Technology Qualification for Space Application [p. 989]

J. Ferrigno, P. Perdu, K. Sanchez and D. Lewis

Optical techniques (light emission and laser stimulation techniques) are routinely used to evaluate defects on specific component for space applications. Just one anomaly on one component could have catastrophic consequences on satellites. We must analyse any kind of fault of the device whatever the origin of thus fault is. It can be design, designprocess, process or end user related... At the early stage of an analysis, choosing the right technique is an increasingly complex task. In some cases, one technique may bring value but not the others. Using a 180nm test structure device, we will present results showing the complementarity of Emission Microscopy (EMMI), Time-Resolved Emission (TRE) and Dynamic Laser Stimulation (DLS) in order to help debug engineers to choose the right approach. This complementarity gives us ability to strengthen hypothesises before any kind of physical analysis.

Interactive Presentations

RECOPS: Reconfiguring Programmable Devices for Military Hardware Electronics [p. 994]

P. Manet, D. Maufroid, L. Tosi, M. Di Ciano, O. Mulertt, Y. Gabriel, J.-D. Legat, D. Aulagnier, C. Gamrat, R. Liberati and V. La Barba

This paper presents the RECOPS project that aims to study the use of reconfiguration in military applications. The project explores the new potentials and possibilities offered by reconfigurable components like FPGA. It identifies specificities related to the use of this technology in military applications and proposes solutions to support them. Specific techniques like dynamic reconfiguration or high speed serial I/Os are also covered. The paper gives a description of the project and then presents preliminary results on the advantages and impacts of using reconfiguration in military applications. It also gives a synthetic view of the needs and challenges that need to face this technology to be integrated in professional and military electronics applications. They are based on a study made over a broad range of seven demonstrators covering most of the fields of military applications.
Keywords: reconfiguration, FPGA, defense, military, dynamic reconfiguration, partial reconfiguration, reconfigurable computing, high speed I/O.

7.4: Timing Analysis and Validation

Moderators: F. Salice, Politecnico di Milano, IT; P. Sanchéz, Cantabria U, ES

WAVSTAN: Waveform Based Variational Static Timing Analysis [p. 1000]

S.K Tiwary and J.R. Phillips

We present a waveform based variational static timing analysis methodology. It is a timing paradigm that lies midway between convention static delay approximations and full dynamic (SPICE-level) analysis. The core idea is to break the modulation of waveforms processed by a circuit into two parts: (a) non-linear circuit elements e.g., transistors, diodes etc. and (b) linear elements: transmission line, RLC network etc. The non-linear and linear parts of the circuit are then solved using a combination of current-source modeling, model order reduction methodology, perturbation analysis and learning-based Galerkin methods which helps us get SPICE-like accuracies. The proposed method is potentially as robust and 10-20X faster than currentsource based gate modeling methodologies.

Rapid and Accurate Latch Characterization via Direct Newton Solution of Setup/Hold Times [p. 1006]

S. Srivastava and J. Roychowdhury

Characterizing setup/hold times of latches and registers, a crucial component for achieving timing closure of large digital designs, typically occupies months of computation in industries such as Intel and IBM. We present a novel approach to speed up latch characterization by formulating the setup/hold time problem as a scalar nonlinear equation h(τ) = 0 derived using state-transition functions, and then solving this equation by Newton-Raphson (NR). The local quadratic convergence of NR results in rapid improvements in accuracy at every iteration, thereby significantly reducing the computation needed for accurate determination of setup/hold times. We validate the fast convergence and computational advantage of the new method on transmission gate and C²MOS latch/register structures, obtaining speedups of 4-10x over the current standard of binary search.

Temperature and Voltage Aware Timing Analysis: Application to Voltage Drops [p. 1012]

B. Lasbouygues, R. Wilson, N. Azemard and P. Maurine

In the nanometer era, the physical verification of CMOS digital circuit becomes a complex task. Designers must account of new factors that impose a significant change in validation methods. One of these major changes in timing verification to handle process variation lies in the progressive development of statistical static timing engines. However the statistical approach cannot capture accurately the deterministic variations of both the voltage and temperature variations. Therefore, we define a novel method, based on non-linear derating coefficients, to account of these environmental variations. Based on temperature and voltage drop CAD tool reports, this method allows computing the delay of logical paths considering more realistic operating conditions for each cell. Application is given to the analysis of voltage drop effects on timings.

Accurate Timing Analysis Using SAT and Pattern-Dependent Delay Models [p. 1018]

D. Tadesse, D. Sheffield, E. Lenge, R.I. Bahar and J. Grodstein

Accurate delay modeling beyond static models is critical to garnering better correlation with post-silicon analysis. Furthermore, post-silicon timing validation requires a pattern-dependent timing model to generate patterns. To address these issues, we propose a timing analysis tool that integrates a data-dependent delay model into its analysis. Our approach solves for the delay by using the concept of circuit unrolling and formulation of timing questions as decision problems for input into a SAT solver. The effectivness and validity of the proposed methodology is illustrated through experiments on benchmark circuits.

7.5: Model-Based Analysis and Middleware of Embedded Systems

Moderators: S. van Loo, Philips Research, NL; H. De Groot, European Microsoft Innovation Centre, DE

CARAT: A Toolkit for Design and Performance Analysis of Component-Based Embedded Systems [p. 1024]

E. Bondarev, M. Chaudron and P.H.N. de With

Solid frameworks and toolkits for design and analysis of embedded systems are of high importance, since they enable early reasoning about critical properties of a system. This paper presents a software toolkit that supports the design and performance analysis of real-time component-based software architectures deployed on heterogeneous multiprocessor platforms. The tooling environment contains a set of integrated tools for (a) component storage and retrieval, (b) graphics-based design of software and hardware architectures, (c) performance analysis of the designed architectures and, (d) automated code generation. The cornerstone of the toolkit is a performance analysis framework that automates composition of the individual component models into a system executable model, allows simulation of the system model and gives design-time predictions of key performance properties like response time, data throughput, and usage of hardware resources. We illustrate the efficiency of this toolkit on a Car Radio Navigation benchmark system.

Modeling and Simulation Alternatives for the Design of Networked Embedded Systems [p. 1030]

E. Alessio, F. Fummi, D. Quaglia and M. Turolla

This paper addresses the problem of modeling and simulating large set of heterogeneous networked embedded systems which cooperate to build cost-efficient, reliable, secure and scalable applications. The purpose of this task is an application-driven top-down design flow which starts from application requirements and then progressively decides the general architecture of the system and the type and structure of its HW, SW and network components. In the past, a considerable research effort has been done to create specific tools for each design domain "software, hardware and network", and to integrate them for data exchange between models and their joint simulation. However, the advantages and drawbacks of different combinations of tools in the various stages of the design flow have not been discussed. The paper describes and discusses how to combine different modeling tools to provide different modeling and simulation alternatives for the design of networked embedded systems devoted to complex distributed applications. The problem is faced both theoretically and practically with a real application derived from a European project.

Middleware Design Optimization of Wireless Protocols Based on the Exploitation of Dynamic Input Patterns [p. 1036]

S. Mamagkakis, D. Soudris and F. Catthoor

Today, wireless networks are moving big amounts of data between mobile devices, which have to work in an ubiquitous computing environment, which perpetually changes at run-time (i.e., nodes log on and off, varied user activity, etc.). These changes introduce problems that can not be fully analyzed at design-time and require dynamic (runtime) solutions. These solutions are implemented with the use of run-time resource management at the middleware level for a wide variety of embedded systems. In this paper, we motivate and propose the characterization of the dynamic inputs of wireless protocols (e.g., input to the IEEE 802.11b protocol coming from IPv4 data fragmentation). Thus, through statistical analysis we derive patterns that will guide our optimization process of the middleware for run-time resource management design. We assess the effectiveness of our approach with inputs of 18 real life case studies of wireless networks. Finally, we show up to 81.97% increase in the performance of the proposed design solution compared to the state-of-the-art solutions, without compromising memory footprint or energy consumption.

Lightweight Middleware for Seamless HW-SW Interoperability, with Application to Wireless Sensor Networks [p. 1042]

F.J. Villanueva, D. Villa, F. Moya, J. Barba, F. Rincón and J.C. López

HW-SW interoperability by means of standard distributed object middlewares has been proved to be useful in the design of new and challenging applications for ubiquitous computing and ambient intelligence environments. Wireless sensor networks are considered to be essential for the proper deployment of these applications, but they impose new constraints in the design of the corresponding communication infrastructure: low-cost middleware implementations that can fit into tiny wireless devices are needed. In this paper, a novel approach for the development of pervasive environments based on an ultra low-cost implementation of standard distributed object middlewares (such as CORBA or ICE) is presented. A fully functional prototype supporting full interoperability with ZeroC ICE is described in detail. Available implementations range from the smallest microcontrollers in the market, to the tiniest embedded Java virtual machines, and even a low-end FPGA.

Interactive Presentation

A Middleware-centric Design Flow for Networked Embedded Systems [p. 1048]

F. Fummi, G. Perbellini, R. Pietrangeli and D. Quaglia

The paper focuses on the design of networked embedded systems which cooperate to provide complex distributed applications. A milestone in the effort of simplifying the implementation of such applications has been the introduction of a service layer, named middleware, which abstracts from the peculiarities of the operating system and HW components. However, the presence of the middleware has not been yet introduced in the design flow as an explicit dimension. This work presents an abstract model of middleware supporting different programming paradigms; it can be used as component in the design flow and allows to simulate and develop the application without doing premature assumptions on the actual HW/SW platform. At the end of the design flow the abstract middleware can be mapped to an actual middleware. The methodology has been analyzed both theoretically and practically with the actual application on a wireless sensor network.

7.6: Advanced Architectures for Low Power Optimization

Moderators: J. Henkel, Karlsruhe U, DE; A. Macii, Politecnico di Torino, IT

Dynamic Reconfiguration in Sensor Networks with Regenerative Energy Sources [p. 1054]

A. Nahapetian, P. Lombardo, A. Acquaviva, L. Benini and M. Sarrafzadeh

In highly power constrained sensor networks, harvesting energy from the environment makes prolonged or even perpetual execution feasible. In such energy harvesting systems, energy sources are characterized as being regenerative. Regenerative energy sources fundamentally change the problem of power scheduling for embedded devices. Instead of the problem being one of maximizing the lifetime of the system given a total amount of energy, as in traditional battery powered devices, the problem becomes one of preventing energy depletion at any given time. Coupling relatively computationally intensive applications, such as video processing applications, with the constrained FPGAs that are feasible on power constrained embedded systems, makes dynamic reconfiguration essential. It provides the speed comparable to a hardware implementation, but it also allows the dynamic reconfiguration to meet the multiple application needs of the system. Different applications can be loaded on the FPGA, as the system's needs change over time. The problem becomes how to schedule the dynamic reconfiguration to appropriately make use of the regenerative energy source, to ensure the proper availability of energy for the system over time. In this paper, we present a methodology for carrying out dynamic reconfiguration for regenerative energy sources, based on statistical analysis of tasks and supply energy. The approach is evaluated through extensive simulations. Additionally, we have evaluated our implementation on our regenerative energy, dynamically reconfigurable prototype, known as the MicrelEye. Our approach is shown to miss 57.7% less deadlines on average than the current approach for reconfiguration with regenerative energy sources.

Dynamic Power Management under Uncertain Information [p. 1060]

H. Jung and M. Pedram

This paper tackles the problem of dynamic power management (DPM) in nanoscale CMOS design technologies that are typically affected by increasing levels of process, voltage, and temperature (PVT) variations and fluctuations. This uncertainty significantly undermines the accuracy and effectiveness of traditional DPM approaches. More specifically, we propose a stochastic framework to improve the accuracy of decision making in power management, while considering the manufacturing process and/or design induced uncertainties. A key characteristic of the framework is that uncertainties are effectively captured by a partially observable semi-Markov decision process. As a result, the proposed framework brings the underlying probabilistic PVT effects to the forefront of power management policy determination. Experimental results with a RISC processor demonstrate the effectiveness of the technique and show that our proposed variability-aware power management technique ensures robust system-wide energy savings under probabilistic variations.

Very Wide Register: An Asymmetric Register File Organization for Low Power Embedded Processors [p. 1066]

P. Raghavan, A. Lambrechts, M. Jayapala, F. Catthoor, D. Verkest and H. Corporaal

In current embedded systems processors, multi-ported register files are one of the most power hungry parts of the processor, even when they are clustered. This paper presents a novel register file architecture, which has single ported cells and asymmetric interfaces to the memory and to the datapath. Several realistic kernels from the TI DSP benchmark and from Software Defined Radio (SDR) are mapped on the architecture. A complete physical design of the architecture is done in TSMC 90nm technology. The novel architecture presented is shown to obtain energy gains of upto 10X with respect to conventional multi-ported register file over the different benchmarks.

Interactive Presentations

Single-ended Coding Techniques for Off-chip Interconnects to Commodity Memory [p. 1072]

M. Choudhury, K. Ringgenberg, S. Rixner and K. Mohanram

This paper introduces a class of single-ended coding schemes to reduce off-chip interconnect energy consumption. State-of-the-art codes for processor-memory off-chip interfaces require the transmitter and receiver (memory controller and memory) to collaborate using current and previously transmitted values to encode and decode data. Modern embedded systems, however, cannot afford to use such double-ended codes that require specialized memories to participate in the code. In contrast, a single-ended code enables the memory controller to encode data stored in memory and subsequently decode that data when it is retrieved, allowing the use of commodity memories. In this paper, single-ended codes are presented that assign limited-weight codewords using trace-based mapping techniques. Simulation results show that such codes can reduce the energy consumption of an uncoded off-chip interconnect by up to 42.5%.

PowerQuest: Trace Driven Data Mining for Power Optimization [p. 1078]

P. Babighian, G. Kamhi and M. Vardi

We introduce a general framework, called PowerQuest, with the primary goal of extracting "interesting" dynamic invariants from a given simulation-trace database, and applying it to the powerreduction problem through detection of gating conditions. PowerQuest adopts machine-learning techniques for data mining. The advantages of PowerQuest in comparison with other state-ofthe- art Dynamic Power Management (DPM) techniques are: 1) Quality of ODC conditions for gating 2) Minimization of extra logic added for gating. We demonstrate the validity of our approach in reducing power through experimental results using ITC99 benchmarks and real-life microprocessor test cases. We present up to 22.7 % power reduction in comparison with other DPM techniques.

7.7: Performance Analysis for NoC Architectures

Moderators: S. Murali, Stanford U, US; L. Carloni, UCB, ES

(408) System Level Assessment of an Optical NoC in an MPSoC Platform [p. 1084]

M. Brière, B. Girodias, Y. Bouchebaba, G. Nicolescu, F. Mieyeville, F. Gaffiot and I. O'Connor

In the near future, Multi-Processor Systems-on-Chip (MPSoC) will become the main thrust driving the evolution of integrated circuits. MPSoCs introduce new challenges, mainly due to growing communication through their interconnect structure. Current electrical interconnects will face hard challenges to overcome such data flows. Integrated optical interconnect is a potential technological improvement to reduce these problems. The main contributions of this paper are i) the optical network integration in a system-level MPSoC platform and ii) the quantitative evaluation of optical interconnect for MPSoC design using a multimedia application.

(142) Systematic Comparison between the Asynchronous and the Multi-Synchronous Implementations of a Network on Chip Architecture [p. 1090]

A. Sheibanyrad, I. Miro Panades and A. Greiner

In this paper we present a systematic comparison between two different implementations of a distributed Network on Chip: fully asynchronous and multi-synchronous. The NoC architecture has been designed to be used in a Globally Asynchronous Locally Synchronous clusterized Multi Processors System on Chip. The 5 relevant parameters are Silicon Area, Network Saturation Threshold, Communication Throughput, Packet Latency and Power Consumption. Both architectures have been physically implemented and simulated by SystemC/VHDL co-simulation. The electrical parameters have also been evaluated by post layout SPICE simulation for a 90nm CMOS fabrication process, taking into account the long wire effects.

(768) Analytical Router Modeling for Networks-on-Chip Performance Analysis [p. 1096]

U.Y. Ogras and R. Marculescu

Networks-on-Chip (NoCs) have recently emerged as a scalable alternative to classical bus and point-to-point architectures. To date, performance evaluation of NoC designs is largely based on simulation which, besides being extremely slow, provides little insight on how different design parameters affect the actual network performance. Therefore, it is practically impossible to use simulation for optimization purposes. In this paper, we first present a generalized router model and then utilize this novel model for doing NoC performance analysis. The proposed model can be used not only to obtain fast and accurate performance estimates, but also to guide the NoC design process within an optimization loop. The accuracy of our approach and its practical use is illustrated through extensive simulation results.

Interactive Presentation

(374) Hard- and Software Modularity of the NOVA MPSoC Platform [p. 1102]

C. Sauer, M. Gries and S. Dirk

The Network-Optimized Versatile Architecture Platform (NOVA) encapsulates embedded cores, tightly and loosely coupled coprocessors, on-chip memories, and I/O interfaces by special sockets that provide a common packet passing and communication infrastructure. To ease the programming of the heterogeneous multiprocessor target for the application developer, a component based framework is used for describing packet processing applications in a natural and productive way. Leveraging identical application and hardware communication semantics, code generators and off-the-shelf tool chains can automate the software implementation process. Using a prototype with four processing cores we quantify the overhead of modularity and programmability for the platform.

8.1: TUTORIAL SESSION - State of the Art (Space and Aeronautics Special Day)

Organizers: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR
Moderator: S. Prudhomme, Airbus, FR

The Methodological and Technological Dimensions of Technology Transfer for Embedded Systems in Aeronautics and Space [p. 1108]

T. Pardessus, H. Daembkes, and R. Arning

This tutorial is in two parts, to elaborate the two pillars of technology transfer in the context of the aeronautics and space industry. The first part illustrates the methodological pillar, showing the state of the art in the industrial approaches to technology transfer. The second part illustrates the technological pillar, giving an overview of recent successes in technology transfer and emerging trends and opportunities, for both hardware and software. These two pillars are further mirrored in the two technical sessions.

8.2: Secure Systems

Moderators: R. Pacalet, ENST, FR; R. Locatelli, STMicroelectronics, FR

Energy Evaluation of Software Implementations of Block Ciphers under Memory Constraints [p. 1110]

J. GroΒschadl, S. Tillich, C. Rechberger, M. Hofmann and M. Medwed

Software implementations of modern block ciphers often require large lookup tables along with code size increasing optimizations like loop unrolling to reach peak performance on general-purpose processors. Therefore, block ciphers are difficult to implement efficiently on embedded devices like cell phones or sensor nodes where run-time memory and program ROM are scarce resources. In this paper we analyze and compare the performance, energy consumption, runtime memory requirements, and code size of the five block ciphers RC6, Rijndael, Serpent, Twofish, and XTEA on the StrongARM SA-1100 processor. Most previous evaluations of block ciphers considered performance as the sole metric of interest and did not care about memory requirements or code size. In contrast to previous work, our study of the performance and energy characteristics of block ciphers has been conducted with "lightweight" implementations which restrict the size of lookup tables to 1 kB and also impose constraints on the code size. We found that Rijndael and RC6 can be well optimized for high performance and energy efficiency, while at the same time meeting the demand for low memory (RAM and ROM) footprint. In addition, we discuss the impact of key expansion and modes of operation on the overall performance and energy consumption of each block cipher. Our simulation results show that RC6 is the most energy-efficient block cipher under memory constraints and thus the best choice for resource-restricted devices.

An Area Optimized Reconfigurable Encryptor for AES-Rijndael [p. 1116]

M. Alam, S. Ray, D. Mukhopadhayay, S. Ghosh, D. RoyCowdhury and I. Sengupta

This paper presents a reconfigurable architecture of the Advanced Encryption Standard (AES-Rijndael) cryptosystem. The suggested reconfigurable architecture is capable of handling all possible combinations of standard bit lengths (128,192,256) of data and key. The fully rolled inner-pipelined architecture ensures lesser hardware complexity. The work develops a FSMD model based controller which is ideal for such iterative implementation of AES. S-boxes here have been implemented using combinational logic over composite field arithmetic which completely eliminates the need of any internal memory. The design has been implemented on Xilinx Vertex XCV1000 and 0.18μ CMOS technology. The performance of the architecture has been compared with existing results in the literature and has been found to be the most compact implementations of the AES algorithm.

Performance Aware Secure Code Partitioning [p. 1122]

S.H.K. Narayanan, M. Kandemir and R. Brooks

Many embedded applications exist where decisions are made using sensitive information. A critical issue in such applications is to ensure that data is accessed only by authorized computing entities. In many scenarios, these entities do not rely on each other, yet they need to work on a secure application in parallel to complete application execution under the specified deadline. Our focus in this paper is on compiler-guided secure code partitioning among a set of hosts. The scenario targeted involves a set of hosts that want to execute a secure embedded application in parallel. The various hosts have different levels of access to the data structures manipulated in the application. Our approach partitions the application among the hosts such that the load imbalance across hosts is minimized to reduce execution time while ensuring that no security leak occurs.

Energy and Execution Time Analysis of a Software-based Trusted Platform Module [p. 1128]

N. Aaraj, A. Raghunathan, S. Ravi and N.K. Jha

Trusted platforms have been proposed as a promising approach to enhance the security of general-purpose computing systems. However, for many resource-constrained embedded systems, the size and cost overheads of a separate Trusted Platform Module (TPM) chip are not acceptable. One alternative is to use a software-based TPM (SW-TPM), which implements TPM functions using software that executes in a protected execution domain on the embedded processor itself. However, since many embedded systems have limited processing capabilities and are battery-powered, it is also important to ensure that the computational and energy requirements for SW-TPMs are acceptable. In this work, we perform an evaluation of the energy and execution time overheads for a SW-TPM implementation on a Sharp Zaurus PDA. We characterize the execution time and energy required by each TPM command through actual measurements on the target platform. In addition, we also evaluate the overheads of using SW-TPM in the context of various end applications, including trusted boot of the Linux operating system (OS), secure file storage, secure VoIP client, and secure web browser. Furthermore, we observe that for most TPM commands, the overheads are primarily due to the use of 2048-bit RSA operations that are performed within SW-TPM. In order to alleviate SW-TPM overheads, we evaluate the use of Elliptic Curve Cryptography (ECC) as a replacement for the RSA algorithm specified in the Trusted Computing Group (TCG) standards. Our experiments indicate that this optimization can significantly reduce SW-TPM overheads (an average of 6.51X execution time reduction and 6.75X energy consumption reduction for individual TPM commands, and an average of 10.25X execution time reduction and 10.75X energy consumption reduction for applications). Our work demonstrates that ECC-based SW-TPMs are a viable approach to realizing the benefits of trusted computing in resource-constrained embedded systems.

8.3: Reliable Microarchitectures

Moderators: S. Vassiliadis, TU Delft, NL; P. Ienne, EPFL Lausanne, CH

Utilization of SECDED for Soft Error and Variation-Induced Defect Tolerance in Caches [p. 1134]

L.D. Hung, H. Irie, M. Goshima and S. Sakai

Combination of SECDED with a redundancy technique can effectively tolerate a high variation-induced defect rate in future processes. However, while a defective cell in a block can be repaired by SECDED, the block becomes vulnerable to soft errors. This paper proposes a technique to deal with the degraded resilience against soft errors. Only clean data can be stored in defective blocks of a cache. This constraint is enforced through selective write-through mechanism. An error occurring in a defective block can be detected and the correct data can be obtained from the lower level caches.

Transient Fault Prediction Based on Anomalies in Processor Events [p. 1140]

S. Narayanasamy, A. Coskun and B. Calder

Future microprocessors will be highly susceptible to transient errors as the sizes of transistors decrease due to CMOS scaling. Prior techniques advocated full scale structural or temporal redundancy to achieve fault tolerance. Though they can provide complete fault coverage, they incur significant hardware and/or performance cost. It is desirable to have mechanisms that can provide partial but sufficiently high fault coverage with negligible cost. To meet this goal, we propose leveraging speculative structures that already exist in modern processors. The proposed mechanism is based on the insight that when a fault occurs, it is likely that the incorrect execution would result in abnormally higher or lower number of mispredictions (branch mispredictions, L2 misses, store set mispredictions) than a correct execution. We design a simple transient fault predictor that detects the anomalous behavior in the outcomes of the speculative structures to predict transient faults.

Low-cost Protection for SER Upsets and Silicon Defects [p. 1146]

M. Mehrara, M. Attariyan, S. Shyam, K. Constantinides, V. Bertacco and T. Austin

Extreme transistor scaling trends in silicon technology are soon to reach a point where manufactured systems will suffer from limited device reliability and severely reduced life-time, due to early transistor failures, gate oxide wear-out, manufacturing defects, and radiation-induced soft errors (SER). In this paper we present a low-cost technique to harden a microprocessor pipeline and caches against these reliability threats. Our approach utilizes online built-in self-test (BIST) and microarchitectural checkpointing to detect, diagnose and recover the computation impaired by silicon defects or SER events. The approach works by periodically testing the processor to determine if the system is broken. If so, we reconfigure the processor to avoid using the broken component. A similar mechanism is used to detect SER faults, with the difference that recovery is implemented by re-execution. By utilizing low-cost techniques to address defects and SER, we keep protection costs significantly lower than traditional fault-tolerance approaches while providing high levels of coverage for a wide range of faults. Using detailed gate-level simulation, we find that our approach provides 95% and 99% coverage for silicon defects and SER events, respectively, with only a 14% area overhead.

Working with Process Variation Aware Caches [p. 1152]

M. Mutyam and V. Narayanan

Deep-submicron designs have to take care of process variation effects as variations in critical process parameters result in large variations in access latencies of hardware components. This is severe in the case of memory components as minimum sized transistors are used in their design. In this work, by considering on-chip data caches, we study the effect of access latency variations on performance. We discuss performance losses due to the worst-case design, wherein the entire cache operates with the worstcase process variation delay, followed by process variation aware cache designs which work at set-level granularity. We then propose a technique called block rearrangement to minimize performance loss incurred by a process variation aware cache which works at set-level granularity. Using block rearrangement technique, we rearrange the physical locations of cache blocks such that a cache set can have its "n" blocks (assuming a n-way set-associative cache) in multiple rows instead of a single row as in the case of a cache with conventional addressing scheme. By distributing blocks of a cache set over multiple sets, we minimize the number of sets being affected by process variation. We evaluate our technique using SPEC2000 CPU benchmarks and show that our technique achieves significant performance benefits over caches with conventional addressing scheme.

Interactive Presentations

(252) An Enhanced Technique for the Automatic Generation of Effective Diagnosis-oriented Test Programs for Processor [p. 1158]

E. Sanchéz, M. Schillaci, G. Squillero and M. Sonza Reorda

The ever increasing usage of microprocessor devices is sustained by a high volume production that in turn requires a high production yield, backed by a controlled process. Fault diagnosis is an integral part of the industrial effort towards these goals. This paper presents a new methodology that significantly improves over a previous work. The goal is construction of cost-effective programs sets for software-based diagnosis of microprocessors. The methodology exploits existing postproduction test sets, designed for software-based self-test, and may use an already developed infrastructure IP to perform the diagnosis. Experimental results are reported in the paper comparing the new results with existing ones, and showing the effectiveness of the new approach for an Intel i8051 processor core.

(161) Functional and Timing Validation of Partially Bypassed Processor Pipelines [p. 1164]

Q. Zhu, A. Shrivastava and N. Dutt

Customizing the bypasses in pipelined processors is an effective and popular means to perform power, performance and complexity trade-offs in embedded systems. However existing techniques are unable to automatically generate test patterns to functionally validate a partially bypassed processor. Manually specifying directed test sequences to validate a partially bypassed processor is not only a complex and cumbersome task, but is also highly error-prone. In this paper we present an automatic directed test generation technique to verify a partially bypassed processor pipeline using a high-level processor description. We define a fault model and coverage metric for a partially bypassed processor pipeline and demonstrate that our technique can fully cover all the faults using 107,074 tests for the Intel XScale processor within 40 minutes. In contrast, randomly generated tests can achieve 100% coverage with 2 million tests after half day. Furthermore, we demonstrate that our technique is able to generate tests for all possible bypass configurations of the Intel XScale processor.

8.4: Formal Techniques to Enhance the Verification Flow

Moderators: V. Bertacco, U of Michigan, US; S. Quer, Politecnico di Torino, IT

A Compositional Approach to the Combination of Combinational and Sequential Equivalence Checking of Circuits without Known Reset States [p. 1170]

I.-H. Moon, B. Bjesse and C. Pixley

As the pressure to produce smaller and faster designs increases, the need for formal verification of sequential transformations increases proportionally. In this paper we describe a framework that attempts to extend the set of designs that can be equivalence checked. Our focus lies in integrating sequential equivalence checking into a standard design flow that relies on combinational equivalence checking today. In order to do so, we can not make use of reset state or reset sequence information (as this is not given in combinational equivalence checking), and we need to mitigate the complexity inherent in the traditional sequential equivalence checking algorithms. Our solution integrates combinational and sequential equivalence checking in such a way that the individual analyses benefit from each other. The experimental results show that our framework can verify designs which are out of range for pure sequential equivalence checking methods aimed designs with unknown reset states.

Estimating Functional Coverage in Bounded Model Checking [p. 1176]

D. GroΒe, U. Kühne and R. Drechsler

Formal verification is an important issue in circuit and system design. In this context, Bounded Model Checking (BMC) is one of the most successful techniques. But even if all specified properties can be verified, it is difficult to determine whether they cover the complete functional behavior of a design. We propose a pragmatic approach to estimate coverage in BMC. The approach can easily be integrated in a BMC tool with only minor changes. In our approach, a coverage property is generated for each important signal. If the considered properties do not describe the signal's entire behavior, the coverage property fails and a counter-example is generated. From the counter-example an uncovered scenario can be derived. In this way the approach also helps in design understanding. Our method is demonstrated on a RISC CPU. Based on the results we identified coverage gaps. We were able to close all of them and achieved 100% functional coverage.

Abstraction and Refinement Techniques in Automated Design Debugging [p. 1182]

S. Safarpour and A. Veneris

Verification is a major bottleneck in the VLSI design flow with the tasks of error detection, error localization, and error correction consuming up to 70% of the overall design effort. This work proposes a departure from conventional debugging techniques by introducing abstraction and refinement during error localization. Under this new framework, existing debugging techniques can handle large designs with long counter-examples yet remain run time and memory efficient. Experiments on benchmark and industrial designs confirm the effectiveness of the proposed framework and encourage further development of abstraction and refinement methodologies for existing debugging techniques.

Interactive Presentation

Automatic Hardware Synthesis from Specifications: A Case Study [p. 1188]

R. Bloem, S. Galler, B. Jobstmann, N. Piterman, A. Pnueli and M. Weiglhofer

We propose to use a formal specification language as a high-level hardware description language. Formal languages allow for compact, unambiguous representations and yield designs that are correct by construction. The idea of automatic synthesis from specifications is old, but used to be completely impractical. Recently, great strides towards efficient synthesis from specifications have been made. In this paper we extend these recent methods to generate compact circuits and we show their practicality by synthesizing an arbiter for ARM's AMBA AHB bus and a generalized buffer from specifications given in PSL. These are the first industrial examples that have been synthesized automatically from their specifications.

8.5: Interconnect Extraction and Synthesis

Moderators: R. Suaya, Mentor Graphics, FR; P. Feldmann, IBM T J Watson Research Center, US

pFFT in FastMaxwell: A Fast Impedance Extraction Solver for 3D Conductor Structures over Substrate [p. 1194]

T. Moselhy, X. Hu and L. Daniel

In this paper we describe the acceleration algorithm implemented in FastMaxwell, a program for wideband electromagnetic extraction of complicated 3D conductor structures over substrate. FastMaxwell is based on the integral domain mixed potential integral equation (MPIE) formulation, with 3-D full-wave substrate dyadic Green's function kernel. Two dyadic Green's functions are implemented. The pre-corrected Fast Fourier Transform (pFFT) algorithm is generalized and used to accelerate the translational invariant complex domain dyadic kernel. Computational results are given for a variety of structures to validate the accuracy and efficiency of FastMaxwell. O(NlogN) computational complexity is demonstrated by our results in both time and memory.

Optimization-based Wideband Basis Functions for Efficient Interconnect Extraction [p. 1200]

X. Hu, T. Moselhy, J. White and L. Daniel

This paper introduces a technique for the numerical generation of basis functions that are capable of parameterizing the frequency-variant nature of cross-sectional conductor current distributions. Hence skin and proximity effects can be captured utilizing much fewer basis functions in comparison to the prevalently-used piecewise-constant basis functions. One important characteristic of these basis functions is that they only need to be pre-computed once for a frequency range of interest per unique conductor cross-sectional geometry, and they can be stored off-line with a minimal associated cost. In addition, the robustness of these frequency-independent basis functions are enforced using an optimization routine. It has been demonstrated that the cost of solving a complex interconnect system can be reduced by a factor of 170 when compared to the use of piecewise-constant basis functions over a wide range of operating frequencies.

Thermally Robust Clocking Schemes for 3D Integrated Circuits [p. 1206]

M. Mondal, A.J. Ricketts, S. Kirolos, T. Ragheb, G. Link, N. Vijaykrishnan and Y. Massoud

3D integration of multiple active layers into a single chip is a viable technique that greatly reduces the length of global wires by providing vertical connections between layers. However, dissipating the heat generated in the 3D chips possesses a major challenge to the success of the technology and is the subject of active current research. Since the generated heat degrades the performance of the chip, thermally insensitive/adaptive circuit design techniques are required for better overall system performance. In this paper, we propose a thermally adaptive 3D clocking scheme that dynamically adjusts the driving strengths of the clock buffers to reduce the clock skew between terminals. We investigate the relative merits and demerits of two alternative clock tree topologies in this work. Simulation results demonstrate that our adaptive technique is capable of reducing the skew by 61.65% on the average, leading to much improved clock synchronization and design performance in the 3D realm.

Double-Via-Driven Standard Cell Library Design [p. 1212]

T.-Y. Lin, T.-H. Lin, H.-H. Tung and R.-B. Lin

Double-via placement is important for increasing chip manufacturing yield. Commercial tools and recent work have done a great job for it. However, they are found with a limited capability of placing more double vias (called via1) between metal 1 and metal 2. Such a limitation is caused by the way we design the standard cells and can not be resolved by developing better tools. This paper presents a double-via-driven standard cell library design approach to solving this problem. Compared to the results obtained using a commercial cell library, our library on average achieves 78% reduction in dead vias and 95% reduction in dead via1s at the expense of 11% increase in total via count. We achieve these results (almost) at no extra cost in total cell area and wire length.

Interactive Presentation

Analysis of Power Consumption and BER of Flip-flop Based Interconnect Pipelining [p. 1218]

J. Xu, A. Roy and M.H. Chowdhury

This paper addresses the problem of interconnect pipelining from both power consumption and bit error rate (BER) point of view and tries to find the optimal solution for a given wire pipelining scheme in nanometer scale very large scale integration technologies. In this paper a detailed analysis for the dependency of power consumption and BER on the number of flip-flops inserted and repeater size is performed. For the best tradeoff between the wire delay, BER and power consumption, a methodology is developed to optimize the repeater size and the number of flip-flops inserted which maximize a user-specified figure of merit. Then this methodology is applied to calculate the optimal solutions for some International Technology Roadmap for Semiconductor technology nodes.

8.6: EMBEDDED TUTORIAL/PANEL - A Future of Customizable Processors: Are We There Yet?

Organizers: L. Pozzi, Lugano U, CH; P. Paulin, STMicroelectronics, CA
Moderator: P. Paulin, STMicroelectronics, CA

A Future of Customizable Processors: Are We There Yet? [p. 1224]

L. Pozzi and P. G. Paulin

Customizable processors are being used increasingly often in SoC designs. During the past few years, they have proven to be a good way to solve the conflicting flexibility and performance requirements of embedded systems design. While their usefulness has been demonstrated in a wide range of products, a few challenges remain to be addressed: 1) Is extending a standard core template the right way to customization, or is it preferable to design a fully customized core from scratch? 2) Is the automation offered by current toolchains, in particular generation of complex instructions and their reuse, enough for what users would like to see? 3) And when we look at the future with the increasing use of multi-processor SoCs, do we see a sea of identical customized processors, or a heterogeneous mix? We comment and elaborate here on these challenges and open questions.

8.7: Placement and Floorplanning

Moderators: J. Dielissen, NXP Research, NL ; T. Shiple, Synopsys, FR

Fast and Accurate Routing Demand Estimation for Efficient Routability-driven Placement [p. 1226]

P. Spindler and F.M. Johannes

This paper presents a fast and accurate routing demand estimation called RUDY and its efficient integration in a force-directed quadratic placer to optimize placements for routability. RUDY is based on a Rectangular Uniform wire DensitY per net and accurately models the routing demand of a circuit as determined by the wire distribution after final routing. Unlike published routing demand estimation, RUDY depends neither on a bin structure nor on a certain routing model to estimate the behavior of a router. Therefore RUDY is independent of the router. Our fast and robust force-directed quadratic placer is based on a generic demand-and-supply model and is guided by the routing demand estimation RUDY to optimize placements for routability. This yields a placer which simultaneously reduces the routing demand in congested regions and increases the routing supply there. Therefore our placer fully utilizes the potential to optimize the routability. This results in the best published routed wirelength of the IBMv2 benchmark suite until now. In detail, our approach outperforms mPL, ROOSTER, and APlace by 9%, 8%, and 5%, respectively. Compared by the CPU times, which ROOSTER needs to place this benchmark, our routability optimization placer is eight times faster.

Yield-aware Placement Optimization [p. 1232]

P. Azzoni, M. Bertoletti, N. Dragone, F. Fummi, C. Guardiani and W. Vendraminetto

In this paper we describe a methodology addressing the issue of avoiding yield hazardous cell abutments during placement. This is made possible by accurate characterization of the yield penalty associated with particular cell-to-cell interactions. Of course characterizing all possible cell abutments in a library of 600+ cells is impractical. We will describe some simple heuristics that attempt to resolve the cell abutment precharacterization complexity. Finally we will show a possible implementation of the proposed yield-aware placement optimization methodology and demonstrate the potential of cell interaction penalty characterization for a 90nm design test case.

Microarchitecture Floorplanning for Sub-threshold Leakage Reduction [p. 1238]

H. Mogal and K. Bazargan

Lateral heat conduction between modules affects the temperature profile of a floorplan, affecting the leakage power of individual blocks which increasingly is becoming a larger fraction of the overall power consumption with scaling of fabrication technologies. By modeling temperature dependent leakage power within a microarchitecture-aware floorplanning process, we propose a method that reduces sub-threshold leakage power. To that end, two leakage models are used: a transient formulation independent of any leakage power model and a simpler formulation derived from an empirical leakage power model, both showing good fidelity to detailed transient simulations. Our algorithm can reduce subthreshold leakage by upto 15% with a minor degradation in performance, compared to a floorplanning process that does not model leakage. We also show the importance of modeling whitespace during floorplanning and its impact on leakage savings.

9.1.1: HOT TOPIC I - Industrial Applications (Space and Aeronautics Special Day)

Organizers: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR
Moderator: E. Lansard, Alcatel Alenia Space, FR

Industrial Applications [p. 1244]

X. Olive, J.-M. Pasquet and D. Flament

This first technical session is further developing the technological dimensions of technology transfer, with three illustrations of successful and representative industrial applications. It covers the cases of embedded autonomy in spacecraft applications, advanced avionics solutions for satellite communication and component hybridization for navigation.

9.1.2: LUNCH TIME KEYNOTE - Setting the Industrial Scene (Space and Aeronautics Special Day)

Flying Embedded: The Industrial Scene and Challenges for Embedded Systems in Aeronautics and Space [p. 1246]

J. Botti

This keynote address, given by an executive representative of the European aeronautics and space industry, introduces the strategic stakes and the international competitive landscape, for further development and understanding of the sizing dimensions of technology transfer all along the special day.

9.2: Crypto Blocks and Security

Moderators: R. Locatelli, STMicroelectronics, IT; R. Pacalet, ENST, FR

Compact Hardware Design of Whirlpool Hashing Core [p. 1247]

T. Alho, P. Hämäläinen, M. Hännikäinen and T.D. Hämäläinen

Weaknesses have recently been found in the widely used cryptographic hash functions SHA-1 and MD5. A potential alternative for these algorithms is the Whirlpool hash function, which has been standardized by ISO/IEC and evaluated in the European research project NESSIE. In this paper we present a Whirlpool hashing hardware core suited for devices in which low cost is desired. The core constitutes of a novel 8-bit architecture that allows compact realizations of the algorithm. In the Xilinx Virtex-II Pro XC2VP40 FPGA, our implementation consumes 376 slices and achieves the throughput of 81.5 Mbit/s. The resource utilization of our design is one fourth of the smallest Whirlpool implementation presented to date.

An Efficient Polynomial Multiplier in GF(2m) and Its Application to ECC Designs [p. 1253]

S. Peter and P. Langendörfer

In this paper we discuss approaches that allow to construct efficient polynomial multiplication units. Such multipliers are the most important components of ECC hardware accelerators. The proposed hRAIK multiplication improves energy consumption, the longest path, and required silicon area compared to state of the art approaches. We use such a core multiplier to construct an efficient sequential polynomial multiplier based on the known iterative Karatsuba method. Finally, we exploit the beneficial properties of the design to build an ECC accelerator. The design for GF(2233) requires about 1.4 mm2 cell area in a .25μm technology and needs 80 μsec for an EC point multiplication.

Flexible Hardware Reduction for Elliptic Curve Cryptography in GF(2m) [p. 1259]

S. Peter, P. Langendörfer and K. Piotrowski

In this paper we discuss two ways to provide flexible hardware support for the reduction step in Elliptic Curve Cryptography in binary fields (GF(2m)). In our first approach we are using several dedicated reduction units within a single multiplier. Our measurement results show that this simple approach leads to an additional area consumption of less than 10% compared to a dedicated design without performance penalties. In our second approach any elliptic curve cryptography up to a predefined maximal length can be supported. Here we take advantage of the features of commonly used reduction polynomials. Our results show a significant area penalty compared to dedicated designs. However, we achieve flexibility and the performance is still significantly better than those of known ECC hardware accelerator approaches with similar flexibility or even software implementations.

Overcoming Glitches and Dissipation Timing Skews in Design of DPA-Resistant Cryptographic Hardware [p. 1265]

K.J. Lin, S.C. Fang, S.-H. Yang, and C.C. Lo

Cryptographic embedded systems are vulnerable to Differential Power Analysis (DPA) attacks. In this paper, we propose a logic design style, called as Precharge Masked Reed-Muller Logic (PMRML) to overcome the glitch and Dissipation Timing Skew (DTS) problems in design of DPA-resistant cryptographic hardware. Both problems can significantly reduce the DPA-resistance. To our knowledge, the DTS problem and its countermeasure have not been reported. The PMRML design can be fully realized using common CMOS standard cell libraries. Furthermore, it can be used to implement universal functions since any Boolean function can be represented as the Reed- Muller form. An AES encryption module was implemented with multi-stage PMRML. The results show the efficiency and effectiveness of the PMRML design methodology.

9.3: Variation Tolerant Mixed Signal Test

Moderators: A. Rubio, UP Catalunya, ES; S. Mir, TIMA Laboratory, FR

Dynamic Critical Resistance: A Timing-Based Critical Resistance Model for Statistical Delay Testing of Nanometer ICs [p. 1271]

J.L. Rosselló, C. de Benito, S.A. Bota, J. Segura

As CMOS IC feature sizes shrink down to the nanometer regime, the need for more efficient test methods capable of dealing with new failure mechanisms increases. Advances in this domain require a detailed knowledge of these failure physical properties and the development of appropriated test methods. Several works have shown the relative increase of resistive defects (both opens and shorts), and that they mainly affect circuit timing rather than impacting its static DC behavior. Defect evolution, together with the increase of parameter variations, represents a serious challenge for traditional delay test methods based on fixed time delay limit setting. One alternative to deal with variation relies on adopting correlation where test limits for one parameter are settled based on its correspondence to other circuit variables. In particular, the correlation of circuit delay to reduced VDD has been proposed as a useful test method. In this work we investigate the merits of this technique for future technologies where variation is predicted to increase, analyzing the possibilities of detecting resistive shorts and opens.

Sensitivity Analysis for Fault-analysis and Tolerance in RF Front-end Circuitry [p. 1277]

T. Das and P.R. Mukund

RFIC reliability is fast becoming a major bottleneck in the yield and performance of modern IC systems, as process complexity and levels of integration continually increase. Due to high frequencies involved, testing these chips is both complicated and expensive. While the areas of Automated testing and Self-test have received significant attention over the past few years, no formal framework of fault-models or sensitivity-models exists in the RF domain. This paper describes a Sensitivity Analysis methodology as a first step towards such a framework. It is applied towards a Low Noise Amplifier, and a case-study application is discussed by using design and experimental results of an adaptive LNA designed in the IBM6RF 0.25 μm CMOS process.

A Two-Tone Test Method for Continuous-Time Adaptive Equalizers [p. 1283]

D. Hong, S. Sabri, K.-T. Cheng and C.P. Yue

This paper describes a novel test method for continuous-time adaptive equalizers. This technique applies a two-sinusoidal-tone signal as stimulus and includes an RMS detector for testing, which incurs no performance degradation and a very small area overhead. To validate the technique, we used a recently published adaptive equalizer as our test case and conducted both behavioral and transistor-level simulations. Simulation results demonstrate that the technique is effective in detecting defects in the equalizer, which might not be easily detected by the conventional eye-diagram method.

Worst-Case Design and Margin for Embedded SRAM [p. 1289]

R. Aitken and S. Idgunji

An important aspect of Design for Yield for embedded SRAM is identifying the expected worst case behavior in order to guarantee that sufficient design margin is present. Previously, this has involved multiple simulation corners and extreme test conditions. It is shown that statistical concerns and device variability now require a different approach, based on work in Extreme Value Theory. This method is used to develop a lower-bound for variability-related yield in memories.

Interactive Presentations

Pulse Propagation for the Detection of Small Delay Defects [p. 1295]

M. Favalli and C. Metra

This paper addresses the problems related to resistive opens and bridging faults which cannot be detected using delay fault testing because they lie out of the most critical paths. Even if the induced defect is not large enough to result in timing violations, these faults may give rise to reliability problems. To detect them, we propose a testing method that is based on the propagation of pulses within the faulty circuit and that exploits the degraded capabity of faulty paths to propagate pulses. The effectiveness of the proposed method is analyzed at the electrical level and compared with the use of reduced clock period which can detect the same class of faults. Results show similar performance in the case of resistive opens and better performance in the case of bridgings. Moreover, the proposed approach is not affected by problems on the clock distribution network.

BIST Method for Die-Level Process Parameter Variation Monitoring in Analog/Mixed-Signal Integrated Circuits [p. 1303]

A. Zjajo, M.J. Barragan Asian and J. Pineda de Gyvez

This paper reports a new built-in self-test scheme for analog and mixed-signal devices based on die-level process monitoring. The objective of this test is not to replace traditional specification-based tests, but to provide a reliable method for early identification of excessive process parameter variations in production tests that allows quickly discarding of the faulty circuits. Additionally, the possibility of on-chip process deviation monitoring provides valuable information, which is used to guide the test and to allow the estimation of selected performance figures. The information obtained through guiding and monitoring process variations is re-used and supplement the circuit calibration.

9.4: SAT Techniques for Verification

Moderators: R. Bloem, TU Graz, AT; R. Drechsler, Bremen U, DE

A New Hybrid Solution to Boost SAT Solver Performance [p. 1307]

L. Fang and M.S. Hsiao

Due to the widespread demands for efficient SAT solvers in Electronic Design Automation applications, methods to boost the performance of the SAT solver are highly desired. We propose a Hybrid Solution to boost SAT solver performance in this paper, via an integration of local and DPLL-based search approaches. A local search is used to identify a subset of clauses to be passed to a DPLL SAT solver through an incremental interface. In addition, the solution obtained by the DPLL solver on the subset of clauses is fed back to the local search solver to jump over any locally optimal points. The proposed solution is highly portable to the existing SAT solvers. For satisfiable instances, up to an order of magnitude speedup can be obtained via the proposed hybrid solver.

QuteSAT: A Robust Circuit-based SAT Solver for Complex Circuit Structure [p. 1313]

C.-A. Wu, T.-H. Lin, C.-C. Lee and C.-Y. Huang

We propose a robust circuit-based Boolean Satisfiability (SAT) solver, QuteSAT, that can be applied to complex circuit netlist structure. Several novel techniques are proposed in this paper, including: (1) a generic watching scheme on general gate types for efficient Boolean Constraint Propagation (BCP), (2) an implicit implication graph representation for efficient learning, and (3) careful engineering on the most advanced SAT algorithms for the circuit-based data structure. Our experimental results show that our baseline solver, without taking the advantage of the circuit information, can achieve the same performance as the fastest Conjunctive Normal Form (CNF)-based solvers. We also demonstrate that by applying a simple circuitoriented decision ordering technique (J-frontier), our solver can constantly outperform the CNF ones for more than 15+ times. With the great flexibility on the circuitbased data structure, our solver can serve as a solid foundation for the general SAT research in the future.

Boosting the Role of Inductive Invariants in Model Checking [p. 1319]

G. Cabodi, S. Nocco and S. Quer

This paper focuses on inductive invariants in unbounded model checking to improve efficiency and scalability. First of all, it introduces optimized techniques to speedup the computation of inductive invariants, considering both equivalences and implications between pairs of nodes in the logic network. Secondly, it presents a very efficient dynamic procedure, based on an incremental SAT approach, to reduce the set of checked invariants. Finally, it shows how to effectively integrate inductive invariant computations with state-of-the-art model checking procedures. Experiments address different property verification aspects, and specifically consider cases where inductive invariants alone are not sufficient for the final proof.

Interactive Presentation

Image Computation and Predicate Refinement for RTL Verilog Using Word Level Proofs [p. 1325]

D. Kroening and N. Sharygina

Automated abstraction is the enabling technique for model checking large circuits. Predicate Abstraction is one of the most promising abstraction techniques. It relies on the efficient computation of predicate images and the right choice of predicates. Existing algorithms use a net-list-level circuit model for computing predicate images. 1) This paper describes a proof-based algorithm that computes an over-approximation of the predicate image at the wordlevel, and thus, scales to larger circuits. 2) The previous work relies on the computation of the weakest preconditions in order to refine the set of predicates. In contrast to that, we propose to extract predicates from a word-level proof to refine the set of predicates.

9.5: Compiler Techniques for Customizable Architectures

Moderators: A. Darte, ENS Lyon, FR; H. van Someren, ACE Associated Compiler Experts, NL

Polynomial-Time Subgraph Enumeration for Automated Instruction Set Extension [p. 1331]

P. Bonzini and L. Pozzi

This paper proposes a novel algorithm that, given a data-flow graph and an input/output constraint, enumerates all convex subgraphs under the given constraint in polynomial time with respect to the size of the graph. These subgraphs have been shown to represent efficient Instruction Set Extensions for customizable processors. The search space for this problem is inherently polynomial but, to our knowledge, this is the first paper to prove this and to present a practical algorithm for this problem with polynomial complexity. Our algorithm is based on properties of convex subgraphs that link them to the concept of multiple-vertex dominators. We discuss several pruning techniques that, without sacrificing the optimality of the algorithm, make it practical for data-flow graphs of a thousands nodes or more.

Interrupt and Low-level Programming Support for Expanding the Application Domain of Statically-Scheduled Horizontally-Microcoded Architectures in Embedded Systems [p. 1337]

M. Reshadi and D. Gajski

The increasing role of software in the embedded systems has made processor an important component in these systems. However, to meet the tight constraints of embedded application, it is often required to customize the processor for the application. Customizing instruction-based processors is difficult and very challenging. Design approaches based on statically-scheduled horizontal-microcoded architectures have been proposed to simplify the architecture customization. In these approaches, first the datapath is specified by the designer, and then the operations of the datapath are extracted automatically. Since the operations are statically scheduled in these architectures (i) low-level programming using assembly is impossible or very tedious; and (ii) execution of programs cannot be interrupted arbitrarily. In this paper, we address the above problems. We show how to efficiently handle interrupts in such architectures and also propose an elegant way of controlling low-level hardware resources in a general way in C language. We also show that after adding interrupt and low-level programming we could use the above architectural style in a multi-core system to implement a complete MP3 decoder that can process 122 frames per second while the standard requirement is 38 frames per seconds.

DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for Embedded Systems [p. 1343]

Z. Ge, W.-F. Wong and H.-B. Lim

Power consumption is of crucial importance to embedded systems. In such systems, the instruction memory hierarchy consumes a large portion of the total energy consumption. A well designed instruction memory hierarchy can greatly decrease the energy consumption and increase performance. The performance of the instruction memory hierarchy is largely determined by the specific application. Different applications achieve better energy-performance with different configurations of the instruction memory hierarchy. Moreover, applications often exhibit different phases during execution, each exacting different demands on the processor and in particular the instruction memory hierarchy. For a given hardware resource budget, an even better energy-performance may be achievable if the memory hierarchy can be reconfigured before each of these phases. In this paper, we propose a new dynamically reconfigurable instruction memory hierarchy to take advantage of these two characteristics so as to achieve significant energyperformance improvement. Our proposed instruction memory hierarchy, which we called DRIM, consists of four banks of on-chip instruction buffers. Each of these can be configured to function as a cache or as a scratchpad memory (SPM) according to the needs of an application and its execution phases. Our experimental results using six benchmarks from the MediaBench and the MiBench suites show that DRIM can achieve significant energy reduction.

Interactive Presentations

SoftSIMD . Exploiting Subword Parallelism Using Source Code Transformations [p. 1349]

S. Kraemer, R. Leupers, G. Ascheid and H. Meyr

SIMD instructions are used to speed up multimedia ap- plications in high performance embedded computing. Ven- dors often use proprietary platforms which are incompati- ble with others. Therefore, porting software is a very com- plex and time consuming task. Moreover, lots of existing embedded processors do not have SIMD extensions at all. But they do provide a wide data path which is 32-bit or wider. Usually, multimedia applications work on short data types of 8 or 16-bit. Thus, only the lower bits of the data path are used and therefore only a fraction of the available computing power is exploited for such algorithms. This paper discusses the possibility to make use of the upper bits of the data path by emulating true SIMD instructions. These instructions are implemented purely in software us- ing a high level language such as C. Therefore, the applica- tion can be modified by making use of source code transfor- mations which are inherently portable. The benefit of this approach is that the computing resources are used more ef- ficiently without compromising the portability of the code. Experiments have shown that a significant speedup can be obtained by this approach.

A Process Splitting Transformation for Kahn Process Networks [p. 1355]

S. Meijer, B. Kienhuis, A. Turjan and E. de Kock

In this paper we present a process splitting transformation for Kahn process networks. Running applications written in this parallel program specification on a multiprocessor architecture does not guarantee that the runtime requirements are met. Therefore, it may be necessary to further analyze and optimize Kahn process networks. In this paper, we will present a four-step transformation that results in a functionally equivalent process network, but with a changed and optimized network structure. The class of networks that can be handled is not restricted to static networks. The novelty of this approach is that it can also handle processes with dynamic program statements. We will illustrate the transformation prototyped in GCC for a JPEG decoder, showing a 21% performance improvements.

9.6: Interconnect Optimization and Metastability

Moderators: S. Sapatnekar, Minnesota U, US; T. Shiple, Synopsys, FR

Computing Synchronizer Failure Probabilities [p. 1361]

S. Yang and M. Greenstreet

System-on-Chip designs often have a large number of timing domains. Communication between these domains requires synchronization, and the failure probabilities of these synchronizers must be characterized accurately to ensure the robustness of the complete system. We present a novel approach for determining the failure probabilities of synchronizer circuits. Our approach using numerical integration to account for the nonlinear behaviour of real synchronizer circuits. We complement this with small-signal techniques to enable accurate estimation of extremely small failure probabilities. Our approach is fully automated, is suitable for integration into circuit simulation tools such as SPICE and enables accurate characterization of extremely small failure probabilities.

Layout-Aware Gate Duplication and Buffer Insertion [p. 1367]

D. Bañeres, J. Cortadella and M. Kishinevsky

An approach for layout-aware interconnect optimization is presented. It is based on the combination of three sub-problems into the same framework: gate duplication, buffer insertion and placement. Different techniques to control the combinatorial explosion are proposed. The experimental results show tangible benefits in delay that endorse the suitability of integrating the three sub-problems in the same framework. The results also corroborate the increasing relevance of interconnect optimization in future semiconductor technologies.

Self-Heating-Aware Optimal Wire Sizing under Elmore Delay Model [p. 1373]

M. Ni and S.O. Memik

Global interconnect temperature keeps rising in the current and future technologies due to self-heating and the adiabatic property of top metal layers. The thermal effects impact adversely both reliability and performance of the interconnect wire, shortening the interconnect lifetime and increasing the interconnect delay. Such effects must be considered during the process of interconnect design. In this paper, one important argument is that the traditional linear dependence between wire resistance and wire width is no longer adequate for high layer interconnects due to the adiabatic property of these wires. By using curve fitting technique, we propose a quadratic model to represent the resistance of interconnect, which is aware of the thermal effects. Based on this model and the Elmore delay model, we derived a linear optimal wire sizing formula in form of f(x) = ax + b. Compared to non-thermal-aware exponential wire sizing formula in form of f(x) = ae^-bx, we observed a 49.7% average delay gain with different choices of physical parameters.

9.7: Physical and Device Simulation

Moderators: M. Zwolinski, Southampton U, UK; F. Gaffiot, Ecole Centrale de Lyon, FR

Statistical Blockade: A Novel Method for Very Fast Monte Carlo Simulation of Rare Circuit Events, and Its Application [p. 1379]

A. Singhee and R.A. Rutenbar

Circuit reliability under statistical process variation is an area of growing concern. For highly replicated circuits such as SRAMs and flip flops, a rare statistical event for one circuit may induce a not-sorare system failure. Existing techniques perform poorly when tasked to generate both efficient sampling and sound statistics for these rare events. Statistical Blockade is a novel Monte Carlo technique that allows us to efficiently filter - to block - unwanted samples insufficiently rare in the tail distributions we seek. The method synthesizes ideas from data mining and Extreme Value Theory, and shows speedups of 10X -100X over standard Monte Carlo.

Clock Domain Crossing Fault Model and Coverage Metric for Validation of SoC Design [p. 1385]

Y. Feng, Z. Zhou, D. Tong and X. Cheng

Multiple asynchronous clock domains have been increasingly employed in System-on-Chip (SoC) designs for different I/O interfaces. Functional validation is one of the most expensive tasks in the SoC design process. Simulation on register transfer level (RTL) is still the most widely used method. It is important to quantitatively measure the validation confidence and progress for clock domain crossing (CDC) designs. In this paper, we propose an efficient method for definition of CDC coverage, which can be used in RTL simulation for a multi-clock domain SoC design. First, we develop a CDC fault model to present the actual effect of metastability. Second, we use a temporal data flow graph (TDFG) to propagate the CDC faults to observable variables. Finally, CDC coverage is defined based on the CDC faults and their observability. Our experiments on a commercial IP demonstrate that this method is useful to find CDC errors early in the design cycles.

Fast Statistical Circuit Analysis with Finite-Point Based Transistor Model [p. 1391]

M. Chen, W. Zhao, F. Liu and Y Cao

A new approach of transistor modeling is developed for fast statistical circuit simulation in the presence of variations. For both I-V and C-V characteristics of a transistor, finite data points are identified by their physical meaning; the impact of process and design variations is embedded into these points as closed-form expressions. Then, the entire I-V and C-V are extrapolated using polynomial formulas. This novel approach significantly enhances the simulation speed with sufficient accuracy. The model is implemented in Verilog-A at 65nm node. Compared to simulations with the BSIM model, the computation time can be reduced by 7x in transient analysis and 9x in Monte-Carlo simulations.

Interactive Presentation

Statistical Simulation of High-Frequency Bipolar Circuits [p. 1397]

W. Schneider, M. Schroter, W. Kraus and H. Wittkopf

This paper describes a physics-based methodology for computationally efficient statistical modeling of highfrequency bipolar transistors along with its practical implementation into a production process design kit. Applications to statistical modeling, circuit simulation, and yield optimization are demonstrated for an opamp circuit. Experimental results are shown that verify the methodology.

10.1: HOT TOPIC II - Development and Industrialization (Space and Aeronautics Special Day)

Organizers/Moderators: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR

Development and Industrialization [p. 1403]

M. Riffiod, P. Caspi, C. Piala and J.-L. Voirin

This second technical session illustrates the methodological dimensions of technology transfer. It elaborates on some methodologies deployed in critical steps of the whole embedded systems development process, particularly to specify safety critical embedded systems, to manage obsolescence of components and to certify the airworthiness of the final solutions.

10.2: Wireless Communication and Networking System Implementation

Moderators: C. Heer, Infineon Technologies, DE ; O. Deprez, Texas Instruments, FR

Low Power Design on Algorithmic and Architectural Level: A Case Study of an HSDPA Baseband Digital Signal Processing System [p. 1406]

M. Schämann, S. Hessel, U. Langmann and M. Bücker

The optimization of power consumption plays a key role in the design of a cellular system: Increasing data rates together with high mobility represent a constantly growing design challenge because advanced algorithms are required with a higher complexity, more chip area and increased power consumption which contrast with limited power supply. In this contribution, digital baseband components for a High Speed Downlink Packet Access (HSDPA) system are optimized on algorithmic and architectural level. Three promising algorithms for the equalization of the propagation channel are compared regarding performance, complexity and power consumption using fixed-point SystemC models. On architectural level an adaptive control unit is introduced together with an output interference analyzer. The presented strategy reduces the arithmetic operations for convenient propagation conditions up to 70% which relates to an estimated power reduction of up to 40% while the overall performance is not affected.

Mapping the Physical Layer of Radio Standards to Multiprocessor Architectures [p. 1412]

C. Grassmann, M. Richter and M. Sauermann

We are concerned with the software implementation of baseband processing for the physical layer of radio standards ("Software Defined Radio - SDR"). Given the constraints for mobile terminals with respect to power consumption, chip area and performance, nonstandard architectures without compiler support are the targets a SDR implementation has to face. For this domain we present a way to safely move from a functional model to the assembly level in order to come to a tested multithreaded optimized implementation in manageable time. We carried out this program for the standards WLAN IEEE 802.11b and 3GPP WCDMA exploiting various levels of parallelism: thread level parallelism ("MIMD"), data level parallelism ("SIMD") and instruction level parallelism ("VLIW"). We came up with a software implementation running in real time on Infineon's programmable Multiple SIMD Core (MuSIC) processor.

Development of an ASIP Enabling Flows in Ethernet Access Using a Retargetable Compilation Flow [p. 1418]

K. Van Renterghem, P. Demuytere, D. Verhulst, J. Vandewege and X.-Z. Qiu

In this paper we research an FPGA based Application Specific Instruction Set Processor (ASIP) tailored to the needs of a flow aware Ethernet access node using a retargetable compilation flow. The toolchain is used to develop an initial processor design, asses the performance and identify the potential bottlenecks. A second design iteration results in a fully optimized ASIP with a VLIW instruction set which allows for high degree of parallelism among the functional units inside the ASIP and has dedicated instructions to accelerate typical packet processing tasks. This way, a single processor is capable of handling the complete throughput of a gigabit Ethernet link. To reach the target of a 10 Gbit/s Ethernet access node several processors operate in parallel in a multicore environment.

An Effective AMS Top-Down Methodology Applied to the Design of a Mixed-Signal UWB System-on-Chip [p. 1424]

M. Crepaldi, M.R. Casu, M. Graziano and M. Zamboni

The design of Ultra Wideband (UWB) mixed-signal SoC for localization applications in wireless personal area networks is currently investigated by several researchers. The complexity of the design claims for effective top-down methodologies. We propose a layered approach based on VHDL-AMS for the first design stages and on an intelligent use of a circuit-level simulator for the transistor-level phase. We apply the latter just to one block at a time and wrap it within the system-level VHDL-AMS description. This method allows to capture the impact of circuit-level design choices and non-idealities on system performance. To demonstrate the effectiveness of the methodology we show how the refinement of the design affects specific UWB system parameters such as bit-error rate and localization estimations.

Interactive Presentation

Behavioral Modeling of Delay-Locked Loops and Its Application to Jitter Optimization in Ultra Wide-Band Impulse Radio Systems [p. 1430]

E. Barajas, R. Cosculluela, D. Coutinho, D. Mateo, J. L. González, I. Cairò, S. Banda, M. Ikeda

This paper presents a behavioral model of a delaylocked loop (DLL) used to generate the timing signals in an integrated ultra wide-band (UWB) impulse radio (IR) system. The requirements of these timing signals in the context of UWB-IR systems are reviewed. The behavioral model includes a modeling of the various noise sources in the DLL that produce output jitter. The model is used to find the optimum loop filter capacitor value that minimizes output jitter. The accuracy of the behavioral model is validated by comparing the system level simulation results with transistor level simulations of the whole DLL.

10.3: Soft Error Evaluation and Tolerance

Moderators: C. Metra, Bologna U, IT; B. Gottlieb, Intel, US

Soft Error Rate Analysis for Sequential Circuits [p. 1436]

N. Miskov-Zivanov and D. Marculescu

Due to reduction in device feature size and supply voltage, the sensitivity to radiation induced transient faults (soft errors) of digital systems increases dramatically. Intensive research has been done so far in modeling and analysis of combinational circuit susceptibility to soft errors, while sequential circuits have received much less attention. In this paper, we present an approach for evaluating the susceptibility of sequential circuits to soft errors. The proposed approach uses symbolic modeling based on BDDs/ADDs and probabilistic sequential circuit analysis. The SER evaluation is demonstrated by the set of experimental results, which show that, for most of the benchmarks used, the SER decreases well below a given threshold (10-7FIT) within ten clock cycles after the hit. The results obtained with the proposed symbolic framework are within 4% average error and up to 11000X faster when compared to HSPICE detailed circuit simulation.

Verification-Guided Soft Error Resilience [p. 1442]

S.A. Seshia, W. Li and S. Mitra

Algorithmic techniques for formal verification can be used not just for bug-finding, but also to estimate vulnerability to reliability problems and to reduce overheads of circuit mechanisms for error resilience. We demonstrate this idea of verification-guided error resilience in the context of soft errors in latches. We show how model checking can be used to identify latches in a circuit that must be protected in order that the circuit satisfies a formal specification. Experimental results on a Verilog implementation of the ESA SpaceWire communication protocol indicate that the power overhead of soft error protection can be reduced by a factor of 4.35 by using our approach rather than protecting all latches.

A Low-SER Efficient Core Processor Architecture for Future Technologies [p. 1448]

E.L. Rhod, C.A. Lisboa and L. Carro

Device scaling in new and future technologies brings along severe increase in the soft error rate of circuits, for combinational and sequential logic. Although potential solutions have started to be investigated by the community, the full use of future resources in circuits tolerant to SETs, without performance, area or power penalties, is still an open research issue. This paper introduces MemProc, an embedded core processor with extra low SER sensitivity, and with no performance or area penalty when compared to its RISC counterpart. Central to the SER reduction are the use of new magnetic memories (MRAM and FRAM) and the minimization of the combinational logic area in the core. This paper shows the results of fault injection in the MemProc core processor and in a RISC machine, and compares performance and area of both approaches. Experimental results show a 29 times increase in fault tolerance, with up to 3.75 times in performance gains and 14 times less sensible area.

Accurate and Scalable Reliability Analysis of Logic Circuits [p. 1454]

M.R. Choudhury and K. Mohanram

Reliability of logic circuits is emerging as an important concern that may limit the benefits of continued scaling of process technology and the emergence of future technology alternatives. Reliability analysis of logic circuits is NP-hard because of the exponential number of inputs, combinations and correlations in gate failures, and their propagation and interaction at multiple primary outputs. By coupling probability theory with concepts from testing and logic synthesis, this paper presents accurate and scalable algorithms for reliability analysis of logic circuits. Simulation results for several benchmark circuits demonstrate the accuracy, performance, and potential applications of the proposed analysis technique.

Interactive Presentation

A New Asymmetric SRAM Cell to Reduce Soft Errors and Leakage Power in FPGA [p. 1460]

B.S. Gill, C. Papachristou and F.G. Wolff

Soft errors in semiconductor memories occur due to charged particle strikes at the cell nodes. In this paper, we present a new asymmetric memory cell to increase the soft error tolerance of SRAM. At the same time, this cell can be used at the reduced supply voltage to decrease the leakage power without significantly increasing the soft error rate of SRAM. A major use of this cell is in the configuration memory of FPGA. The cell is designed using a 70nm process technology and verified using Spice simulations. Soft error tolerance results are presented and compared with standard SRAM cell and an existing increased soft error tolerance cell. Simulation results show that our cell has lowest soft error rate at the various supply voltages.

10.4: EMBEDDED TUTORIAL - EDA - A Pivotal Theme in the European Technology Platforms - ARTEMIS and ENIAC (System)

Organizers/Moderators: P. Magarshack, STMicroelectronics, FR; E. Schutz, STMicroelectronics, BE

Design Challenges at 65nm and Beyond [p. 1466]

A.B. Kahng

Semiconductor manufacturing technology faces evergreater challenges of pitch, mobility, variability, leakage, and reliability. To enable cost-effective continuation of the semiconductor roadmap, there is greater need for design technology to provide "equivalent scaling", and for product-specific design innovation (multi-core architecture, software support, beyond-die integration, etc.) to provide "more than Moore" scaling. Design challenges along the road to 45nm include variability and power management, and leverage of design-manufacturing synergies. Potential solutions include "design for manufacturability" bridges between chip implementation and manufacturing know-how.

The ARTEMIS Cross-Domain Architecture for Embedded Systems [p. 1468]

H. Kopetz

Today the embedded system market is a highly fragmented market, where custom-designed solutions dominate, resulting in a significant duplication of development effort for hardware, software and services. The ever-increasing complexity level of embedded systems, the technology trends of the semiconductor industry to large production series of chips, and the increased competition in the world market entail the need for a European-wide coherent and integrated development strategy for embedded systems. The ARTEMIS technology platform has been created to fill this need by joining the forces of many of the European players in the embedded system market in order to create the critical mass that is necessary to tackle the formidable challenges of the field.

HW/SW Implementation from Abstract Architecture Models [p. 1470]

A.A. Jerraya

The evolution of technologies is enabling the integration of complex platforms in a single chip; called System-on-Chip, SoC. Modern SoC may include one or several CPU subsystems to execute software and sophisticated interconnect in addition to specific hardware subsystems. This is no more an advanced research topic for academia. 90% of SoCs designed since the start of the 130nm process include at least one CPU. Multimedia platforms (e.g. Nomadik and Nexperia) are already multiprocessor systems-on-chip (MPSoCs) using different kinds of programmable processors (e.g. DSPs and microcontrollers). This trend of building heterogeneous multi-processor SoCs will even accelerate. It is easy to imagine that the design of a SoC with more than a hundred processors will become a current practice in a few years time, e.g. with 45nm technology in 2008. Compared with conventional ASIC design, such a multi-processor SoC is a fundamental change in chip design. These chips will include very sophisticated interconnect such as networks-on-chips (NoC). Moreover, to achieve the required communication performances, each processor may use different local architectures and communication schemes (fast links, non standard memory organization and access).

10.5: Memory and Instruction-Set Customization for Real-Time Systems

Moderators: T.-W. Kuo, National Taiwan U, ROC ; H. van Someren, ACE Associated Compiler Experts, NL

Instruction-Set Customization for Real-Time Embedded Systems [p. 1472]

H.P. Huynh and T. Mitra

Application-specific customization of the instruction set helps embedded processors achieve significant performance and power efficiency. In this paper, we explore customization in the context of multi-tasking real-time embedded systems. We propose efficient algorithms to select the optimal set of custom instructions for a task set under two popular real-time scheduling policies. Our algorithms minimize the processor utilization through customization while satisfying the task deadlines and the constraint on silicon area. Experimental evaluation with various task sets shows that appropriate customization can achieve significant reduction in the processor utilization and the energy consumption.

A Novel Technique to Use Scratch-pad Memory for Stack Management [p. 1478]

S. Park, H.-W. Park and S. Ha

Extensive work has been done for optimal management of scratch-pad memory (SPM) all assuming that the SPM is assigned a fixed address space. The main target objects to be placed on the SPM have been code and global memory since their sizes and locations are not changed dynamically. We propose a novel idea of dynamic address mapping of SPM with the assistance of memory management unit (MMU). It allows us to use SPM for stack management without architecture modification and complier assistance. The proposed technique is orthogonal to the previous works so can be used at the same time. Experiments results show that the proposed technique results in average performance improvement of 13% and energy savings of 12% observed compared to using only external DRAM. And it also gives noticeable speed up and energy saving against a typical cache solution for stack data.

Scratchpad Memories vs Locked Caches in Hard Real-Time Systems: A Quantitative Comparison [p. 1484]

I. Puaut and C. Pais

We propose in this paper an algorithm for off-line selection of the contents of on-chip memories. The algorithm supports two types of on-chip memories, namely locked caches and scratchpad memories. The contents of on-chip memory, although selected off-line, is changed at run-time, for the sake of scalability with respect to task size. Experimental results show that the algorithm yields to good ratios of on-chip memory accesses on the worst-case execution path, with a tolerable reload overhead, for both types of on-chip memories. Furthermore, we highlight the circumstances under which one type of on-chip memory is more appropriate than the other depending of architectural parameters (cache block size) and application characteristics (basic block size).

Task Scheduling for Reliable Cache Architectures of Multiprocessor Systems [p. 1490]

M. Sugihara, T. Ishihara and K. Murakami

This paper presents a task scheduling method for reliable cache architectures (RCAs) of multiprocessor systems. The RCAs dynamically switch their operation modes for reducing the usage of vulnerable SRAMs under real-time constraints. A mixed integer programming model has been built for minimizing vulnerability under real-time constraints. Experimental results have shown that our task scheduling method achieved 47.7-99.9% less vulnerability than a conventional approach.

10.6: Order Reduction and Variation-Aware Interconnect Modelling

Moderators: L. Daniel, Massachusetts Institute of Technology, US; L.M. Silveira, TU Lisbon, PT

Fast Positive-Real Balanced Truncation of Symmetric Systems Using Cross Riccati Equations [p. 1496]

N. Wong

We present a computationally efficient implementation of positive-real balanced truncation (PRBT) for symmetric multiple-input multiple-output (MIMO) systems. The solution of a pair of algebraic Riccati equations (AREs) in conventional PRBT, whose complexity limits practical largescale realization, is replaced with the solution of one cross Riccati equation (XRE). The cross-Riccatian solution then permits simple construction of projection matrices without actually balancing the system. The method encompasses passive linear networks, as commonly used in interconnect and package modelings, due to their inherent reciprocity and therefore symmetric transfer functions. Effectiveness of the proposed approach is verified by numerical examples.

Random Sampling of Moment Graph: A Stochastic Krylov-Reduction Algorithm [p. 1502]

Z. Zhu and J. Phillips

In this paper we introduce a new algorithm for model order reduction in the presence of parameter or process variation. Our analysis is performed using a graph interpretation of the multi-parameter moment matching approach, leading to a computational technique based on Random Sampling ofMoment Graph (RSMG). Using this technique, we have developed a new algorithm that combines the best aspects of recently proposed parameterized moment-matching and approximate TBR procedures. RSMG attempts to avoid both exponential growth of computational complexity and multiple matrix factorizations, the primary drawbacks of existing methods, and illustrates good ability to tailor algorithms to apply computational effort where needed. Industry examples are used to verify our new algorithms.

Statistical Model Order Reduction for Interconnect Circuits Considering Spatial Correlations [p. 1508]

J. Fan, N. Mi, S.X.-D. Tan, Y. Cai and X. Hong

In this paper, we propose a novel statistical model order reduction technique, called statistical spectrum model order reduction (SSMOR) method, which considers both intra-die and inter-die process variations with spatial correlations. The SSMOR generates orderreduced variational models based on given variational circuits. The reduced model can be used for fast statistical performance analysis of interconnect circuits with variational input sources, such as power grid and clock networks. The SSMOR uses statistical spectrum method to compute the variational moments and Monte Carlo sampling method with the modified Krylov subspace reduction method to generate the variational reduced models. To consider spatial correlations, we apply orthogonal decomposition to map the correlated random variables into independent and uncorrelated variables. Experimental results show that the proposed method can deliver about 100x speedup over the pureMonte Carlo projectionbased reduction method with about 2% of errors for both means and variances in statistical transient analysis.

A Sparse Grid Based Spectral Stochastic Collocation Method for Variations-Aware Capacitance Extraction of Interconnects under Nanometer Process Technology [p. 1514]

H. Zhu, X. Zeng, W. Cai, J. Xue and D. Zhou

In this paper, a Spectral Stochastic Collocation Method (SSCM) is proposed for the capacitance extraction of interconnects with stochastic geometric variations for nanometer process technology. The proposed SSCM has several advantages over the existing methods. Firstly, compared with the PFA (Principal Factor Analysis) modeling of geometric variations, the K-L (Karhunen-Loeve) expansion involved in SSCM can be independent of the discretization of conductors, thus significantly reduces the computation cost. Secondly, compared with the perturbation method, the stochastic spectral method based on Homogeneous Chaos expansion has optimal (exponential) convergence rate, which makes SSCM applicable to most geometric variation cases. Furthermore, Sparse Grid combined with a MST (Minimum Spanning Tree) representation is proposed to reduce the number of sampling points and the computation time for capacitance extraction at each sampling point. Numerical experiments have demonstrated that SSCM can achieve higher accuracy and faster convergence rate compared with the perturbation method.

Interactive Presentation

Simulation Methodology and Experimental Verification for the Analysis of Substrate Noise on LC-VCO's [p. 1520]

S. Bronckers, C. Soens, G. Van Der Plas, G. Vandersteen and Y. Rolain

This paper presents a methodology for the analysis and prediction of the impact of wideband substrate noise on a LC-Voltage Controlled Oscillator (LC-VCO) from DC up to Local Frequency (LO). The impact of substrate noise is modeled a priori in a high-ohmic 0.18μm 1P6M CMOS technology and then verified on silicon on a 900MHz LC-VCO. Below a frequency of 10MHz, the impact is dominated by the on-chip resistance of the VCO ground, while above 10MHz the bond wires, parasitics of the on-chip inductor and the PCB decoupling capacitors determine the behavior of the perturbation.

10.7: Temperature and Process Aware Low Power Techniques

Moderators: C. Silvano, Politecnico di Milano, IT; E. Schmidt, ChipVision Design Systems, DE

Accurate Temperature-Dependent Integrated Circuit Leakage Power Estimation Is Easy [p. 1526]

Y. Liu, R.P. Dick, L. Shang and H. Yang

It has been the conventional assumption that, due to the superlinear dependence of leakage power consumption on temperature, and widely varying on-chip temperature profiles, accurate leakage estimation requires detailed knowledge of thermal profile. Leakage power depends on integrated circuit (IC) thermal profile and circuit design style. We show that linear models can be used to permit highly-accurate leakage estimation over the operating temperature ranges in real ICs. We then show that for typical IC packages and cooling structures, a given amount of heat introduced at any position in the active layer will have similar impact on the average temperature of the layer. These two observations allow us to prove that, for wide ranges of design styles and operating temperatures, extremely fast, coarse-grained thermal models, combined with linear leakage power consumption models, permit highly-accurate system-wide leakage power consumption estimation. The results of our proofs are further confirmed via comparisons with leakage estimation based on detailed, time-consuming thermal analysis techniques. Experimental results indicate that the proposed technique yields a 59,259x-1,790,000x speedup in leakage power estimation while maintaining accuracy.

Low-Overhead Circuit Synthesis for Temperature Adaptation Using Dynamic Voltage Scheduling [p. 1532]

S. Ghosh, S. Bhunia and K. Roy

Increasing power density causes die overheating due to limited cooling capacity of the package. Conventional thermal management techniques e.g. logic shutdown, clock gating, frequency scaling, simultaneous voltage-frequency tuning etc. increase the design complexity and/or degrade the performance significantly. In this paper, we propose a novel design technique, which makes a circuit amenable to temperature adaptation using dynamic voltage scheduling (DVS). It is accomplished by a synthesis technique that (a) isolates and predicts the set of paths that may become critical under variations, (b) ensures they are activated rarely, and (c) tolerates possible delay failures (at reduced voltage) in these paths by adaptive clock stretching. This allows us to schedule a lower supply voltage during increased temperature without requiring frequency tuning. Simulation results on an example pipeline show that proposed design yields similar temperature reduction as conventional design with only 11% performance penalty and 14% area overhead. The conventional pipeline design, on contrary, leads to 50% performance degradation due to reduced operating frequency.

Maximum Circuit Activity Estimation Using Pseudo-Boolean Satisfiability [p. 1538]

H. Mangassarian, A. Veneris, S. Safarpour, F.N. Najm and M.S. Abadir

Disproportionate instantaneous power dissipation may result in unexpected power supply voltage fluctuations and permanent circuit damage. Therefore, estimation of maximum instantaneous power is crucial for the reliability assessment of VLSI chips. Circuit activity and consequently power dissipation in CMOS circuits are highly input-pattern dependent, making the problem of maximum power estimation computationally hard. This work proposes a novel pseudo-boolean satisfiability based method that reports the exact input sequence maximizing circuit activity in combinational and sequential circuits. The method is also extended to take multiple gate transitions into account by integrating delay information into the pseudo-boolean optimization problem. An extensive suite of experiments on ISCAS85 and ISCAS89 circuits confirms the efficiency and robustness of the approach compared to simulation based techniques and encourages further research for low-power solutions using boolean satisfiability.

Interactive Presentations

Efficient Computation of Discharge Current Upper Bounds for Clustered Sleep Transistor Sizing [p. 1544]

A. Sathanur, A. Calimera, L. Benini, A. Macii, E. Macii and M. Poncino

Sleep transistor insertion is a key step in low power design methodologies for nanometer CMOS. In the clustered sleep transistor approach, a single sleep transistor is shared among a number of gates and it must be sized according to the maximum current that can be injected onto the virtual ground by the gates in the cluster. A conservative (upper bound) estimate of the maximuminjected current is required in order to avoid excessive speed degradation and possible violations of timing constraints. In this paper we propose a scalable algorithm for tightening upper bound computation, with a controlled and tunable computational cost. The algorithm leverages the capabilities of state-of-the-art commercial timing analysis engines, and it is tightly integrated into standard industrial flow for leakage optimization. Benchmark results demonstrate the effectiveness and efficiency of our approach.

Processor Tolerant Beta-Ratio Modulation for Ultra-Dynamic Voltage Scaling [p. 1550]

M.-E. Hwang, T. Cakici and K. Roy

Most wireless and hand-held gadgets work in burst mode, and the performance demand varies with time. When the performance requirement is low, the supply voltage can be dithered and the circuit can enter from superthreshold region to subthreshold region (V_dd < V_T). Such ultra dynamic voltage scaling (UDVS), where the supply voltage switches from 1.2V to 200mV (say), enables remarkable decrease in power consumption with "acceptable" performance penalty in the non-burst mode of operation. However, subthreshold operation is very sensitive to process variation (PV) due to the reduced noise margin, and may not work properly unless corrective measures are taken. In this paper, we model the trip voltage in both subthreshold and superthreshold regions, and analyze the impact of PV in UDVS. We also propose a circuit design technique such that the same logic gate can efficiently operate in both superthreshold and subthreshold regions under PV. We do that by modulating the β-ratio (P-to-N ratio) of the logic gates. By proper β-ratio modulation, we show that the proposed methodologies can lower energy dissipation per cycle by more than an order of magnitude (42X) in non-burst mode with reduced impact to PVs.

11.1: PANEL SESSION - Towards Total Open Source in Aeronautics and Space?
(Space and Aeronautics Special Day)

Organizers: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR
Moderator: P. Aycinena, Editor, EDA Confidential, US

Towards Total Open Source in Aeronautics and Space? [p. 1556]

Panelists: E. Bantegnie, G. Ladier, R. Mueller, F. Gasperoni and A. Wilson

Aeronautics and space are extraordinarily technical fields of engineering and science that reside within a niche characterized by unique end-product requirements. The severe operating conditions in flight or in space, in combination with the need for mission-critical reliability, create a difficult and challenging level of expectation for those who develop the hardware and software that goes into systems for aeronautics and space.

11.2: Wireless Communication and Networking Algorithms

Moderators: C. Grassmann, Infineon Technologies, DE ; O. Deprez, Texas Instruments, FR

A Tiny and Efficient Wireless Ad-hoc Protocol for Low-cost Sensor Networks [p. 1557]

P. Gburzynski, B. Kaminska and W. Olesinski

We introduce a simple ad-hoc routing scheme that operates in the true spirit of ad-hoc networking, i.e., in a modeless fashion, without neighborhood discovery or explicit point-to-point forwarding, while offering a high (and tunable) degree of reliability, fault-tolerance and robustness. Being aimed at truly tiny devices (e.g., with 1KB of RAM), our scheme can automatically take advantage of extra memory resources to improve the quality of routes for critical nodes. In contrast to some popular low-cost solutions, like ZigBee,TM our approach involves a single node type and exhibits lower resource requirements. The presented scheme has been verified in an industrial deployment with stringent quality of service requirements.

Scalable Reconfigurable Channel Decoder Architecture for Future Wireless Handsets [p. 1563]

G. Krishnaiah, N. Engin and S. Sawitzki

The current trend in the consumer devices and communication service provider market is the integration of different communication standards within a single device (e.g. GSM phone with Bluetooth, WLAN and infrared interface) requiring tight integration of mobile broadcast, networking and cellular technologies within one product. Channel decoder is traditionally one of the most computationally intensive building block within digital receivers. The aim of this paper is to investigate the feasibility of a programmable channel decoder that can be dynamically reconfigured for decoding turbo and convolutionally encoded streams from various wireless standards. The architecture options are presented and the area costs and flexibility compared between the options. The resulting decoder architecture supports hardware resource sharing and reconfiguration between different standards and decoders and is more efficient in terms of silicon area than independent implementation of every decoder on the same IC.

A New Pipelined Implementation for Minimum Norm Sorting Used in Square Root Algorithm for MIMO-VBLAST Systems [p. 1569]

Z. Khan, T. Arslan, J.S. Thompson, A.T. Erdogan

Multiple Input - Multiple Output (MIMO) wireless technology involves highly complex vectors and matrix computations which are directly related to increased power and area consumption. This paper proposes an area and power efficient VLSI architecture that can serve the dual purpose of minimum norm sorting of rows as well as upper/lower block tri-angularization of matrices. The resources inside the architecture are shared among both operations and only primitive computations are used. Results indicate saving in silicon real estate as well as power consumption compared to previous architecture without degrading performance.

Optimization of the "FOCUS" Inband-FEC Architecture for 10-Gbps SDH/SONET Optical Communication Channels [p. 1575]

A. Tychopoulos and O. Koufopavlou

Forward-Error Correction (FEC) is of key importance to the robustness of optical communication networks. In particular, Inband-FEC is an attractive option, because it improves channel-performance without requiring an increase of the transmission bandwidth. We have devised and implemented a novel inband FEC method, dubbed FOCUS, for the electronic-mitigation of physical impairments in SDH/SONET optical networks. It is an inherently low-cost approach for both the metro and backbone network regions, scalable to any SDH/SONET rate and capable to significantly increase optical channel performance. This paper analyzes the most sophisticated ones from the plethora of optimizations that were employed to minimize the architectural complexity of FOCUS, falling in: a) Arithmetic operator design, b) Resource sharing and c) Redundant logic elimination. These optimizations were necessary to obtain a prototype, which eventually permitted the first fully successful laboratory evaluation of the FOCUS Inband-FEC method.

11.3: System Reliability and Security Issues

Moderators: C. Bolchini, Politecnico di Milano, IT; S. Bocchio, STMicroelectronics, IT

A Framework for System Reliability Analysis Considering Both System Error Tolerance and Component Test Quality [p. 1581]

S.-J. Pan and K.-T. Cheng

The failure rate, the sources of failures and the test costs for nanometer devices are all increasing. Therefore, to create a reliable system-on-a-chip device requires designers to implement fault tolerance. However, while system-level fault tolerance could significantly relax the quality requirements of the system's building blocks, every fault-tolerant scheme only works under certain failure mechanisms and within a certain range of error probabilities. Also, designing a system with a high failure-rate component could be very expensive because the growth rate of the design complexity and the system overhead for fault tolerance could be significantly greater than the component failure rate. Therefore, it is desirable to understand the trade-offs between component test quality and system fault-tolerant capability for achieving the desired reliability under cost constraints. In this paper, we propose an analysis framework for system reliability considering (a) the test quality achieved by manufacturing testing, on-line selfchecking, and off-line built-in self-test; (b) the fault-tolerant and spare schemes; and (c) the component defect and error probabilities. We demonstrate that, through proper redundancy configurations and low-cost testing to insure a certain degree of component test quality, a low-redundant system might achieve equal or higher reliability than a highredundant system.

Experimental Evaluation of Protections against Laser-induced Faults and Consequences on Fault Modeling [p. 1587]

R. Leveugle, A. Ammari, V. Maingot, E. Teyssou, P. Moitrel, C. Mourtel, N. Feyt, J.-B. Rigaud and A. Tria

Lasers can be used by hackers to situations to inject faults in circuits and induce security flaws. On-line detection mechanisms are classically proposed to counter such attacks, and are often based on error detecting codes. However, the efficiency of such schemes has not been precisely validated against real attack conditions. This paper presents results showing that, with a given type of laser, a classical protection technique can leave open doors to an attacker. The results give also insights into the fault models to be taken into account when designing a secured circuit.

Evaluation of Design for Reliability Techniques in Embedded Flash Memories [p. 1593]

B. Godard, J.-M. Daga, L. Torres and G. Sassatelli

Non-volatile Flash memories are becoming more and more popular in Systems-on-Chip (SoC). Embedded Flash (eFlash) memories are based on the well-known floatinggate transistor concept. The reliability of such type of technology is a growing up issue for embedded systems; endurance and retention are of course the main features to analyze. To enhance memory reliability current eFlash memories designs use techniques such as Error Correction Code (ECC), Redundancy or Threshold Voltage (VT) Analysis. In this paper, a memory model to evaluate the reliability of eFlash memory arrays under distinct enhancement schemes is developed.

Reduction of Detected Acceptable Faults for Yield Improvement via Error-Tolerance [p. 1599]

T.-Y. Hsieh, K.-J. Lee and M.A. Breuer

Error-tolerance is an innovative way to enhance the effective yield of IC products. Previously a test methodology based on error-rate estimation to support error-tolerance was proposed. Without violating the system error-rate constraint specified by the user, this methodology identifies a set of faults that can be ignored during testing, thereby leading to a significant improvement in yield. However, usually the patterns detecting all of the unacceptable faults also detect a large number of acceptable faults, resulting in a degradation in achievable yield improvement. In this paper, we first provide a probabilistic analysis of this problem and show that a conventional ATPG procedure cannot adequately address this problem. We then present a novel test pattern selection procedure and an output masking technique to deal with this problem. The selection process generates a test set aimed to detect all unacceptable faults but as few acceptable faults as possible. The masking technique then examines the generated test patterns and identifies a list of output lines that can be masked (not observed) during testing so as to further avoid the detection of acceptable faults. Experimental results show that by employing the proposed techniques, only a small number of acceptable faults are still detected. In many cases the actual yield improvement approaches the optimal value that can be achieved.

11.4: Statistical Timing and Worst-Delay Corner Analysis

Moderators: M. Berkelaar, Magma Design Automation, NL; J. Cortadella, UP Catalunya, ES

Use of Statistical Timing Analysis on Real Designs [p. 1605]

A. Nardi, E. Tuncer, S. Naidu, A. Antonau, S. Gradinaru, T. Lin and J. Song

A vast literature has been published on Statistical Static Timing Analysis (SSTA), its motivations, its different implementations and their runtime/accuracy trade-offs. However, very limited literature exists ([1]) on the applicability and the usage models of this new technology on real designs. This work focuses on the use of SSTA in real designs and its practical benefits and limitations over the traditional design flow. We introduce two new metrics to drive the optimization: skew criticality and aggregate sensitivity. Practical benefits of SSTA are demonstrated for clock tree analysis, and correct modeling of on-chip-variations. The use of SSTA to cover the traditional corner analysis and to drive optimization is also discussed. Results are reported on three designs implemented on a 90nm technology.

A Novel Criticality Computation Method in Statistical Timing Analysis [p. 1611]

F. Wang, Y. Xie and H. Ju

The impact of process variations increases as technology scales to nanometer region. Under large process variations, the path and arc/node criticality [18] provide effective metrics in guiding circuit optimization. To facilitate the criticality computation considering the correlation, we define the critical region for the path and arc/node in a timing graph, and propose an efficient method to compute the criticality for paths and arcs/nodes simultaneously by a single breadth-first graph traversal during the backward propagation. Instead of choosing a set of paths for analysis prematurely, we develop a new property of the path criticality to prune those paths with low criticality at very earlier stages, so that our path criticality computation method has linear complexity with respect of the timing edges in a timing graph. To improve the computation accuracy, cutset and path criticality properties are exploited to calibrate the computation results. The experimental results on ISCAS benchmark circuits show that our criticality computation method can achieve high accuracy with fast speed.

Efficient Computation of the Worst-Delay Corner [p. 1617]

L. Guerra e Silva, L.M. Silveira and J.R. Phillips

Timing analysis and verification is a critical stage in digital integrated circuit design. As feature sizes decrease to nanometer scale, the impact of process parameter variations in circuit performance becomes extremely relevant. Even though several statistical timing analysis techniques have recently been proposed, as a form of incorporating variability effects in traditional static timing analysis, corner analysis still is the current timing signoff methodology for any industrial design. Since it is impossible to analyze a design for all the process corners, due to the exponential size of the corner space, the design is usually analyzed for a set of carefully chosen corners, that are expected to cover all the worst-case scenarios. However, there is no established systematic methodology for picking the right worstcase corners, and this task usually relies on the experience of design and process engineers, many times leading to over design. This paper proposes an efficient automated methodology for computing the worst-delay process corners of a digital integrated circuit, given a linear parametric characterization of the gate and interconnect delays.

11.5: Real-Time Methodologies

Moderators: I. Puaut, Rennes U/IRISA, FR; S. Baruah, North Carolina U, US

Accounting for Cache-Related Preemption Delay in Dynamic Priority Schedulability Analysis [p. 1623]

L. Ju, S. Chakraborty and A. Roychoudhury

Recently there has been considerable interest in incorporating timing effects of microarchitectural features of processors (e.g. caches and pipelines) into the schedulability analysis of tasks running on them. Following this line of work, in this paper we show how to account for the effects of cache-related preemption delay (CRPD) in the standard schedulability tests for dynamic priority schedulers like EDF. Even if the memory space of tasks is disjoint, their memory blocks usually map into a shared cache. As a result, task preemption may introduce additional cache misses which are encountered when the preempted task resumes execution; the delay due to these additional misses is called CRPD. Previous work on accounting for CRPD was restricted to only static priority schedulers and periodic task models. Our work extends these results to dynamic priority schedulers and more general task models (e.g. sporadic, generalized multiframe and recurring real-time). We show that our schedulability tests are useful through extensive experiments using synthetic task sets, as well as through a detailed case study.

Energy-Efficient Real-Time Task Scheduling with Task Rejection [p. 1629]

J.-J. Chen, T.-W. Kuo, C.-L. Yang and K.-J. King

In the past decade, energy-efficiency has been an important system design issue in both hardware and software managements. For mobile applications with critical missions, both energy consumption reduction and timing guarantee have to be provided by system engineers to extend operation duration and maintain system stability. This research explores real-time systems composed of homogeneous multiple processors with the capability of dynamic voltage scaling (DVS), in which a given task can be rejected with a specified value of rejection penalty. The objective is to minimize the summation of the total rejection penalty for the tasks that are not completed in time and the energy consumption of the system. This study provides analysis to show that there does not exist any polynomial-time approximation algorithm for the studied problem, unless P = NP. Moreover, we propose algorithms for systems with ideal and nonideal DVS processors. The capability of the proposed algorithms is provided with extensive evaluations. The evaluation results reveal that our proposed algorithms could derive effective solutions of the energy-efficient scheduling problem with task rejection considerations.
Keywords: Energy-Efficient Scheduling, Task Rejection, Real- Time Task Scheduling.

Feasibility Intervals for Multiprocessor Fixed-Priority Scheduling of Arbitrary Deadline Periodic Systems [p. 1635]

L. Cucu and J. Goossens

In this paper we study the global scheduling of periodic task systems with arbitrary deadlines upon identical multiprocessor platforms. We first show two very general properties which are well-known for uniprocessor platforms and which remain for multiprocessor platforms: (i) under few and not so restrictive assumptions, we show that any feasible schedule of arbitrary deadline periodic task systems is periodic from some point and (ii) for the specific case of synchronous periodic task systems, we show that the schedule repeats from the origin. We then present our main result: any feasible schedule of asynchronous periodic task sets using a fixed-priority scheduler is periodic from a specific point. Moreover, we characterize that point and we provide a feasibility interval for those systems.

Energy Minimization with Soft Real-time and DVS for Uniprocessor and Multiprocessor Embedded Systems [p. 1641]

M. Qiu, C. Xue, Z. Shao and E.H.-M. Sha

Energy-saving is extremely important in real-time embedded systems. Dynamic Voltage Scaling (DVS) is one of the prime techniques used to achieve energy-saving. Due to the uncertainties in execution times of some tasks of systems, this papermodels each varied execution time as a random variable. By using probabilistic approach, we propose two optimal algorithms, one for uniprocessor and one for multiprocessor to explore soft real-time embedded systems and avoid over-designing them. Our goal is to minimize the expected total energy consumption while satisfying the timing constraint with a guaranteed confidence probability. The solutions can be applied to both hard and soft real-time systems. The experimental results show that our approach achieves significant energy-saving than previous work.

11.6: Impact of Nanometer Technologies in MPSoCs and SoC Design

Moderators: R. Marculescu, Carnegie Mellon U, US; D. Atienza, DACYA . Madrid Complutense U, ES

Joint Consideration of Fault-Tolerance, Energy-Efficiency and Performance in On-Chip Networks [p. 1647]

A. Ejlali, B.M. Al-Hashimi, P. Rosinger and S.G. Miremadi

High reliability against noise, low energy consumption and high performance are key objectives in the design of on-chip networks. Recently some researchers have considered the various trade-offs between two of these objectives. However, as we will argue later, the three design objectives should be considered jointly and simultaneously. The first aim of this paper is to analyze the impact of various error-control schemes on the simultaneous trade-off between reliability, performance and energy when voltage swing varies. We provide a detailed comparative analysis of the error-control schemes using analytical models and SPICE simulations. The second aim of this paper is to analyze the impact of noise power and time constraint on the effectiveness of errorcontrol schemes, which have not been addressed in previous studies.

Impact of Process Variations on Multicore Performance Symmetry [p. 1653]

E.B. Humenay, D. Tarjan and K. Skadron

Multi-core architectures introduce a new granularity at which process variations may occur, yielding asymmetry among cores that were designed - and that software expects - to be symmetric in performance. The chief source of this phenomenon are highly correlated, "systematic" within-die variations such as optical imperfections yielding variations across the exposure field. Per-core voltages can be used to bring all cores to the same performance level, but this compensation strategy also affects power, chiefly due to leakage power. Boosting a core's frequency may therefore boost its leakage sufficiently to engage thermal throttling. This sets up a tradeoff between static performance asymmetry due to frequency variation versus dynamic performance asymmetry due to thermal throttling. This paper explores the potential magnitude of these effects.

Temperature Aware Task Scheduling in MPSoCs [p. 1659]

A. Kivilcim Coskun, T. Simunic Rosing and K. Whisnant

In deep submicron circuits, elevation in temperatures has brought new challenges in reliability, timing, performance, cooling costs and leakage power. Conventional thermal management techniques sacrifice performance to control the thermal behavior by slowing down or turning off the processors when a critical temperature threshold is exceeded. Moreover, studies have shown that in addition to high temperatures, temporal and spatial variations in temperature impact system reliability. In this work, we explore the benefits of thermally aware task scheduling for multiprocessor systems-on-a-chip (MPSoC). We design and evaluate OS-level dynamic scheduling policies with negligible performance overhead. We show that, using simple to implement policies that make decisions based on temperature measurements, better temporal and spatial thermal profiles can be achieved in comparison to stateof- art schedulers. We also enhance reactive strategies such as dynamic thread migration with our scheduling policies. This way, hot spots and temperature variations are decreased, and the performance cost is significantly reduced.

11.7: High-Level Memory and Clock Power Optimization

Moderators: R. Zafalon, STMicroelectronics, IT; J. Haid, Infineon Technologies, DE

Architectural Leakage-Aware Management of Partitioned Scratchpad Memories [p. 1665]

O. Golubeva, M. Loghi, M. Poncino and E. Macii

Partitioning a memory into multiple blocks that can be independently accessed is a widely used technique to reduce its dynamic power. For embedded systems, its benefits can be even pushed further by properly matching the partition to the memory access patterns. When leakage energy comes into play, however, idle memory blocks must be put into a proper low-leakage sleep state to actually save energy when not accessed. In this case, the matching becomes an instance of power management problem, because moving to and from this sleep state requires additional energy. In this work, we propose an explorative solution to the problem of leakage-aware partitioning of a memory into disjoint sub-blocks. In particular, we target scratchpad memories, which are commonly used in some embedded systems as a replacement of caches. We show that the total energy (dynamic and static) cost function yields a non-convex partitioning space, making smart exploration the only viable option; we propose an effective randomized search in the solution space which has very good match with the results of exhaustive exploration, when this is feasible. Experiments on a different sets of embedded applications has shown that total energy savings larger than 60% on average can be obtained, with a marginal overhead in execution time, thanks to an effective implementation of the low-leakage sleep state.

Memory Bank Aware Dynamic Loop Scheduling [p. 1671]

M. Kandemir, T. Yemliha, S.W. Son and O. Özturk

In a parallel system with multiple CPUs, one of the key prob- lems is to assign loop iterations to processors. This problem known as the loop scheduling problem, has been studied in the past, and several schemes, both static and dynamic, have been pro- posed. One of the attractive features of dynamic schemes, as com- pared to their static counterparts, is their ability of exploiting the latency variations across the execution times of the different loop iterations. In all the dynamic loop scheduling techniques proposed in literature so far, performance has been the primary metric of interest. In a battery-operated embedded execution environment, however, power consumption is another metric to consider dur- ing iteration-to-processor assignment. In particular, in a banked memory system, this assignment can have an important impact on memory power consumption, which can be a significant portion of the overall energy consumption, especially for data-intensive embedded applications such as those from the domain of image data processing. This paper presents a bank aware dynamic loop scheduling scheme for array-intensive embedded media applica- tions. The goal behind this new scheduling scheme is to minimize the number of memory banks that need to be used for executing the current working set (group of loop iterations) when all proces- sors are considered together. That is, during the loop iteration- to-processor assignment, our approach considers the bank access patterns of loop iterations and carefully selects the set of itera- tions to assign to an idle processor so that, if possible, the num- ber of memory banks that are used at the current state is not in- creased. Our experimental results show that the proposed schedul- ing scheme leads to much better energy results when compared to prior loop scheduling techniques and it is also competitive with the scheduler that generates the best performance. To our knowledge, this is the first dynamic loop scheduling scheme that is memory bank aware.

System Level Clock Tree Synthesis for Power Optimization [p. 1677]

S.A. Butt, S. Schmermbeck, J. Rosenthal, A. Pratsch and E. Schmidt

The clock tree is the interconnect net on Systems-on-Chip (SoCs) with the heaviest load and consumes up to 40% of the overall power budget. Substantial savings of the overall power dissipations are possible by optimizing the clock tree. Although these savings are already relevant at systemlevel, only little effort has been made to consider the clock tree at higher levels of abstraction. This paper shows how the clock-tree can be integrated into system-level power estimation and optimization. A clock tree routing algorithm is chosen, adapted to the system-level and then integrated into an algorithmic-level power optimization tool. Experimental results demonstrate the importance of the clock tree for system-level power optimization.

DATE 2007 ABSTRACTS

1.2: Design Records

1.3: Design for Testability for SoCs

1.4: Communication Synthesis under Timing Constraints

1.5: Performance Modelling and Synthesis of Analogue/Mixed-Signal Circuits

1.6: System Level Mapping and Simulation

1.7: Algorithms and Applications of Run-Time Reconfiguration

2.2: IP Designs for Media Processing and Other Computational Intensive Kernels

2.3: Test Infrastructure of SoCs and its Verification

2.4: HOT TOPIC - Microprocessors in the Era of Terascale Integration

2.5: Statistical / Nonlinear Analysis and Verification for Analogue Circuits

2.6: System Modeling and Specification

2.7: Design Space Exploration and Nano-Technologies for Reconfigurable Computing

3.2: Implementation of LDPC Codecs for Various Communication Standards

3.3: Testing NoCs

3.4: Synthesis at System and Architectural Levels

3.5: Analogue and Mixed-Signal Design and Characterization

3.6: PANEL SESSION - Should You Trust the Surgeon or the Family Doctor?

3.7: Automatic Synthesis of Computation Intensive Application Specific Circuits

4.1: EMBEDDED TUTORIAL - Design, Verification and Test (Ubiquitous Communication and Computation Special Day)

4.2: Automotive

4.3: Test Generation for Diagnosis, Scan Testing and Advanced Memory Fault Models

4.4: Future Design Challenges

4.5: Application-Specific Architectures

4.6: Technology and Process Aware Low Power Circuit Design

4.7: Hardware Implementation of MPSoCs and NoCs Architectures

5.1.1: Security and Trust in Ubiquitous Communication (Ubiquitous Communication and Computation Special Day)

5.1.2: Lunch-Time Keynote and Awards

5.2: Best Industrial System Designs in Aerospace, Avionics and Automotive

5.3: Mixed-Signal and RF Test

5.4: EMBEDDED TUTORIAL AND PANEL - Heterogeneous Systems on Chip and Systems in Package

5.5: Novel Directions in Architectural Simulation and Validation

5.6: Power Management

5.7: Advanced Techniques for Embedded Processors Design

6.1: Advances in Potential Power Supply (Ubiquitous Communication and Computation Special Day)

6.2: Best Industrial Systems Designs in Communication and Multimedia

6.3: Nano and FIFO

6.4: System Level Validation

6.5: Model-Based Design for Embedded Systems

6.6: PANEL SESSION - Life Begins at 65 - Unless You Are Mixed Signal

6.7: Resource Optimisation for Best Effort and Quality of Service

7.1: HOT TOPIC - Testing 35 Billions of Transistors in 2020, Is It Possible?

7.2: Designs in Avionics, Military and Space

7.4: Timing Analysis and Validation

7.5: Model-Based Analysis and Middleware of Embedded Systems

7.6: Advanced Architectures for Low Power Optimization

7.7: Performance Analysis for NoC Architectures

8.1: TUTORIAL SESSION - State of the Art (Space and Aeronautics Special Day)

8.2: Secure Systems

8.3: Reliable Microarchitectures

8.4: Formal Techniques to Enhance the Verification Flow

8.5: Interconnect Extraction and Synthesis

8.6: EMBEDDED TUTORIAL/PANEL - A Future of Customizable Processors: Are We There Yet?

8.7: Placement and Floorplanning

9.1.1: HOT TOPIC I - Industrial Applications (Space and Aeronautics Special Day)

9.1.2: LUNCH TIME KEYNOTE - Setting the Industrial Scene (Space and Aeronautics Special Day)

9.2: Crypto Blocks and Security

9.3: Variation Tolerant Mixed Signal Test

9.4: SAT Techniques for Verification

9.5: Compiler Techniques for Customizable Architectures

9.6: Interconnect Optimization and Metastability

9.7: Physical and Device Simulation

10.1: HOT TOPIC II - Development and Industrialization (Space and Aeronautics Special Day)

10.2: Wireless Communication and Networking System Implementation

10.3: Soft Error Evaluation and Tolerance

10.4: EMBEDDED TUTORIAL - EDA - A Pivotal Theme in the European Technology Platforms - ARTEMIS and ENIAC (System)

10.5: Memory and Instruction-Set Customization for Real-Time Systems

10.6: Order Reduction and Variation-Aware Interconnect Modelling

10.7: Temperature and Process Aware Low Power Techniques

11.1: PANEL SESSION - Towards Total Open Source in Aeronautics and Space? (Space and Aeronautics Special Day)

11.2: Wireless Communication and Networking Algorithms

11.3: System Reliability and Security Issues

11.4: Statistical Timing and Worst-Delay Corner Analysis

11.5: Real-Time Methodologies

11.6: Impact of Nanometer Technologies in MPSoCs and SoC Design

11.7: High-Level Memory and Clock Power Optimization

11.1: PANEL SESSION - Towards Total Open Source in Aeronautics and Space?
(Space and Aeronautics Special Day)