Whereas Moore's law continues to push the industry to ever more complex technologies (More Moore) supporting sophisticated digital applications, the so-called More than Moore technologies are finding more and more heterogeneous application domains. EDA challenges in More Moore are related to power optimisation, DfM and verification needs whereas the More than Moore technologies require EDA tools that relate various electrical, logical and physical domains in one environment. System level design is badly needed in More Moore and in More than Moore.
Successful design of complex electronic systems increasingly requires the bi-directional flow of information among groups of design specialists who are becoming more dispersed geographically and organisationally. This affects the type of design flows we develop, the nature of the design tools, how design software is supported and the organisational structure of the EDA and electronics industries. Dr Rhines will provide examples and discuss issues that suggest how the interaction among designers and design organisations will affect the future evolution of design methodology and EDA.
This paper proposes a complete allocation and scheduling framework, where an MPSoC virtual platform is used to accurately derive input parameters, validate abstract models of system components and assess constraint satisfaction and objective function optimization. The optimizer implements an efficient and exact approach to allocation and scheduling based on problem decomposition. The allocation subproblem is solved through Integer Programming while the scheduling one through Constraint Programming. The two solvers can interact by means of no-good generation, thus building an iterative procedure which has been proven to converge to the optimal solution. Experimental results show significant speedups w.r.t. pure IP and CP exact solution strategies as well as high accuracy with respect to cycle accurate functional simulation. A case study further demostrates the practical viability of our framework for real-life systems and applications.
This paper addresses the allocation of link capacities in the automated design process of a network-on-chip based system. Communication resource costs are minimized under Quality-of-Service timing constraints. First, we introduce a novel analytical delay model for virtual channeled wormhole networks with non-uniform link capacities that eliminates costly simulations at the inner-loop of the optimization process. Second, we present an efficient capacity allocation algorithm that assigns link capacities such that packet delays requirements for each flow are satisfied. We demonstrate the benefit of capacity allocation for a typical system on chip, where the traffic is heterogeneous and delay requirements may largely vary, in comparison with the standard approach which assumes uniform-capacity links.
With the advent of multi-processor systems-on-chip, the interest in process migration is again on the rise both in research and in product development. New challenges associated with the new scenario include increased sensitivity to implementation complexity, tight power budgets, requirements on execution predictability, the lack of virtual memory support in many low-end MPSoCs. As a consequence, effectiveness and applicability of traditional transparent migration mechanisms are put in discussion in this context. Our paper proposes a task management software infrastructure that is well suited for the constraints of single chip multiprocessors with distributed operating systems. Load balancing in the system is maintained by means of intelligent initial placement and task migration. We propose a user-managed migration scheme based on code checkpointing and user-level middleware support as an effective solution for many MPSoC application domains. In order to prove the practical viability of this scheme, we also propose a characterization methodology for task migration overhead. We derive the minimum execution time following a task migration event during which the system configuration should be frozen to make up for the migration cost.
In this paper, a wavelet based approach is proposed for the model order reduction of linear circuits in time domain. Compared with Chebyshev reduction method, the wavelet reduction approach can achieve smaller reduced order circuits with very high accuracy, especially for those circuits with strong singularities. Furthermore, to compute the basis function coefficient vectors, a fast Sylvester equation solver is proposed, which works more than one or two orders faster than the vector equation solver employed by Chebyshev reduction method. The proposed wavelet method is also compared with the frequency domain model reduction method, which may loose accuracy in time domain. Both theoretical analysis and experiment results have demonstrated the high speed and high accuracy of the proposed method.
This paper presents a domain decomposition (DD) technique for efficient simulation of large-scale linear circuits such as power distribution networks. Simulation results show that by integrating the proposed DD framework, existing linear circuit simulators can be extended to handle otherwise intractable systems.
Power distribution and signal transmission are becoming key limiters for chip performance in nanometer era. These issues can be simultaneously addressed by designing transmission lines in power grids. The transmission lines are well suited for high quality intra-chip signal transmission at multi gigabit data rates. By having signal lines between the power grids, the VDD and GND lines in the grid can be exploited as return paths besides being used for regular power distribution. This approach also improves wiring density. In this paper, we rigorously analyze and discuss the design considerations for laying transmission lines in power grids. We also present design oriented modeling methods in 2D and 3D geometry. We show how the grid modeling complexity is simplified. We experimentally validate our results with fabricated test structures. We also show VDD lines in the grid act as good return path without external decoupling capacitors in our design. Further we discuss substrate effects and deduce guidelines for designing power grid transmission lines on a low resistive silicon substrate.
This paper derives the multi-layer heat conduction Green's function, by integrating the eigen-expansion technique and the classic transmission line theories, and presents a logarithmic full-chip thermal analysis algorithm, which is verified by comparisons with a computational fluid dynamics tool (FLUENT). The paper considers Dirichlet's and general heat convection boundary conditions at chip surfaces. Experimental results show that the algorithm offers superior computing speed, compared to FLUENT and traditional Green's function based methods. The paper also studies the limitations of the traditional single-layer thermal model.
A fast method for timing analysis of large scale RLC networks using the RLCG-MNA formulation, which provides good properties for fast matrix solvers, is presented. The proposed method is faster than INDUCTWISE and more general than the RLP algorithm, where INDUCTWISE and RLP algorithm are known as the state-of-art simulation methods. In the numerical example, good performances of the proposed method are illustrated compared with the previous works.
In this paper, we present an analysis methodology to compute circuit node sensitivity due to charged particle induced delay (timing) errors, Soft Delay Errors (SDE). We define node sensitivity metric and describe a step by step procedure to compute node sensitivity. We use mixed-mode simulations to extract accurate current pulses for the characterization of SDE. A technique for logic cell library characterization for SDE is described. Our approach is orders of magnitude faster than using Spice based analysis and its accuracy is close to Spice. Using our approach, we provide distribution of nodes sensitivity for various ISCAS85 circuits and two adders. Such analysis is important to employ node hardening techniques on selected nodes to increase the reliability of CMOS circuits. We use two test circuits to apply a node hardening technique on the highly sensitivy nodes which were determined by our approach. Results are provided for the reduction of the circuit sensitivity.
Built-in self-repair (BISR) technique is gaining popular for repairing embedded memory cores in system-onchips (SOCs). To increase the utilization of memory redundancy, the BISR technique usually needs to perform built-in redundancy-analysis (BIRA) algorithm for redundancy allocation. This paper presents an efficient BIRA scheme for embedded memory repair. The BIRA scheme executes the 2D redundancy allocation based on the 1D local bitmap. This enables that the BIRA circuitry can be implemented with low area cost. Also, the BIRA algorithm can provide good repair rate (i.e., the ratio of the number of repaired memories to the number of defective memories). Experimental results show that the repair rate of the proposed BIRA scheme approximates to that of the optimal scheme for the memories with different fault distributions. Also, the ratio of the analysis time to the test time is small.
In this paper we analyze the impact of error detecting codes, implemented on an on-chip bus, on the on-chip simultaneous switching noise (SSN). First, we analyze in detail how SSN is impacted by different bus transitions, pointing out its dependency on the number and placement of switching wires. Afterwards, we present an analytical model that we have developed in order to estimate the SSN, and that we prove to be very accurate in SSN prediction. Finally, by employing the developed model, we estimate the SSN due to different EDCs implemented on an on-chip bus. In particular, we highlight how their differences in the number of switching wires, bus parallelism and codewords influence the on-chip SSN.
Today's nanometer technology trends have a very negative impact on the reliability of semiconductor products. Intermittent faults constitute the largest part of reliability failures that are manifested in the field during the semiconductor product operation. Since Software-Based Self-Test (SBST) has been proposed as an effective strategy for on-line testing of processors integrated in non-safety critical low-cost embedded system applications, optimal test period specification is becoming increasingly challenging. In this paper we first introduce a reliability analysis for optimal periodic testing of intermittent faults that minimizes the test cost incurred based on a two-state Markov model for the probabilistic modeling of intermittent faults. Then, we present for the first time an enhanced SBST strategy for on-line testing of complex pipelined embedded processors. Finally, we demonstrate the effectiveness of the proposed optimal periodic SBST strategy by applying it to a fully-pipelined RISC embedded processor and providing experimental results.
We discuss the use of the Berer code for Concurrent Error Detection (CED) in Asynchronous Burst-Mode Machines (ABMMs). We present a state encoding method which guarantees the existence of the two key components for Berger-encoding an ABMM, namely an inverter-free ABMM implementation of the circuit and an ABMM implementation of the corresponding Berger code generator. We also propose improved solutions to two inherent problems of CED in ABMMs, namely checking synchronization and detection of error-induced hazards. Experimental results demonstrate that Berger-code-based CED reduces significantly the cost of previous CED methods for ABMMs.
Resonant clocking holds the promise of trading speed for energy in CMOS circuits that can afford to operate at low frequency, like hearing aids. An experimental chip with 110k transistors and more than 2500 latches, has been designed, fabricated and tested. The measured energy consumption of the design at 0.8 V is 62 μW/MHz, about 7.5% less than the conventional single-edge-triggered benchmark. Closer analysis reveals that much of the energy savings brought about by resonant clocking at low supply voltages are lost when a CMOS circuit is operated at higher voltages. This is because of the crossover currents that persist for much of a clock period when a circuit is driven from sine-type clock waveform.
An on-chip interconnect is implemented with 3Gbps/wire bandwidth performance with 8:1 serialization scheme. Such high-speed serialization is achieved using a novel serialization scheme, Wave-Front-Train. In order to apply such high-speed link technique to Network-on-Chip channels, three adaptive control schemes are used: supply voltage dependent reference voltage control, phase compensation scheme with selfcalibrating function, and adaptive bandwidth control. The chip is fabricated using 0.18μm CMOS technology.
We report the first fully integrated single photon avalanche diode array fabricated in 0.35 μm CMOS technology. At 25 μm, the pixel pitch achieved by this design is the smallest ever reported. Thanks to the level of miniaturization enabled by this design, we were able to build the largest single photon streak camera ever built in any technology, thus proving the scalability of the technology. Applications requiring low noise, high dynamic range, and/or picosecond timing accuracies are the prime candidates of this technology. Examples include bio-imaging at cellular and molecular level, fast optical imaging, single photon telecommunications, 3D cameras, optical rangefinders, LIDAR, and low light level imagers.
Automotive systems are becoming increasingly difficult and expensive to design successfully as the market demands increasing complexity. Body electronics are particularly affected by this trend, a good example being power windows design. This seemingly mundane area involves meeting market and legislative requirements, which means creating a control system that combines the input from several sensors and follows complex behavioral rules [1]. Traditional design methodologies involve writing a text specification and implementing algorithms in C. However, algorithms cannot be verified without hardware. This approach leaves the engineer in the unenviable position of waiting for the last piece of hardware to arrive to enable them to test their system. To avoid these problems, engineers need to decouple algorithm development and verification from the availability of hardware. To address this need, OEMs and suppliers around the world are switching to Model-Based Design.
Mathematical modeling, which already is established for a long time in many engineering domains, now also gains strongly in importance in the development of embedded software. In the automotive sector [9], modeling is used on the one hand for the conceptual anticipation of the functionality to be realized (open/closed loop control, monitoring) and, on the other, for the simulation of the behavior of real physical systems (plant, environment).
The increasing importance of electronics in the automotive industry is illustrated by the growing proportion of manufacturing costs taken up by electrical and electronic systems - this has now reached approx. 30%. At the same time, electrical and electronic systems are the main cause of vehicle failures in the field, accounting for approx. 30% of these. Manufacturers and also suppliers are well aware of the problems caused by the increasing number of electronic control units (ECUs). Thus, quality assurance is becoming increasingly important, as problems in quality are a liability risk, with the danger of image problems and the cost of recall campaigns and rectification. The realization is that "good quality is expensive, bad quality even more so". Quality must not be left behind by the immense speed at which new technologies and functions are being developed. Quality is becoming a decisive factor in competition, and quality assurance is becoming a key task and a core competence; and testing is a key component of quality assurance. To allow testing throughout the entire development process, powerful and efficient means of developing and describing tests are necessary. These also have to take into account the various requirements of the test tasks and the different development phases. This contribution gives an overview of modern test development in various phases of development and of test management throughout the overall process, using a model-based test process.
DSP System designers have no shortage of great ideas, and are forever finding new, powerful, and creative algorithms, now this is what they are good at and it is this skill they generally use to secure their paycheck. So there doesn't appear to be any problems with this situation, you have ideas people being paid to come up with ideas, everyone is happy. Most of the companies who pay system designers have a product to complete, and it is getting this product to market, and selling the product, that is the challenge. This is not just a business problem, but also a technical issue as designers are being driven to consider more efficient ways of verifying and implementing their designs. For a long time software tools have helped the designer develop and simulate their designs, thus helping to improve productivity. However as designs become more complex, such as Wireless Broadband, Software Defined Radio, and Video/Imaging systems, the task of converting algorithms into code for hardware implementation becomes exponentially more difficult.
Several recent EDA surveys [1-2] confirm that The Mathworks Matlab/Simulink and the Unified Modelling Language (UML) are both gaining increased attention as Electronic System Level (ESL) languages. While Matlab is commonly used to model signal processing intensive systems, UML has the potential to support innovative ESL methodologies which tie the architecture, design and verification aspects in a unified perspective. Integrated design flows which exploit the benefits of the complementarity between UML and Matlab provide an interesting answer to the issues of mono-disciplinary modeling and the necessity of moving beyond point-tool solutions [3]. This paper summarizes how UML and Matlab/Simulink can be associated and what is the impact of SysML, a new modeling language based on UML to describe complex heterogeneous systems.
The paper presents an innovative simulation scheme to speed-up simulations of multi-clusters multi-processors SoCs at the TLM/T (Transaction Level Model with Time) abstraction level. The hardware components of the SoC architecture are written in standard SystemC. The goal is to describe the dynamic behavior of a given software application running on a given hardware architecture (including the dynamic contention in the interconnect and the cache effects), in order to provide the system designer with the same reliable timing information as a cycle accurate simulation, with a simulation speed similar to a TLM simulation. The key idea is to apply Parallel Discrete Event Simulation (PDES) techniques to a collection of communicating SystemC SC THREAD. Experimental results show a simulation speedup of a factor up to 50 versus a BCA simulation (Bus Cycle Accurate), for a timing error lower than 10-3.
The introduction of Transaction Level Modeling (TLM) allows a system designer to model a complete application, composed of hardware and software parts, at several levels of abstraction. The simulation speed of TLM is orders of magnitude faster than traditional RTL simulation; nevertheless, it can become a limiting factor when considering a Multi-Processor System-On-Chip (MP-SoC), as the analysis of these systems can be very complex. The main goal of this paper is to introduce a novel way of exploiting TLM features to increase simulation efficiency of complex systems by switching TLM models at runtime. Results show that simulation performance can be increased significantly without sacrificing the accuracy of critical application kernels.
Recent advancement in hardware design urged using a transaction based model as a new intermediate design level. Supporters for the Transaction Level Modeling (TLM) trend claim its efficiency in terms of rapid prototyping and fast simulation in comparison to the classical RTL-based approach. Intuitively, from a verification point of view, faster simulation induces better coverage results. This is driven by two factors: coverage measurement and simulation guidance. In this paper, we propose to use an abstract model of the design, written in the Abstract State Machines Language (AsmL), in order to provide an adequate way for measuring the functional coverage. Then, we use this metric in defining the fitness function of a genetic algorithm proposed to improve the simulation efficiency. Finally, we compare our coverage and simulation results to: (1) random simulation at TLM; and (2) the Specman tool of Verisity at RTL.
Instruction set simulators are common tools used for the development of new architectures and embedded software among countless other functions. This paper presents a framework that quickly generates fast and flexible instruction-set simulators from a specification based on a C-like architecture-description language. The framework provides a consistent platform for constructing and evaluating different classes of simulators, including interpreters, static-compiled simulators, and dynamic-compiled simulators. The framework also features a new construction method for dynamic-compiled simulator that involves no low-level programming. It profiles and translates frequently executed regions of simulated binary to C++ code and invokes GCC to compile such code into dynamically loaded libraries, which are then loaded into the simulator at run time to accelerate simulation. Our experimental results based on the MIPS architecture and the SPEC CPU2000 benchmarks show that our dynamic-compiled simulator is capable of achieving up to 11 times speedup compared to our fast interpreter. Compared to other dynamic-compiled simulators requiring significant system programming expertise to construct, the proposed approach is simpler to implement and more portable.
A communication-centric design approach, Networks on
Chips (NoCs), has emerged as the design paradigm for
designing a scalable communication infrastructure for future
Systems on Chips (SoCs). As technology advances, the
number of applications or use-cases integrated on a single
chip increases rapidly. The different use-cases of the SoC
have different communication requirements (such as different
bandwidth, latency constraints) and traffic patterns. The
underlying NoC architecture has to satisfy the constraints of
all the use-cases. In this work, we present a methodology
to map multiple use-cases onto the NoC architecture, satisfying
the constraints of each use-case. We present dynamic
re-configurationmechanisms that match the NoC configuration
to the communication characteristics of each use-case,
also accounting for use-cases that can run in parallel. The
methodology is applied to several real and synthetic SoC
benchmarks, which result in a large reduction in NoC area
(an average of 80%) and power consumption (an average
of 54%) compared to traditional design approaches.
Keywords: Networks on Chips, Systems on Chips, Use-Cases,
Modes, Dynamic Re-Configuration.
Increasing miniaturization is posing multiple challenges to electronic designers. In the context of Multi-Processor System-on-Chips (MPSoCs), we focus on the problem of implementing efficient interconnect systems for devices which are ever more densely packed with parallel computing cores. Easily seen that traditional buses can not provide enough bandwidth, a revolutionary path to scalability is provided by packet-switched Network-on-Chips (NoCs), while a more conservative approach dictates the addition of bandwidth-rich components (e.g. crossbars) within the preexisting fabrics. While both alternatives have already been explored, a thorough contrastive analysis is still missing. In this paper, we bring crossbar and NoC designs to the chip layout level in order to highlight the respective strengths and weaknesses in terms of performance, area and power, keeping an eye on future scalability.
Network-on-Chip (NoC) has been proposed to replace traditional bus based architectures to address the global communication challenges in nanoscale technologies. In future SoC architectures, minimizing power consumption will continue to be an important design goal. In this paper, we present a novel heuristic technique consisting of system-level physical design, and interconnection network generation that generates custom low power NoC architectures for application specific SoC. We demonstrate the quality of the solutions produced by our technique by experimentation with many benchmarks. Our technique has a low computational complexity, and consumes only 1.25 times the power consumption, and 0.85 times the number of router resources compared to an optimal MILP based technique [1] whose computational complexity is not bounded.
This paper presents the design of an adaptable NoC for FPGA based dynamically reconfigurable SoCs. At runtime, switches can be added or removed from the network, allowing to adapt the NoC to the number, size and location of currently configured hardware modules. By using dynamic routing tables, reconfiguration can be done without stopping or stalling the NoC. The proposed architecture avoids the limitations of bus-based interconnection schemes which are often applied in partially dynamically reconfigurable FPGA designs.
A novel noise transfer function (NTF) for high order reduced-sample-rate sigma-delta-pipeline (SDP) ADCs is presented. The proposed NTF determines the location of the non-zero poles improving the stabilization of the loop and implementing the reduced-sample-rate structure, concurrently. A design methodology based on simulated-annealing-algorithm is developed to design the optimum NTF. To verify the usefulness of the proposed NTF and design procedure, two different modulators are presented. Simulation results show that with a 4th order modulator, designed making use of the proposed approach, the maximum SNDR of 115dB and 124.1dB can be achieved with only OSR of 8, and 16 respectively.
This paper presents a systematic and optimal design of hybrid cascode compensation method which is used in fully differential two-stage CMOS operational transconductance amplifiers (OTAs). The closed loop analysis results are given to obtain a design procedure. A simple design procedure for the minimum settling time of the hybrid cascode compensation technique for a twostage class A/AB amplifier is proposed. Optimal design issues of power dissipation are considered to achieve the lowest power consumption for the required settling time. Finally, a design example is presented to show both the usefulness of the hybrid cascode compensation and the proposed design procedure. The proposed design technique can help circuit designers as well as it can be used in computer aided circuit design tools.
Analyzing the stability of an analog circuit is an important part of the circuit design. Several commercial simulators are equipped with special stability analysis techniques. Problems arise when your design kit does not support such simulator. Another issue is when the designer wants to get insight into the sources of the instability to propose a stabilization. This can be done through analyzing the open-loop or the closed-loop transfer function of the circuit. The aim of this paper is to propose an automated analysis method which identifies the nodes to be considered for stabilization. The method does not need to break feedback loops or to manipulate netlists. It only uses AC simulations and does not require the full modified nodal equations. The method is illustrated on 3 design examples: a Voltage Controlled Oscillator (VCO), a reference bias circuit and the common-mode feedback network in a gm-C filter.
This paper describes an original methodology for accurately modeling MOSFET process parameter variations. As compared to other process parameter variation modeling methods, the proposed methodology is capable of correctly modeling not only differences of process/model parameters, but also the process parameter variations for individual devices. This capability is very important for popular analog circuits like current biasing circuits, voltage reference circuits, and single-ended output amplifiers.
A baseband filter synthesizer that takes a behavioural description of the design and produces an efficient transistor level implementation is presented. The tool optimizes the filter at the cascade level, providing the best trade-off between power consumption and dynamic range, and at the cell level, selecting minimum power solutions, through accurate analytical models and an efficient bi-quad topology. Differently from past cascade design techniques based on dynamic range optimization through linear programming [2], we focus on power minimization while guaranteeing minimum performance levels, given the increasing importance of power savings in hand-held devices. A synthesized filter has been realized in silicon demonstrating the effectiveness of our approach.
Soft errors have emerged as an important reliability challenge for nanoscale VLSI designs. In this paper, we present a fast and efficient soft error rate (SER) computation algorithm for combinational circuits. We first present a novel parametric waveform model based on the Weibull function to represent particle strikes at individual nodes in the circuit. We then describe the construction of the SET descriptor that efficiently captures the correlation between the transient waveforms and their associated rate distribution functions. The proposed algorithm consists of operations to inject, propagate and merge SET descriptors while traversing forward along the gates in a circuit. The parameterized waveforms enable an efficient static approach to calculate the SER of a circuit. We exercise the proposed approach on a wide variety of combinational circuits and observe that our algorithm has linear runtime with the size of the circuit. The runtimes for soft error estimation were observed to be in the order of about one second, compared to several minutes or even hours for previously proposed methods.
In this paper we present a novel circuit for the on-line detection of transient and crosstalk faults affecting the interconnects of systems implemented using Field Programmable Gate-Arrays (FPGAs). The proposed detector features self-checking ability with respect to faults possibly affecting itself, thus being suitable for systems with high reliability requirements, like those for space applications. Compared to alternate solutions, the proposed circuit requires a significantly lower area overhead, while implying a comparable, or lower, impact on system performance. We have verified our circuit operation and self-checking ability by means of post-layout simulations.
In this paper we describe a methodology to measure exactly the quality of fault-tolerant designs by combining fault-injection in high level design (HLD) descriptions with a formal verification approach. We utilize BDD based symbolic simulation to determine the coverage of online error-detection and - correction logic. We describe an easily portable approach, which can be applied to a wide variety of multi-GHz industrial designs. Index Terms - Formal Verification, Soft Error Injection, Error Detection and Correction, Fault/Error Coverage
This paper investigates the sensitivity of real-time systems running applications under operating systems that are subject to soft-errors. We consider applications using different real-time operating system services: scheduling, time and memory management, intertask communication and synchronization. We report results of a detailed analysis regarding the impact of soft-errors on real-time operating systems cores, taking into account the application timing constraints. Our results show the extent to which soft-errors occurring in a real-time operating system's kernel impact its reliability.
We present a de-layered protocol engine for termination of 40Gbps TCP connections using a reconfigurable FPGA silicon platform. This protocol engine is designed for a planned attempt at the Internet Speed Record. In laboratory demonstrations at 40Gbps, this core beat the previous record of 7.2Gbps by a factor of five. We present an aggressive crosslayer optimization methodology and corresponding design-flow and tools used to implement this record-breaking TCP Protocol Engine. The 40Gbps TCP Of.oad Engine has been implemented on a Xilinx FPGA platform, based on a VirtexII-pro 2VP7 device. Each FPGA device terminates a 10Gbps OC-768 channel, and the aggregate capacity of the four FPGA devices is 40Gbps. The four 10Gbps channels are intended to be connected to four trunked 10GbE ethernet ports on a router. The 40Gbps TCP implementation has been demonstrated in the lab in system level as well as gate-level simulations, and live implementations have been tested with each 10Gbps channel FPGA board connected back-to-back in transmission tests at full wire-speed. We believe this to be the fastest TCP protocol engine to have been implemented so far.
This paper presents a multi-board, multi-FPGA hardware/
software architecture, for computation intensive, high
resolution (2048x2048 pixels), real-time (24 frames per second)
digital film processing. It is based on Xilinx Virtex-
II Pro FPGAs, large SDRAM memories for multiple frame
storage and a PCI express communication network. The architecture
reaches record performance running a complex
noise reduction algorithm including a 2.5 dimensions DWT
and a full 16x16 motion estimation at 24 fps requiring a
total of 203 Gops/s net computing performance and a total
of 28 Gbit/s DDR-SDRAM frame memory bandwidth.
To increase design productivity and yet achieve high clock
rates (125MHz), the architecture combines macro component
configuration and macro level floorplanning with weak
programmability using distributed microcoding. As an example,
the core of the bidirectional motion estimation using
2772 CLBs reaching 155 Gop/s (1538 op/pixel) requiring
7 Gbit/s external memory bandwidth was developed in two
men-months.
Keywords: motion-estimation, weak-programming,
stream-based architechture, digital film, reconfigurable,
FPGA
The design of future communication systems with high throughput demands will become a critical task, especially when sophisticated channel coding schemes have to be applied. LDPC codes are one of the most promising candidates because of their outstanding communications performance. One major problem for a decoder hardware realization is the huge design space composed of many interrelated parameters which enforces drastic design trade-offs. Another important issue is the need for flexibility of such systems. In this paper we illuminate this design space with special emphasis on the strong interrelations of theses parameters. Three design studies are presented to highlight the effects on a generic architecture if some parameters are constraint by a given standard, given technology, and given area constraints.
We propose a novel methodology to generate Application Specific Instruction Processors (ASIPs) including custom instructions. Our implementation balances performance and area requirements by making custom instructions reusable across similar pieces of code. In addition to arithmetic and logic operations, table look-ups within custom instructions reduce costly accesses to global memory. We present synthesis and cycle-accurate simulation results for six embedded benchmarks running on customised processors. Reusable custom instructions achieve an average 319% speedup with only 5% additional area. The maximum speedup of 501% for the Advanced Encryption Standard (AES) requires only 3.6% additional area.
Instruction Set Extensions (ISEs) can be used effectively to accelerate the performance of embedded processors. The critical, and difficult task of ISE selection is often performed manually by designers. A few automatic methods for ISE generation have shown good capabilities, but are still limited in the handling of memory accesses, and so they fail to directly address the memory wall problem. We present here the first ISE identification technique that can automatically identify state-holding Application-specific Functional Units (AFUs) comprehensively, thus being able to eliminate a large portion of memory traffic from cache and main memory. Our cycle-accurate results obtained by the SimpleScalar simulator show that the identified AFUs with architecturally visible storage gain significantly more than previous techniques, and achieve an average speedup of 2.8x over pure software execution. Moreover, the number of required memory-access instructions is reduced by two thirds on average, suggesting corresponding benefits on energy consumption.
In recent years, processor customization has matured to become a trusted way of achieving high performance with limited cost/energy in embedded applications. In particular, Instruction Set Extensions (ISEs) have been proven very effective in many cases. A large body of work exists today on creating tools that can select efficient ISEs given an application source code: ISE automation is crucial for increasing the productivity of design teams. In this paper we show that an additional motivation for automating the ISE process is to facilitate algorithm exploration: the availability of ISE can have a dramatic impact on the performance of different algorithmic choices to implement identical or equivalent functionality. System designers need fast feedbacks on the ISE-ability of various algorithmic flavors. We use a case study in elliptic curve (EC) cryptography to exemplify the following contributions: (1) ISE can reverse the relative performance of different algorithms for one and the same operation, and (2) automatic ISE, even without predicting speed-ups as precisely as detailed simulation can, is able to show exactly the trends that the designer should follow.
Code size and energy consumption are critical design concerns for embedded processors as they determine the cost of the overall system. Techniques such as reduced length instruction sets lead to significant code size savings but also introduce performance and energy consumption impediments such as additional dynamic instructions or decompression latency. In this paper, we show that a blockaware instruction set (BLISS) which stores basic block descriptors in addition to and separately from the actual instructions in the program allows embedded processors to achieve significant improvements in all three metrics: reduced code size and improved performance and lower energy consumption.
The increasing complexity of embedded systems pushes system designers to higher levels of abstraction. Transaction Level Modeling (TLM) has been proposed to model communication in systems in an abstract manner. Although being widely accepted, TLMs have not been analyzed for their loss in accuracy. This paper will analyze and quantify the speed-accuracy tradeoff of TLM using a case study on AMBA, an industry bus standard. It shows the results of modeling the Advanced High-performance Bus (AHB) of AMBA using a set of models at different abstraction levels. The analysis of the simulation speed shows improvements of two orders of magnitude for each TLM abstraction, while the timing in the model remains accurate for many applications. As a result, the paper will classify the different models towards their applicability in typical modeling situations, allowing the system designer to achieve fast and accurate simulation of communication.
Recent research on performance analysis for embedded systems shows a trend to formal compositional models and methods. These compositional methods can be used to determine the performance of embedded systems by composing formal analytical models of the individual components. In case there exist no formal component models with the required precision, simulation-based approaches are used for system-level performance analysis. The often high runtimes of simulation runs lead to the new approach described in this paper: Analytical methods are combined with simulation-based approaches to speed up simulation. We describe how the simulation models can be coupled with the formal analysis framework, specify the interfaces needed for such a combination and show the applicability of the approach using a case study.
UML2 and SysML try to adopt techniques known from software development to systems engineering. However, the focus has been put on modeling aspects until now and quantitative performance analysis is not adequately taken into account in early design stages of the system. In this paper, we present our approach for formal and simulation based performance analysis of systems specified with UML2/SysML. The basis of our analysis approach is the detection of communication that synchronize the control flow of the corresponding instances of the system and make the relationship explicit. Using this knowledge, we are able to determine a global timing behavior and violations of this effected by preset constraints. Hence, it is also possible to detect potential conflicts on shared communication resources if a specification of the target architecture is given. With these information it is possible to evaluate system models at an early design stage.
The ever increasing complexity and heterogeneity of modern System-on-Chip (SoC) architectures make an early and systematic exploration of alternative solutions mandatory. Efficient performance evaluation methods are of highest importance for a broad search in the solution space. In this paper we present an approach that captures the SoC functionality for each architecture resource as sequences of trace primitives. These primitives are translated at simulation runtime into transactions and superposed on the system architecture. The method uses SystemC as modeling language, requires low modeling effort and yet provides accurate results within reasonable turnaround times. A concluding application example demonstrates the effectiveness of our approach.
As the complexity of nowadays systems continues to grow, we are moving away from creating individual components from scratch, toward methodologies that emphasize composition of re-usable components via the network paradigm. Complex component interactions can create a range of amazing behaviors, some useful, some unwanted, some even dangerous. To manage them, a "science" for network design is evolving, applicable in some surprising areas. In this paper, we consider a few application domains and discus the design challenges involved from a methodology standpoint. From large-scale hardware/software systems, to dynamically adaptive sensor networks, and network-on-chip architectures, these ideas find wide application.
Properties of analog circuits can be verified formally by partitioning the continuous state space and applying hybrid system verification techniques to the resulting abstraction. To verify properties of oscillator circuits, cyclic invariants need to be computed. Methods based on forward reachability have proven to be inefficient and in some cases inadequate in constructing these invariant sets. In this paper we propose a novel approach combining forward- and backward-reachability while iteratively refining partitions at each step. The technique can yield dramatic memory and runtime reductions. We illustrate the effectiveness by verifying, for the first time, the limit cycle oscillation behavior of a third-order model of a differential VCO circuit.
We present a generalization of standard AC analysis to oscillators by exploiting least-squares solution techniques. This provides an attractive alternative to the current practice of employing transient simulation for small signal analysis of oscillators. Unlike phase condition based oscillator analysis techniques, which suffer from numerical artifacts, the least-squares approach of this paper results in a robust and efficient oscillator AC technique. We validate our method on LC and ring oscillators, obtaining speedups of 1-3 orders of magnitude over transient simulation, and 4-6x over phase-condition-based techniques.
CAFFEINE, introduced previously, automatically generates nonlinear, template-free symbolic performance models of analog circuits from SPICE data. Its key was a directly-interpretable functional form, found via evolutionary search. In application to automated sizing of analog circuits, CAFFEINE was shown to have the best predictive ability from among 10 regression techniques, but was too slow to be used practically in the optimization loop. In this paper, we describe Double-Strength CAFFEINE, which is designed to be fast enough for automated sizing, yet retain good predictive abilities. We design "smooth, uniform" search operators which have been shown to greatly improve efficiency in other domains. Such operators are not straightforward to design; we achieve them in functions by simultaneously making the grammar-constrained functional form implicit, and embedding explicit "introns" (subfunctions appearing in the candidate that are not expressed). Experimental results on six test problems show that Double-Strength CAFFEINE achieves an average speedup of 5x on the most challenging problems and 3x overall; thus making the technique fast enough for automated sizing.
A new approach for automated synthesis of analog and mixed-signal systems is presented. The heterogeneous genetic optimization strategy starts from a functional description and evolves a simple design solution in a strict topdown design process to a complex one that fulfills multiple objectives. Transformations of both architecture and parameters are applied. The expected improvement of the violated objectives is used as driver for the transformation selection. The topology is really created, giving the opportunity to explore new architectures.
This paper describes a novel approach to the problem of model order reduction (MOR) of very large nonlinear systems. We consider the behavior of a dynamic nonlinear system as having two fundamental characteristics: a global behavioral "envelope" that describes major transformations to the state of the system under external stimuli and a local behavior that describes small perturbation responses. The nonlinear low order envelope function is generated by using the remainders from the coalescence of projection bases taken through a space-state sample. A behavioral model can then be expressed as the superposition of these two descriptions, operating according to the input stimuli and the current state value. The global behavior describes major transformations to the state of the system under external stimuli and the local behavior describes small perturbation responses. Local effects are captured by regions through a set of linear projections to a reduced state-space while global effects are captured by examining the non-commonalty among these projections. These "remainders" are used to build a modulation function that will generate the required dynamic changes in the common linear projection. The advantage of the envelope representation for strongly nonlinear systems is that it simplifies the complexity of the model into a two-part problem. Depending on the complexity or cost of the behavioral separation procedure, it can be repeated recursively.
We present a new methodology for fast analog circuit synthesis, based on the use of temperature-dependent symbolic sensitivity analysis and symbolic performance evaluation in synthesis loop. Fast sensitivity analysis achieved and performance estimation are based on element-coefficient diagrams (ECDs). Sensitivity and performance evaluation expressions are generated from ECDs at the same time which reduces overall runtime greatly. The experimental results demonstrate that the speed and convergence of analog synthesis are improved significantly.
Multiple levels of design hierarchy are common in current-generation system-on-chip (SOC) integrated circuits. However, most prior work on test access mechanism (TAM) optimization and test scheduling is based on a flattened design hierarchy. We investigate hierarchy-aware test infrastructure design, wherein wrapper/TAM optimization and test scheduling are carried out for hierarchical SOCs for two practical design scenarios. In the first scenario, the wrapper and TAM implementation for the embedded child cores in hierarchical (parent) cores are delivered in a hard form by the core provider. In the second scenario, the wrapper and TAM architecture of the child cores embedded in the parent cores are implemented by the system integrator. Experimental results are presented for the ITC'02 SOC test benchmarks.
This paper presents a test scheduling approach for system-onchip production tests with peak-power constraints. An abort-on-first-fail test approach is assumed, whereby the test is terminated as soon as the first fault is detected. Defect probabilities of individual cores are used to guide the test scheduling and the peak-power constraint is considered in order to limit the test concurrency. Test set partitioning is used to divide a test set into several test sequences so that they can be tightly packed into the two-dimensional space of power and time. The partitioning of test sets is integrated into the test scheduling process. A heuristic has been developed to find an efficient test schedule which leads to reduced expected test time. Experimental results have shown the efficiency of the proposed test scheduling approach.
This paper presents a wrapper and test access mechanism
design for multi-clock domain SoCs that consists of
cores with different clock frequencies during test. We also
propose a test scheduling algorithm for multi-clock domain
SoCs to minimize test time under power constraint. In the
proposed method, we use virtual TAM to solve the frequency
gaps between cores and the ATE, and also to reduce power
consumption of a core during test while maintaining the test
time of the core. Experimental results show the effectiveness
of our method not only for multi-clock domain SoCs, but
also for single-clock domain SoCs with power constraints.
keywords: multi-clock domain SoC, test scheduling, test
access mechanism, power consumption
In this paper, we propose a new method for test access and test scheduling in NoC-based system. It relies on a progressive reuse of the network resources for transporting test data to routers. We present possible solutions to the implementation of this scheme. We also show how the router testing can be scheduled concurrently with core testing to reduce test application time. Experimental results for the ITC'02 SoC benchmarks show that the proposed method can lead to substantial reduction on test application time compared to previous work based on the use of serial boundary scan. The method can also help to reduce hardware overhead.
Fast failure analysis is a key enabler in shortening the time between design tape out and product introduction in the market. With faster detection of manufacturability issues, problems associated with parametric variations, model approximations or physical design rules can be fixed faster either at the process control level or at the mask level. Failure analysis can be accelerated with additional hardware support for design-for-testability (DFT) and design-for-failure-analysis (DFFA). In this paper, we will focus on one such DFFA technique deployed in the industry, identify its shortcomings and offer improvements to fix deficiencies.
In this paper, we present a test generation framework for testing of quantum cellular automata (QCA) circuits. QCA is a nanotechnology that has attracted significant recent attention and shows immense promise as a viable future technology. This work is motivated by the fact that the stuck-at fault test set of a circuit is not guaranteed to detect all defects that can occur in its QCA implementation. We show how to generate additional test vectors to supplement the stuck-at fault test set to guarantee that all simulated defects in the QCA gates get detected. Since nanotechnologies will be dominated by interconnects, we also target bridging faults on QCA interconnects. The efficacy of our framework is established through its application to QCA implementations of MCNC benchmarks that use majority gates as primitives.
Quantum information processing technology is in its pioneering stage and no proficient method for synthesizing quantum circuits has been introduced so far. This paper introduces an effective analysis and synthesis framework for quantum logic circuits. The proposed synthesis algorithm and flow can generate a quantum circuit using the most basic quantum operators, i.e., the rotation and controlled-rotation primitives. The paper introduces the notion of quantum factored forms and presents a canonical and concise representation of quantum logic circuits in the form of quantum decision diagrams (QDD's), which are amenable to efficient manipulation and optimization including recursive unitary functional bi-decomposition. This paper concludes by presenting the QDD-based algorithm for automatic synthesis of quantum circuits.
Recent advances in microfluidics are expected to lead to sensor systems for high-throughput biochemical analysis. CAD tools are needed to handle increased design complexity for such systems. Analogous to classical VLSI synthesis, a top-down design automation approach can shorten the design cycle and reduce human effort. We focus here on the droplet routing problem, which is a key issue in biochip physical design automation. We develop the first systematic droplet routing method that can be integrated with biochip synthesis. The proposed approach minimizes the number of cells used for droplet routing, while satisfying constraints imposed by throughput considerations and fluidic properties. A real-life biochemical application is used to evaluate the proposed method.
Discrete droplet digital microfluidics-based biochips face problems similar to that in other VLSI CAD systems, but with new constraints and interrelations. We focus on one such problem of resource constrained scheduling for digital microfluidic biochips. Since the problem is NP-complete, finding the optimal solution is a very time expensive task. We propose a hybrid priority scheduling algorithm solution directly applicable to digital microfluidics with the potential to yield near optimal schedules in the general case in a very short time. Furthermore we propose the use of configurable detectors that allow for even more improved system performance.
It is anticipated that self assembled ultra-dense nanomemories will be more susceptible to manufacturing defects and transient faults than conventional CMOS-based memories, thus the need exists for fault-tolerant memory architectures. The development of such architectures will require intense analysis in terms of achievable performance measures - power dissipation, area, delay and reliability. In this paper, we propose and develop a hybrid automation framework, called HMAN, that aids the design and analysis of fault-tolerant architectures for nanomemories. Our framework can analyze memory architectures at two different levels of the design abstraction, namely the system and circuit levels. To the best of our knowledge, this is the first such attempt at analyzing memory systems at different levels of abstraction and then correlating the different performance measures. We also illustrate the application of our framework to self-assembled crossbar architectures by analyzing a hierarchical fault-tolerant crossbar-based memory architecture that we have developed.
Optical interconnects enable faster signal propagation with virtually no crosstalk. In addition, wavelength division multiplexing allows a single waveguide to be shared among multiple interconnects. This paper proposes efficient algorithms for the construction of timing and congestion-driven waveguides considering the optical resource constraints. We develop the first optical router for System-on-Packages (SOPs), which reduce electrical wirelength by 11% and improve performance by 23%, when a single optical layer is introduced for every placement layer.
Reduced energy consumption is one of the most important design goals for embedded application domains like wireless, multimedia and biomedical. Instruction memory hierarchy has been proven to be one of the most power hungry parts of the system. This paper introduces an architectural enhancement for the instruction memory to reduce energy and improve performance. The proposed distributed instruction memory organization requires minimal hardware overhead and allows execution of multiple loops in parallel in a uni-processor system. This architecture enhancement can reduce the energy consumed in the instruction and data memory hierarchy by 70.01% and improve the performance by 32.89% compared to enhanced SMT based architectures.
In current multi-media systems major parts of the functionality consist of software tasks executed on a set of concurrently operating processors. Those tasks interfere with each other when they share memory and other hardware components. For instance when the tasks share caches and no precautions are taken they potentially flush each other's data at random. In this case the control over the system performance is lost. However, in media processing the performance must be under tight control. In particular the performance of each individual task must be preserved if the tasks are executed concurrently in arbitrary combinations or if additional tasks are added. A system satisfying this property is addressed as being compositional. This paper proposes a novel cache partitioning technique that enhances compostionality. We assume a cache to be a rectangular array of memory elements arranged in "sets" (rows) and "ways" (columns). We perform two partitioning types. First, each task and each inter-task common data gets an exclusive part of the cache sets. Second, inside the cache sets of common data each task accessing it gets a number of ways. We apply the proposed method on a homogeneous multiprocessor using two applications: H.264 decoding and picture-in-picture-TV. Our experiments indicate that, for both applications, under our partitioning scheme the sum of misses of the individual tasks executed separately and the number of misses of all tasks executed concurrently differs at most by 4%. We conclude that compositionality is achieved within reasonable bounds. Additionally, our technique appears to improve the efficiency of the cache operation.
Previous work on efficient customized processor design primarily focused on in-order architectures. However, with the recent introduction of out-of-order processors for highend high-performance embedded applications, researchers and designers need to address how to automate the design process of customized out-of-order processors. Because of the parallel execution of independent instructions in out-of-order processors, in-order processor design methodologies which subdivide the search space in independent components are unlikely to be effective in terms of accuracy for designing out-of-order processors. In this paper we propose and evaluate various automated single- and multi-objective optimizations for exploring out-of-order processor designs. We conclude that the newly proposed genetic local search algorithm outperforms all other search algorithms in terms of accuracy. In addition, we propose two-phase simulation in which the first phase explores the design space through statistical simulation; a region of interest is then simulated through detailed simulation in the second phase. We show that simulation time speedups can be obtained of a factor 2.2X to 7.3X using two-phase simulation.
Embedded systems allow application-specific optimizations to improve the power/performance trade-off. In this paper, we show how application-specific hashing of the address can eliminate a large number of conflict misses in caches. We consider XOR-functions: each set index bit is computed as the XOR of a subset of the address bits. Previous work has considered simpler bit-selecting functions. Compared to such work, the contributions of this paper are two-fold. Firstly, we present a heuristic algorithm to construct application-specific XOR-functions. Secondly, in order to adapt the hashing to the application, we show that a reconfigurable XOR-function selector is inherently less complex than a reconfigurable selector for bit-selecting functions. This is possible by placing restrictions on the allowed XOR-functions. Our evaluation shows a reduction of cache misses for standard benchmarks averaging between 30% and 60%, depending on the cache size.
In this work, we investigate the problem of automatically mapping applications onto a coarse-grained reconfigurable architecture and propose an efficient algorithm to solve the problem. We formalize the mapping problem and show that it is NP-complete. To solve the problem within a reasonable amount of time, we divide it into three subproblems: covering, partitioning and layout. Our empirical results demonstrate that our technique produces nearly as good performance as hand-optimized outputs for many kernels.
In this paper, we propose two FPGA-area allocation algorithms based on profiling results for reducing the impact on performance of dynamic reconfiguration overheads. The problem of FPGA-area allocation is presented as a 0-1 integer linear programming problem and efficient solvers are incorporated for finding the optimal solutions. Additionally, we discuss the FPGA-area allocation problem in two scenarios. In the first scenario, all hardware operations are allocated on the FPGA while in the second scenario, any hardware operation can be switched to software execution in order to provide an overall performance improvement. We evaluate our proposed algorithms using the MPEG2 and MJPEG encoder multimedia benchmarks and the hardware implementations for SAD, DCT, IDCT, Quantization and VLC tasks. We show that a significant performance improvement (up to 61 % for MPEG2 and 94 % for MJPEG) is to be achieved when the proposed algorithms are used, while the reconfiguration overhead is reduced by at least 36% for MJPEG.
Temporal partitioning techniques are useful to implement large and complex applications, which can be split into partitions in FPGA devices. In order to minimize resources, each of these partitions can be multiplexed in an only FPGA area by reconfiguration techniques. These multiplexing approaches increase the effective area, allowing parallelism exploitation in small devices. However, multiplexing means reconfiguration time, which can cause impact on the application performance. Thus, intensive parallelism exploitation in massive computation applications must be explored to compensate such inconvenient and improve processes. In this work, a temporal partitioning technique is presented for a class of image processing (massive computation) applications. The proposal technique is based on the algorithmic complexity (area x time) for each task that composes the applications. Experimental results are used to demonstrate the efficiency of the approach when compared to the optimal solution obtained by exhaustive timing search.
This paper presents a new operation chaining reconfigurable scheduling algorithm (CRS) based on list scheduling that maximizes instruction level parallelism available in distributed high performance instruction cell based reconfigurable systems. Unlike other typical scheduling methods, it considers the placement and routing effect, register assignment and advanced operation chaining compilation technique to generate higher performance scheduled code. The effectiveness of this approach is demonstrated here using a recently developed industrial distributed reconfigurable instruction cell based architecture [11]. The results show that schedules using this approach achieve equivalent throughput to VLIW architectures but at much lower power consumption.
The concepts of Design for Manufacturability and Design for Yield DFM/DFY are bringing together domains that co-existed mostly separated until now - circuit design, physical design and manufacturing process. New requirements like SoC, mixed analog/digital design and deep-submicron technologies force to a mutual integration of all levels. A major challenge coming with new deepsubmicron technologies is to design and verify integrated circuits for high yield. Random and systematic defects as well as parametric process variations have a large influence on quality and yield of the designed and manufactured circuits. With further shrinking of process technology, the on-chip variation is getting worse for each technology node. For technologies larger than 180nm feature sizes, variations are mostly in a range of below 10%. Here an acceptable yield range is achieved by regular but error-prone re-shifts of the drifting process. However, shrinking technologies down to 90nm, 65nm and below cause on-chip variations of more than 50%. It is understandable that tuning the technology process alone is not enough to guarantee sufficient yield and robustness levels any more. Redesigns and, therefore, respins of the whole development and manufacturing chain lead to high costs of multiple manufacturing runs. All together the risk to miss the given market window is extremely high. Thus, it becomes inevitable to have a seamless DFM/DFY concept realized for the design phase of digital, analog, and mixed-signal circuits. New DFY methodologies are coming up for parametric yield analysis and optimization and have recently been made available for the industrial design of individual analog blocks on transistor level up to 1500 transistors. The transfer of yield analysis and yield optimization techniques to other abstraction levels " both for digital as well as for analog " is a big challenge. Yield analysis and optimization is currently applied to individual circuit blocks and not to the overall chip yielding on the one hand often too pessimistic results - best/worst case and OCV (On Chip Variation) factor - for the digital parts. On the other hand for analog often very high efforts are spent to design individual blocks with highrobustness (>6σ). For abstraction to higher digitallevels first approaches like statistical static timing analysis (SSTA) are under development. For theanalog parts a strategy to develop macro models and hierarchical simulation or behavioral simulation methodologies is required that includes low-level statistical effects caused by local and global processvariation of the individual devices.
This paper proposes a methodology for designing reconfigurable continuous-time DS modulator topologies. The methodology is based on the concept of generic topology that expresses all possible signal paths in a reconfigurable topology. Topologies are optimized for minimizing the complexity of the topologies, maximizing the sharing of circuitry for different modes, maximizing the topology robustness with respect to circuit nonidealities, and minimizing total power consumption. The paper presents a case study for designing topologies for a three mode reconfigurable DS modulator, and compares topologies with state-of-the-art design.
This paper presents novel double sampling high order single-loop sigma-delta modulator structures for wideband applications. To alleviate the quantization noise folding into the inband frequency region, two previously reported techniques are used. The DAC sampling paths are implemented with the single capacitor approach and an additional zero is placed at the half of the sampling frequency of the modulator's noise transfer function (NTF). The detrimental effect of this additional zero on both the NTF and signal transfer function (STF) is also resolved through the proposed modulator architectures with a low additional circuit requirement.
This paper presents a four-stage CMOS distributed amplifier (DA) design implemented in standard 0.18 μm CMOS technology. The proposed design eliminates the need for transmission line capacitors and, consequently, uses significantly smaller spiral inductors compared with the previous designs. Using the minimum size inductor, the bandwidth of the amplifiers is extended, and the quality factors of the on-chip inductor are improved. Proposed DA occupies the smallest die area (0.3μm*0.8μm) amongst the DAs reported with the same performance. A unity gain bandwidth of 10 GHz and a gain of 15 dB are measured. DC power dissipation is 56 mW.
This paper reports a high speed and low power consumption direct-indirect bootstrapped full-swing CMOS inverter driver circuit (bfi-driver). The simulation results, based on 0:13μm triple well CMOS technology, show that, when operated at 1V , bfi-driver is 94% faster and consumes 22% less power compared to a counterpart direct bootstrap circuit [1].
The ever increasing usage of microprocessor devices is sustained by a high volume production that in turn requires a high production yield, backed by a controlled process. Fault diagnosis is an integral part of the industrial effort towards these goals. This paper presents a novel cost-effective approach to the construction of diagnostic software-based test sets for microprocessors. The methodology exploits an existing post-production test set, designed for software-based self-test, and an already developed infrastructure IP to perform the diagnosis. An initial diagnostic test set is built, and then iteratively refined resorting to an evolutionary method. Experimental results are reported in the paper showing the feasibility and effectiveness of the approach for an Intel i8051 processor core.
In this paper, we propose a timing-reasoning algorithm to improve the resolution of delay fault diagnosis. In contrast to previous approaches which identify candidates by utilizing only logic conditions, we propose a timing-simulation-based method to perform the candidate reasoning. Based on the circuit timing information, we identify invalid candidates which cannot maintain the consistency of failure behaviors. By eliminating those invalid candidates, the diagnosis resolution can be improved. We then analyze the problem of circuit timing uncertainty caused by the delay variation and the simulation model. We calculate a metric, named invalid-probability, for each candidate. Then we propose a candidate-ranking heuristic which is robust with respect to such sources of timing uncertainty. By ranking the candidates based on their invalid-probability, we can improve the candidate first-hit-rate of the traditional critical path tracing (CPT) technique. To demonstrate the efficiency of the proposed method, we have developed a timing diagnosis framework which can simulate the real diagnosis process to evaluate and compare different algorithms.
In this paper, we propose a new circuit transformation technique in conjunction with the use of a special diagnostic test pattern, named SO-SLAT pattern, to achieve higher multiple-fault diagnosis resolutions. For a given list of candidate faults, which could be stuck-at, transition, bridging, or other faults, we generate a set of SO-SLAT patterns, each of which attempts to activate only one fault in the list and propagate its effects to only one observation point. Observing the responses to SO-SLAT patterns helps more precisely identify fault candidates. The method can also tolerate most of the timing hazards for more accurate diagnosis of failures caused by timing faults. The experimental results demonstrate the effectiveness of the proposed method for diagnosing multiple faults.
Software-based self-test (SBST) of processors offers many benefits, such as dispense with expensive test equipments, test execution during maintenance and in the field or initialization tests for the whole system. In this paper, for the first time a structural SBST methodology is proposed which optimizes energy, average power consumption, test length and fault coverage at the same time. Key words: Test program generation, processor test, low power test
Scan is the most widely used DFT technique in today's VLSI industry. Mux-DFF and Level Sensitive Scan Design (LSSD) are the most popular scan architectures. For Mux-DFF, when scan enable is set to "1", the scan chain is in shift mode. When scan enable is set to "0", the scan chain is in capture mode. For LSSD, two clocks are used to control the shift. When scan enable or scan clock has defects, it is desirable to locate the defects at logic level by algorithmic techniques to guide failure analysis. Similar to the defects on other signals, faulty scan enable / clock signals may be caused by numerous types of defects. E.g., a shorted net, an open net or an incorrect timing with respect to clock or scan data stream. The following examples are used to illustrate how to apply various fault models for different defects. If a scan enable signal is shorted to VCC, only incorrect capturing will result. Scan cells will capture data from the previous scan cell instead of capturing data from system logic. We may use a stuck-at-1 fault model for this scenario. Clearly, the chain integrity test will pass since these patterns don't have the capture operation. The scan patterns would fail and the scan logic diagnosis will be used in this scenario. In rest of this paper, we do not discuss this category of scan enable defects.
We consider lock-free synchronization for dynamic embedded real-time systems that are subject to resource overloads and arbitrary activity arrivals. We model activity arrival behaviors using the unimodal arbitrary arrival model (or UAM). UAM embodies a stronger "adversary" than most traditional arrival models. We derive the upper bound on lock-free retries under the UAM with utility accrual scheduling - the first such result. We establish the tradeoffs between lock-free and lock-based sharing under UAM. These include conditions under which activities' accrued timeliness utility is greater under lock-free than lock-based, and the consequent upper bound on the increase in accrued utility that is possible with lock-free. We confirm our analytical results with a POSIX RTOS implementation.
Traffic shaping is a well-known technique in the area of networking and is proven to reduce global buffer requirements and end-to-end delays in networked systems. Due to these properties, shapers also play an increasingly important role in the design of multi-processor embedded systems that exhibit a considerable amount of on-chip traffic. Despite their growing importance in this area, no methods exist to analyze shapers in distributed embedded systems, and to incorporate them into a system-level performance analysis. Hence it is until now not possible to determine the effect of shapers to end-to-end delay guarantees or buffer requirements in these systems. In this work, we present a method to analyze greedy shapers, and we embed this analysis method into a well-established modular performance analysis framework. The presented approach enables system-level performance analysis of complete systems with greedy shapers, and we prove its applicability by analyzing two case study systems.
In this paper, we present an extension to existing approaches that capture and exploit timing-correlation between tasks for scheduling analysis in distributed systems. Previous approaches consider a unique timing-reference for each set of time-correlated tasks and thus, do not capture the complete timing-correlation between task activations. Our approach is to consider multiple timing-references which allows us to capture more information about the timing-correlation between tasks. We also present an algorithm that exploits the captured information to calculate tighter bounds for the worst-case response time analysis under a static priority preemptive scheduler.
This paper presents an efficient method to find the optimal intra-task voltage/frequency scheduling for single tasks in practical real-time systems using statistical workload information. Our method is analytic in nature and proved to be optimal. Simulation results verify our theoretical analysis and show significant energy savings over previous methods. In addition, in contrast to the previous techniques in which all available frequencies are used in a schedule, we find that, by carefully selecting a subset of a small number of frequencies, one can still design a reasonably good schedule while avoiding unnecessary transition overheads.
With the increasing complexity and heterogeneity of embedded electronic systems, a unified design methodology at higher levels of abstraction becomes a necessity. Meanwhile, it is also important to incorporate the current design practice emphasizing IP reuse at various abstraction levels. However, the abstraction gap prohibits easy communication and synchronization in IP integration and co-simulation. In this paper, we present a communication infrastructure for an integrated design framework that enables co-design and cosimulation of heterogeneous design components specified at different abstraction levels and in different languages. The core of the approach is to abstract different communication interfaces or protocols to a common high level communication semantics. Designers only need to specify the interfaces of the design components using extended regular expressions; communication adapters can then be automatically generated for the co-simulation or other co-design and co-verification purposes.
The increasing demands of high-performance in embedded applications under shortening time-to-market has prompted system architects in recent time to opt for Multi-Processor Systems-on-Chip (MP-SoCs) employing several programmable devices. The programmable cores provide a high amount of flexibility and reusability, and can be optimized to the requirements of the application to deliver high-performance as well. Since application software forms the basis of such designs, the need to tune the underlying SoC architecture for extracting maximum performance from the software code has become imperative. In this paper, we propose a framework that enables software development, verification and evaluation from the very beginning of MP-SoC design cycle. Unlike traditional SoC design flows where software design starts only after the initial SoC architecture is ready, our framework allows a codevelopment of the hardware and the software components in a tightly coupled loop where the hardware can be refined by considering the requirements of the software in a stepwise manner. The key element of this framework is the integration of a fine-grained software instrumentation tool into a System-Level-Design (SLD) environment to obtain accurate software performance and memory access statistics. The accuracy of such statistics is comparable to that obtained through Instruction Set Simulation (ISS), while the execution speed of the instrumented software is almost an order of magnitude faster than ISS. Such a combined design approach assists system architects to optimize both the hardware and the software through fast exploration cycles, and can result in far shorter design cycles and high productivity. We demonstrate the generality and the efficiency of our methodology with two case studies selected from two most prominent and computationally intensive embedded application domains.
New generation Electronic System-Level design tools are the key to overcome the complexity and the increasing design productivity gap in the development of future Multiprocessor Systems-on-Chip. This paper presents a SystemC-based system-level simulation environment, called CASSE, which helps in the modelling and analysis of complex SoCs. CASSE combines application modeling, architecture modeling, mapping and analysis within a unified environment, with the aim to ease and speed up these modeling steps. The main contribution of this tool is to enable this fast modelling and analysis at the very beginning of the design process, helping in the design space exploration phase. CASSE capabilities are disclosed in this work by means of a case study where an MPEG-4 decoder application is implemented on an Altera Excalibur platform.
We propose a novel framework, called Virtual Processing Components (VPC), that permits the modeling and simulation of multiple processors running arbitrary scheduling strategies in SystemC. The granularity is given by task accuracy that guarantees a small simulation overhead.
This paper summarizes the characteristics of distributed object models used in large-scale distributed software systems. We examine the common subset of requirements for distributed software systems and systems-on-a-chip (SoC), namely: openness, heterogeneity and multiple forms of transparency. We describe the application of these concepts to the emerging class of complex, parallel SoC's, including multiple heterogeneous embedded processors interacting with hardware co-processors and I/O devices. An implementation of this approach is embodied in STMicroelectronics' DSOC (Distributed System Object Component) programming model. The use of this programming model for an architecture exploration of ST's Nomadik mobile multimedia platform is described.
Most of the challenges related to the development of multi-processor platforms for complex wireless and multi-media applications fall into the Electronic System Level (ESL) domain. That is to say, design tasks like embedded SW development, architecture definition, or system verification have to be addressed before the silicon or even the RTL implementation becomes available. We believe that one of the major obstacles preventing the urgently required adoption and proliferation of an ESL based design approach is the nonexistence of an efficient and intuitive methodology for modeling complex platforms. This extended abstract gives a rough overview of a modeling methodology we have developed on the basis of SystemC based Transaction Level Modeling (TLM) in order to remedy this lack of modeling competence.
Scalable Networks on Chips (NoCs) are needed to match
the ever-increasing communication demands of large-scale
Multi-Processor Systems-on-chip (MPSoCs) for high-end
wireless communications applications. The heterogeneous
nature of on-chip cores, and the energy efficiency requirements
typical of wireless communications call for
application-specific NoCs which eliminate much of the
overheads connected with general-purpose communication
architectures. However, application-specific NoCs must be
supported by adequate design flows to reduce design time
and effort.
In this paper we survey the main challenges in
application-specific NoC design, and we outline a complete
NoC design flow and methodology. A case study on
a high complexity SoC demonstrates that it is indeed possible
to generate an application-specific NoC from a high
level specification in a few hours. Comparison with a handtuned
solution shows that the automatically generated one
is very competitive from the area, performance and power
viewpoint, while design time is reduced from days to hours.
Keywords: Systems on chip, networks on chip,
application-specific integrated systems, design methodologies
We propose instruction-driven slicing, a new technique for annotating microprocessor descriptions at the Register Transfer Level (RTL) in order to achieve lower power dissipation. Our technique automatically annotates existing RTL code to optimize the circuit for lowering power dissipated by switching activity. Our technique can be applied at the architectural level as well, achieving similar power gains. We demonstrate our technique on architectural and RTL models of a 32-bit OpenRISC processor (OR1200), showing power gains for the SPEC2000 benchmarks.
The world of 3D graphics, until recently restricted to high-end workstations and game consoles, is rapidly expanding into the domain of mobile platforms such as cellular phones and PDAs. Even as the mobile chip market is poised to exceed production of 500 million chips per year, incorporation of 3D graphics in handhelds poses several serious challenges to the hardware designer. Compared with other platforms, graphics on handhelds have to contend with limited energy supplies and lower computing horsepower. Nevertheless, images must still be rendered at high quality since handheld screens are typically held closer to the observer's eye, making imperfections and approximations very noticeable. In this paper, we provide an in-depth quantitative analysis of the power consumption of mobile 3D graphics pipelines. We analyze the effects of various 3D graphics factors such as resolution, frame rate, level of detail, lighting and texture maps on power consumption. We demonstrate that significant imbalance exists across the workloads of different graphics pipeline stages. In addition, we illustrate how this imbalance may vary dynamically, depending on the characteristics of the graphics application. Based on this observation, we identify and compare the benefits of candidate Dynamic Voltage and Frequency Scaling (DVFS) schemes for mobile 3D graphics pipelines. In our experiments we observe that DVFS for mobile 3D graphics reduces energy by as much as 50%.
A significant volume of research has concentrated on operating-system directed power management (OSPM). The primary focus of previous research has been the development of OSPM policies. Under different conditions, one policy may outperform another and vice versa. In this paper, we explain how to select the best policies at run-time without user or administrator intervention. We present a hardware-neutral architecture portable across different platforms running Linux. Our experiments reveal that changing policies at run-time can adapt to workloads more quickly than using any of the policies individually.
Reducing energy consumption is an important issue in modern computers. Dynamic power management (DPM) has been extensively studied in recent years. One approach for DPM is to adjust workloads, such as clustering or eliminating requests, as a way to trade-off energy consumption and quality of services. Previous studies focus on single processes. However, when multiple concurrently running processes are considered, workload adjustment must be determined based on the interleaving of the processes' requests. When multiple processes share the same hardware component, adjusting one process may not save energy. This paper presents an approach to assign energy responsibility to individual processes based on how they affect power management. The assignment is used to estimate potential energy reduction by adjusting the processes. We use the estimation to guide runtime adaptation of workload behavior. Experiments demonstrate that our approach can save more energy and improve energy efficiency.
We present a dynamic bit-width adaptation scheme in DCT applications for efficient trade-off between image quality and computation energy. Based on sensitivity differences of 64 DCT coefficients, various operand bit-widths are used for different frequency components to reduce computation energy in DCT operation. Numerical results show that our DCT architecture can achieve power savings ranging from 36 % to 75% compared to normal operation.
Simultaneous switching noise due to inductance in VLSI packaging is a significant limitation to system performance. The inductive parasitics within IC packaging causes bounce on the power supply pins in addition to glitches and rise-time degradation on the signal pins. These factors bound the maximum performance of off-chip busses, which limits overall system performance. Until recently, the parasitic inductance problem was addressed by aggressive package design which attempts to decrease the total inductance in the package interconnect. In this work we present an encoding technique for off-chip data transmission to limit bounce on the supplies and reduce inductive signal coupling. This is accomplished by inserting intermediate (henceforth called "stutter") states in the data transmission to bound the maximum number of signals that switch simultaneously, thereby limiting the overall inductive noise. Bus stuttering is cheaper than expensive package design since it increases the bus performance without changing the package. We demonstrate that bus stuttering can bound the maximum amount of inductive noise, which results in increased bus performance even after accounting for the encoding overhead. Our results show that the performance of an encoded bus can be increased up to 225% over using un-encoded data. In addition, synthesis results of the encoder in a TSMC 0.13μm process show that the encoder size and delay are negligible in a modern VLSI design.
State of the art statistical timing analysis (STA) tools often yield less accurate results when timing variables become correlated. Spatial correlation and correlation caused by path reconvergence are among those which are most difficult to deal with. Existing methods treating these correlations will either suffer from high computational complexity or significant errors. In this paper, we present a new sensitivity pruning method which will significantly reduce the computational cost to consider path reconvergence correlation. We also develop an accurate and efficient model to deal with the spatial correlation.
This paper focuses on statistical interconnect timing analysis in a parameterized block-based statistical static timing analysis tool. In particular, a new framework for performing timing analysis of RLC networks with step inputs, under both Gaussian and non-Gaussian sources of variation, is presented. In this framework, resistance, inductance, and capacitance of the RLC line are modeled in a canonical first order form and used to produce the corresponding propagation delay and slew (time) in the canonical first-order form. To accomplish this step, mean, variance, and skewness of delay and slew distributions are obtained in an efficient, yet accurate, manner. The proposed framework can be extended to consider higher order terms of the various sources of variation. Experimental results show average errors of less than 2% for the mean, variance and skewness of interconnect delay and slew while achieving orders of magnitude speedup with respect to a Monte Carlo simulation with 104 samples.
A cell delay model based on rate-of-current change is presented, which accounts for the impact of the shape of the noisy waveform on the output voltage waveform. More precisely, a pre-characterized table of time derivatives of the output current as a function of input voltage and output load values is constructed. The data in this table, in combination with the Taylor series expansion of the output current, is utilized to progressively compute the output current waveform, which is then integrated to produce the output voltage waveform. Experimental results show the effectiveness and efficiency of this new delay model.
Variabilities in metal interconnect structures can affect circuit timing performance or even cause function failure in VLSI designs. This paper proposes a method to estimate the difference between the nominal and perturbed circuit waveforms by calculating the moments in frequency-domain via efficient iterative method. The algorithm can be used to accurately reproduce the di.erential waveforms, or to provide efficient early estimates on the timing impact of the variabilities for RC networks.
Absolutely fail-safe operation in any critical situation, highest reliability in day-to-day operation and best-in-class convenience at a reasonable price: all drive innovation in automotive electronics. These goals result in car systems with ever-increasing complexity, challenging every single component, IC and line of code. As electronics' failure rates are perceived to grow, we introduce root cause analysis, key technologies and new measures that enable carmakers to keep pace. The goal is to introduce test and reliability challenges and respective solutions for automotive systems. Representatives of car companies and suppliers will explain their views and practical experiences.
The large variety of architectural dimensions in automotive electronics design, for example, bus protocols, number of nodes, sensors and actuators interconnections and power distribution topologies, makes architecture design task a very complex but crucial design step especially for OEMs. This situation motivates the need for a design environment that accommodates the integration of a variety of models in a manner that enables the exploration of design alternatives in an efficient and seamless fashion. Exploring these design alternatives in a virtual environment and evaluating them with respect to metrics such as cost, latency, flexibility and reliability provide an important competitive advantage to OEMs and help minimize integration risks later in the design cycle. In particular, the choice of the degree of decentralization of the architecture has become a crucial issue in automotive electronics. In this paper, we demonstrate how a rigorous methodology (Platform-Based Design) and the Metropolis framework can be used to find the balance between centralized and decentralized architectures.
Automakers are still facing an increasing complexity in vehicle requirements with regard to their EE systems. This complexity is not only caused by innovations, which are being provided for tomorrow's drivers, but is also due to system requirements regarding EE Architecture cost, managing software updates und diagnostics concepts. One way to conquer this challenge is the use of standards in the field of basic technology. Ongoing activities such as Autosar, FlexRay, HIS (Herstellerinitiative Software - OEM initiative software) and more, underline the car industry's contribution to create and establish these standards. Another way - more linked to OEM's internal processes - is to undertake a deeper analysis of Architecture work. Here at first, a profound description is essential. This tool-based description is the basis for a more detailed analysis. Two options should be focussed upon: expert reviews and automatically calculated metrics in the tool, such as cost, weight or even more sophisticated metrics for feasability. With this technique, iteration by iteration the EE Architecture reaches more profound stability and will meet the functional and non-functional requirements far better.
Future automotive security systems will benefit from visual scene analysis based on a fusion of video, infrared, and radar images. Today we have already functions like lane departure warning and automatic cruise control (ACC) for pretty well defined driving environments, such as highways and primary roads. Recent research activities concentrate on more complex environments, such as city traffic with a wide variety of traffic participants moving in a unpredictable manner, e.g. bikes, pedestrians, children, and even animals, and under changing weather and lighting conditions.
To enhance efficiency and reliability in the design of distributed electronic control units with hard real-time constraints new methods and computer aided tools are required, especially to support early system design phases. Domain specific tools are required to support design space exploration in the concept phase of electric/electronic systems. Design and verification based on heterogeneous models (closed loop control systems, reactive systems and UML based software intensive systems) and using a CASE-tool integration platform will allow for a seemless design flow.
Automotive suppliers like Siemens VDO cover the full spectrum of automotive electronics. Our customers expect from such suppliers the capability to deliver cross-domain solutions and products. The deliveries should not only be correctly interconnected and integrated, but also simultaneously defined and developped in order to ensure that they achieve on the global vehicle level the optimum of cost, quality, flexibility and scalability. SV combined the expertise of its divisions into a corporate Vehicle System Group, in charge of developing the cross-divisional expertise and solutions. The final aim of the division is to successfully assist the OEM during the pre-development phase of the architecture, by allocating system engineers who will help in the understanding of the requirements in order to select from a product portfolio the optimum combination. Such an activity requires a strong expertise in intersystem technologies, such as Flexray and AUTOSAR, where SV is a leading contributor today. In order to master the complexity of an architecture definition phase, where many actors are involved, many hypothesis analysed, Siemens VDO invested a lot of efforts in the definition of the processes and in the development of an architecture toolchain, SEDAM, which support and accelerate this process. The conference will explain the vision of architecture Design and Assessment within Siemens VDO Automotive.
The automotive electronics has been introduced with multiple waves over the time: powertrain, safety & vehicle dynamic, body & convenience, telematics. The future is already knocking at the door and revolutionary systems are currently developed: X-by-wire, E-safety, Hybrid vehicle. The increasing requirements for fuel economy, safety, emission reduction, and onboard diagnosis push the automotive industry for more innovative solutions with a rapid increase of complexity. The presentation will highlight the motivation to introduce high performance electronics in the car. At the early time of electronic, ECUs (electronic control units) were seen as been the system, with the birth of networking the complete car was the system to be controlled, today with modern communication & services the car is just a node in the traffic, this last one is now the system to be considered. The innovation for the individual transportation is at 90% enabled by electronic. The development of such system shows three main challenges: dependable communication, dependable computation and dependable power. The modern high-end cars are running more than 80 ECUs, the communication bandwidth and message determinism require the development of new busses such as Flexray. The increasing power demand is pushing for a different voltage class. The cost pressure and the time to market are forcing the automotive industry to re-invent processes, development cycle and to introduce standards. This panel discussion will demonstrate the key elements to provide a powerful, scalable and configurable control solution that offer a migration pass to evolution and even revolution of automotive electronics.
The topic on platform-based system modeling has received a great deal of attention today. One of the important tasks that significantly affect the effectiveness and efficiency of the system modeling is the modeling of IP components and communication between IPs. To be effective, it is generally accepted that the system modeling should be performed in two steps; In the first step, a fast but some inaccurate system modeling is considered to facilitate the simultaneous development of software and hardware. The second step then refines the models of the software and hardware blocks (i.e., IPs) to increase the simulation accuracy for the system performance analysis. Here, one critical factor required for a successful system modeling is a systematic modeling of the IP blocks and bus subsystem connecting the IPs. In this respect, this work addresses the problem of systematic modeling of the IPs and bus subsystem in different levels of refinements. In the experiments, we found that by applying our proposed IP and bus modeling methods to the MPEG-4 application, we are able to achieve 4x performance improvement and at the same time, reduce the software development time by 35%, compared to that by conventional modeling methods.
Enhancing productivity for designing complex embedded systems requires system level design methodology and language support for capturing complex design in high level models. For an effective methodology, efficiency of simulation and a sound refinement based implementation path are also necessary. Although some of the recent system level design languages for system level abstractions, several essential ingredients are missing from these. We consider (i) explicit support for multiple models of computation (MoCs) or heterogeneity; (ii) the ability to build complex behaviors by hierarchically composing simpler behaviors; and (iii) hierarchical composition of behaviors that belong to distinct models of computation, as essential for successful SLDLs. These render an SLDL with modeling fidelity that exploits both heterogeneity and hierarchy and allows for simpler modeling and efficient simulation. One important requirement for such an SLDL should be that the simulation semantics be also compositional, and hence no flattening of hierarchically composed behaviors be needed for simulation. In this paper we show how we designed SystemC extensions to provide facilities for heterogeneous behavioral hierarchy, compositional simulation semantics, and implemented a simulation kernel which we show experimentally as up to 50% more efficient than standard SystemC simulation.
Most hardware description languages do not enforce determinacy, meaning that they may yield races. Race conditions pose a problem for the implementation, verification, and validation of hardware. Enforcing determinacy at the modeling level provides a solution to this problem. In this paper, we consider a common model of computation for hardware modeling - a network of cycle-true finite-state machines with datapaths (FSMDs) - and we identify the conditions under which such models are guaranteed to be race-free. We base our analysis on the Kahn Principle and a formal framework to represent FSMD semantics. We present our conclusions as four simple and easy to enforce modeling rules. A hardware designer that applies those four modeling rules, will thus obtain race-free hardware.
Modeling systems based on semi-formal graphical formalisms, such as Statecharts, has become standard practice in the design of reactive embedded devices. However, the modeling of realistic applications often results in very large and unmanageable graphics, severely compromising their readability and practical use. To overcome this, we present a methodology to support the easy development and understanding of complex Statecharts. Central to our approach is the definition of a Statechart Normal Form (SNF), which provides a standardized layout that is compact and makes systematic use of secondary notations to aid readability. This concept is extended to dynamic Statecharts.
Partitioning is a time consuming and computationally complex optimization problem in the codesign of hardware software systems. The stringent time-to-market requirements have resulted in truncating this step resulting in sub-optimal solutions being offered to consumers. To obtain the true global minima, which translates to finding the best solution available, a new methodology is needed that can achieve this goal in a minimal time. An approach is presented that forms a basis for design space exploration from a partitioning perspective using UML 2.0.
Design tools for application specific instruction set processors (ASIPs) are an important discipline in system-level design for wireless communications and other embedded application areas. Some ASIPs are still designed completely from scratch to meet extreme efficiency demands. However, there is also a trend towards use of partially predefined, configurable RISC-like embedded processor cores that can be quickly tuned to given applications by means of instruction set extension (ISE) techniques. While the problem of optimized ISE synthesis has been studied well from a theoretical perspective, there are still few approaches to an overall HW/SW design flow for configurable cores that take all real-life constraints into account. In this paper, we therefore present a novel procedure for automated ISE synthesis that accommodates both user-specified and processor-specific constraints in a flexible way and that produces valid, optimized ISE solutions in short time. Driven by an advanced applicationCcode analysis/ profiling frontend, the ISE synthesis core algorithm is embedded into a complete design flow, where the backend is formed by a state-of-the-art industrial tool for processor configuration, ISE HW synthesis, and SW tool retargeting. The proposed design flow, including ISE synthesis, is demonstrated via several benchmarks for the MIPS CorExtend configurable RISC processor platform.
Performance achievements on programmable architectures due to process technology are reaching their limits, since designs are becoming wire- and power-limited rather than device limited. Likewise, traditional exploitation of instruction level parallelism saturates as the conventional approach for designing wider issue machines leads to very expensive interconnections, big instruction memory footprint and high register file pressure. New architectural concepts targeted to the application domain of media processing are needed in order to push current state-of-the-art limitations. To this end, we regard media applications as a collection of tasks which consume and produce chunks of data. The exploitation of task level parallelism as well as more traditional forms of parallelism is a key issue for achieving the required amount of MOPS/Watt and MOPS/mm2 for media applications. Tasks comprise data transfers and number crunching algorithm kernels, which are very computingintensive yet highly predictable. Moreover, most of the data manipulated by a task is of a local nature. Granularity and characteristics of these tasks will lead us in this paper to draw conclusions about memory hierarchy, task scheduling strategies and efficient low-overhead programmable architectures for highly predictable kernel computations.
We describe the VLSI implementation of MIMO detectors that exhibit close-to optimum error-rate performance, but still achieve high throughput at low silicon area. In particular, algorithms and VLSI architectures for sphere decoding (SD) and K-best detection are considered, and the corresponding trade-offs between uncoded error-rate performance, silicon area, and throughput are explored. We show that SD with a per-block run-time constraint is best suited for practical implementations.
We have been "talking" about 4G systems emerging in 2010 for many years. However, to deploy these systems in 2010, we should already know with high confidence the 4G signal processing and SoC architectures for 4G handsets. It realistically takes 2 years to develop a power-efficient, cost competitive system-on-a-chip (SoC) for a volume market. There are standards to be completed, field trials, and wide scale acceptance before a system solution becomes viable. The entire cycle is at least 5 years. But, rather than giving up on 2010 as the year for 4G, we need to continue developing the right signal processing, network protocols, and SoC architectures given our knowledge of Moore's Law, emerging tools sets, and advanced receiver technology, which together facilitate rapid time-to-market of energy efficient solutions. The market winners will quickly adapt to the emerging 4G ecosystem and will develop solutions before others. This talk provides some historical perspectives on architectures and systems evolution with the goal of providing an optimistic view that 4G is very near.
Cutting-edge applications of future embedded systems demand highest processor performance with low power consumption to get acceptable battery-life times. Therefore, low power optimization techniques are strongly applied during the development of modern Application Specific Instruction Set Processors (ASIPs). Electronic System Level design tools based on Architecture Description Languages (ADL) offer a significant reduction in design time and effort by automatically generating the software tool-suite as well as the Register Transfer Level (RTL) description of the processor. In this paper, the automation of power optimization in ADL-based RTL generation is addressed. Operand isolation is a well-known power optimization technique applicable at all stages of processor development. With increasing design complexitiy several efforts have been undertaken to automate operand isolation. In pipelined datapaths, where isolating signals are often implicitly available, the traditional RTL-based approach introduces unnecessary overhead. We propose an approach which extracts high-level structural information from the ADL representation and systematically uses the available control signals. Our experiments with state-of-the-art embedded processors show a significant power reduction (improvement in power efficiency).
This paper explores optimization techniques of the synchronization mechanisms for MPSoCs based on complex interconnect (Network-on-Chip), targeted at future power efficient systems. The proposed solution is based on the idea of locally performing synchronization operations which require the continuous polling of a shared variable, thus featuring large contention (e.g. spin locks). We introduce a HW module, the Synchronization-operation Buffer (SB), which queues and manages the requests issued by the processors. Experimental validation has been carried out by using GRAPES, a cycle-accurate performance/power simulation platform. For 8-processor target architecture, we show that the proposed solution achieves up to 40% performance improvement and 30% energy saving with respect to synchronization based on directory-based coherence protocol.
In this paper we present a state dependent analytical leakage power model for FPGAs. The model accounts for subthreshold leakage and gate leakage in FPGAs, since these are the two dominant components of total leakage power. The power model takes into account the dependency of gate and subthreshold leakage on the probability of the state of circuit inputs. The leakage power model has two main components, one which computes the probability of a state for a particular FPGA circuit element, and the other which computes the leakage of the FPGA circuit element for a given input using analytical equations. This FPGA power model is particularly important for rapidly analyzing various FPGA architectures across different technology nodes.
The modern era of embedded system design is geared towards design of low-power systems. One way to reduce power in an ASIC implementation is to reduce the bit-width precision of its computation units. This paper describes algorithms to optimize the bit-widths of fixed point variables for low power in a SystemC design environment. We propose an algorithm for optimal bitwidth precision for two variables and a greedy heuristic which works for any number of variables. The algorithms are used in the automation of converting floating point SystemC programs into ASIC synthesizable SystemC programs. Expected inputs are profiled to estimate errors in the finite precision conversions. Experimental results on the trade-offs between quantization error, power consumption and hardware resources used are reported on a set of four SystemC benchmarks that are mapped onto 0.18 micron ASIC cell library from Artisan Components. We demonstrate that it is possible to reduce the power consumption by 50% on average by allowing round-off errors to increase from 0.5% to 1%.
In this paper, we present a technique that exploits the statistical behavior of data values transmitted on global signal buses to determine an energy-efficient ordering of bits that minimizes the inter-wire coupling energy and also reduces total bus energy. Statistics are collected for instruction and data bus traffic from eight SPEC CPU2K benchmarks and an optimization problem is formulated and solved optimally using a publicly-available tool. Results obtained using the optimal bit order on large non-overlapping test samples from the same set of benchmarks show that, on average, adjacent inter-wire coupling energies reduce by about 35.4% for instruction buses and by about 21.6% for data buses using the proposed technique.
Continuing scaling of CMOS technology has allowed aggressive pursuant of increased clock rate in DSM chips. The ever shorter clock period has made switching times of different inputs on a logic gate ever closer to each other. The traditional method of static timing analysis assuming single input switching is no longer adequate enough to capture gate level delays accurately. Gate delay models considering multiple input switching are needed for DSM chips.We propose a new method of systematically modeling gate delays using the high dimensional model representation (HDMR) method. The proposed method models gate delays with respect to the relative signal arrival times (RSAT) of its inputs. The systematic nature of the proposed algorithm allows gate delay characterization with more inputs switching close to each other. This paper will show, for the first time, gate delay models of up to 5 inputs. In addition, the proposed model is extended to allow the input signal slope and process variations to be taken into account for statistical static timing analysis. Our results show that the proposed HDMR model gives an error between 2.2% to 12.9% for a variety of static and dynamic logic gates as compared to SPICE results, depending on the number of inputs involved in switching.
In this paper we present a method which allows the statistical analysis of nanoelectronic Boolean networks with respect to timing uncertainty and noise. All signals are considered to be instationary random processes which is the most general signal representation. As one cannot deal with random processes per se, we focus on certain statistical properties which are propagated through networks of Boolean gates yielding the instationary probability density function (pdf) of each signal in the network. Finally, several values of interest as the error probability, the average path delay or the average signal trace over time can be extracted from these pdf.
State machine based simulation of Boolean functions is substantially faster if the function being simulated is symmetric. Unfortunately function symmetries are comparatively rare. Conjugate symmetries can be used to reduce the state space for functions that have no detectable symmetries, allowing the benefits of symmetry to be applied to a much wider class of functions. Substantial improvements in simulation speed, from 30-40% have been realized using these techniques.
A new methodology is presented to assure numerically reliable integration of the magnetisation slope in the Jiles-Atherton model of ferromagnetic core hysteresis. Two HDL implementations of the technique are presented, one in SystemC and the other in VHDL-AMS. The new model uses timeless discretisation of the magnetisation slope equation and provides superior accuracy and numerical stability especially at the discontinuity points that occur in hysteresis. Numerical integration of the magnetisation slope is carried out by the model itself rather than by the underlying analogue solver. The robustness of the model is demonstrated by practical simulations of examples involving both major and minor hysteresis loops.
In this work a method to improve the loopback test used in RF analog circuits is described. The approach is targeted to the SoC environment, being able to reuse system resources in order to minimize the test overhead. An RF sampler is used to observe spectral characteristics of the RF signal path during loopback operation. While able to improve the observability of the signal path, the method also allows faster diagnosis than conventional loopback tests, as the number of transmitted symbols can be greatly reduced. Practical results for a prototyped RF link at 860MHz are presented in order to demonstrate the relevance of the method.
Chip overheating has become a critical problem during test of today's complex core-based systems. In this paper, we address the overheating problem in Network-on-Chip (NoC) systems through thermal optimization using variable-rate on-chip clocking. We control the core temperatures during test scheduling by assigning different test clock frequencies to cores. We present two heuristics to achieve thermal optimization and reduced test time. Experimental results for example NoC systems show that the proposed method can guarantee thermal safety and yield better thermal balance, compared to previous methods using power constraints. Test application time is also reduced.
Digital and analog centric load boards have well established board check methodologies as part of their "release to production requirements", while for RF load boards this is still an open research issue. Potential faults on RF load can be caused by mechanical/electrical defects of components and sockets used on the board. Hence, we propose a novel methodology to accurately check/diagnose the RF path using only reflection measurements with suitable terminations of these paths. These reflection measurements and derived "checker equations" are used to accurately diagnose the RF path on the load board during production test at no extra test cost. A pilot test vehicle is used to demonstrate the practical implementation and production worthiness of the proposed board check and diagnosis methodology.
Pseudorandom test techniques are widely used for measuring the impulse response (IR) for linear devices and Volterra kernels for nonlinear devices, especially in the acoustics domain. This paper studies the application of pseudorandom functional test techniques to linear and nonlinear MEMS Built-In-Self-Test (BIST). We will first present the classical pseudorandom BIST technique for Linear Time Invariant (LTI) systems which is based on the evaluation of the IR of the Device Under Test (DUT) stimulated by a Maximal Length Sequence (MLS). Then we will introduce a new type of pseudorandom stimuli called the Inverse-Repeat Sequence (IRS) that proves better immunity to noise and distortion than MLS. Next, we will illustrate the application of these techniques for weakly nonlinear, purely nonlinear and strongly nonlinear devices.
In this paper we present an overview of an on-chip noise detection circuit. Mainly, this work is different form the previous works concerning on-chip noise measurement in one or more of the following: First: it does not assume specific noise properties such as periodicity. Second: the requested bandwidth of the output channel can be adjusted freely, therefore, the user can avoid the effect of on-chip parasites and the need to offchip sophisticated monitoring tools. Third: the detector is equipped with an on-chip voltage divider, which enables measuring the high and low swing fluctuations accurately. Therefore, the detector is suitable to measure the nonperiodic /single event noise for the purpose of reliability evaluation and performance modeling. A slower version of the detector is implemented in a test chip using Hitachi 0.18 μm technology.
The advent of multi-core embedded processors has brought along new challenges for embedded system design. This paper presents an efficient, battery aware, code partitioning technique for a text to speech system, which is executed on a multi-core embedded processor. The system achieves significant performance improvements both in terms of execution time as well as battery lifetimes. The mentioned technique provides a new paradigm for battery aware embedded system design which can be easily extended to other applications.
Using additional store-checkpoinsts (SCPs) and compare-checkpoints (CCPs), we present an adaptive checkpointing for double modular redundancy (DMR) in this paper. The proposed approach can dynamically adjust the checkpoint intervals. We also design methods to calculate the optimal numbers of checkpoints, which can minimize the average execution time of tasks. Further, the adaptive checkpointing is combined with the DVS (dynamic voltage scaling) scheme to achieve energy reduction. Simulation results show that, compared with the previous methods, the proposed approach significantly increases the likelihood of timely task completion and reduces energy consumption in the presence of faults.
Modern applications for mobile devices, such as multimedia video/audio, often exhibit a common behavior: they process streams of incoming data in a regular, predictable way. The runtime behavior of these applications can be accurately estimated most of the time by analyzing the data to be processed and annotating the stream with the information collected. We introduce a software annotation based approach to power optimization and demonstrate its application on a backlight adjustment technique for LCD displays during multimedia playback, for improved battery life and user experience. Results from analysis and simulation show that up to 65% of backlight power can be saved through our technique, with minimal or no visible quality degradation.
Current trends indicate that multiprocessor-systemon- chip (MPSoC) architectures are being increasingly used in building complex embedded systems. While circuit/ architectural support for MPSoC based systems are making significant strides, programming these devices and providing suitable software support (e.g., compiler and operating systems) seem to be a tougher problem. This is because either programmers or compilers will have to make code explicitly parallel to run on these systems. An additional difficulty occurs when multiple applications use an MPSoC at the same time, because MPSoC resources should be partitioned across these applications carefully. This paper explores a proactive resource partitioning scheme for parallel applications simultaneously exercising the same MPSoC system. The proposed approach has two major components. The first component includes an offline preprocessing of applications which gives us an estimated profile for each application. Each application to be executed on our MPSoC is profiled and annotated with the profile information. The second component of our approach is an online resource partitioning, which partitions both the processing cores (i.e., computation resources) and on-chip memory space (i.e., storage resource) among simultaneously-executing applications. Our experimental evaluation with this partitioner shows that it generates much better results than conventional operating system based resource management. The results also reveal that both memory partitioning and processor partitioning are very important for obtaining the best results.
This paper we proposes compiler-based leakage optimization strategy for on-chip scratch-pad memories (SPMs). The idea is to keep only a small set of SPM regions active at a given time and pre-activate SPM regions based on the compiler-extracted data access pattern. Our strategy, called activity clustering, increases the length of the idle periods of SPM regions by clustering accesses to a small set of regions at a time. It thus allows an SPM to take better advantage of the underlying leakage optimization mechanism.
Embedded multiprocessors pose new challenges in the design and implementation of embedded software. This has led to the need for programming interfaces that expose the capabilities of the underlying hardware. In addition, for systems that implement applications consisting of multiple concurrent threads of computation, the optimized management of interthread communication is crucial for realizing high-performance. This paper presents the design of an application-adaptive thread library that conforms to the IEEE POSIX 1003.1c threading standard (Pthreads). The library adapts the placement of both explicitly marked application data objects, as well as implicitly created data objects, in a physically distributed onchip memory architecture, based on the application's data access characteristics.
Memory and communication architectures have a significant impact on the cost, performance, and time-to-market of complex multi-processor system-on-chip (MPSoC) designs. The memory architecture dictates most of the data traffic flow in a design, which in turn influences the design of the communication architecture. Thus there is a need to co-synthesize the memory and communication architectures to avoid making sub-optimal design decisions. This is in contrast to traditional platform-based design approaches where memory and communication architectures are synthesized separately. In this paper, we propose an automated application specific cosynthesis methodology for memory and communication architectures (COSMECA) in MPSoC designs. The primary objective is to design a communication architecture having the least number of busses, which satisfies performance and memory area constraints, while the secondary objective is to reduce the memory area cost. Results of applying COSMECA to several industrial strength MPSoC applications from the networking domain indicate a saving of as much as 40% in number of busses and 29% in memory area compared to the traditional approach.
In this paper we present an approach to the scheduling of fault-tolerant embedded systems for safety-critical applications. Processes and messages are statically scheduled, and we use process re-execution for recovering from multiple transient faults. If process recovery is performed such that the operation of other processes is not affected, we call it transparent recovery. Although transparent recovery has the advantages of fault containment, improved debugability and less memory needed to store the fault-tolerant schedules, it will introduce delays that can violate the timing constraints of the application. We propose a novel algorithm for the synthesis of fault-tolerant schedules that can handle the transparency/performance trade-offs imposed by the designer, and makes use of the fault-occurrence information to reduce the overhead due to fault tolerance. We model the application as a conditional process graph, where the fault occurrence information is represented as conditional edges and the transparent recovery is captured using synchronization nodes.
Network-on-Chip (NoC)-based communication represents a promising solution to complex on-chip communication problems. Due to their regular structure, mesh-like NoC architectures have become very popular recently. However, they have poor topological properties such as long inter-node distances. In this paper, we address this very issue and explore the potential of partial NoC customization to improve both static and dynamic properties of the network significantly, while minimally affecting its regularity. Precise energy measurements on an FPGA prototype show that the improvement in network properties is achieved without a significant penalty in area and communication energy consumption.
This paper addresses communication optimisation for applications implemented on networks-on-chip. The mapping of data packets to network links and the timing of the release of the packets are critical for avoiding destination contention. This reduces the demand for communication buffers with obvious advantages in chip area and energy savings. We propose a buffer need analysis approach and a strategy for communication synthesis and packet release timing with minimum communication buffer demand that guarantees worst-case response times.
The allocation of device variables on I/O registers affects the code size and performance of an I/O device driver. This work seeks the allocation with the minimal software or hardware cost in a hardware/software codesign environment. The problems of exact minimization under constraints are formulated as zero-one integer linear programming problems. Heuristic algorithms based on iterative refinement are also proposed. The proposed design methodology was implemented in C language. Compared with industrial designs, the system can obtain design alternatives that reduce both software and hardware costs.
This session addresses the inter-dependency of physical and system level design and the economic issues of 4G implementations.
Status. Supply and Demand in the mobile operator seem almost decoupled - New technologies and ever increasing bandwidths compete for the attention of CTO's. On higher network layers, IMS and VoIP are key technologies shaking up the mobile value chain. New contenders, WiMax, WiFi, jointly with independent VoIP based operators threaten the value proposition of mobile altogether. Mobile technology is still evolving and the rate of innovation is high. The marketing department, on the other hand, focuses on value creation, large bundles, simplified terminals, format competition, thereby slowly coming to terms with the likely reduction of ARPU. Marketing has to deal with all the features of a mature market. Surely, some experiments on mobile broadband propositions are being launched, for example Mobile TV, music download, browsing, but none of them has yet delivered significant ARPU contributions. Finally, mobile communications are mainly peer-to-peer communications, i.e. too much player differentiation implies limited ability to communicate, an effect that has significantly hampered e.g. the success of MMS. The impact of all this on mobile terminals, specifically, is that diversity of requirements in terms of supported radio and coding standards, applications, speed and power efficiency increases dramatically. Operators will carry a high cost burden against a background of not yet amortized investment into licenses and network equipment.
Dynamic variations in application functionality and performance requirements can lead to the imposition of widely disparate requirements on System-on-Chip (SoC) platform hardware over time. This has led to interest in the design and use of adaptive SoC platforms that are capable of providing high performance in the face of such variations. Recent advances in circuits and architectures are enabling platforms that contain various mechanisms for runtime adaptation. However, the problem of exploiting such configurability in a coordinated manner at the system level remains a challenging task. In this work, we focus on two configurable subsystems of SoC platforms that play a crucial role in determining overall system performance, namely, the on-chip communication architecture, and the on-chip memory architecture. Using detailed case studies, we demonstrate the limitations of designs in which the architectural configuration of a busbased communication architecture and the placement of data in memory are statically optimized, and those in which each is customized separately, without considering their interdependence. We propose an integrated methodology for dynamically relocating on-chip data and reconfiguring the communication architecture, and discuss the necessary hardware support. Experiments conducted on an SoC platform that integrates decoders for the UMTS (3G) and IEEE 802.11a (Wireless LAN) standards demonstrate that the proposed integrated adaptation technique helps boost the maximum achievable performance by up to 32% over the best statically optimized design.
We present a modular and scalable approach for automatically extracting actual performance information from a set of FPGA-based architecture topologies. This information is used dynamically during simulation to support performance analysis in a System Level Design environment. The topologies capture systems representing common designs using FPGA technologies of interest. Their characterization is done only once; the results are then used during simulation of actual systems being explored by the designer. Our approach allows a rich set of FPGA architectures to be explored accurately at various abstraction levels to seek optimized solutions with minimal effort by the designer. To offer an industrial example of our results, we describe the characterization process for Xilinx CoreConnect-based platforms and the integration of this data into the Metropolis modeling environment.
Network applications are becoming increasingly popular in the embedded systems domain requiring high performance, which leads to high energy consumption. In networks is observed that due to their inherent dynamic nature the dynamic memory subsystem is a main contributor to the overall energy consumption and performance. This paper presents a new systematic methodology, generating performance-energy trade-offs by implementing Dynamic Data Types (DDTs), targeting network applications. The proposed methodology consists of: (i) the application-level DDT exploration, (ii) the network-level DDT exploration and (iii) the Pareto-level DDT exploration. The methodology, supported by an automated tool, offers the designer a set of optimal dynamic data type design solutions. The effectiveness of the proposed methodology is tested on four representative real-life case studies. By applying the second step, it is proved that energy savings up to 80% and performance improvement up to 22% (compared to the original implementations of the benchmarks) can be achieved. Additional energy and performance gains can be achieved and a wide range of possible trade-offs among our Pare-to-optimal design choices are obtained, by applying the third step. We achieved up to 93% reduction in energy consumption and up to 48% increase in performance.
In this paper we propose Application Specific Instruction Set Processors with heterogeneous multiple pipelines to effficiently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specified in C language, the design system can generate a processor with a number of pipelines specifically suitable to the application, and the parallel code associated with the processor. Each pipeline in such a processor is customized, and implements its own special instruction set so that the instructions can be executed in parallel with low hardware overhead. Our simulations and experiments with a group of benchmarks, largely from Mibench suite, show that on average, 77% performance improvement can be achieved compared to a single pipeline ASIP, with the overheads of 49% on area, 51% on leakage power, 17% on switching activity, and 69% on code size.
Bit-width specification will affect the total memory storage requirement of a video processing system. However, what is not so obvious is that the bit-width specification will also affect the design of the memory hierarchy. Experiments with a real-life surveillance system show how the optimal allocation of shift registers for the storage of intermediate results is sensitive to bit-widths. It is shown that the total on-chip memory storage requirement can be reduced by 61 percent compared to a non-optimal design.
This paper describes an efficient method to perform factorization of DSP transforms based on Taylor Expansion Diagram (TED). It is shown that TED can efficiently represent and manipulate mathematical expressions. We demonstrate that it enables efficient factorization of arithmetic expressions of DSP transforms, resulting in a simplification of the computation.
The clock distribution network is a key component on any synchronous VLSI design. As techonology moves into the nanometer era, innovative clocking techniques are required to solve the power dissipation and variability issues. Rotary clocking is a novel technique which employs unterminated rings formed by differential transmission lines to save power and reduce skew variability. Despite its appealing advantages, rotary clocking requires latch locations to match pre-designed clock skew on rotary clock rings. This requirement is a difficult chicken-and-egg problem which prevents its wide application. In this work, we proposed an integrated placement and skew scheduling methodology to break this hurdle, making rotary clocking compatible with practical design flows. A network flow based latch assignment algorithm and a cost-driven skew optimization algorithm are developed. Experiments show that our method can generate chip placements which satisfy the unique requirements of rotary clocks, without sacrificing design quality. By enabling concurrent clock network and placement design, our method can also be applied in other clocking methodologies as well.
In clock network synthesis, sometimes skew constraints are required only within certain groups of clock sinks and do not exist between different groups. This is the so-called associative skew clock routing problem. Although the number of constraints is reduced, the problem becomes more difficult to solve due to the enlarged solution space. The perhaps only previous work used a very primitive delay model and cannot handle difficult instances in which sink groups are intermingled. We reuse existing techniques to solve this problem, including the difficult instances, based on a more accurate and popular delay model. Experimental results show that our algorithm can reduce the total clock routing wirelength by 12% on average compared to greedy-DME which is one of the best zero skew routing algorithms.
In current very deep submicron (VDSM) circuits, incremental routing is crucial to incorporating engineering change orders (ECOs) late in the design cycle. In this paper, we address the important incremental routing objective of satisfying timing constraints in high-speed designs while minimizing wirelength, vias and routing layers. We develop an effective timing-driven (TD) incremental routing algorithm TIDE for ASIC circuits that addresses the dual goals of time-efficiency, and slack satisfaction coupled with effective optimizations. There are three main novelties in our approach: (i) a technique for locally determining slack satisfaction of the entire routing tree when either a new pin is added to the tree or an interconnect in it is re-routed - this technique is used in both the global and detailed routing phases; (ii) an interval-intersection and tree-truncation algorithm, used in global routing, for quickly determining a near-minimumlength slack-satisfying interconnection of a pin to a partial routing tree; (iii) a depth-first-search process, used in detailed routing, that allows new nets to bump and re-route existing nets in a controlled manner in order to obtain better optimized designs. Experimental results show that within the constraint of routing all nets in only two metal layers, TIDE succeeds in routing more than 94% of ECO-generated nets, and also that its failure rate is 7 and 6.7 times less than that of the TD versions of previous incremental routers Standard (Std) and Ripup&Reroute (R&R), respectively. It is also able to route nets with very little (3.4%) slack violations, while the other two methods have appreciable slack violations (16-19%). TIDE is about 2 times slower than the simple TD-Std method, but more than 3 times faster than TD-R&R.
Quantum dot Cellular Automata (QCA) is one of the
promising technologies for nano scale implementation. The
operation of QCA systems is based on a new paradigm generally
referred to as processing-by-wire (PBW). This paper
analyzes the defect tolerance properties of PBW when tiles
are employed using molecular QCA cells. Based on a 3x3
QCA block, with different input/output arrangements, different
tiles are analyzed and simulated using a coherence vector engine.
The functional characterization and polarization level of
these tiles for undeposited cell defects are
reported. It is shown that novel features of PBW are possible
due to spatial redundancy and QCA tiles are robust and
inherently defect tolerant.
Index words: QCA, defect tolerance, emerging technologies.
Negative Bias Temperature Instability (NBTI) has become one of the major causes for temporal reliability degradation of nanoscale circuits. In this paper, we analyze the temporal delay degradation of logic circuits due to NBTI.We show that knowing the threshold voltage degradation of a single transistor due to NBTI, one can predict the performance degradation of a circuit with a reasonable degree of accuracy. We also propose a sizing algorithm taking NBTI-affected performance degradation into account to ensure the reliability of nanoscale circuits for a given period of time. Experimental results on several benchmark circuits show that with an average of 8.7% increase in area one can ensure reliable performance of circuits for 10 years.
In this paper, different circuit arrangements of Quantumdot Cellular Automata (QCA) are proposed for the so-called coplanar crossing. These arrangements exploit the majority voting properties of QCA to allow a robust crossing of wires on the Cartesian plane. This is accomplished using enlarged lines and voting. Using a Bayesian Network (BN) based simulator, new results are provided to evaluate the robustness to so-called kink of these arrangements to thermal variations. The BN simulator provides fast and reliable computation of the signal polarization versus normalized temperature. It is shown that by modifying the layout, a higher polarization level can be achieved in the routed signal by utilizing the proposed QCA arrangements.
As devices are scaled to the nanoscale regime, it is clear that future nanodevices will be plagued by higher soft error rates and reduced noise margins. Traditional implementations of error correcting codes (ECC) can add to the reliability of systems but can be ineffective in highly noisy operating conditions. This paper proposes an implementation of ECC based on the theory of Markov random fields (MRF). The MRF probabilistic model is mapped onto CMOS circuitry, using feedback between transistors to reinforce the correct joint probability of valid logical states. We show that our MRF approach provides superior noise immunity formemory systems that operate under highly noisy conditions.
This paper presents the design of a Time-Triggered Ethernet (TTE) Switch, which is one of the core units of the Time-Triggered Ethernet system. Time-triggered Ethernet is a communication architecture intended to support eventtriggered and time-triggered traffic in a single communication system. The TTE Switch distinguishes between two classes of traffic. The Event-Triggered (ET) traffic is handled in conformance with the existing Ethernet standard, while the Time-Triggered (TT) traffic is transmitted with temporal guarantees. A TTE Switch is used in the Time-Triggered Ethernet system for exchanging time-triggered messages in a time-predictable way while continuing the support of standard Ethernet traffic in order to use existing networking protocols such as IP, UDP or IPX without any modifications. In this paper we present the mechanisms the TTE Switch uses to guarantee a constant transmission delay for timetriggered traffic. Also an experimental validation of these mechanisms is given.
This paper presents a Java processor, called JOP, designed for time-predictable execution of real-time tasks. JOP is the implementation of the Java virtual machine in hardware. We propose a processor architecture that favors low worst-case execution time (WCET) over average case performance. The resulting processor is an easy target for the low-level WCET analysis.
The object-oriented paradigm has become popular over the last years due to its characteristics that help managing the complexity in computer systems design. This feature also attracted the embedded systems community, as today's embedded systems need to cope with several complex functionalities as well as timing, power, and area restrictions. Such scenario has promoted the use of the Java language and its real-time extension (RTSJ) for embedded real-time systems design. Nevertheless, the RTSJ was not primarily designed to be used within the embedded domain. This paper presents an approach to optimize the use of the RTSJ for the development of embedded real-time systems. Firstly, it describes how to design real-time embedded applications using an API based on RTSJ. Secondly, it shows how the generated code is optimized to cope with the tight resources available, without interfering in the mandatory timing predictability of the generated system. Finally it discusses an approach to synthesize the applications on top of affordable FPGAs. The approach used to synthesize the embedded real-time system ensures a bounded timing behavior of the object-oriented aspects of the application, like the polymorphism mechanism and read/write access to object's data fields.
The best currently available solvers for Quantified Boolean Formulas (QBFs) process their input in prenex form, i.e., all the quantifiers have to appear in the prefix of the formula separated from the purely propositional part representing the matrix. However, in many QBFs deriving from applications, the propositional part is intertwined with the quantifier structure. To tackle this problem, the standard approach is to first convert them in prenex form, thereby loosing structural information about the prefix. In this paper we show that conversion to prenex form is not necessary, i.e., that it is relatively easy to extend current search based solvers in order to exploit the original quantifier structure, i.e., to handle non prenex QBFs. Further, we show that the conversion can lead to the exploration of search spaces bigger than the space explored by solvers handling non prenex QBFs. To validate our claims, we implemented our ideas in the state-of-the-art search based solver QUBE, and conducted an extensive experimental analysis. The results show that very substantial speedups can be obtained.
We present a new approach to conflict analysis for propositional satisfiability solvers based on the DPLL procedure and clause recording. When conditions warrant it, we generate a supplemental clause from a conflict. This clause does not contain a unique implication point, and therefore cannot replace the standard conflict clause. However, it is very effective at reducing excessive depth in the implication graphs and at preventing repeated conflicts on the same clause. Experimental results show consistent improvements over state-of-the-art solvers and confirm our analysis of why the new technique works.
This paper addresses the problem of equivalence verification of RTL descriptions that implement arithmetic computations (add, mult, shift) over bit-vectors that have differing bit-widths. Such designs are found in many DSP applications where the widths of input and output bit-vectors are dictated by the desired precision. A bit-vector of size n can represent integer values from 0 to 2n -1; i.e. integers reduced modulo 2n. Therefore, to verify bit-vector arithmetic over multiple word-length operands, we model the RTL datapath as a polynomial function from Z2n1 x Z 2n2 x ... x Z 2nd to Z2m. Subsequently, RTL equivalence f ≡ g is solved by proving whether (f-g) ≡ 0 over such mappings. Exploiting concepts from number theory and commutative algebra, a systematic, complete algorithmic procedure is derived for this purpose. Experimentally, we demonstrate how this approach can be applied within a practical CAD setting. Using our approach, we verify a set of arithmetic datapaths at RTL where contemporary approaches prove to be infeasible.
The existence of non-uniform thermal gradients on the substrate in high performance IC's can significantly impact the performance of global on-chip interconnects. This issue is further exacerbated by the aggressive scaling and other factors such as dynamic power management schemes and nonuniform gate level switching activity. In high-performance systems, one of the most important problems is clock skew minimization since it has a direct impact on the maximum operating frequency of the system. Since clocks are routed across the entire chip, the presence of thermal gradients can significantly alter their characteristics because wire resistance increases linearly as the temperature increases. This often results in failure to meet original timing constraints thereby rendering the original topology unusable. Therefore it is necessary to perform a temperature aware re-embedding of the original topology to meet timing under these temperature effects. This work primarily explores these issues by proposing two algorithms that re-structure an existing clock tree topology to compensate for such temperature effects and as a result also meet timing constraints.
The power density inside high performance systems continues to rise with every process technology generation, thereby increasing the operating temperature and creating "hot spots" on the die. As a result, the performance, reliability and power consumption of the system degrade. To avoid these "hot spots", "temperature-aware" design has become a must. For low-power embedded systems though, it is not clear whether similar thermal problems occur. These systems have very different characteristics from the high performance ones: they consume hundred times less power, they are based on a multi-processor architecture with lots of embedded memory and rely on cheap packaging solutions. In this paper, we investigate the need for temperature-aware design in a low-power systems-on-a-chip and provide guidlines to delimit the conditions for which temperatureaware design is needed.
Ever-increasing integrated circuit (IC) power densities and peak temperatures threaten reliability, performance, and economical cooling. To address these challenges, thermal analysis must be embedded within IC synthesis. However, detailed thermal analysis requires accurate three-dimensional chip-package heat flow analysis. This has typically been based on numerical methods that are too computationally intensive for numerous repeated applications during synthesis or design. Thermal analysis techniques must be both accurate and fast for use in IC synthesis. This article presents a novel, accurate, incremental, self-adaptive, chip-package thermal analysis technique, called ISAC, for use in IC synthesis and design. It is common for IC temperature variation to strongly depend on position and time. ISAC dynamically adapts spatial and temporal modeling granularity to achieve high efficiency while maintaining accuracy. Both steady-state and dynamic thermal analysis are accelerated by the proposed heterogeneous spatial resolution adaptation and temporally decoupled element time marching techniques. Each technique enables orders of magnitude improvement in performance while preserving accuracy when compared with other state-of-the-art adaptive steady-state and dynamic IC thermal analysis techniques. Experimental results indicate that these improvements are sufficient to make accurate dynamic and static thermal analysis practical within the inner loops of IC synthesis algorithms. ISAC has been validated against reliable commercial thermal analysis tools using industrial and academic synthesis test cases and chip designs. It has been implemented as a software package suitable for integration in IC synthesis and design flows and has been publicly released.
As technology scales, increasing clock rates, decreasing interconnect pitch, and the introduction of low-k dielectrics have made self-heating of the global interconnects an important issue in VLSI design. In this paper, we study the self-heating of on-chip buses and show that the thermal impact due to self-heating of onchip buses increases as technology scales, thus motivating the need of finding solutions to mitigate this effect. Based on the theoretical analysis, we propose an irredundant bus encoding scheme for on-chip buses to tackle the thermal issue. Simulation results show that our encoding scheme is very efficient to reduce the on-chip bus temperature rise over substrate temperature, with much less overhead compared to other low power encoding schemes.
This paper presents a novel design methodology for ultralow power design (in bulk and double-gate SOI technology) using subthreshold leakage as the operating current (suitable for medium frequency of operation: tens to hundreds of MHz). It has been shown that a complete co-design at all levels of hierarchy (device, circuit and architecture) is necessary to reduce the overall power consumption. Simulation results of co-design on a five-tap FIR filter shows ~2.5x (for bulk) and ~3.8x (for SOI) improvement in throughput at iso-power compared to a conventional design. It has been further demonstrated that the double-gate SOI technology is better suited for sub-threshold operation.
Clock power consumes a significant fraction of total power dissipation in high speed precharge/evaluate logic styles. In this paper, we present a novel low-cost design methodology for reducing clock power in the active mode for dynamic circuits with fine-grained clock gating. The proposed technique also improves switching power by preventing redundant computations. A logic synthesis approach for domino/skewed logic styles based on Shannon expansion is proposed, that dynamically identifies idle parts of logic and applies clock gating to them to reduce power in the active mode of operation. Results on a set of MCNC benchmark circuits in predictive 70nm process exhibit improvements of 15% to 64% in total power with minimal overhead in terms of delay and area compared to conventionally synthesized domino/skewed logic.
Functional unit shutdown based on MTCMOS devices is effective for leakage reduction in aggressively scaled technologies. However, the applicability of MTCMOS-based shutdown in a synthesis-based design flow poses the challenge of interfacing logic blocks in shutdown mode with active units: The outputs of inactive gates can float at intermediate voltages, causing very large short-circuit currents in the active gates they drive. In this paper, we propose two novel low-overhead elementary cells that fully address this issue. These cells can be added to any synthesis library, and they can be inserted into a netlist at the boundary between shutdown and active regions. Our results show that: (i) Our cells solve the interfacing problem with minimum overhead; (ii) A nonintrusive design flow enhancement is sufficient to automatically insert interface cells in post-synthesis netlists.
New applications in embedded systems are becoming increasingly dynamic. In addition to increased dynamism, they have massive data storage needs. Therefore, they rely heavily on dynamic, run-time memory allocation. The design and configuration of a dynamic memory allocation subsystem requires a big design effort, without always achieving the desired results. In this paper, we propose a fully automated exploration of dynamic memory allocation configurations. These configurations are fine tuned to the specific needs of applications with the use of a number of parameters. We assess the effectiveness of the proposed approach in two representative real-life case studies of the multimedia and wireless network domains and show up to 76% decrease in memory accesses and 66% decrease in memory footprint within the Pareto-optimal trade-off space.
In this work we take a control-theoretic approach to feedback-based dynamic voltage scaling (DVS) in Multi Processor System on Chip (MPSoC) pipelined architectures. We present and discuss a novel feedback approach based on both linear and non-linear techniques aimed at controlling interprocessor queue occupancy. Theoretical analysis and experiments, carried out on a cyclea-ccurate multiprocessor simulation platform, show that feedbackbased control reduces energy consumption with respect to standard local DVS policies and highlight that non-linear strategies allows a more flexible and robust implementation in presence of variable workload conditions.
3D circuits have the potential to improve performance over traditional 2D circuits by reducing wirelength and interconnect delay. One major problem with 3D circuits is that their higher device density due to reduced footprint area leads to greater temperatures. Thermal vias are a potential solution to this problem. This paper presents a thermal via insertion algorithm that can be used to plan thermal via locations during floorplanning. The thermal via insertion algorithm relies on a new thermal analyzer based on random walk techniques. Experimental results show that, in many cases, considering thermal vias during floorplanning stages can significantly reduce the temperature of a 3D circuit.
This paper proposes a yield optimization method for standard-cells under timing constraints. Yield-aware logic synthesis and physical optimization require yield-enhanced standard cells and the proposed method automatically creates yield-enhanced cell layouts by de-compacting the original cell layout. However, the careless modification of the original layout may degrade its performances severely. Therefore, the proposed method de-compacts the original layout under given timing constraints using a Linear Programming (LP). We develop a new accurate linear delay model which approximates the di.erence from the original delay and use this model to formulate the timing constraints in the LP. Experimental results show that the proposed method can pick up the yield variants of a cell layout from the trade o. curve of cell delay versus critical area and is used to create the yield-enhanced cell library which is essential to realize yield-aware VLSI design flows.
Process variations due to lens aberrations are to a large extent systematic, and can be modeled for purposes of analyses and optimizations in the design phase. Traditionally, variations induced by lens aberrations have been considered random due to their small extent. However, as process margins reduce, and as improvements in reticle enhancement techniques control variations due to other sources with increased efficacy, lens aberration-induced variations gain importance. For example, our experiments indicate that lens aberration can result in up to 8% variation in cell delay. In this paper, we propose an aberration-aware timing-driven analytical placement approach that accounts for aberration-induced variations during placement. Our approach minimizes the design's cycle time and prevents hold-time violations under systematic aberration-induced variations. On average, the proposed placement technique reduces cycle time by ~5% at the cost of ~2% increase in wirelength.
The impact of test conditions on the detectability of open defects is investigated. We performed an inductive fault analysis on representative standard gates. The simulation results show that open-like defects result in a wide range of different voltage-delay dependencies, ranging from a strongly increasing to a strongly decreasing delay as a function of voltage. The behaviour is not only determined by the defect location but also by the test pattern. Knowing the expected behaviour of a certain defect location helps failure localisation. The detectability of a defect is strongly determined by the behaviour of the affected path as well as that of the longest path. Our simulations and measurements show that in general elevated supply voltages give a better detectability of open-like defects.
In this work we present an analytical formulation to estimate quickly and accurately the impact of crosstalk induced delay in submicron CMOS ICs gates taking into account time skew. Crosstalk delay is computed from the additional charge injected from the aggressor gate on the victim gate during simultaneous switching. The model provides a very good agreement with HSPICE simulations for a 0.18μm technology.
Generation of n -detection test sets is typically done for a single fault model. In this work we investigate the generation of n -detection test sets by pairing each fault of a target fault model with n faults of a different fault model. Tests are generated such that they detect both faults of a pair. To facilitate test generation, we ensure that the faults included in a single pair have overlapping requirements for their detection. The advantage of this approach is that it ensures the detection of additional faults that would not be targeted during n -detection test generation for a single fault model. Experimental results with transition faults as the first fault model and four-way bridging faults as the second fault model are presented.
Defect density and defect size distributions (DDSDs) are key parameter used in IC yield loss predictions. Traditionally, memories and specialized test structures have been used to estimate these distributions. In this paper, we propose a stratategy to accur accurately estimate DDSDs for shorts in metal layers using production IC test results.
Sophisticated C compiler support for network processors (NPUs) is required to improve their usability and consequently, their acceptance in system design. Nonetheless, high-level code compilation always introduces overhead, regarding code size and performance compared to handwritten assembly code. This overhead results partially from high-level function calls that usually introduce memory accesses in order to save and reload register contents. A key feature of many NPU architectures is hardware multi-threading support, in the form of separate register files, for fast context switching between different application tasks. In this paper, a new NPU code optimization technique to use such HW contexts is presented that minimizes the overhead for saving and reloading register contents for function calls via the runtime stack. The feasibility and the performance gain of this technique are demonstrated for the Infineon Technologies PP32 NPU architecture and typical network application kernels.
Scratch-Pad memory (SPM) allocators that exploit the presence of affine references to arrays are important for scientific benchmarks. On the other hand, such allocators have so far been limited in their general applicability. In this paper we propose an integrated scheme that for the first time combines the specialized solution for affine program allocation with a general framework for other code. We find that our integrated framework does as well or outperforms other allocators for a variety of SPM sizes.
There exist many embedded applications such as those executing on set-top boxes, wireless base stations, HDTV, and mobile handsets that are structured as nested loops and benefit significantly from a software managed memory. Prior work on scratchpad memories (SPMs) focused primarily on applications with regular data access patterns. Unfortunately, some embedded applications do not fit in this category and consequently conventional SPM management schemes will fail to produce the best results for them. In this work, we propose a novel compilation strategy for data SPMs for embedded applications that exhibit irregular data access patterns. Our scheme divides the task of optimization between compiler and runtime. The compiler processes each loop nest and insert code to collect information at runtime. Then, the code is modified in such a fashion that, depending on the collected information, it dynamically chooses to use or not to use the data SPM for a given set of accesses to irregular arrays. Our results indicate that this approach is very successful with the applications that have irregular patterns and improves their execution cycles by about 54% over a state-of-the-art SPM management technique and 23% over the conventional cache memories. Also, the additional code size overhead incurred by our approach is less than 5% for all the applications tested.
In many computer systems with large data computations, the delay of memory access is one of the major performance bottlenecks. In this paper, we propose an enhanced field remapping scheme for dynamically allocated structures in order to provide better locality than conventional field layouts. Our proposed scheme reduces cache miss rates drastically by aggregating and grouping fields from multiple instances of the same structure, which implies the performance improvement and power reduction. Our methodology will become more important in the design space exploration, especially as the embedded systems for data oriented application become prevalent. Experimental results show that average L1 and L2 data cache misses are reduced by 23% and 17%, respectively. Due to the enhanced localities, our remapping achieves 13% faster execution time on average than original programs. It also reduces power consumption by 18% for data cache.
Traditionally, active power has been the primary source of power dissipation in CMOS designs. Although, leakage power is becoming increasingly more important as technology feature sizes continue to shrink, traditioinal power optimization techniques often neglect its contribution to total system power. In this paper, we present a power-aware compilation methodology that targets an embedded processor with both dynamic voltage scaling (DVS) and adaptive body biasing (ABB) capabilities. Our technique has the unique advantage of optimizing design power by jointly optimizing dynamic and leakage power dissipation. Considering the delay and energy penalty of swithching between processor modes, our compiler generates code with minimum power consumption under deadline constraints. Compared to not performing any optimization, or using DVS alone, our technique improves the power consumption of a number of embedded application kernels by 26%, and 14%, respectively.
In this paper we propose a dynamic code overlay technique of synchronous data-flow (SDF) - modeled program for low-end embedded systems which lack MMU-support. With this technique, the system can utilize expensive SRAM memory more efficiently by using flash memory as code storage. SRAM is divided into several regions called overlay slots. A data-flow block or a cluster of data-flow blocks is loaded into the corresponding overlay slot on demand at run-time. Which blocks are clustered together and which overlay slots are allocated to the clusters are statically decided by the clustering and placement algorithm. We also propose an automatic code generation framework that generates the C-program code, dynamic loader and linker script files from the given SDF-modeled blocks and schematic, so we can run or simulate the program immediately without any additional coding effort. Experiments report that we can reduce the SRAM size significantly with a reasonable amount of time overhead for several real applications.
Networks-on-Chip (NoC) way of system design has been introduced to overcome the communication and the performance bottlenecks of a bus based system design. Area is at a premium in FPGAs. In this research, we propose to reduce network area overhead by reducing the number of routers, by making the router handle multiple logic cores. We implement an improved multi-local port router design with variable number of local ports. In addition to substantial area savings, we observe significant performance improvement. We discuss the issues involved in the use of multi-local port routers for NoC design in FPGAs. We observe an average of 36% area savings (maximum of 47.5%) on XC2V P30 FPGA and significant performance gain (30% average compared to single-local port version) with a multi-local port router. Mapping of cores onto such a non-traditional NoC architecture is a complex task. We present an algorithm which optimally maps the cores based on the given set of objectives. For the given task graph and the set of constraints, the algorithm finds the optimal number of routers, configuration of each router, optimal mesh topology and the final mapping. We test the algorithm on a wide variety of benchmarks and report the results.
Eigenvalue computation is essential in many fields of science and engineering. For high performance and real-time applications, this may need to be done in hardware. This paper focuses on the exploration of hardware architectures which compute eigenvalues of symmetric matrices. We propose to use the Approximate Jacobi Method for general case symmetric matrix eigenvalue problem. The paper illustrates that the proposed architecture is more efficient than previous architectures reported in the literature. Moreover, for the special case of 3x3 symmetric matrices, we propose to use an Algebraic Method. It is shown that the pipelined architecture based on the Algebraic Method has a significant advantage in terms of area.
Concurrent programs are difficult to write, reason about, re-use, and maintain. In particular, for system-level descriptions that use a shared memory abstraction for thread or process synchronization, the current practice involves manual scheduling of processes, introduction of guard conditions, and clocking tricks, to enforce memory dependencies. This process is tedious, time consuming, and error-prone. At the same time, the need for a concurrent programming model is becoming ever essential to bridge the productivity gap that is widening with every manufacturing process generation. In this paper, we present two novel techniques to automatically enforce memory dependencies in platform FPGAs using on-chip memories, starting from a system-level description. Both the techniques utilize static analysis to generate circuits for enforcing these dependencies. This paper will investigate these two techniques for their generality, overhead in implementation, and usefulness or otherwise for different application requirements.
Multitasking on reconfigurable logic can achieve very high silicon reusability. However, configuration latency is a major limitation and it can largely degrade the system performance. One reason is that tasks can run in parallel but configurations of the tasks can be done only in sequence. This work presents a novel configuration model to enable configuration parallelism. It consists of multiple homogeneous tiles and each tile has its own configuration SRAM that can be individually accessed. Thus multiple configuration controllers can load tasks in parallel and more speedups can be achieved. We used a prefetch scheduling technique to evaluate the model with randomly generated tasks. The experiment results reveal that in average using multiple controllers can reduce the configuration overheads by 21%. Compared to best cases of using multiple tiles with a single controller, additional 40% speedup can be achieved using multiple controllers.
Wireless sensor networks are a hot issue worldwide, and significant progress has been achieved in the past few years. However, we are only beginning to find out about their real potential, and there are still major challenges that need to be solved. In this presentation an overview of the biggest challenges in wireless sensor networks are addressed, and some of the solutions will be highlighted. Then, some applications of sensor network technologies are presented, which go beyond traditional sensor network applications.
The WiseNET system includes an ultra low-power system-on-chip (SoC) hardware platform and WiseMAC, a low power medium access control protocol (MAC) dedicated to duty-cycled radios. Both elements have been designed to meet the specific requirements of wireless sensor networks and are particularly well suited to ad-hoc and hybrid networks. The WiseNET radio offers dual-band operation (434-MHz and 868-MHz) and runs from a single 1.5-V battery. It consumes only 2.5-mW in receive mode with a sensitivity smaller than -108-dBm at a BER of 10-3 and for a 100-kb/s data rate. In addition to this low-power radio, the WiseNET system-on-chip (SoC) also includes all the functions required for data acquisition, processing and storage of the information provided by the sensors. Ultra-low power consumption with the WiseNET system is achieved thanks to the combination of the low power consumption of the transceiver and the high energy efficiency of the WiseMAC protocol. The WiseNET solution consumes more than 250 times less power than comparable solutions based on the IEEE 802.15.4 standard.
The BTnode platform is a versatile and flexible platform for functional prototyping of ad hoc and sensor networks. Based on an Atmel microcontroller, a Bluetooth radio and a low-power ISM band radio it offers ample resources to implement and test a broad range of algorithms and applications ranging from pure technology studies to complete application demonstrators. Accompanying the hardware is a suite of system software, application examples and tutorials as well as support for debugging, test, deployment and validation of wireless sensor network applications. We discuss aspects of system design, development and deployment based on our experience with real wireless sensor network experiments. We further discuss our approach of a deployment-support network that tries to close the gap between current proof-of-concept experiments to sustainable real-world sensor network solutions.
In this paper, we propose a general Circuit-aware Device Design methodology, which can improve the overall circuit design by taking advantages of the individual circuit characters during the device design phase. The proposed methodology analytically derives the optimal device in terms of the pre-specified circuit quality factor. We applied the proposed methodology to SRAM design and achieved significant reduction in standby leakage and access time (11% and 7%, respectively, for conventional 6TSRAM). Also, we observed that the optimal devices selected depend considerably on the applied circuit techniques. We believe that the proposed Circuit-aware Device Design methodology will be useful in the sub-90nm technology, where different leakage components (subthreshold, gate, and junction tunneling) are comparable in magnitude. Also, in this work, we have presented a design automation framework for SRAM, which is conventionally custom designed and optimized.
In this paper, an approximated closed-form total power consumption equation for circuits working at their optimal supply and threshold voltage is presented. Comparisons of this formula to the numerical calculation show an error less than 3% on a set of thirteen 16 bit multipliers. Starting from this equation the influence of architecture transformations (including pipelining, parallelization, sequentialization) on the optimal total power is discussed. Finally, by a similar approach, the impact of the technology choice on achievable power saving is considered, showing how a moderated tradeoff between leakage and speed is the key characteristic of a good low power technology.
Aggressive CMOS scaling results in low threshold voltage and thin oxide thickness for transistors manufactured in very deep submicron regime. As a result, reducing the subthreshold and gate-tunneling leakage currents has become one of the most important criteria in the design of VLSI circuits. This paper presents a method based on dual-Vt and dual-Tox assignment to reduce the total leakage power dissipation of SRAMs while maintaining their performance. The proposed method is based on the observation that the read and write delays of a memory cell in an SRAM block depend on the physical distance of the cell from the sense amplifier and the decoder. Thus, the idea is to deploy different types of six-transistor SRAM cells corresponding to different threshold voltage and oxide thickness assignments for the transistors. Unlike other techniques for low-leakage SRAM design, the proposed technique incurs neither area nor delay overhead. In addition, it results in a minor change in the SRAM design flow. Simulation results with a 65nm process demonstrate that this technique can reduce the total leakage power dissipation of a 64Kb SRAM by more than 50%.
Modern microprocessors feature wide datapaths to support large on-chip memory and to enable computation on large-magnitude operands. With device scaling and rising clock frequencies, energy consumption and power density have become critical concerns, especially in datapath circuits. Datapaths are typically designed to optimize delay for worst-case operands. However, such operands rarely occur; the most frequently occurring input operand words (comprising long strings or subwords of 0's and 1's) present two major opportunities for energy optimization: (1) avoiding unnecessary computation involving such "special" input operand subword values and (2) exploiting timing slack in circuits (designed to accommodate worst-case inputs) arising due to such values. Previous techniques have exploited only one or the other of these factors, but not both simultaneously. Our new technique, dynamic multi-VDD, which is capable of dynamically switching between supply voltages in hardware submodules, simultaneously exploits both factors. Using the computation bypass framework and multiple supply voltages, we estimate data-dependent slack based on submodules that will be bypassed and exploit this slack by operating active submodules at a lower supply voltage. Our analysis of SPEC CPU2K benchmarks shows energy savings of up to 55% (and 46.53% on average) in functional units with minimal performance overheads.
Transaction level modeling (TLM) is becoming an usual practice for simplifying system-level design and architecture exploration. It allows the designers to focus on the functionality of the design, while abstracting away implementation details that will be added at lower abstraction levels. However, moving from transaction level to RTL requires to redefine TLM testbenches and assertions. Such a wasteful and error prone conversion can be avoided by adopting transactor-based verification (TBV). Many recent works adopt this strategy to propose verification methodologies that allow (1) mixing TLM and RTL components, and (2) reusing TLM assertions and testbenches at RTL. Even if practical advantages of such an approach are evident, there are no papers in the literature that evaluate the effectiveness of the TBV compared to a more traditional RTL verification strategy. This paper is intended to fill in the gap. It theoretically compares the quality of the TBV towards the rewriting of assertions and testbenches at RTL with respect to both fault coverage and assertion coverage.
Transaction level models promise to be the basis of the verification environment for the whole design process. Realizing this promise requires connecting transaction level and RTL blocks through an object called a transactor, which translates back and forth between RTL signal-based communication, and transaction level function-call based communication. Each transactor is associated with a pair of interfaces, one at RTL and one at transaction level. Typically, however, a pair of interfaces is associated to more than one transactor, each assuming a different role in the verification process. In this paper we propose a methodology in which both the interfaces and their relation are captured by a single formal specification. By using the specification, we show how the code for all the transactors associated with a pair of interfaces can be automatically generated.
We present a coverage metric which evaluates the testing of a set of interacting concurrent processes. Existing behavioral coverage metrics focus almost exclusively on the testing of individual processes. However the vast majority of practical hardware descriptions are composed of many processes which must correctly interact to implement the system. Coverage metrics which evaluate processes separately are unlikely to model the range of design errors which manifest themselves when components are integrated to build a system. A metric which models component interactions is essential to enable validation techniques to scale with growing design complexity. We describe the effectiveness of our metric and provide results to demonstrate that coverage computation using our metric is tractable.
An ever increasing portion of design effort is spent on functional verification. The verification space as the set of possible combinations of a design's attributes is likely to be very large making it infeasible to verify each point in this space. State-of-the-art verification tools tackle this problem by using directed random generation of combinations in conjunction with manually defined corner cases in order to get satisfactory coverage with the desired distribution. In this work, the underlying methodology to automatically generating complete sets of disjoint coverage models on the basis of formal attribute definitions is extended to take relational constraints into account. This allows the utilization of coverage models with non-orthogonal, non-planar boundaries, which can make hole analysis for coverage data obsolete. It shall be demonstrated, how the proposed methodology can be used to automatically determine corner cases more accurately than it is possible with conventional approaches.
This article presents the classification tree method for functional verification to close the gap from the specification of a test plan to SystemVerilog [2] testbench generation. Our method supports the systematic development of test configurations and is based on the classification tree method for embedded systems (CTM/ES) [1] extending CTM/ES for random test generation as well as for functional coverage and property specification. We support the structured coding of assertions and constraints by a twostep method: (i) creation of the classification tree (ii) creation of (sample) abstract test sequences. For SystemVerilog testbench generation, we introduce a mapping to SystemVerilog random tests, assertions, and functional coverage specifications. As our method is derived from the CTM/ES, it is also compliant to the V-method and thus applies to IEC61508-conformant development of electronic safety related systems. The remainder of this paper gives an overview of the classification tree method (CTM) before presenting our extension for functional verification.
In this paper we introduce a new test-data compression method for IP cores with unknown structure. The proposed method encodes the test data provided by the core vendor using a new, very effective compression scheme based on multilevel Huffman coding. Specifically, three different kinds of information are compressed using the same Huffman code, and thus significant test data reductions are achieved. A simple architecture is proposed for decoding on-chip the compressed data. Its hardware overhead is very low and comparable to that of the most efficient methods in the literature. Additionally, the proposed technique offers increased probability of detection of unmodeled faults since the majority of the unknown values of the test set are replaced by pseudorandom data generated by an LFSR.
We present an approach to prevent overtesting in scan-based
delay test. The test data is transformed with respect to functional
constraints while simultaneously keeping as many positions
as possible unspecified in order to facilitate test compression.
The method is independent of the employed delay
fault model, ATPG algorithm and test compression technique,
and it is easy to integrate into an existing flow. Experimental
results emphasize the severity of overtesting in scanbased
delay test. Influence of different functional constraints
on the amount of the required test data and the compression
efficiency is investigated. To the best of our knowledge,
this is the first systematic study on the relationship between
overtesting prevention and test compression.
Keywords: Overtesting prevention, Functional constraints,
Scan-based delay test, Test compression
A concurrent core test approach is proposed to reduce the test cost of SOC. Multiple cores in SOC can be tested simultaneously by using a shared test set and scan chain disable. Prior to test, the test sets corresponding to cores under test (CUT) are merged by using the proposed merging algorithm to obtain a shared test set with minimum size. During test, the on-chip scan chain disable signal (SCDS) generator is employed to retrieve the original test vectors from the shared test set. The approach is non-intrusive and automatic test pattern generator (ATPG) independent. Moreover, the approach can reduce test cost further by combining with general test compression/decompression technique. Experimental results for ISCAS 89 benchmark circuits have proven the efficiency of the proposed approach.
This paper presents an efficient method to block unknown values from entering temporal compactors. The control signals for the blocking logic are generated by an LFSR. The proposed technique minimizes the size of the LFSR by propagating only one fault effect for each fault and balancing the number of specified bits in each control pattern. The linear solver to find seeds of the LFSR intelligently chooses a solution such that the impact on test quality is minimal. Experimental results show that sizes of control data for the proposed method are smaller than prior work and run time of the proposed method is several orders of magnitude smaller than that of prior work. Hardware overhead is very low.
The presence of unknown values in simulation is the greatest barrier to effective test response compaction. For space compactors, some response may not be observable due to the masking effect caused by unknown values. This paper reports on experiments conducted to evaluate the impact on the test quality of various percentages of observable responses for both modeled and un-modeled faults.
Much research has focused on power conservation for the processor, while power conservation for I/O devices has received little attention. In this paper, we analyze the problem of online energy-aware I/O scheduling for hard realtime systems based on the preemptive periodic task model. We propose an online energy-aware I/O device scheduling algorithm: Energy-efficient Device Scheduling (EEDS). The EEDS algorithm utilizes device slack to perform device power state transitions to save energy, without jeopardizing temporal correctness. An evaluation of the approach shows that it yields significant energy savings with respect to no Dynamic Power Management (DPM) techniques.
The energy-aware design for electronic systems has been an important
issue in hardware and/or software implementations, especially
for embedded systems. This paper targets a synthesis problem for
heterogeneous multiprocessor systems to schedule a set of periodic
real-time tasks under a given energy consumption constraint. Each
task is required to execute on a processor without migration, where
tasks might have different execution times on different processor
types. Our objective is to minimize the processor cost of the entire
system under the given timing and energy consumption constraints.
The problem is first shown being NP-hard and having
no polynomial-time algorithm with a constant approximation ratio
unless NP = P. We propose polynomial-time approximation algorithms
with (m + 2)-approximation ratios for this challenging
problem, where m is the number of the available processor types.
Experimental results show that the proposed algorithms could always
derive solutions with system costs close to those of optimal
solutions.
Keywords: Energy-aware systems, Task scheduling, Real-time
systems, Task partitioning, Multiprocessor synthesis.
Scheduling is an important step in high-level synthesis (HLS). In our tool, we perform scheduling in two steps: coarse-grain scheduling, in which we take into account the whole control structure of the program including imperfect loop nests, and fine-grain scheduling, where we refine each logical step using a detailed description of the available resources. This paper focuses on the second step. Tasks are modeled as reservation tables (or templates) and we express resource constraints using dis-equations (i.e., negations of equations). We give an exact algorithm based on a branch-and-bound method, coupled with variants of Dijkstra's algorithm, which we compare with a greedy heuristic. Both algorithms are tested on pieces of scientific applications to demonstrate their suitability for HLS tools.
Conventional task scheduling on real-time systems with multiple processors is notorious for its computational intractability. This problem becomes even harder when designers also have to consider other constraints such as energy consumptions. Such a multi-objective trade-off exploration is a crucial step to generating cost-efficient real-time embedded systems. Although previous task schedulers have attempted to provide fast heuristics for design space exploration, they cannot handle large systems efficiently. As today's embedded systems become increasingly larger, we need a scalable scheduler to handle this complexity. This paper presents a hierarchical scheduler that combines the graph partition and the task interleaving to tackle the tradeoff exploration problem in a scalable way. Our scheduler can employ the existing flattened scheduler and significantly accelerate the design space explorations for large tasks. The speed-up of up to 2 orders of magnitude has been obtained for large task models compared to the conventional flattened scheduler.
Boolean matching is a powerful technique that has been used in technology mapping to overcome the limitations of structural pattern matching. The current basis for performing Boolean matching is the computation of a canonical form to represent functions that are equivalent under negation and permutation of inputs and outputs. In this paper, we first present a detailed analysis of previous techniques for Boolean matching. We then describe a novel combination of existing methods and new ideas that results in a matcher which is dramatically faster than previous work. We point out that the presented algorithm is equally relevant for detecting generalized functional symmetries, which has broad applications in logic optimization and verification.
Optimizing sequential cycles is essential for many types of high-performance circuits, such as pipelines for packet processing. Retiming is a powerful technique for speeding pipelines, but it is stymied by tight sequential cycles. Designers usually attack such cycles by manually combining Shannon decomposition with retiming - effectively a form of speculation - but such manual decomposition is error-prone. We propose an efficient algorithm that simultaneously applies Shannon decomposition and retiming to optimize circuits with tight sequential cycles. While the algorithm is only able to improve certain circuits (roughly half of the benchmarks we tried), the performance increase can be dramatic (7%-61%) with only a modest increase in area (3%-12%). The algorithm is also fast, making it a practical addition to a synthesis flow.
The clock latency scheduling problem is usually solved on the sequential graph, also called register-to-register graph. In practice, the the extraction of the sequential graph for the given circuit is much more expensive than computing the clock latency schedule for the sequential graph. In this paper we present a new algorithm for clock latency scheduling which does not require the complete sequential graph as input. The new algorithm is based on the parametric shortest paths algorithm by Young, Tarjan and Orlin. It extracts the sequential timing graph only partly, that is in the critical regions, through a call back. It is still guaranteed that the algorithm finds the critical cycle and the minimum clock period. As additional input the algorithm only requires for every register the maximum delay of any outgoing combinational path. Computing these maximum delays for all the registers is equivalent to the timing analysis problem, hence they can be computed very efficiently. Computational results on recently released public benchmarks and industrial designs show that in average only 20.0 % of the edges in the sequential graph need to be extracted and this reduces the overall runtime to 5.8 %.
Mesh architectures are used to distribute critical global signals on a chip, such as clock and power/ground. Redundancy created by mesh loops smooths out undesirable variations between signal nodes spatially distributed over the chip. However, one problem with the mesh architectures is the difficulty in accurately analyzing large instances. Furthermore, variations in process and temperature, supply noise and crosstalk noise cause uncertainty in the delay from clock source to flip-flops. In this paper, we study the problem of analyzing timing uncertainty in mesh-based clock architectures. We propose solutions for both pure mesh and (mesh + global-tree) architectures. The solutions can handle large design and mesh instances. The maximum error in uncertainty values reported by our solutions is 1-3ps with respect to the golden Monte Carlo simulations, which is at most 0.5% of the nominal clock latency of about 600ps.
We present a methodology, an environment and supporting tools to map an application on a wireless sensor network (WSN). While the method is quite general, we use extensively an example in the domain of industrial control as it is one of the most promising application of WSN and yet it is largely untouched by it. Our design flow starts from a high level description of the control algorithm and a set of candidate hardware platforms and automatically derives an implementation that satisfies system requirements while optimizing for power consumption. To manage the heterogeneity and complexity inherent in this rather complete design flow, we identify three abstraction layers and introduce the tools to transition between different layers and obtain the final solution. We present a case study of a control application for manufacturing plants that shows how the methodology covers all the aspects of the design process, from conceptual description to implementation.
Controlled experiments, with larger sensor networks configurations (100 + nodes), as well as discovering inefficiencies in their operation are rather complex. In this paper we will present a concept of a testbed for WSNs supporting easy change of configuration, multi tier operation and precise observation of network behaviour using a wired backbone connectivity. The design considerations will be accompanied by early usage experience.
We aim at developing a next-generation system for sow monitoring. Today, farmers use RFID based solutions with an ear tag on the sows and a reader located inside the feeding station. This does not allow the farmers to locate a sow in a large pen, or to monitor the life cycle of the sow (detect heat period, detect injury...). Our goal is to explore the design of a sensor network that supports such functionalities and meets the constraints of this industry in terms of price, energy consumption and availability.
Major impediments to technology scaling in the nanometer regime include power (or energy) dissipation and "erroneous" behavior induced by process variations and noise susceptibility. In this paper, we demonstrate that CMOS devices whose behavior is rendered probabilistic by noise (yielding probabilistic CMOS or PCMOS) can be harnessed for ultra low energy and high performance computation. PCMOS devices are inherently probabilistic in that they are guaranteed to compute correctly with a probability 1/2 < p < 1 and thus, by design, they are expected to compute incorrectly with a probability (1-p). In this paper, we show that PCMOS technology yields significant improvements, both in the energy consumed as well as in the performance, for probabilistic applications with broad utility. these benefits are derived using an application-architecture-technology (A2T) co-design methodology introduced here, yielding an entirely novel family of probabilistic system-on-a-chip (PSOC) architectures. All of our application and architectural savings are quantified using the product of the energy and the performance denoted (energy x performance): the PCMOS based gains are as high as a substantial multiplicative factor of over 560 when compared to a competing energy-efficient CMOS based realization.
IR and di/dt events may cause ohmic losses and large supply voltage variations due to system parasitics. Today, parallelism in the power delivery path is used to reduce ohmic loss while decoupling capacitance is used to minimize the supply voltage variation. Future integrated circuits, however, will exhibit large enough currents and current transients to mandate additional safeguards. A novel, distributed power delivery and decoupling network is introduced reducing the supply voltage variation magnitude by 67% and the future ohmic loss by 15.9W (compared to today's power delivery and decoupling networks) using conventional processing and packaging techniques in a 130nm technology node.
This paper presents an ultra low-power TLB design, which combines two techniques to minimize the power dissipated in TLB accesses. In our design, we first propose a real-time filter scheme to eliminate the redundant TLB accesses. Without delay penalty the proposed real-time filter can distinguish the redundant TLB access as soon as the virtual address is generated. The second technique is a banking-like structure, which aims to reduce the TLB power consumption in case of necessary accesses. We present two adaptive variants of the banked TLB. Compared to the conventional banked TLB, these two variants achieve better power efficiency without increasing the TLB miss ratio. The experimental results show that by filtering out all the redundant TLB accesses and then minimizing the power consumption per TLB access, our design can effectively improve the Energy*Delay product of the TLBs, especially for the data TLBs with poor spatial locality.
This paper presents a timeout-driven DPM technique which relies on the theory of Markovian processes. The objective is to determine the energy-optimal timeout values for a system with multiple power saving states while satisfying a set of user defined performance constraints. More precisely, a controllable Markovian process is exploited to model the power management behavior of a system under the control of a timeout policy. Starting with this model, a perturbation analysis technique is applied to develop an offline gradient-based approach to determine the optimal timeout values. Online implementation of this technique for a system with dynamically-varying system parameters is also described. Experimental results demonstrate the effectiveness of the proposed approach. Introduction Dynamic power management (DPM), which refers to selective shut-off or slow-down of components that are idle or underutilized, has proven to be a particularly effective technique for reducing power dissipation in such systems. In the literature, various DPM techniques have been proposed, from heuristic methods presented in early works [1][2] to stochastic optimization approaches [3][4]. Among the heuristic DPM methods, the timeout policy is the most widely used approach in industry and has been implemented in many operating systems. Examples include the power management scheme incorporated into the Windows system, the low-power saving mode of the IEEE 802.11a-g protocol for wireless LAN card, and the enhanced adaptive battery life extender (EABLE) for the Hitachi disk drive. Most of these industrial DPM techniques provide mechanisms to adjust the timeout values at the user level.
We utilize a formal model of division for determining a testbench of p-bit (dividend, divisor) pairs whose output 2p-bit quotients have properties characterizing these instances as the most challenging for verifying any division algorithm design and implementation. Specifically, our test suites yield 2p-bit quotients where the leading p-bits traverse all or a pseudo-random sample of leading bit combinations, and the next p-bits comprise a round bit followed by (p-1) identical bits. These values are pro ven to be closest to the p-bit quotient rounding boundaries and shown to possess other desirable coverage properties. We introduce an efficient method of generating these test-benches. We also describe applications of these testbenches at the design simulation stage and the product evaluation stage.
The problem of diagnosis - or locating the source of an error or fault - occurs in several areas of computer aided design, such as dynamic verification, property checking, equivalence checking and production test. Manually locating errors can be a time consuming and resource-intensive process. Several automated approaches for diagnosis have been presented, among them are simulation-based and SAT-based techniques. These two approaches are found to be robust even for large circuits as well as being applicable to a broad range of diagnosis problems. An in-depth comparison of both approaches necessary to augment our knowledge of diagnosis procedures has not been addressed by previous work. This paper provides a thorough analysis of the similarities and differences between simulation-based and SAT-based procedures for diagnosis. The relation between the basic approaches is theoretically analyzed. Issues regarding performance and diagnosis quality (resolution) are discussed. Experimental data strengthens the theoretical results. This detailed understanding of the relations between the techniques is necessary to provide further improvements to the field of diagnosis. The initial steps towards building a hybrid technique are also presented.
In recent years, increasing manufacturing density has allowed the development of Multi-Processor Systems-on-Chip (MPSoCs). Application-Specific Instruction Set Processors (ASIPs) stand out as one of the most efficient design paradigms and could be especially effective as SoC computing engines. However, multiple hurdles which are hindering the productivity of SoC designers and researchers must be solved first. Among them, the difficulty of thoroughly exploring the design space by simultaneously sweeping axes like processing elements, memory hierarchies and chip interconnect fabrics. We tackle this challenge by proposing an integrated approach where state-of-the-art platform modeling infrastructures, at the IP core level and at the system level, meet to provide the designer with maximum openness and flexibility in terms of design space exploration.
In traditional parallel co-simulation approaches, the simulation speed is heavily limited by time synchronization overhead between simulators and idle time caused by data dependency. Recent work has shown that the time synchronization overhead can be reduced significantly by predicting the next synchronization points more effectively or by separating trace-driven architecture simulation from trace generation from component simulators. The latter is known as virtual synchronization technique. In this paper, we propose redundant host execution to minimize the simulation idle time caused by data dependency in simulation models. By combining virtual synchronization and redundant host execution techniques we could make parallel execution of multiple simulators a viable solution for fast but cycle-accurate co-simulation. Experiments show about 40% performance gain over a technique which uses virtual synchronization only.
We propose a new task scheduling algorithm for timed-functional simulation of concurrent software tasks. It attains efficiency by reducing the frequency of context-switching between concurrent tasks. It also provides a high-degree of portability in the sense that it only needs the underlying system to support a very small number of primitives. We provide a concrete implementation built on top of the SystemC scheduler and show some results of preliminary evaluation.
In this paper we analyze the test power of SRAM memories and demonstrate that the full functional precharge activity is not necessary during test mode because of the predictable addressing sequence. We exploit this observation in order to minimize power dissipation during test by eliminating the unnecessary power consumption associated with the pre-charge activity. This is achieved through a modified pre-charge control circuitry, exploiting the first degree of freedom of March tests, which allows choosing a specific addressing sequence. The efficiency of the proposed solution is validated through extensive Spice simulations.
We present a very effective on-line interconnect built-in-self-test (BIST) method I-BIST for FPGAs that uses a combination of the following novel techniques: a track-adjacent and a switch-adjacent (also called "mirror adjacent") pairwise net comparison mechanism that achieves high detectability, a carefully designed set of only five net-configurations that cover all types and locations of wire-segment and switch faults, a 2-phase global-detailed testing approach, and a divide-and-conquer technique used in detailed testing to quickly narrow down the set of potential suspect interconnects that are then detail-diagnosed. These techniques result in I-BIST having provable detectability in the presence of an unbounded number of multiple faults, very high diagnosability of 99-100% even for high fault densities of up to 10% that are expected in emerging nano-scale technologies, and much lower test times or fault latencies than the previous best interconnect BIST techniques. In particular, for application to on-line testing, our method requires 2n roving-tester (ROTE) configurations to test an entire n x n FPGA, while the previous best online interconnect BIST technique requires n2 configurations. Thus, I-BIST is an order of magnitude more time- as well as power-effficient, and will scale well with rapidly increasing FPGA device sizes that are expected in emerging technologies.
This paper proposes reuse of on-chip networks for testing switches in Network on Chips (NoCs). The proposed algorithm broadcasts test vectors of switches through the on-chip networks and detects faults by comparing output responses of switches with each other. This algorithm alleviates the need for: (1) external comparison of the output response of the circuit-under-test with the response of a fault free circuit stored on a tester (2) on-chip signature analysis (3) a dedicated test-bus to reach test vectors and collect their responses. Experimental results on a few test benches compare the proposed algorithm with traditional System on Chip (SoC) test methods.
It has been proven that scan path is a potent hazard for secure chips. Scan based attacks have been recently demonstrated against DES or AES and several solutions have been presented in the literature in order to securize the scan chain. Nevertheless, the different proposed techniques are all ad hoc techniques, which are not always easy to integrate into a completely automated design flow or in an IP reuse environment. In this paper, we propose a scan chain integrity detection mechanism, which respects both automated design flow and IP reuse environment.
Entering the nanometer era, a major challenge to current design methodologies and tools is to effectively address the high defect densities projected for nanotechnologies. To this end, we proposed a reconfiguration-based defect-avoidance methodology for defect-prone nanofabrics. It judiciously architects the nanofabric, using probabilistic considerations, such that a very large number of alternative implementations can be mapped into it, enabling defects to be circumvented at configuration time in a scalable way. Building on this foundation, in this paper we propose a synthesis framework aimed at implementing this new design paradigm. A key novelty of our approach with respect to traditional high level synthesis is that, rather than carefully optimizing a single ("deterministic") solution, our goal is to simultaneously synthesize a large family of alternative solutions, so as to meet the required probability of successful configuration, or yield, while maximizing the family's average performance. Experimental results generated for a set of representative benchmark kernels, assuming different defect regimes and target yields, empirically show that our proposed algorithms can effectively explore the complex probabilistic design space associated with this new class of high level synthesis problems.
High level synthesis transformations play a major part in shaping the properties of the final circuit. However, most optimizations are performed without much knowledge of the final circuit layout. In this paper, we present a physically aware design flow for mapping high level application specifications to a synthesizable register transfer level hardware description. We study the problem of optimizing the data communication of the variables in the application specification. Our algorithm uses floorplan information that guides the optimization. We develop a simple, yet effective, incremental floorplanner to handle the perturbations caused by the data communication optimization. We show that the proposed techniques can reduce the wirelength of the final design, while maintaining a legal floorplan with the same area as the initial floorplan.
For CMOS technologies below 65nm, gate oxide direct tunneling current is a major component of the total power dissipation. This paper presents a simulated annealing based algorithm for the gate leakage current reduction by simultaneous scheduling, allocation and binding during behavioral synthesis. Gate leakage current reduction is based on the use of functional units of different oxide thickness while simultaneously accounting for process variations. We present a cost function that minimizes leakage and area overhead. The algorithm minimizes the cost function for a given delay trade-off factor. It uses a pre-characterized cell library for tunneling current, delay and area, expressed as analytical functions of the gate oxide thickness Tox. We tested our approach using a number of behavioral level benchmark circuits characterized for a 45nm library by integrating our algorithm into a high-level synthesis system. We obtained an average gate leakage reduction of 76.88% with an average area overhead of 17.38% for different delay trade-off factors ranging from 1.0 to 1.4.
Customizing the bypasses in an embedded processor uncovers valuable trade-o.s between the power, performance and the cost of the processor. Meaningful exploration of bypasses requires bypass-sensitive compiler. Operation Tables (OTs) have been proposed to perform bypass-sensitive compilation. However, due to lack of automated methods to generate OTs, OTs are currently manually speci.ed by the designer. Manual speci.cation of OTs is not only an extremely time consuming task, but is also highly error-prone. In this paper, we present AutoOT, an algorithm to automatically generate OTs from a high-level processor description. Our experiments on the Intel XScale processor model running MiBench benchmarks demonstrate that AutoOT greatly reduces the time and e.ort of speci.cation. Automatic generation of OTs makes it feasible to perform full bypass exploration on the Intel XScale and thus discover interesting alternate bypass con.gurations in a reasonable time. To further reduce the compile-time overhead of OT generation, we propose another novel algorithm, AutoOTDB. AutoOTDB is able to cut the compile-time overhead of OT generation by half.
The sensitivity of the response of an analog system to circuit parameter variations is a vital performance metric for evaluation of its quality. This paper proposes a unified high level synthesis methodology for higher order continuous time state variable filters, considering the optimization of this metric. Minimization of the hardware count, which is another important issue, has also been taken into account at a much earlier stage of design. The entire methodology is illustrated with the case study of a state variable low pass filter and the benefits of the approach are clearly brought out.
Finite state models generated from software programs have unique characteristics that are not exploited by existing model checking algorithms. In this paper, we propose a novel disjunctive image computation algorithm and other simplifications based on these characteristics. Our algorithm divides an image computation into a disjunctive set of easier ones that can be performed in isolation. Hypergraph partitioning is used to minimize the number of live variables in each disjunctive component. We use the live variables to simplify transition relations and reachable state subsets. Our experiments on a set of real-world C programs show that the new algorithm achieves orders-of-magnitude performance improvement over the best known conjunctive image computation algorithm.
Constrained random simulation is a widespread technique used to perform functional verification on complex digital designs, because it can generate simulation vectors at a very high rate. However, the generation of high-coverage tests remains a major challenge even in light of this high performance. In this paper we present Guido, a hybrid verification software that uses formal veri.cation techniques to guide the simulation towards a verification goal. Guido is novel in that 1) it guides the simulation by means of a distance function derived from the circuit structure, and 2) it has a trace sequence controller that monitors and controls the direction of the simulation by striking a balance between random chance and controlled hill-climbing. We present experimental results indicating that Guido can tackle complex designs, including a picoJava microprocessor, and reach a veri.cation goal in far fewer simulation cycles than random simulation.
Practitioners of formal property verification often work around the capacity limitations of formal verification tools by breaking down properties into smaller properties that can be checked on the sub-modules of the parent module. To support this methodology, we have developed a formal methodology for verifying whether the decomposition is indeed sound and complete, that is, whether verifying the smaller properties on the submodules actually guarantees the original property on the parent module. In practice, however designers do not write properties for all modules and thereby our previous methodology was applicable to selected cases only. In this paper we present new formal methods that allow us to handle RTL blocks in the analysis. We believe that the new approach will significantly widen the scope of the methodology, thereby enabling the validation engineer to handle much larger designs than admitted by existing formal verification tools.
We present in this paper a technique for the formal verification of probabilistic systems described in PMAUDE, a probabilistic extension of the rewriting system Maude. Our methodology is based on a numerical verification using the probabilistic symbolic model checking tool PRISM. In particular, we show how we can construct an abstract system from the runs of a model that preserve all the probabilistic properties of the latter. Then we deduce the probabilistic matrix that will be used for the verification in PRISM.
During Bounded Model Checking (BMC) blocks of a design are often considered separately due to complexity issues. Because the environment of a block is not available for the proof, invalid input sequences frequently lead to false negatives, i.e. counter-examples that can not occur in the complete design. Finding and understanding such false negatives is currently a time-consuming manual task. Here, we propose a method to automatically avoid false negatives which are caused by invalid input sequences for blocks connected by standard communication protocols.
While transistors per square millimeter and on-chip clock keep scaling smoothly according to Moore's Law, Vdd does not, nor does Vth. This leads to a dramatic increase in chip power density, and to a significant shift in the balance between dynamic and leakage power. In spite of the recent effort made by EDA vendors in delivering novel solutions that help mitigating the effects on power consumption of technology scaling, the question of whether EDA industry is taking the low-power matter seriously still This session will provide an answer to this intriguing question, by first offering a short review of the stateof- the-art in design technologies for dynamic and leakage power minimisation. The session will then continue with a public "trial", in which OEMs, IDMs, IP and fabless semiconductor vendors will play the role of the public prosecutor, against defendant EDA industry. The court's ruling will tell us about the future targets the EDA vendors will pursue in low-power design technologies.
This paper presents an effective approach to formally verify SystemC designs. The approach translates SystemC models into a Petri-Net based representation. The Petri-net model is then used for model checking of properties expressed in a timed temporal logic. The approach is particularly suitable for, but not restricted to, models at a high level of abstraction, such as transaction-level. The efficiency of the approach is illustrated by experiments.
We introduce collapsed flushing, a new flushing-based refinement map for automatically verifying safety and liveness properties of term-level pipelined machine models. We also present a new method for handling liveness that is both simpler to define and easier to verify than previous approaches. To empirically validate collapsed flushing, we ran extensive experiments which show more than an order-of-magnitude improvement in verification times over standard flushing. Furthermore, by combining collapsed flushing with commitment refinement maps, we can monolithically verify complex pipelined machine models with deep pipelines - a salient feature of state-of-the-art microprocessor designs - that previous approaches cannot handle.
Functional validation is a major bottleneck in pipelined processor design. Simulation using functional test vectors is the most widely used form of processor validation. While existing model checking based approaches have proposed several promising ideas for efficient test generation, many challenges remain in applying them to realistic pipelined processors. The time and resources required for test generation using existing model checking based techniques can be extremely large. This paper presents an efficient test generation technique using decompositional model checking. The contribution of the paper is the development of both property and design decomposition procedures for efficient test generation of pipelined processors. Our experimental results using a multi-issue MIPS processor demonstrate several orders-of-magnitude reduction in memory requirement and test generation time.
We developed an original method to synthesize monitors from declarative specifications written in the PSL standard. Monitors observe sequences of values on their input signals, and check their conformance to a specified temporal expression. Our method implements both the weak and strong versions of PSL FL operators, and has been proven correct using the PVS theorem prover. This paper discusses the salient aspects of the proof of our prototype implementation for on-line design verification.
DRAMs play an important role in the semi-conductor industry, due to their highly dense layout and their low price per bit. This paper presents the first framework of fault models specifically designed to describe the faulty behavior of DRAMs. The fault models in this paper are the outcome of a close collaboration with the industry, and are validated using a detailed Spice-based analysis of the faulty behavior of real DRAMs. The resulting fault space is then used to derive a couple of new DRAM-specific tests, needed to detect some of the faults in practice.
Static Linked Faults are considered an interesting class of memory faults. Their capability of influencing the behavior of other faults causes the hiding of the fault effect and makes test algorithm design a very complex task. A large number of March Tests with different fault coverage have been published and some methodologies have been presented to automatically generate March Tests. In this paper we present an approach to automatically generate March Tests for Static Linked Faults. The proposed approach generates better test algorithms then previous, by reducing the test length.
Transparent-scan was proposed as an approach to test generation and test compaction for scan circuits. Its effectiveness was demonstrated earlier in reducing the test application time for stuck-at faults. We show that similar advantages exist when considering transition faults. We first show that a test sequence under the transparent-scan approach can imitate the application of broadside tests for transition faults. Test compaction can proceed similar to stuck-at faults by omitting test vectors from the test sequence. A new approach for enhancing test compaction is also described, whereby additional broadside tests are embedded in the transparent-scan sequence without increasing its length or reducing its fault coverage.
We present a probabilistic fault model that allows any number of gates in an integrated circuit to fail probabilistically. Tests for this fault model, determined using the theory of output deviations, can be used to supplement tests for classical fault models, thereby increasing test quality and reducing the probability of test escape. Output deviations can also be used for test selection, whereby the most effective test patterns can be selected from large test sets during time-constrained and high-volume production testing. Experimental results are presented to evaluate the effectiveness of patterns with high output deviations for the single stuck-at and bridging fault models.
Memory elements are the most vulnerable system component to soft errors. Since memory elements in cache arrays consume a large fraction of the die in modern microprocessors, the probability of particle strikes in these elements is high and can significantly impact overall processor reliability. Previous work [2] has developed effective metrics to accurately measure the vulnerability of cache memory elements. Based on these metrics, we have developed a reliability-performance evaluation framework, which has been built upon the Simplescalar simulator. In this work, we focus on the reliability aspects of L1 and L2 caches. Specifically, we present algorithms for tag vulnerability computation and investigate and report in detail on the vulnerability of data, tag, and status bits in the L2 array. Experiments on SPECint2K and SPECfp2K benchmarks show that one class of error, replacement error, makes up almost 85% of the total tag vulnerability of a 1MB write-back L2 cache. In addition, the vulnerability of L2 tag-addresses significantly increases as the size of the memory address space increases. Results show that the L2 tag array can be as susceptible as first-level instruction and data caches (IL1/DL1) to soft errors.
Due to increasing concern about various errors, current processors adopt error protection mechanisms. Especially, protecting L2/L3 caches incur as much as 12.5% area overhead due to error correcting codes. Considering large L2/L3 caches of current processors, the area overhead is very high. This paper proposes an area-efficient error protection scheme for L2/L3 caches. First, it selectively applies ECC (Error Correcting Code) to only dirty cache lines and other clean cache lines are protected using simple parity check codes. Second, the dirty cache lines are periodically cleaned by exploiting the generational behavior of cache lines. Experimental results show that the cleaning technique effectively reduces the number of dirty cache lines per cycle. The ECCs of this reduced number of dirty cache lines can be maintained in a small storage. Our proposed scheme is shown to reduce the area overhead of a 1MB L2 cache for error protection by 59% for SPEC2000 benchmarks running on a typical four-issue superscalar processor.
In this paper, we present the first multi-objective microarchitectural floorplanning algorithm for designing high-performance, high-reliability processors in the early design phase. Our floorplanner takes a microarchitectural netlist and determines the placement of the functional modules while simultaneously optimizing for performance and thermal reliability. The traditional design objectives such as area and wirelength are also considered. Our multi-objective hybrid floorplanning approach combining Linear Programming and Simulated Annealing is shown to be fast and effective in obtaining high-quality solutions. We evaluate the trade-off of performance, temperature, area, and wirelength and provide comprehensive experimental results.
Carry Save Adder (CSA) trees are commonly used for high speed implementation of multi-operand additions. We present a method to reduce the number of the adders in CSA trees by extracting common three-term subexpressions. Our method can optimize multiple CSA trees involving any number of variables. This optimization has a significant impact on the total area of the synthesized circuits, as we show in our experiments. To the best of our knowledge, this is the only known method for eliminating common subexpressions in CSA structures. Since extracting common subexpressions can potentially increase delay, we also present a delay aware extraction algorithm that takes into account the different arrival times of the signals.
The paper presents a heuristic algorithm for the minimization of 2-SPP networks, i.e., three-level EXOR-ANDOR forms with EXOR gates restricted to fan-in 2. Previous works had presented exact algorithms for the minimization of unrestricted SPP networks and of 2-SPP networks. The exact minimization procedures were formulated as covering problems as in the minimization of SOP forms and had worst-case exponential complexity. Extending the expand-irredundant-reduce paradigm of the ESPRESSO heuristic, we propose a minimization algorithm for 2-SPP networks that iterates local minimization and reshape of a solution until further improvement. We introduce also the notion of EXOR-irredundant to prove that OR-AND-EXOR irredundant networks are fully testable and guarantee that our algorithm yields OR-AND-EXOR irredundant solutions. We report a large set of experiments showing impressive high-quality results with affordable run times, handling also examples whose exact solutions could not be computed.
Conventional high-level synthesis uses the worst case delay to relate all inputs to all outputs of an operation. This is a very conservative approximation of reality, especially in arithmetic operations (where some bits are required later than others and some bits are produced earlier than others). This paper proposes a pre-synthesis optimization algorithm that takes advantage of this feature for more efficient high-level synthesis of data-flow graphs formed by additions and multiplications. The presented pre-processor analyzes the critical path at bit-granularity and splits the arithmetic operations into subwords fragments. In particular, some of the specification multiplications are broken up into several smaller multiplications, additions, and other operations of three new types specially defined to reduce the clock cycle duration. These fragments become the input to any regular high-level synthesis tool to speed up circuit execution times. The experimental results carried out show that implementations obtained from the optimized specification are on average 70% faster and in most cases substantial area reductions are also achieved.
We propose a logic synthesis flow which utilizes the functionality of circuit to synthesize a domino-cell network which will have more wires crosstalk-immune to each other. For that purpose, techniques of output phase flipping and crosstalk-aware technology mapping are used. Meanwhile, metric to measure the crosstalk sensitivity of domino cells in synthesis level is proposed. Experimental results demonstrate that the crosstalk sensitivity of the synthesized domino-cell network is greatly reduced by 51% using our synthesis flow as compared with conventional methodology. Furthermore, after placement and routing are performed, the ratio of the number of crosstalk-immune wire pairs to the number of total wire pairs is about 25% using our methodology as compared to 9% using conventional techniques.
Our concept of a virtual transaction layer (VTL) architecture allows to directly map transaction-level communication channels onto a synthesizable multiprocessor SoC implementation. The VTL is above the physical MPSoC communication architecture, acting as a hardware abstraction layer for both HW and SW components. TLM channels are represented by virtual channels which efficiently route transactions between SW and HW entities through the on-chip communication network with respect to quality-of-service and realtime requirements. The goal is to methodically simplify MPSoC design by systematic HW/SW interface abstraction, thus enabling early SW verification, rapid prototyping and fast exploration of critical design issues. With TRAIN, we present our implementation of such a VTL architecture for Virtex-II Pro and PowerPC and illustrate its efficiency by experimentation.
This paper presents the design and full prototype implementation of a configurable multiprocessor platform that supports distributed execution of applications described in UML 2.0. The platform is comprised of multiple Altera Nios II softcore processors and custom hardware accelerators connected by the Heterogeneous IP Block Interconnection (HIBI) communication architecture. Each processor has a local copy of eCos real-time operating system for the scheduling of multiple application threads. The mapping of a UML application into the proposed platform is presented by distributing a WLAN medium access control protocol onto multiple CPUs. The experiments performed on FPGA show that our approach raises system design to a new level. To our knowledge, this is the first real implementation combining a high-level design flow with a synthesizable platform.
This paper presents a new multiprocessor platform for high throughput turbo decoding. The proposed platform is based on a new configurable ASIP combined with an efficient memory and communication interconnect scheme. This Application-Specific Instruction-set Processor has an SIMD architecture with a specialized and extensible instruction-set and 5-stages pipeline control. The attached memories and communication interfaces enable the design of efficient multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffling technique introduced in the turbo-decoding field to reduce communication latency. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for various standards and operating modes. Results obtained for double binary DVB-RCS turbo codes demonstrate a 100 Mbit/s throughput using 16-ASIP multiprocessor architecture.