Growth of the semiconductor industry has been driven by a series of electronic system applications, such as personal computers, home entertainment, and mobile handsets. The most recent growth is driven by revolution of the information technology (IT) industry. The key word of this next revolution is "Ubiquitous". As semiconductor technology is scaled into the nanometer regime where hundreds of millions of transistors can be placed on a chip, designers are now incorporating their advanced system concepts into silicon. These systems include digital, analogue, and RF components. System-on-a-Chip (SoC) enables the IT industry to realise various products that can comply with rapidly changing market requirements as well as with unprecedented ubiquitous life style. However, SoC products in the ubiquitous era are facing challenges such as high performance, low-power, small-size and low-cost. These factors may jeopardise the success of SoC unless there is a breakthrough from system-level design through manufacturing technologies. Advanced EDA technology is indispensable to cope with ever-increasing design complexity of gigascale integration and complicated physical effects inherent from the nanoscale technology. In this talk, the speaker will provide an overview of the key challenges with SoC developments in days to come, namely: Issues in the system-level design, low power, high performance, verification, and relevant nanometer technology. Solutions including some of Samsung's recent R&D activities in those areas will be discussed and the speaker will conclude his speech by saying that all these challenges will promise the endless possibilities of the SoC.
Today's semiconductor marketplace demands nanometer designs of unprecedented complexity and performance, with uncompromising time-to-market requirements. This drives a focus on predictable, high-quality design results despite the challenges associated with these next-generation technologies. This scenario is complicated even further by the need to address these challenges across a wide spectrum of products, ranging from high-frequency processor designs to extremely complex ASIC designs. In the nanometer era, the common factor for ensuring market leadership across this broad variety of products is achieving single-pass design success to avoid costly re-spins and the loss of market opportunities: design turnaround time must be minimized without compromising design efficiency and first-time-right requirements. Design automation tools must balance both requirements, while providing designers with information that enables them to "design around" potential trouble spots in both today's and tomorrow's environment to ensure an exceptional level of built-in quality. This discussion highlights some of the innovations IBM is developing, such as variation-aware and statistical timing, faster serial and parallel processing, more highly integrated data models and tools, and concurrent chip and package design, which optimise the competing requirements of simultaneously reducing design turnaround time and achieving single-pass design success, while effectively managing the technical challenges associated with nanometer designs.
The aggressive application of scalar replacement to array references substantially reduces the number of memory operations at the expense of a possibly very large number of registers. In this paper we describe a register allocation algorithm that assigns registers to scalar replaced array references along the critical paths of a computation, in many cases exploiting the opportunity for concurrent memory accesses. Experimental results, for a set of image/signal processing code kernels, reveal that the proposed algorithm leads to a substantial reduction of the number of execution cycles for the corresponding hardware implementation on a contemporary Field-Programmable-Gate-Array (FPGA) when compared to other greedy allocation algorithms, in some cases, using even fewer number of registers.
Coarse-grained reconfigurable architectures aim to achieve both goals of high performance and flexibility. However, existing reconfigurable array architectures require many resources without considering the specific application domain. Functional resources that take long latency and/or large area can be pipelined and/or shared among the processing elements. Therefore the hardware cost and the delay can be effectively reduced without any performance degradation for some application domains. We suggest such reconfigurable array architecture template and design space exploration flow for domain-specific optimization. Experimental results show that our approach is much more efficient both in performance and area compared to existing reconfigurable architectures.
Field programmable gate arrays (FPGAs) provide designers with
the ability to quickly create hardware circuits. Increases in FPGA
configurable logic capacity and decreasing FPGA costs have
enabled designers to more readily incorporate FPGAs in their
designs. FPGA vendors have begun providing configurable soft
processor cores that can be synthesized onto their FPGA
products. While FPGAs with soft processor cores provide
designers with increased flexibility, such processors typically
have degraded performance and energy consumption compared to
hard-core processors. Previously, we proposed warp processing,
a technique capable of optimizing a software application by
dynamically and transparently re-implementing critical software
kernels as custom circuits in on-chip configurable logic. In this
paper, we study the potential of a MicroBlaze soft-core based
warp processing system to eliminate the performance and energy
overhead of a soft-core processor compared to a hard-core
processor. We demonstrate that the soft-core based warp
processor achieves average speedups of 5.8 and energy
reductions of 57% compared to the soft core alone. Our data
shows that a soft-core based warp processor yields performance
and energy consumption competitive with existing hard-core
processors, thus expanding the usefulness of soft processor cores
on FPGAs to a broader range of applications.
Keywords
Hardware/software partitioning, warp processing, FPGA, dynamic
optimization, soft cores, MicroBlaze.
This paper presents a System-on-a-Chip (SoC) architecture for Elliptic Curve Cryptosystems (ECC) which targets reconfigurable hardware. A four-level partitioning scheme is described for exploring the area and speed trade-offs. A design generator is used to generate parameterisable building blocks for the configurable SoC architecture. A secure web server, which runs on a reconfigurable soft-processor and an embedded hard-processor, shows over 2000 times speedup when the computationally-intensive operations run on the customised building blocks. The embedded on-chip timer block gives accurate performance information. The design factors of configurable SoC architectures are also discussed and evaluated.
This paper presents an infrastructure to test the functionality of the specific architectures output by a high-level compiler targeting dynamically reconfigurable hardware. It results in a suitable scheme to verify the architectures generated by the compiler, each time new optimization techniques are included or changes in the compiler are performed. We believe this kind of infrastructure is important to verify, by functional simulation, further research techniques, as far as compilation to Field-Programmable Gate Array (FPGA) platforms is concerned.
This paper presents a novel FPGA architecture for implementing various styles of asynchronous logic. The main objective is to break the dependency between the FPGA architecture dedicated to asynchronous logic and the logic style. The innovative aspects of the architecture are described. Moreover the structure is well suited to be rebuilt and adapted to fit with further asynchronous logic evolutions thanks to the architecture genericity. A full-adder was implemented in different styles of logic to show the architecture flexibility.
This special session adresses the problems that designers face when implementing analog and digital circuits in nanometer technologies. An introductory embedded tutorial will give an overview of the design problems at hand : the leakage power and process variability and their implications for digital circuits and memories, and the reducing supply voltages, the design productivity and signal integrity problems for embedded analog blocks. Next, a panel of experts from both industrial semiconductor houses and design companies, EDA vendors and research institutes will present and discuss with the audience their opinions on whether the design road ends at marker "65nm" or not.
Multi-site testing is a popular and effective way to increase test throughput and reduce test costs. We present a test throughput model, in which we focus on wafer testing, and consider parameters like test time, index time, abort-on-fail, and contact yield. Conventional multi-site testing requires sufficient ATE resources, such as ATE channels, to allow to test multiple SOCs in parallel. In this paper, we design and optimize on-chip DfT, in order to maximize the test throughput for a given SOC and ATE. The on-chip DfT consists of an E-RPCT wrapper, and, for modular SOCs, module wrappers and TAMs. We present experimental results for a Philips SOC and several ITC'02 SOC Test Benchmarks.
Many SOCs today contain both digital and analog embedded cores. Even though the test cost for such mixed-signal SOCs is significantly higher than that for digital SOCs, most prior research in this area has focused exclusively on digital cores. We propose a low-cost test development methodology for mixed-signal SOCs that allows the analog and digital cores to be tested in a unified manner, thereby minimizing the overall test cost. The analog cores in the SOC are wrapped such that they can be accessed using a digital test access mechanism (TAM). We evaluate the impact of the use of analog test wrappers on area overhead and test time. To reduce area overhead, we present an analog test wrapper optimization technique, which is then combined with TAM optimization in a cost-oriented heuristic approach for test scheduling. We also demonstrate the feasibility of using analog wrappers by presenting transistor-level simulations for an analog wrapper and a representative core. We present experimental results on test scheduling for an ITC'02 benchmark SOC that has been augmented with five analog cores.
This paper addresses delay test for SOC devices with high frequency clock domains. A logic design for on-chip high-speed clock generation, implemented to avoid expensive test equipment, is described in detail. Techniques for on-chip clock generation, meant to reduce test vector count and to increase test quality, are discussed. ATPG results for the proposed techniques are given.
The increasing complexity and the short life cycles of embedded systems are pushing the current system-on-chip designs towards a rapid increasing on the number of programmable processing units, while decreasing the gate count for custom logic. Considering this trend, this work proposes a test planning method capable of reusing available processors as test sources and sinks, and the on-chip network as the test access mechanism. Experimental results are based on ITC'02 benchmarks and on two open core processors compliant with MIPS and SPARC instruction set. The results show that the cooperative use of both the on-chip network and the embedded processors can increase the test parallelism and reduce the test time without additional cost in area and pins.
MANY TECHNIQUES for synthesizing digital hardware from C-like languages have been proposed, but none have emerged as successful as Verilog or VHDL for register-transfer-level design. This paper looks at two of the fundamental challenges: concurrency and timing control. Familiarity is the main reason C-like languages have been proposed for hardware synthesis. Synthesize hardware from C, proponents claim, and we will be able to turn a C programmer into a hardware designer. Another common motivation is hardware/software codesign: today's systems usually contain a mix of hardware and software, and it is often unclear initially which portions to implement in hardware. Here, using a single language should simplify the migration task.
Software Thread Integration (STI) [1] and Asynchronous STI (ASTI) [2] are compiler techniques which interleave functions from separate program threads at the assembly language level, creating implicitly multithreaded functions which provide low-cost concurrency on generic hardware. This extends the reach of software and reduces the need to rely upon dedicated hardware. STI and ASTI are driven by two types of timing requirements: thread-level (e.g. the delay between an event occuring and a service thread running) and instruction-level (e.g. when a specific instruction or code region must begin executing relative to the start of the function or another such instruction or region). These coarse- and fine-grain approach provide a precise method of defining timing requirements. STI provides synchronous thread progress; both functions proceed lock-step. ASTI provides asynchronous (independent) thread progress through the use of lightweight context switches (coroutine calls) between primary and secondary threads. The primary thread has hard-real-time constraints, while the secondary thread is not real-time, or has much longer deadlines. We assume that instructions take a predictable number of cycles to execute. This implies a straightforward instruction execution pipeline (if used) and a predictable memory system (e.g. the cache is locked, software managed, or not present). These requirements are met for the processors we target: 8 and 16 bit microcontrollers. We target applications with only one hard real-time thread (the primary thread, used for the communication protocol), although recent extensions to STI [3] support multiple hard-real-time primary threads. We have implemented a thread-integrating compiler Thrint which implements many of these analyses and transformations for the AVR architecture, which is 8-bit, load/store, and optimized for embedded C code.
Traditionally system design has been made from a black box/functionality only perspective which forces the developer to concentrate on how the functionality can be decomposed and recomposed into so called components. While this technique is well established and well known it does suffer from some drawbacks; namely that the systems produced can often be forced into certain, incompatible architectures, difficult to maintain or reuse and the code itself difficult to debug. Now that ideas such as the OMG's Model Based Architecture (MDA) or Model Based Engineering (MBE)1 and the ubiquitous modelling language UML are being used (allegedly) and desired we face a number of challenges to existing techniques. When working with the UML, one must take into consideration object orientation. The UML is a language for expressing systems (or whatever) in terms of object oriented concepts and its meta-model and its semantics make this explicit. Object orientation, unlike functional based approaches makes both functionality and data first-class modelling elements. Whenever anything is specified in UML, that modelling element is either based on the notion of a class or is directly related to a class. Some methods appear to adhere to this but fail to use classes in this way by assuming the existence of a "global" system and then just using classes as data elements. Effectively the UML equivalent of programming Fortran in C++.
The problem of determining lower bounds for the energy cost of a given nanoscale design is addressed via a complexity theory-based approach. This paper provides a theoretical framework that is able to assess the trade-offs existing in nanoscale designs between the amount of redundancy needed for a given level of resilience to errors and the associated energy cost. Circuit size, logic depth and error resilience are analyzed and brought together in a theoretical framework that can be seamlessly integrated with automated synthesis tools and can guide the design process of nanoscale systems comprised of failure prone devices. The impact of redundancy addition on the switching energy and its relationship with leakage energy is modeled in detail. Results show that 99% error resilience is possible for fault-tolerant designs, but at the expense of at least 40% more energy if individual gates fail independently with probability of 1%.
On-chip buses are typically designed to meet performance constraints at worst-case conditions, including process corner, temperature, IR-drop, and neighboring net switching pattern. This can result in significant performance slack at more typical operating conditions. In this paper, we propose a dynamic voltage scaling (DVS) technique for buses, based on a double sampling latch which can detect and correct for delay errors without the need for retransmission. The proposed approach recovers the available slack at non-worst-case operating points through more aggressive voltage scaling and tracks changing conditions by monitoring the error recovery rate. Voltage margins needed in traditional designs to accommodate worst-case performance conditions are therefore eliminated, resulting in a significant improvement in energy efficiency. The approach was implemented for a 6mm memory read bus operating at 1.5GHz (0.13 μm technology node) and was simulated for a number of benchmark programs. Even at the worst-case process and environment conditions, energy gains of up to 17% are achieved, with error recovery rates under 2.3%. At more typical process and environment conditions, energy gains range from 35% to 45%, with a performance degradation under 2%. An analysis of optimum interconnect architectures for maximizing energy gains with this approach shows that the proposed approach performs well with technology scaling.
This paper presents a scheme to combine memory and power management for achieving better energy reduction. Our method periodically adjusts the size of physical memory and the timeout value to shut down a hard disk for reducing the average power consumption. We use Pareto distributions to model the distributions of idle time. The parameters of the distributions are adjusted at run-time for calculating the corresponding timeout value of the disk power management. The memory size is changed based on the inclusion property to predict the number of disk accesses at different memory sizes. Experimental results show more than 50% energy savings compared to a 2-competitive fixed-timeout method.
With the scaling of technology and higher requirements on performance and functionality, power dissipation is becoming one of the major design considerations in the development of network processors. In this paper, we use an assertion-based methodology for system-level power/performance analysis to study two dynamic voltage scaling (DVS) techniques, traffic-based DVS and execution-based DVS, in a network processor model. Using the automatically generated distribution analyzers, we analyze the power and performance distributions and study their trade-offs for the two DVS policies with different parameter settings such as threshold values and window sizes. We discuss the optimal configurations of the two DVS policies under different design requirements. By a set of experiments, we show that the assertion-based trace analysis methodology is an efficient tool that can help a designer easily compare and study optimal architectural configurations in a large design space.
Although the huge reconfiguration latency of the available FPGA platforms is a well-known shortcoming of the current FCCMs, little research in instruction scheduling has been undertaken to eliminate or diminish its negative influence on performance. In this paper, we introduce an instruction scheduling algorithm that minimizes the number of executed hardware reconfiguration instructions taking into account the "FPGA area placement conflicts" between the available configurations. The algorithm is based on compiler analyses and feedback-directed techniques and it can switch from hardware execution to software execution for an operation, when the reconfiguration latency could not be reduced. The algorithm has been tested for the M-JPEG encoder application and the real hardware implementations for DCT, Quantization and VLC operations. Based on simulation results, we determine that, while a simple scheduling produces a significant performance decrease, our proposed scheduling contributes for up to 16x M-JPEG encoder speedup.
Due to the emergence of highly dynamic multimedia applications there is a need for flexible platforms and runtime scheduling support for embedded systems. Dynamic Reconfigurable Hardware (DRHW) is a promising candidate to provide this flexibility but, currently, not sufficient run-time scheduling support to deal with the run-time reconfigurations exists. Moreover, executing at run-time a complex scheduling heuristic to provide this support may generate an excessive run-time penalty. Hence, we have developed a hybrid design/run-time prefetch heuristic that schedules the reconfigurations at run-time, but carries out the scheduling computations at design-time by carefully identifying a set of near-optimal schedules that can be selected at run-time. This approach provides run-time flexibility with a negligible penalty.
FPGAs, as computing devices, offer significant speedup over microprocessors. Furthermore, their configurability offers an advantage over traditional ASICs. However, they do not yet enjoy high-level language programmability, as microprocessors do. This has become the main obstacle for their wider acceptance by application designers. ROCCC is a compiler designed to generate circuits from C source code to execute on FPGAs, more specifically on CSoCs. It generates RTL level HDLs from frequently executing kernels in an application. In this paper, we describe ROCCC's system overview and focus on its data path generation. We compare the performance of ROCCC-generated VHDL code with that of Xilinx IPs. The synthesis result shows that ROCCC-generated circuit takes around 2x ~ 3x area and runs at comparable clock rate.
This paper presents a novel method for simulation of sampled systems with weakly nonlinear behavior. These systems can be characterized by adding weakly non-linear terms to the linear state-space equations of the system resulting in an extended state-space model. Perturbation theory is used to split these equations in an ideal linear behavior and a non-ideal small perturbation. The linear equations are solved analytically which reduces simulation time compared to numerical evaluation. The solution of the perturbation equations is approximated by orthogonal polynomials. This methodology not only reduces simulation time compared to traditional numerical simulations, but also deals naturally with clock jitter and the discontinuous behavior of sampled systems. An implementation of the methodology has been used to analyze systems including switched filters and continuous-time DS modulators.
Process variations play an increasingly important role on the success of analog circuits. State-of-the-art analog circuits are based on complex architectures and contain many hierarchical layers and parameters. Knowledge of the parameter variances and their contribution patterns is crucial for a successful design process. This information is valuable to find solutions for many problems in design, design automation, testing, and fault tolerance. In this paper, we present a hierarchical variance analysis methodology for analog circuits. In the proposed method, we make use of previously computed values whenever possible so as to reduce computational time. Experimental results indicate that the proposed method provides both accuracy and computational efficiency when compared with prior approaches.
In this paper, we highlight a fast, effective and practical statistical approach that deals with inter and intra-die variations in VLSI chips. Our methodology is applied to a number of random variables while accounting for spatial correlations. Our methodology sorts the Probability Density Functions (PDFs) of the critical paths of a circuit based on a confidence-point. We show the mathematical accuracy of our method as well as implement a typical program to test it on various benchmarks. We find that worst-case analysis over-estimates path delays by more than 50% and that a path's probabilistic rank with respect to delay is very different from its deterministic rank.
This paper presents the novel idea of multi-placement structures, for a fast and optimized placement instantiation in analog circuit synthesis. These structures need to be generated only once for a specific circuit topology. When used in synthesis, these pre-generated structures instantiate various layout floorplans for various sizes and parameters of a circuit. Unlike procedural layout generators, they enable fast placement of circuits while keeping the quality of the placements at a high level during a synthesis process. The fast placement is a result of high speed instantiation resulting from the efficiency of the multi-placement structure. The good quality of placements derive from the extensive and intelligent search process that is used to build the multi-placement structure. The target benchmarks of these structures are analog circuits in the vicinity of 25 modules . An algorithm for the generation of such multi-placement structures is presented. Experimental results show placement execution times with an average of a few milliseconds making them usable during layout-aware synthesis for optimized placements.
Multi-channel waveform monitoring technique enhances built-in test and diagnostic capability of mixed-signal VLSI circuits. An 8-channel prototype system incorporates adaptive sample time generation with a 10-bit variable step delay generator and algorithmic digitization with a 10-bit incremental reference voltage generator. The prototype in a 0.18-μm CMOS technology demonstrated on-chip waveform acquisition at 40-ps and 200-μV resolutions. The waveforms were as accurate as those by an off-chip measurement technique, while more than 95 % reduction of the waste time in waveform monitoring was achieved. The area of 700μm x 600μm was occupied by a single waveform acquisition kernel that was shared with 8 front-end modules of 60μm x 200μm each. The developed on-chip multi-channel waveform monitoring technique is waveform accurate, area efficient, and low cost, which are all requisite factors for diagnosing methodology toward mixed analog and digital signal integrity in a systems-on-a-chip era.
This paper describes two research projects that develop new low-cost techniques for testing devices with multiple high-speed (2 to 5 Gbps) signals. Each project uses commercially available components to keep costs low, yet achieves performance characteristics comparable to (and in some ways exceeding) more expensive ATE. A common CMOS FPGA-based logic core provides flexibility, adaptability, and communication with controlling computers while customized positive emitter-coupled logic (PECL) achieves multi-gigahertz data rates with about +25ps timing accuracy.
A technique for evaluating noise figure suitable for BIST implementation is described. It is based on a low cost single-bit digitizer, which allows the simultaneous evaluation of noise figure in several test points of the analog circuit. The method is also able to benefit from SoC resources, like memory and processing power. Theoretical background and experimental results are presented in order to demonstrate the feasibility of the approach.
Testing a non-digital integrated system against all of its specification can be quite expensive due to the elaborate test application and measurement setup required. We propose to eliminate redundant tests by employing ε-SVM based statistical learning. Application of the proposed methodology to an operational amplifier and a MEMS accelerometer reveal that redundant tests can be statistically identified from a complete set of specification-based tests with negligible error. Specifically, after eliminating five of seven specification-based tests for an operatic amplifier, the defect escape and yield loss is small at 0.6% and 0.9%, respectively. For the accelerometer, defect escape of 0.2% and yield loss of 0.1% occurs when the hot and cold tests are eliminated. For the accelerometer, this level of compaction would reduce test cost by more than half.
This paper is aimed at studying defect-oriented test techniques for RF components in order to optimize production test sets. This study is mandatory for the definition of an efficient test flow strategy. We have carried out a fault simulation campaign for a Low-Noise Amplifier (LNA) for reducing a test set while maintaining high fault coverage. The set of production test measurements should include low-cost structural tests such as simple current consumption and only a few more sophisticated tests dedicated to functional specifications such as S parameters, Noise Figure (NF) or IP3.
An analogue testing standard IEEE 1149.4 is mainly targeted for low-frequency testing. The problem studied in this paper is extending the standard also for radio frequency testing. IEEE 1149.4 compatible measurement structures (ABMs) developed in this study extract the information one is measuring from the radio frequency signal and represent the result as a DC voltage level. The ABMs presented in this paper are targeted for power and frequency measurements operating in frequencies from 1 GHz to 2 GHz. The power measurement error caused by temperature, supply voltage and process variations is roughly 2 dB and the frequency measurement error is 0.1 GHz, respectively.
This issue discusses the fault-trajectory approach suitability for fault diagnosis on analog networks. Recent works have shown promising results concerning a method based on this concept for ATPG for diagnosing faults on analog networks. Such method relies on evolutionary techniques, where a generic algorithm (GA) is coded to generate a set of optimum frequencies capable to disclose faults.
Security is emerging as an important concern in embedded system design. The security of embedded systems is often compromised due to vulnerabilities in "trusted" software that they execute. Security attacks exploit these vulnerabilities to trigger unintended program behavior, such as the leakage of sensitive data or the execution of malicious code. In this work, we present a hardware-assisted paradigm to enhance embedded system security by detecting and preventing unintended program behavior. Specifically, we extract properties of an embedded program through static program analysis, and use them as the bases for enforcing permissible program behavior in real-time as the program executes. We present an architecture for hardware-assisted run-time monitoring, wherein the embedded processor is augmented with a hardware monitor that observes the processor's dynamic execution trace, checks whether the execution trace falls within the allowed program behavior, and flags any deviations from the expected behavior to trigger appropriate response mechanisms. We present properties that can be used to capture permissible program behavior at different levels of granularity within a program, namely inter-procedural control flow, intra-procedural control flow, and instruction stream integrity. We also present a systematic methodology to design application-specific hardware monitors for any given embedded program. We have evaluated the hardware requirements and performance of the proposed architecture for several embedded software benchmarks. Hardware implementations using a commercial design flow, and architectural simulations using the SimpleScalar framework, indicate that the proposed technique can thwart several common software and physical attacks, facilitating secure program execution with minimal overheads.
As the scale of electronic devices shrinks, "electronic textiles" (e-textiles) will make possible a wide variety of novel applications which are currently unfeasible. Due to the wearability concerns, low-power techniques are critical for etextile applications. In this paper, we address the issue of the energy-aware routing for e-textile platforms and propose an efficient algorithm to solve it. The platform we consider consists of dedicated components for e-textiles, including computational modules, dedicated transmission lines and thin-film batteries on fiber substrates. Furthermore, we derive an analytical upper bound for the achievable number of jobs completed over all possible routing strategies. From a practical standpoint, for the Advanced Encryption Standard (AES) cipher, the routing technique we propose achieves about fifty percent of this analytical upper bound. Moreover, compared to the non-energy-aware counterpart, our routing technique increases the number of encryption jobs completed by one order of magnitude.
The lifetime of wireless sensor networks can be increased
by minimizing the number of active nodes that provide complete
coverage, while switching off the rest. In this paper,
we propose a distributed and scalable node-scheduling algorithm
that conserves overall system energy by minimizing
the number of active nodes, localizing the execution to the
dying sensor(s), and minimizing the frequency of execution
by reacting only to the occurrence of a sensing hole. This effects
an increased system lifetime while maintaining coverage
over an application-defined threshold value. We compare
our algorithm to a network with a centralized nodescheduling
algorithm. Our results show equivalent coverage
degree over a wide range of sensor networks.
Keywords
Wireless Sensor Network, Coverage, Set Cover
Wireless microsensor networks, which have been the topic of intensive research in recent years, are now emerging in industrial applications. An important milestone in this transition has been the release of the IEEE 802.15.4 standard that specifies interoperable wireless physical and medium access control layers targeted to sensor node radios. In this paper, we evaluate the potential of an 802.15.4 radio for use in an ultra low power sensor node operating in a dense network. Starting from measurements carried out on the off-the-shelf radio, effective radio activation and link adaptation policies are derived. It is shown that, in a typical sensor network scenario, the average power per node can be reduced down to 211mW. Next, the energy consumption breakdown between the different phases of a packet transmission is presented, indicating which part of the transceiver architecture can most effectively be optimized in order to further reduce the radio power, enabling self-powered wireless microsensor networks.
In this paper, we study a communication/sensing network that comprises of large number of radio enabled sensors. These sensors are either randomly or deterministically placed within a certain region to monitor events that are spatially and temporally independent of each other. Possible applications include: habitat and climate monitoring, diagnosing faults in industrial supply lines, measuring data such as traffic-intensity, detecting human/vehicular intrusion, etc. The sensor nodes in these networks are powered by a battery with limited power, which is dissipated during the data transmission/reception. A cheap and effective approach is to replace the sensor nodes in due course instead of replenishing of their batteries. Thus, the objective is to find the replacement time Tr such that none of the sensor nodes run out of their batteries (disconnected) before Tr. An alternative way of formulating this problem is to find the lifetime T of the network, which is defined as the time after which the first node in the network disconnects. Studies evaluating the lifetime model of the sensor networks have been done before in [1], [2], [4]. However, the primary difference between previous approaches and our work is that we specifically model a data generation process at an individual sensor node, where each node covers certain area and the amount of data generated at a node is proportional to its coverage area.
As technology scales down, the static power is expected to become a significant fraction of the total power. The exponential dependence of static power with the operating temperature makes the thermal profile estimation of high-performance ICs a key issue to compute the total power dissipated in next-generations. In this paper we present accurate and compact analytical models to estimate the static power dissipation and the temperature of operation of CMOS gates. The models are the fundamentals of a performance estimation tool in which numerical procedures are avoided for any computation to set a faster estimation and optimization. The models developed are compared to measurements and SPICE simulations for a 0.12mm technology showing excellent results.
In this paper, two packing algorithms for the detection of activity profiles in MTCMOS-based FPGA structures are proposed for leakage power mitigation. The first algorithm is a connection-based packing technique by which the proximity of the logic blocks is accounted for, and the second algorithm is a logic-based packing approach by which the weighted Hamming distance between the blocks activities is considered. After both algorithms are analyzed, they are applied to a number of FGPA benchmarks for verification. Once the activity profiles are realized, sleep transistors are carefully positioned to contain the clustered blocks that share similar activity profiles. Finally, the percentage of the leakage power savings for each of the two algorithms is evaluated.
In this paper, we provide a methodology to perform both bus partitioning and bus frequency assignment to each of the bus segment simultaneously while optimizing both power consumption and performance of the system. We use a genetic algorithm and design an appropriate cost function which optimizes the solution on the basis of its power consumption and performance. The evaluation of our approach using a set of multiprocessor applications show that an average reduction of the energy consumption by 60% over a single shared bus architecture. Our results also show that it is beneficial to simultaneously assign bus frequencies and performing bus partitioning instead of performing them sequentially.
In nanometer scaled CMOS devices significant increase in the subthreshold, the gate and the reverse biased junction band-to-band- tunneling (BTBT) leakage, results in the large increase of total leakage power in a logic circuit. Leakage components interact with each other in device level (through device geometry, doping profile) and also in the circuit level (through node voltages). Due to the circuit level interaction of the different leakage components, the leakage of a logic gate strongly depends on the circuit topology i.e. number and nature of the other logic gates connected to its input and output. In this paper, for the first time, we have analyzed loading effect on leakage and proposed a method to accurately estimate the total leakage in a logic circuit, from its logic level description considering the impact of loading and transistor stacking.
On-chip networks have been proposed as the interconnect fabric for future systems-on-chip and multi-processors on chip. Power is one of the main constraints of these systems and interconnect consumes a significant portion of the power budget. In this paper, we propose four leakage-aware interconnect schemes .Our schemes achieve 10.13%~63.57% active leakage savings and 12.35%~95.96% standby leakage savings across schemes while the delay penalty ranges from 0% to 4.69%.
Run-time management of both communication and computation resources in a heterogeneous Network-on-Chip (NoC) is a challenging task. First, platform resources need to be assigned in a fast and efficient way. Secondly, the resources might need to be reallocated when platform conditions or user requirements change. We developed a run-time resource management scheme that is able to efficiently manage a NoC containing fine grain reconfigurable hardware tiles. This paper details our task assignment heuristic and two run-time task migration mechanisms that deal with the message consistency problem in a NoC. We show that specific reconfigurable hardware tile support improves performance of the heuristic and that task migration mechanisms need to be tailored to on-chip networks.
Vendor-provided softcore processors often support advanced features such as caching that work well in uniprocessor or uncoupled multiprocessor architectures. However, it is a challenge to implement Symmetric Multiprocessor on a Programmable Chip (SMPoPC) systems using such processors. This paper presents an implementation of a tightly-coupled, cache-coherent symmetric multiprocessing architecture using a vendor-provided softcore processor. Experimental results show that this implementation can be achieved without invasive changes to the vendor-provided softcore processor and without degradation of the performance of the memory system.
Current Systems-On-Chip (SoC) execute applications that demand extensive parallel processing. Networks-On-Chip (NoC) provide a structured way of realizing interconnections on silicon, and obviate the limitations of bus-based solution. NoCs can have regular or ad hoc topologies, and functional validation is essential to assess their correctness and performance. In this paper, we present a flexible emulation environment implemented on an FPGA that is suitable to explore, evaluate and compare a wide range of NoC solutions with a very limited effort. Our experimental results show a speed-up of four orders of magnitude with respect to cycle-accurate HDL simulation, while retaining cycle accuracy. With our emulation framework, designers can explore and optimize a various range of solutions, as well as characterize quickly performance figures.
Run-time task migration in a heterogeneous multiprocessor System-on-Chip (MP-SoC) is a challenge that requires cooperation between the task and the operating system. In task migration, minimization of the overhead during normal task execution (i.e when not migrating) and the minimization of the migration reaction time are important. We introduce a novel technique that reuses the processor's debug registers in order to minimize the overhead during normal execution. This paper explains our task migration proof-of-concept setup and compares it to the state-of-the art. By reusing existing hardware and software functionality our approach reduces the run time overhead.
This extended abstract presents models to derive timing and resource usage numbers for an application when distant, shared memories are used in an important class of future embedded platforms, namely network-on-chip-based multiprocessors.
We present a complete top-down design of a low-power multi-channel clock recovery circuit based on gated current-controlled oscillators. The flow includes several tools and methods used to specify block constraints, to design and verify the topology down to the transistor level, as well as to achieve a power consumption as low as 5mW/Gbit/s. Statistical simulation is used to estimate the achievable bit error rate in presence of phase and frequency errors and to prove the feasibility of the concept. VHDL modeling provides extensive verification of the topology. Thermal noise modeling based on well-known concepts delivers design parameters for the device sizing and biasing. We present two practical examples of possible design improvements analyzed and implemented with this methodology.
This paper proposes a novel architecture synthesis algorithm for single-loop single-bit δσmodulators. We defined a generic modulator architecture and derived its noise and signal transfer function (NTF/STF) in symbolic forms. We then used the TF in MINLP to generate optimal topologies for a variety of design requirement, such as modulator complexity, sensitivity and power consumption, which appeared as cost functions. Experiments show the superiority of synthesized topologies as compared to traditional solutions.
This paper reports a novel simulation methodology for analysis and prediction of substrate noise impact on analog / RF circuits taking into account the role of the parasitic resistance of the on-chip interconnect in the impact mechanism. This methodology allows investigation of the role of the separate devices (also parasitic devices) in the analog / RF circuit in the overall impact. This way is revealed which devices have to be taken care of (shielding, topology change) to protect the circuit against substrate noise. The developed methodology is used to analyze impact of substrate noise on a 3 GHz LC-tank Voltage Controlled Oscillator (VCO) designed in a high-ohmic 0.18 mm 1PM6 CMOS technology. For this VCO (in the investigated frequency range from DC to 15 MHz) impact is mainly caused by resistive coupling of noise from the substrate to the non-ideal on-chip ground interconnect, resulting in analog ground bounce and frequency modulation. Hence, the presented test-case reveals the important role of the on-chip interconnect in the phenomenon of substrate noise impact.
The emerging concept of SoC-AMS leads to research new top-down methodologies to aid systems designers in sizing analog and mixed devices. This work applies this idea to the high-level optimization of pipeline ADC. Considering a given technology, if consists in comparing different configurations according to their imperfections and their architectures without FFT computation or time-consuming simulators. The final selection is based on a figure of merit.
This paper suggests a practical "hybrid" synthesis methodology which integrates designer-derived analytical models for system-level description with simulation-based models at the circuit level. We show how to optimize stage-resolution to minimize the power in a pipelined ADC. Exploration (via detailed synthesis) of several ADC configurations is used to show that a 4-3-2... resolution distribution uses the least power for a 13-bit 40 MSPS converter in a 0.25 μm CMOS process.
Soft errors are an increasingly serious problem for logic circuits. To estimate the effects of soft errors on such circuits, we develop a general computational framework based on probabilistic transfer matrices (PTMs). In particular, we apply them to evaluate circuit reliability in the presence of soft errors, which involves combining the PTMs of gates to form an overall circuit PTM. Information such as output probabilities, the overall probability of error, and signal observability can then be extracted from the circuit PTM. We employ algebraic decision diagrams (ADDs) to improve the efficiency of PTM operations. A particularly challenging technical problem, solved in our work, is to simultaneously extend tensor products and matrix multiplication in terms of ADDs to non-square matrices. Our PTM-based method enables accurate evaluation of reliability for moderately large circuits and can be extended by circuit partitioning. To demonstrate the power of the PTM approach, we apply it to several problems in fault-tolerant design and reliability improvement.
Nanometer circuits are becoming increasingly susceptible to soft-errors due to alpha-particle and atmospheric neutron strikes as device scaling reduces node capacitances and supply/threshold voltage scaling reduces noise margins. It is becoming crucial to add soft-error tolerance estimation and optimization to the design flow to handle the increasing susceptibility. The first part of this paper presents a tool for accurate soft-error tolerance analysis of nanometer circuits (ASERTA) that can be used to estimate the soft-error tolerance of nanometer circuits consisting of millions of gates. The tolerance estimates generated by the tool match SPICE generated estimates closely while taking orders of magnitude less computation time. The second part of the paper presents a tool for soft-error tolerance optimization of nanometer circuits (SERTOPT) using the tolerance estimates generated by ASERTA. The tool finds optimal sizes, channel lengths, supply voltages and threshold voltages to be assigned to gates in a combinational circuit such that the soft-error tolerance is increased while meeting the timing constraint. Experiments on ISCAS'85 benchmark circuits showed that soft-error rate of the optimized circuit decreased by as much as 47% with marginal increase in circuit delay.
A new approach for enhancing the process-variation tolerance of digital circuits is described. We extend recent advances in statistical timing analysis into an optimization framework. Our objective is to reduce the performance variance of a technology-mapped circuit where delays across elements are represented by random variables which capture the manufacturing variations. We introduce the notion of statistical critical paths, which account for both means and variances of performance variation. An optimization engine is used to size gates with a goal of reducing the timing variance along the statistical critical paths. We apply a pair of nested statistical analysis methods deploying a slower more accurate approach for tracking statistical critical paths and a fast engine for evaluation of gate size assignments. We derive a new approximation for the max operation on random variables which is deployed for the faster inner engine. Circuit optimization is carried out using a gain-based algorithm that terminates when constraints are satisfied or no further improvements can be made. We show optimization results that demonstrate an average of 72% reduction in performance variation at the expense of average 20% increase in design area.
As device sizes shrink and current densities increase, the probability of device failures due to gate oxide breakdown (OBD) also increases. To provide designs that are tolerant to such failures, we must investigate and understand the manifestations of this physical phenomenon at the circuit and system level. In this paper, we develop a model for operational OBD defects, and we explore how to test for faults due to OBD. For a NAND gate, we derive the necessary input conditions that excite and detect errors due to OBD defects at the gate level. We show that traditional pattern generators fail to exercise all of these defects. Finally, we show that these test patterns can be propagated and justified for a combinational circuit in a manner similar to traditional ATPG.
In this paper, we present an accurate but very fast soft error rate (SER) estimation technique for digital circuits based on error propagation probability (EPP) computation. Experiments results and comparison of the results with the random simulation technique show that our proposed method is on average within 6% of the random simulation method and four to five orders of magnitude faster.
Very deep submicron and nanometer technologies have increased notably integrated circuit (IC) sensitiveness to radiation. Soft errors are currently appearing into ICs working at earth surface. Hardened circuits are currently required in many applications where Fault Tolerance (FT) was not a requirement in the very near past. The use of platform FPGAs for the emulation of single-event upset effects (SEU) is gaining attention in order to speed up the FT evaluation. In this work, a new emulation system for FT evaluation with respect to SEU effects is proposed, providing shorter evaluation times by performing all the evaluation process in the FPGA and avoiding emulator-host communication bottlenecks.
In this paper we present arithmetic real-coded variation operators tailored for time slot and turn optimization on TDMA-scheduled resources with evolutionary algorithms. Our operators implement a heuristic strategy to converge towards the solution space and are able to escape local minima. Furthermore, we explicitly separate the variation of the admitted loads and the turn-length in order to give the designer increased control over the optimization process. Experimental results show that our variation operators have advantages over string-coded binary variation operators which are frequently used to solve continuous optimization problems.
Context-aware applications pose new challenges, including a need for new computational models, uncertainty management, and efficient optimization under uncertainty. Uncertainty can arise at two levels: multiple and single tasks. When a mobile user changes environments, the context changes resulting in the possibility of the user requesting tasks which are specific for the new environment. However, as the user moves these requested tasks may no longer be context relevant. Additionally, the runtime of each task is often highly dependent on the input data. We introduce a hierarchical multi-resolution statistical task model that captures relevant aspects at the task and intertask levels, and captures not only uncertainty, but also introduces the notion of utility for the user.We have developed a system of non-parametric statistical techniques for modeling the runtime of a specific task. This model is a framework where we define problems of design and optimization of statistical soft real-time systems (SSRTS). The main algorithmic novelty is a cumulative potential-based task scheduling heuristic for maximizing utility. The heuristic conducts global optimization and induces low runtime overhead. We demonstrate the effectiveness of the scheduling heuristic using a Trimaran-based evaluation platform.
Increasing reuse opportunities is a well-known problem for software designers as well as for hardware designers. Nonetheless, current software and hardware engineering practices have embraced different approaches to this problem. Software designs are usually modelled after a set of proven solutions to recurrent problems called design patterns. This approach differs from the component-based reuse usually found in hardware designs: design patterns do not specify unnecessary implementation details. Several authors have already proposed translating structural design patterns concepts to hardware design. In this paper we extend the discussion to behavioural design patterns. Specifically, we describe how the hardware version of the Iterator can be used to enhance model reuse.
Sharing IP blocks in today's competitive market poses significant high security risks. Creators and owners of IP designs want assurances that their content will not be illegally redistributed by consumers, and consumers want assurances that the content they buy is legitimate. Recently, digital watermarking emerged as a candidate solution for copyright protection of IP blocks. In this paper, we propose a new approach for watermarking IP designs based on the embedding of the ownership proof as part of the IP design's FSM. The approach utilizes coinciding as well as, un-used transitions in the state transition graph of the design. Our approach increases the robustness of the watermark and allows a secure implementation, hence enabling the development of the first public-key IP watermarking scheme at the FSM level. We also define evaluation criteria for our approach, and use experimental measures to prove its robustness.
Research studies have demonstrated the feasibility and advantages of Network-on-Chip (NoC) over traditional bus-based architectures but have not focused on compatibility communication standards. This paper describes a number of issues faced when designing a VC-neutral NoC, i.e. compatible with standards such as AHB 2.0, AXI, VCI, OCP, and various other proprietary protocols, and how a layered approach to communication helps solve these issues.
We present a novel, quality-driven, architectural-level approach that trades-off the output quality to enable power-aware processing of multimedia streams. The error tolerance of multimedia data is exploited to selectively eliminate computation while maintaining a specified output quality. We construct relaxed, synthesized power macro-models for power-hungry units to predict the cycle-accurate power consumption of the input stream on the fly. The macro-models, together with an effective quality model, are integrated into a programmable architecture that allows both power savings and quality to be dynamically tuned with the available battery-life. In a case study, power monitors are integrated with functional units of the IDCT module of a MPEG-2 decoder. Experiments indicate that, for a moderate power monitor energy overhead of 5%, power savings of 72% in the functional units can be achieved resulting in an increase in battery life by 1.95x.
In this paper, a method is proposed for finding a pixel transformation function that maximizes backlight dimming while maintaining a pre-specified image distortion level for a liquid crystal display. This is achieved by finding a pixel transformation function, which maps the original image histogram to a new histogram with lower dynamic range. Next the contrast of the transformed image is enhanced so as to compensate for brightness loss that would arise from backlight dimming. The proposed approach relies on an accurate definition of the image distortion which takes into account both the pixel value differences and a model of the human visual system and is amenable to highly efficient hardware realization. Experimental results show that the histogram equalization for backlight scaling method results in about 45% power saving with an effective distortion rate of 5% and 65% power saving for a 20% distortion rate. This is significantly higher power savings compared to previously reported backlight dimming approaches.
In this paper, we present a methodology for customized communication architecture synthesis that matches the communication requirements of the target application. This is an important problem, particularly for network-based implementations of complex applications. Our approach is based on using frequently encountered generic communication primitives as an alphabet capable of characterizing any given communication pattern. The proposed algorithm searches through the entire design space for a solution that minimizes the system total energy consumption, while satisfying the other design constraints. Compared to the standard mesh architecture, the customized architecture generated by the newly proposed approach shows about 36% throughput increase and 51% reduction in the energy required to encrypt 128 bits of data with a standard encryption algorithm.
This paper presents a technique for eliminating redundant cache-tag and cache-way accesses to reduce power consumption. The basic idea is to keep a small number of Most Recently Used (MRU) addresses in a Memory Address Buffer (MAB) and to omit redundant tag and way accesses when there is a MAB-hit. Since the approach keeps only tag and set-index values in the MAB, the energy and area overheads are relatively small even for a MAB with a large number of entries. Furthermore, the approach does not sacrifice the performance. In other words, neither the cycle time nor the number of executed cycles increases. The proposed technique has been applied to Fujitsu VLIW processor (FR-V) and its power saving has been estimated using NanoSim. Experiments for 32kB 2-way set associative caches show the power consumption of I-cache and D-cache can be reduced by 40% and 50%, respectively.
By incorporating reconfigurable hardware in embedded system architectures it has become easier to satisfy the performance constraints of demanding applications while lowering system cost. In order to evaluate the performance of a candidate architecture, the nodes (tasks) of the data flow graphs that describe an application must be assigned to the computing resources of the architecture: programmable processors and reconfigurable FPGAs, whose run-time reconfiguration capabilities must be exploited. In this paper we present a novel design exploration tool - based on a local search algorithm with global convergence properties - which simultaneously explores choices for computing resources, assignments of nodes to these resources, task schedules on the programmable processors and context definitions for the reconfigurable circuits. The tool finds a solution that minimizes system cost while meeting the performance constraints; more precisely it lets the designer select the quality of the optimization (hence its computing time) and finds accordingly a solution with close-to-minimal cost.
The objective of this paper is to introduce dependability as an optimization criterion in the system-level design process of embedded systems. Given the pervasiveness of embedded systems, especially in the area of highly dependable and safety-critical systems, it is imperative to directly consider dependability in the system level design process. This naturally leads to a multi-objective optimization problem, as cost and time have to be considered too. This paper proposes a genetic algorithm to solve this multi-objective optimization problem and to determine a set of Pareto optimal design alternatives in a single optimization run. Based on these alternatives, the designer can choose his best solution, finding the desired tradeoff between cost, schedulability, and dependability.
Efficient evaluation of design choices, in terms of selection of algorithms to be implemented as hardware or software, and finding an optimal hw/sw design mix is an important requirement in the design flow of Embedded Systems. Time-to-market, faster upgradability and flexibility are some of the driving points to put increasing amounts of functionality as software executed on general purpose processing elements. In this scenario, dividing a monolithic task into multiple interacting tasks, and scheduling them on limited processing elements has become very important for a system designer. This paper presents an approach to model time-slice based task schedulers in the designs where the performance estimate of hardware and software models is less than time-slice accurate. The approach aims to increase the simulation efficiency of designs modeled at system level. We used Metropolis [1] as our codesign environment.
This paper presents a scheme for efficient channel usage between simulator and accelerator where the accelerator models some RTL sub-blocks in the accelerator-based hardware/software co-simulation while the simulator runs transaction-level model of the remaining part of the whole chip being verified. With conventional simulation accelerator, evaluations of simulator and accelerator alternate at every valid simulation time, which results in poor simulation performance due to startup overhead of simulator-accelerator channel access. The startup overhead can be reduced by merging multiple transactions on the channel into a single burst traffic. We propose a predictive packetizing scheme for reducing channel traffic by merging as many transactions into a burst traffic as possible based on "prediction and rollback". Under ideal condition with 100% prediction accuracy, the proposed method shows a performance gain of 1500% compared to the conventional one.
Automated synthesis of monitors from high-level properties plays a significant role in assertion-based verification. We present here a methodology to synthesize assertion monitors from visual specifications given in CESC (Clocked Event Sequence Chart). CESC is a visual language designed for specifying system level interactions involving single and multiple clock domains. It has well-defined graphical and textual syntax and formal semantics based on synchronous language paradigm enabling formal analysis of specifications. In this paper we provide an overview of CESC language with few illustrative examples. The algorithm for automated synthesis of assertion monitors from CESC specifications is described. A few examples from standard bus protocols (OCP-IP and AMBA) are presented to demonstrate the application of monitor synthesis algorithm.
In this paper, we present a software compilation approach for microprocessor/FPGA platforms that partitions a software binary onto custom hardware implemented in the FPGA. Our approach imposes less restrictions on software tool flow than previous compiler approaches, allowing software designers to use any software language and compiler. Our approach uses a back-end partitioning tool that utilizes decompilation techniques to recover important high-level information, resulting in performance comparable to high-level compiler-based approaches.
The increased dominance of intra-die process variations has motivated the field of Statistical Static Timing Analysis (SSTA) and has raised the need for SSTA-based circuit optimization. In this paper, we propose a new sensitivity based, statistical gate sizing method. Since brute-force computation of the change in circuit delay distribution to gate size change is computationally expensive, we propose an efficient and exact pruning algorithm. The pruning algorithm is based on a novel theory of perturbation bounds which are shown to decrease as they propagate through the circuit. This allows pruning of gate sensitivities without complete propagation of their perturbations. We apply our proposed optimization algorithm to ISCAS benchmark circuits and demonstrate the accuracy and efficiency of the proposed method. Our results show an improvement of up to 10.5% in the 99-percentile circuit delay for the same circuit area, using the proposed statistical optimizer and a run time improvement of up to 56x compared to the brute-force approach.
Graph dominators provide a general mechanism for identifying re-converging paths in circuits. This is useful in a number of CAD applications including computation of signal probabilities for test generation, switching activities for power and noise analysis, statistical timing analysis, cut point selection in equivalence checking, etc. Single-vertex dominators are too rare in circuit graphs to handle re-converging paths in a practical way. This paper addresses the problem of finding double-vertex dominators, which occur more frequently. First, we introduce a data structure, called dominator chain, which allows representing all possible O(n2) double-vertex dominators of a given vertex in O(n) space, where n is the number of vertices of the circuit graph. Dominator chains can be efficiently manipulated, e.g. it takes constant time to look-up whether a given pair of vertices is a double-vertex dominator. Second, we present an efficient algorithm for finding double-vertex dominators. The experimental results show that the presented algorithm is an order of magnitude faster than existing algorithms for finding double-vertex dominators. Thus, it is suitable for running in an incremental manner during logic synthesis.
This paper describes an improved approach to Boolean network optimization using internal don't-cares. The improvements concern the type of don't-cares computed, their scope, and the computation method. Instead of the traditionally used compatible observability don't-cares (CODCs), we introduce and justify the use of complete don't-cares (CDC). To ensure the robustness of the don't-care computation for very large industrial networks, a optional windowing scheme is implemented that computes substantial subsets of the CDCs in reasonable time. Finally, we give a SAT-based don't-care computation algorithm that is more efficient than BDD-based algorithms. Experimental results confirm that these improvements work well in practice. Complete don't-cares allow for a reduction in the number of literals compared to the CODCs. Windowing guarantees robustness, even for very large benchmarks on which previous methods could not be applied. SAT reduces the runtime and enhances robustness, making don't-cares affordable for a variety of other Boolean methods applied to the network.
A class of discrete event synthesis problems can be reduced to solving language equations F · X ⊆S, where F is the fixed component and S the specification. Sequential synthesis deals with FSMs when the automata for F and S are prefix closed, and are naturally represented by multi-level networks with latches. For this special case, we present an efficient computation, using partitioned representations, of the most general prefix-closed solution of the above class of language equations. The transition and the output relations of the FSMs for F and S in their partitioned form are represented by the sets of output and next state functions of the corresponding networks. Experimentally, we show that using partitioned representations is much faster than using monolithic representations, as well as applicable to larger problem instances.
The purpose of this paper is to formally specify a flow devoted to the design of Differential Power Analysis (DPA) resistant QDI asynchronous circuits. The paper first proposes a formal modeling of the electrical signature of QDI asynchronous circuits. The DPA is then applied to the formal model in order to identify the source of leakage of this type of circuits. Finally, a complete design flow is specified to minimize the information leakage. The relevancy and efficiency of the approach is demonstrated using the design of an AES crypto-processor.
This paper addresses two problems related to disjoint-support decomposition of Boolean functions. First, we present a heuristic for finding a subset of variables, X, which results in the disjoint-support decomposition f(X,Y) = h(g(X),Y) with a good area/delay trade-off. Second, we present a technique for re-synthesis of the original circuit implementing f(X,Y) into a circuit implementing the decomposed representation h(g(X),Y). Preliminary experimental results indicate that the proposed approach has a significant potential.
Recent work on Differential Power Analysis shows that even mathematically-secure cryptographic protocols may be vulnerable at the physical implementation level. By measuring energy consumed by a working digital circuit, one can glean enough information to break encryption. Thwarting such attacks requires a new approach to logic and physical design. In this work, we seek to equalize switching activity of a circuit over all possible inputs and input transitions by adding redundant gates and increasing the overall number of signal transitions. We introduce uniformly-switching (U-S) logic, and present a doubling construction that equalizes power dissipation without requiring drastic changes in CAD tools.
We propose an approach to optimally synthesize quantum circuits from non-permutative quantum gates such as Controlled-Square-Root-of-Not (i.e. Controlled-V). Our approach reduces the synthesis problem to multiple-valued optimization and uses group theory. We devise a novel technique that transforms the quantum logic synthesis problem from a multi-valued constrained optimization problem to a group permutation problem. The transformation enables us to utilize group theory to exploit the properties of the synthesis problem. Assuming a cost of one for each two-qubit gate, we found all reversible circuits with quantum costs of 4, 5, 6, etc, and give another algorithm to realize these reversible circuits with quantum gates.
This paper presents the effectiveness of various stress conditions (mainly voltage and frequency) on detecting the resistive shorts and open defects in deep sub-micron embedded memories in an industrial environment. Simulation studies on very-low voltage, high voltage and at-speed testing show the need of the stress conditions for high quality products; i.e., low defect-per-million (DPM) level, which is driving the semiconductor market today. The above test conditions have been validated to screen out bad devices on real silicon (a test-chip) built on CMOS 0.18μm technology. IFA (inductive fault analysis) based simulation technique leads to an efficient fault coverage and DPM estimator, which helps the customers upfront to make decisions on test algorithm implementations under different stress conditions in order to reduce the number of test escapes.
Test sets that detect each target fault n times (n-detection test sets) are typically generated for restricted values of n due to the increase in test set size with n. We perform both a worst-case analysis and an average-case analysis to check the effect of restricting n on the unmodeled fault coverage of an (arbitrary) n-detection test set. Our analysis is independent of any particular test set or test generation approach. It is based on a specific set of target faults and a specific set of untargeted faults. It shows that, depending on the circuit, very large values of n may be needed to guarantee the detection of all the untargeted faults. We discuss the implications of these results.
A method to generate test patterns referred to as defect aware test patterns is proposed. Defect aware test patterns have greater ability to detect un-modeled defects. The proposed method can be used with any test generation procedure to improve the effectiveness of the tests in detecting un-modeled defects. Experimental results on several industrial designs show the effectiveness of defect aware tests. We also propose a measure to estimate the effectiveness of given test sets in detecting un-modeled defects.
Characterization of semiconductor devices is used to gather as much data about the device as possible to determine weaknesses in design or trends in the manufacturing process. In this paper, we propose a novel multiple trip point characterization concept to overcome the constraint of single trip point concept in device characterization phase. In addition, we use computational intelligence techniques (e.g. neural network, fuzzy and genetic algorithm) to further manipulate these sets of multiple trip point values and tests based on semiconductor test equipments, Our experimental results demonstrate an excellent design parameter variation analysis in device characterization phase, as well as detection of a set of worst case tests that can provoke the worst case variation, while traditional approach was not capable of detecting them.
The embedded DRAM (eDRAM) is more and more used in System On Chip (SOC). The integration of the DRAM capacitor process into a logic process is challenging to get satisfactory yields. The specific process of DRAM capacitor and the low capacitance value (~30fF) of this device induce problems of process monitoring and failure analysis. We propose a new test structure to measure the capacitance value of each DRAM cell capacitor in a DRAM array. This concept has been validated by simulation on a 0.18μm eDRAM technology.
In this paper we present a simple and efficient built-in temperature sensor for thermal monitoring of standard-cell based VLSI circuits. The proposed smart temperature sensor uses a ring-oscillator composed of complex gates instead of inverters to optimize their linearity. Simulation results from a 0.18μm CMOS technology show that the non-linearity error of the sensor can be reduced when on adequate set of standard logic gates is selected.
In the recent decade, voltage scaling has become an attractive feature for many system component designs. In this paper, we consider energy-efficient real-time task scheduling over a chip multiprocessor architecture. The objective is to schedule a set of frame-based tasks with the minimum energy consumption, where all tasks are ready at time 0 and share a common deadline. We show that such a minimization problem is NP-hard and then propose a 2.371-approximation algorithm. The strength of the proposed algorithm was demonstrated by a series of simulations, for which near optimal results were obtained.
We present an energy-efficient real-time scheduling algorithm called EUA*, for the unimodal arbitrary arrival model (or UAM). UAM embodies a "stronger" adversary than most arrival models. The algorithm considers application activities that are subject to time/utility function time constraints, UAM, and the multi-criteria scheduling objective of probabilistically satisfying utility lower bounds, and maximizing system-level energy efficiency. Since the scheduling problem is intractable, EUA* allocates CPU cycles, scales clock frequency, and heuristically computes schedules using statistical estimates of cycle demands, in polynomial-time. We establish that EUA* achieves optimal timeliness during under-loads, and identify the conditions under which timeliness assurances hold. Our simulation experiments illustrate EUA*'s superiority.
In this paper we present a new technique which exploits timing-correlation between tasks for scheduling analysis in multiprocessor and distributed systems with tree-shaped task-dependencies. Previously developed techniques also allow capturing and exploiting timing-correlation in distributed systems. However, they are only suitable for linear systems, where tasks cannot trigger more than one succeeding task. The new technique presented in this paper, allows capturing timing-correlation between tasks in parallel paths in a more accurate way, enabling its exploitation to calculate tighter bounds for the worst-case response time analysis for tasks scheduled under a static priority preemptive scheduler.
In this paper we introduce a new task model that is specifically targeted towards representing stream processing applications. Examples of such applications are those involved in network packet processing (such as a software-based router) and multimedia processing (such as an MPEG decoder application). Our task model is made up of two parts: (i) a new task structure to accurately model the software structures of stream processing applications such as conditional branches and different end-to-end deadlines for different types of input data items, and (ii) a new event model to represent the arrival pattern of the data items to be processed, which triggers the task structure. This event model is more expressive than classical models such as purely periodic, periodic with jitter or sporadic event models. We then present algorithms for the schedulability analysis of this task model. The basic scheme underlying our algorithms is a generalization of the techniques used for the schedulability analysis of the recently proposed generalized multiframe and the recurring real-time task models.
This paper presents new fast exact feasibility tests for uniprocessor real-time systems using preemptive EDF scheduling. Task sets which are accepted by previously described sufficient tests will be evaluated in nearly the same time as with the old tests by the new algorithms. Many task sets are not accepted by the earlier tests despite them being feasible. These task sets will be evaluated by the new algorithms a lot faster than with known exact feasibility tests. Therefore it is possible to use them for many applications for which only sufficient test are suitable. Additionally this paper shows that the best previous known sufficient test, the best known feasibility bound and the best known approximation algorithm can be derived from these new tests. In result this leads to an integrated schedulability theory for EDF.
Complex real-time control system is a software dense and algorithms dense system, which needs modern software engineering techniques to design. UML is an object-oriented industrial standard modeling language, used more and more in real-time domain. This paper first analyses the advantages and problems of using UML for real-time control systems design. Then, it proposes an extension of UML-RT to support time-continuous subsystems modeling. So we can unify modeling of complex real-time control systems on UML-RT platform, from requirement analysis, model design, simulation, until generation code.
Complex applications implemented as Systems on Chip (SoCs) demand extensive use of system level modeling and validation. Their implementation gathers a large number of complex IP cores and advanced interconnection schemes, such as hierarchical bus architectures or networks on chip (NoCs). Modeling applications involves capturing its computation and communication characteristics. Previously proposed communication weighted models (CWM) consider only the application communication aspects. This work proposes a communication dependence and computation model (CDCM) that can simultaneously consider both aspects of an application. It presents a solution to the problem of mapping applications on regular NoCs while considering execution time and energy consumption. The use of CDCM is shown to provide estimated average reductions of 40% in execution time, and 20% in energy consumption, for current technologies.
Shared memory is a common interprocessor communication paradigm for single-chip multi-processor platforms. Snoop-based cache coherence is a very successful technique that provides a clean shared-memory programming abstraction in general-purpose chip multi-processors, but there is no consensus on its usage in resource-constrained multiprocessor systems on chips (MPSoCs) for embedded applications. This work aims at providing a comparative energy and performance analysis of cache coherence support schemes in MPSoCs. Thanks to the use of a complete multiprocessor simulation platform, which relies on accurate technology-homogeneous power models, we were able to explore different cache-coherent shared-memory communication schemes for a number of cache configurations and workloads.
Supply voltage scaling and adaptive body-biasing are important techniques that help to reduce the energy dissipation of embedded systems. This is achieved by dynamically adjusting the voltage and performance settings according to the application needs. In order to take full advantage of slack that arises from variations in the execution time, it is important to recalculate the voltage (performance) settings during runtime, i.e., online. However, voltage scaling (VS) is computationally expensive, and thus significantly hampers the possible energy savings. To overcome the online complexity, we propose a quasi-static voltage scaling scheme, with a constant online time complexity O(1). This allows to increase the exploitable slack as well as to avoid the energy dissipated due to online recalculation of the voltage settings. We conduct several experiments that demonstrate the advantages of the proposed technique over the previously published voltage scaling approaches.
We propose a novel energy-efficient memory architecture which relies on the use of cache with a reduced number of tag bits. The idea behind the proposed architecture is based on moving a large number of the tag bits from the cache into an external register (Tag Overflow Buffer) that identifies the current locality of the memory references; additional hardware allows to dynamically update the value of the reference locality contained in the buffer. Energy efficiency is achieved by using, for most of the memory accesses, a reduced-tag cache. This architecture is minimally intrusive for existing designs, since it assumes the use of a regular cache, and does not require any special circuitry internal to the cache such as row or column activation mechanisms. Average energy savings are 51% on tag energy, corresponding to about 20% saving on total cache energy, measured on a set of typical embedded applications.
When applying Dynamic Power Management (DPM) technique to pervasively deployed embedded systems, the technique needs to be very efficient so that it is feasible to implement the technique on low end processor and tight-budget memory. Furthermore, it should have the capability to track time varying behavior rapidly because the time varying is an inherent characteristic of real world system. Existing methods, which are usually model-based, may not satisfy the aforementioned requirements. In this paper, we propose a model-free DPM technique based on Q-Learning. Q-DPM is much more efficient because it removes the overhead of parameter estimator and mode-switch controller. Furthermore, its policy optimization is performed via consecutive online trialing, which also leads to very rapid response to time varying behavior.
In this paper, we present power emulation, a novel design paradigm that utilizes hardware acceleration for the purpose of fast power estimation. Power emulation is based on the observation that the functions necessary for power estimation (power model evaluation, aggregation, etc.) can be implemented as hardware circuits. Therefore, we can enhance any given design with "power estimation hardware", map it to a prototyping platform, and exercise it with any given test stimuli to obtain power consumption estimates. Our empirical studies with industrial designs reveal that power emulation can achieve significant speedups (10X to 500X) over state-of-the-art commercial register-transfer level (RTL) power estimation tools.
The lack of an overall understanding of the interplay of the sub-systems and of the difficulties encountered in integrating very complex parts, system integration is becoming increasingly a nightmare. In fact, Jurgen Hubbert, in charge of the Mercedes-Benz passenger car division, publicly stated in 2003: "The industry is fighting to solve problems that are coming from electronics and companies that introduce new technologies face additional risks. We have experienced blackouts on our cockpit management and navigation command system and there have been problems with telephone connections and seat heating". I believe that this sorry state is the rule for the leading OEMs, it is not the exception in today's environment. The source of these problems is clearly the increased complexity but also the difficulty of the OEMs in managing the integration and maintenance process with subsystems that come from different suppliers who use different design methods, different software architecture, different hardware platforms, different (and often proprietary) Real-Time Operating Systems. Therefore, the need for for standards in the software and hardware domains that will allow plug-and-play of subsystems and their implementation are essential while the competitive advantage of an OEM will increasingly reside on essential functionalities (e.g. stability control).
Carefully tested electric/electronic components are a requirement for effective hardware-in-the-loop tests and vehicle tests in automotive industry. A new method for definition and execution of component tests is described. The most important advantage of this method is independance from the test stand. It therefore offers the opportunity to build up knowledge over a long period of time and the ability to share this knowledge with different partners.
In the early 19th Henry Ford has the vision of an economical build car priced for everyone. Henry Ford reaches his vision by a new organisation of well defined management and engineering processes. A historical citation is "It means a lot to me to prove clearly that our ideas are all about accomplishable" that they are not automotive specific but rather they are a part of a global code. One day he explains that he in the future would build only one kind of a car and every car has the same chassis. He explained that each customer could paint his car as he want if it is black. His consequent way to his success by a very limited, standardised offer of one car was also his problem in the future when GM starts a model offensive with several configuration possibilities. What could we learn from Henry Fords vision and his implementation? He realised that individual construction steps and a lot of individual build parts is the main problem for reaching high quality and low costs. In modern cars we find the same problems today with complex software systems build by individual "artists". A well defined software construction process and standards for key components seem to be a successful way for future software-engineering. One of the main topics of building complex and safety critical software systems is the establishing of constructing quality as key knowledge for automotive engineers.
Model based design enables the automatic generation of final-build software from models for high-volume automotive embedded systems. This paper presents a framework of processes, methods and tools for the design of automotive embedded systems. A steer-by-wire system serves as an example.
Increase in system level modeling has given rise to a need for efficient functional validation of models above cycle accurate level. This paper presents a technique for comparing system level models, before and after the static scheduling of tasks on processing elements of the architecture. We derive a graph representation from models written in system level design languages (SLDLs) and define their execution semantics. Notion of functional equivalence of system level models is established using these graphs. We then present well defined rules for reduction of such graphs to a normal form. Finally, we show how to check for functional equivalence of two system level models by isomorphism of their normal graph representations. A checker built on the above concept is used to automatically validate the functional correctness of the static scheduling step. As a result, the models generated for various scheduling decisions do not have to be reverified using costly simulations.
In this paper we formally define an enhanced RTL semantics. This is intended to elevate the RTL design abstraction level and help bridge the HDL semantic gap among synthesis, simulation and formal verification tools. We define the enhanced semantics based on a new RTL++ language that supports pipelined operations using a new pipelined register variable concept. The execution semantics of RTL++ is specified in a structural operational semantics style aimed to form the basis for related simulation and formal verification algorithm development. A RFSM model is defined to support natively the synthesis semantics of RTL++. We also present an example of extending SystemC to support the notion of pipelined register variable.
This paper presents the methodology and the modeling constructs we have developed to capture the real time aspects of RTOS simulation models in a System Level Design Language (SLDL) like SystemC. We describe these constructs and show how they are used to build a simulation model of an RTOS kernel targeting the μ-ITRON OS specification standard.
Transaction level modeling allows exploring several SoC design architectures leading to better performance and easier verification of the final product. In this paper, we present an approach to design and verify SystemC models at the transaction level. We integrate the verification as part of the design-flow. In the proposed approach, we first model both the design and the properties (written in PSL) in UML. Then, we translate them into an intermediate format modeled with Abstract State Machines (ASM). The ASM model is used to generate an FSM of the design including the properties. Checking the correctness of the properties is performed on-the-fly while generating the state machine. Finally, we translate the verified design to SystemC and map the properties to a set of assertions (as monitors in C#) that can be re-used to validate the design at lower levels through simulation. We illustrate our approach on two case studies including the PCI bus standard and a generic Master/Slave architecture from the SystemC library.
This paper gives an overview of a transaction level modeling (TLM) design flow for straightforward embedded system design with SystemC. The goal is to systematically develop both application-specific HW and SW components of an embedded system using the TLM approach, thus allowing for fast communication architecture exploration, rapid prototyping and early embedded SW development. To this end, we specify the lightweight transaction-based communication protocol SHIP and present a methodology for automatic mapping of the communication part of a system to a given architecture, including HW/SW interfaces.
The goal of this paper is to demonstrate a prevalent global deadlock situation resulting from a local deadlock in a GALS ring architecture. We present a novel design for building systems which will be tolerant to such deadlocks arising in the local modules. This paper, concentrates on the modeling of the proposed design methodology and its correctness is proved with the help of a public domain verification tool.
Several years ago, the vertically integrated semiconductor companies started to disaggregate into separate sectors, such as fabless, EDA, IP, Design Services, DFT, foundry, and test & packaging houses. On the one hand the disaggregated sectors are in the process of merging and optimizing their product lines, roles and responsibilities On the other hand, certain companies are trying to reverse the trend by reaggregating some of the disaggregated sectors. Will the reaggregation trend dominate? Which trend enables a better semiconductor market growth? Which trend allows superior technological offerrings? Who will be the shark?
Memory cores are usually the densest portion with the smallest feature size in system-on-chip (SOC) designs. The reliability of memory cores thus has heavy impact on the reliability of SOCs. Transparent test is one of useful technique for improving the reliability of memories during life time. This paper presents a systematic algorithm used for transforming a bit-oriented march test into a transparent word-oriented march test. The transformed transparent march test has shorter test complexity compared with that proposed in the previous works [12, 13]. For example, if a memory with 32-bit words is tested with March C, time complexity of the transparent word-oriented test transformed by the proposed scheme is only about 56% or 19% time complexity of the transparent word-oriented test converted by the scheme reported in [12] or [13], respectively.
Single Event Upsets (SEU) as well as permanent faults
can significantly affect the correct on-line operation of digital
systems, such as memories and microprocessors; a memory
can be made resilient to permanent and transient faults
by using modular redundancy and coding. In this paper,
different memory systems are compared: these systems utilize
simplex and duplex arrangements with a combination
of Reed Solomon coding and scrubbing. The memory systems
and their operations are analyzed by novel Markov
chains to characterize performance for dynamic reconfiguration
as well as error detection and correction under the
occurrence of permanent and transient faults. For a specific
Reed Solomon code, the duplex arrangement allows to efficiently
cope with the occurrence of permanent faults, while
the use of scrubbing allows to cope with transient faults
Index Terms:
High Reliability Systems, Reliability Evaluation,
Reed-Solomon Codes, Scrubbing, Dynamic Redundancy.
Transient errors are one of the major reasons for system downtime in many systems. While prior research has mainly focused on the impact of transient errors on datapath, caches and main memories, the register file has largely been neglected. Since the register file is accessed very frequently, the probability of transient errors is high. In addition, errors in it can quickly spread to different parts of the system, and cause application crash or silent data corruption. This paper addresses the reliability of register files in superscalar processors. Particularly, we propose to duplicate actively used physical registers in unused physical registers. The rationale behind this idea is that if the protection mechanism (parity or ECC) used for the primary copy indicates an error, the duplicate can provide the data as long as it is not corrupted. We implement two types of strategies based on this register duplication idea. In the "conservative strategy," we limit ourselves with the given register usage behavior, and duplicate register contents only on otherwise unused registers. Consequently, there is no impact on the original performance when there is no error, except for the protection mechanism used for the primary copy. Our experiments with two different versions of this strategy show that, with the more powerful conservative scheme, 78% of the accesses are to the physical registers with duplicates. The "aggressive strategy" sacrifices some performance to increase the number of register accesses with duplicates. It does so by marking the registers not used for a long time as "dead" and using them for duplicating actively used registers. The experiments with this strategy indicate that it takes the fraction of the reliable register accesses to 84%, and degrades the overall performance by only 0.21% on the average.
In this paper we propose a new Built in Current Sensor (BICS) to detect single event upsets in SRAM. The BICS is designed and validated for 100nm process technology. The BICS reliability analysis for process, voltage, temperature, and power supply noise are provided. This BICS detect various shapes of current pulses generated due to particle strike. The BICS power consumption and area overhead are also provided. This BICS found to be very reliable for process, voltage and temperature variation and under stringent noise conditions.
Safety-critical embedded systems having to meet real-time constraints are expected to be highly predictable in order to guarantee at design time that certain timing deadlines will always be met. This requirement usually prevents designers from utilizing caches due to their highly dynamic, thus hardly predictable behavior. The integration of scratchpad memories represents an alternative approach which allows the system to benefit from a performance gain comparable to that of caches while at the same time maintaining predictability. In this work, we compare the impact of scratchpad memories and caches on worst case execution time (WCET) analysis results. We show that caches, despite requiring complex techniques, can have a negative impact on the predicted WCET, while the estimated WCET for scratchpad memories scales with the achieved performance gain at no extra analysis cost.
In this paper we present a new measurement-based worst-case execution time (WCET) analysis method. Exhaustive end-to-end measurements are computationally intractable in most cases. Therefore, we propose to measure execution times of subparts of the application. We use heuristic methods and model checking to generate test data, forcing the execution of selected paths to perform runtime measurements. The measured times are used to calculate the WCET in a final computation step. As we operate on source code level our approach is platform independent except for the run time measurements performed on the target host. We show the feasibility of the required steps and explain our approach by means of a case study.
The wider and wider use of high-performance processors as part of real-time systems makes it more and more difficult to guarantee that programs will respect their strict deadlines. While the computation of Worst-Case Execution Times relies on static analysis of the code, the challenge is to model with enough safety and accuracy the behaviour of intrinsically dynamic components. In this paper, we focus on the dynamic branch predictor. Several models to bound the number of branch mispredictions have been previously published. Some of them exhibit a high complexity while other ones have shown that taking into account semantic information from the source code makes things more tractable. We extend this work to more general nested loop structures. We also give some simulation results that show that the way branch mispredictions are usually taken into account cannot be both safe and accurate in the case of high-performance pipelines. We propose a more realistic approach to be used as part of WCET computation.
Static program analysis by abstract interpretation is an efficient method to determine properties of embedded software. One example is value analysis, which determines the values stored in the processor registers. Its results are used as input to more advanced analyses, which ultimately yield information about the stack usage and the timing behavior of embedded software.
In this work we consider battery powered portable systems which either have Field Programmable Gate Arrays (FPGA) or voltage and frequency scalable processors as their main processing element. An application is modeled in the form of a precedence task graph at a coarse level of granularity. We assume that for each task in the task graph several unique design-points are available which correspond to different hardware implementations for FPGAs and different voltage-frequency combinations for processors. It is assumed that performance and total power consumption estimates for each design-point are available for any given portable platform, including the peripheral components such as memory and display power usage. We present an iterative heuristic algorithm which finds a sequence of tasks along with an appropriate design-point for each task, such that a deadline is met and the amount of battery energy used is as small as possible. A detailed illustrative example along with a case study of a real-world application of a robotic arm controller which demonstrates the usefulness of our algorithm is also presented.
Side channel attacks are a major security concern for smart cards and other embedded devices. They analyze the variations on the power consumption to find the secret key of the encryption algorithm implemented within the security IC. To address this issue, logic gates that have a constant power dissipation independent of the input signals, are used in security ICs. This paper presents a design methodology to create fully connected differential pull down networks. Fully connected differential pull down networks are transistor networks that for any complementary input combination connect all the internal nodes of the network to one of the external nodes of the network. They are memoryless and for that reason have a constant load capacitance and power consumption. This type of networks is used in specialized logic gates to guarantee a constant contribution of the internal nodes into the total power consumption of the logic gate.
A novel energy reduction strategy to maximally exploit the dynamic workload variation is proposed for the offline voltage scheduling of preemptive systems. The idea is to construct a fully-preemptive schedule that leads to minimum energy consumption when the tasks take on approximately the average execution cycles yet still guarantees no deadline violation during the worst-case scenario. End-time for each sub-instance of the tasks obtained from the schedule is used for the on-line dynamic voltage scaling (DVS) of the tasks. For the tasks that normally require a small number of cycles but occasionally a large number of cycles to complete, such a schedule provides more opportunities for slack utilization and hence results in larger energy saving. The concept is realized by formulating the problem as a Non-Linear Programming (NLP) optimization problem. Experimental results show that, by using the proposed scheme, the total energy consumption at runtime is reduced by as high as 60% for randomly generated task sets when comparing with the static scheduling approach only using worst case workload.
Low power oriented circuit optimization consists in selecting the best alternative between gate sizing, buffer insertion and logic structure transformation, for satisfying a delay constraint at minimum area cost. In this paper we used a closed form model of delay in CMOS structures to define metrics for a deterministic selection of the optimization alternative. The target is delay constraint satisfaction with minimum area cost. We validate the design space exploration method, defining maximum and minimum delay bounds on logical paths. Then we adapt this method to a "constant sensitivity method" allowing to size a circuit at minimum area under a delay constraint. An optimisation protocol is finally defined to manage the trade-off performance constraint - circuit structure. These methods are implemented in an optimization tool (POPS) and validated by comparing on a 0.25μm process, the optimization efficiency obtained on various benchmarks (ISCAS'85) to that resulting from an industrial tool.
This paper presents a design flow for an improved selective multi-threshold(Selective-MT) circuit. The Selective-MT circuit is improved so that plural MT-cells can share one switch transistor. We propose the design methodology from RTL(Register Transfer Level) to final layout with optimizing switch transistor structure.
Many existing thermal management techniques focus on reducing the overall power consumption of the chip, and do not address location-specific temperature problems referred to as hotspots. We propose the use of dynamic runtime reconfiguration to shift the hotspot-inducing computation periodically and make the thermal profile more uniform. Our analysis shows that dynamic reconfiguration is an effective technique in reducing hotspots for NoCs.
In this paper, we investigate the impact of Tox and Vth on power performance trade-offs for on-chip caches. We start by examining the optimization of the various components of a single level cache and then extend this to two level cache systems. In addition to leakage, our studies also account for the dynamic power expended as a result of cache misses. Our results show that one can often reduce overall power by increasing the size of the L2 cache if we only allow one pair of Vth/Tox in L2. However, if we allow the memory cells and the peripherals to have their own Vth's and Tox's, we show that a two-level cache system with smaller L2's will yield less total leakage. We further show that two Vth's and two Tox's are sufficient to get close to an optimal solution, and that Vth is generally a better design knob than Tox for leakage optimization, thus it is better to restrict the number of Tox's rather than Vth's if cost is a concern.
This session addresses different approaches to automotive design architectures: state-of-the-art and trends in automotive system architectures from a tier one supplier's perspective, a new approach to reconfigurable architectures as well as a new trend in automotive system design, i.e. platforms to integrate several in-car services and telecommunication services in one configurable approach. The speakers are a mix of industrial and academic experts with experience in automotive system design.
Increasing functional and non-functional requirements in automotive electric /electronic vehicle development will significantly enhance the integration of novel functions in the embedded networks. Major driving forces are the demand for driver assistance function, active and passive safety systems and the fulfillment of environmental and legal requirements. The contribution will demonstrate that this task in system design can only be managed, if the non competitive elements are developed together in automotive industry - leading to an infrastructure standard like e.g. in AUTOSAR, FlexRay and LIN. Working on such basis the OEMs can have a dedicated system design environment for the competitive implementations of functions already starting in early phases for feasibility studies. This basis is consequently a fix point through serial development and even in the maintenance phase and enables shared functional development and exploitation as well as in project adaptations of non-automotive industry driven hardware developments.
Linear Pseudo-Boolean Optimization (PBO) is a widely used modeling framework in Electronic Design Automation (EDA). Due to significant advances in Boolean Satisfiability (SAT), new algorithms for PBO have emerged, which are effective on highly constrained instances. However, these algorithms fail to handle effectively the information provided by the cost function of PBO. This paper addresses the integration of lower bound estimation methods with SAT-related techniques in PBO solvers. Moreover, the paper shows that the utilization of lower bound estimates can dramatically improve the overall performance of PBO solvers for most existing benchmarks from EDA.
We present new techniques for improving search in a hybrid Davis-Putnam-Logemann-Loveland based constraint solver for RTL circuits (HDPLL). In earlier work on HDPLL [7], the authors combined solvers for integer and Boolean domains using finite-domain constraint propagation with heuristic conflict-based learning. In this work, we describe a new algorithm that extends the conflict-based unique-implication point learning in Boolean SAT solvers to hybrid Boolean-Integer domains in HDPLL. We describe data-structures for efficient constraint propagation on the hybrid learned relations, similar to two-literal watching in Boolean SAT. We demonstrate that these new techniques provide considerable performance benefits when compared with other combinations of decision theories.
It is a hot research topic to eliminate irrelevant variables from counterexample, to make it easier to be understood. The BFL algorithm is the most effective counterexample minimization algorithm compared to all other approaches. But its time overhead is very large due to one call to SAT solver for each candidate variable to be eliminated. The key to reduce time overhead is to eliminate multiple variables simultaneously. Therefore, we propose a faster counterexample minimization algorithm based on refutation analysis in this paper. We perform refutation analysis on those UNSAT instances of BFL, to extract the set of variables that lead to UNSAT. All variables not belong to this set can be eliminated simultaneously as irrelevant variables. Thus we can eliminate multiple variables with only one call to SAT solver. Theoretical analysis and experiment result shows that, our algorithm can be 2 to 3 orders of magnitude faster than existing BFL algorithm, and with only minor lost in counterexample minimization ability.
Functional verification of microprocessors is one of the most complex and expensive tasks in the current system-on-chip design process. A significant bottleneck in the validation of such systems is the lack of a suitable functional coverage metric. This paper presents a functional coverage based test generation technique for pipelined architectures. The proposed methodology makes three important contributions. First, a general graph-theoretic model is developed that can capture the structure and behavior (instruction-set) of a wide variety of pipelined processors. Second, we propose a functional fault model that is used to define the functional coverage for pipelined architectures. Finally, test generation procedures are presented that accept the graph model of the architecture as input and generate test programs to detect all the faults in the functional fault model. Our experimental results on two pipelined processor models demonstrate that the number of test programs generated by our approach to obtain a fault coverage is an order of magnitude less than those generated by traditional random or constrained-random test generation techniques.
This paper introduces a new SAT solver that integrates logic-based reasoning and integer programming methods to systems of CNF and PB constraints. Its novel features include an efficient PB literal watching strategy and several PB learning methods that take advantage of the pruning power of PB constraints while minimizing their overhead.
Current algorithms for bounded model checking use SAT methods for checking satisfiability of Boolean formulae. These methods suffer from the potential memory explosion problem. Methods based on the validity of Quantified Boolean Formulae (QBF) allow an exponentially more succinct representation of formulae to be checked, because no "unrolling" of the transition relation is required. These methods have not been widely used, because of the lack of an efficient decision procedure for QBF. We evaluate the usage of QBF in bounded model checking (BMC), using general-purpose SAT and QBF solvers. We develop a special-purpose decision procedure for QBF used in BMC, and compare our technique with the methods using general-purpose SAT and QBF solvers on real-life industrial benchmarks.
In this paper a non-canonical circuit-based state set representation is used to efficiently perform quantifier elimination. The novelty of this approach lies in adapting equivalence checking and logic synthesis techniques, to the goal of compacting circuit based state set representations resulting from existential quantification. The method can be efficiently combined with other verification approaches such as inductive and SAT-based pre-image verifications.
UML 2.0 provides a rich set of diagrams for systems documentation and specification. Many efforts have been undertaken to employ different aspects of UML for multiple domains, mainly in the area of software systems. Considering the area of electronic design automation, however, we currently see only very few approaches, which investigate UML for hardware design and hardware/software co-design. In this article, we present an approach for executable UML closing the gap from system specification to its model-based execution on reconfigurable hardware. For this purpose, we present our Abstract Execution Platform (AEP), which is based on a Virtual Machine running an executable UML subset for embedded software and reconfigurable hardware. This subset combines UML 2.0 Class, StateMachine and Sequence Diagrams for complete system specification. We describe how these binary encoded UML specifications can be directly executed and give the implementation of such a virtual machine on a Virtex II FPGA. Finally, we present evaluation results comparing the AEP implementation with C code on a C167 microcontroller.
Developing a functional prototype of a system-on-chip provides a unifying vehicle for model validation and system refinement. Keeping the prototype executable across several abstraction levels, clock domains and design tools is a key requirement to effective prototyping. This paper presents how model-level transactors address design heterogeneity by unifying event-based and cycle-based worlds from specification to implementation. Transactors are used to build a functional prototype of a software-radio component. An executable UML model is bridged to a hardware abstraction of a radio stream developed with Simulink to implement a realistic and working prototype. Model validation and performance measurements are realized through prototype execution and real-time monitoring.
In this paper, we present a SoC design methodology joining the capabilities of UML and SystemC to operate at system-level. We present a UML 2.0 profile of the SystemC language exploiting the MDA capabilities of defining modeling languages, platform independent and reducible to platform dependent languages. The UML profile captures both the structural and the behavioral features of the SystemC language, and allows high level modeling of system-on-a-chip with straightforward translation to SystemC code.
Unified Modeling Language (UML) 2.0 is emerging in the area of embedded system design. This paper presents a new UML 2.0 profile - called TUT-Profile - that introduces a set of stereotypes and design rules for an application, platform, and mapping. The profile classifies different application and platform components, and enables their parameterization. TUT-Profile concentrates on the structure of an application and platform, and utilizes standard UML 2.0 for the behavioral modeling. The application is seen as a set of active classes with an internal behavior. Correspondingly, the platform is seen as a component library with a parameterized presentation in UML 2.0 for each library component.
UML is gaining increased attention as a system design language, as indicated by current standardization activities such as the SysML initiative and the UML for SoC Forum. Moreover the adoption of UML 2 is a significant step towards a broader range of modeling capabilities. This paper provides an overview of the impact of these recent advances on the application of UML for SoC and NoC development, proposes a model-driven development method taking benefit of the best techniques recently introduced, and investigates the design of power efficient systems with UML.
Hardware software co-design seeks to meet performance objectives via a combination of hardware and software modules. One difficulty in reaching these objectives lies in lack of cohesion and increased coupling amongst the implemented modules that results in an increased inter module communication cost. While most of the traditional partitioning approaches are initiated in the post-coding phase, we suggest the design stage may be a better focus of attention in addressing this problem. In this paper, we propose a novel approach that uses information from sequence diagrams in UML designs to help ease the partitioning problem.
Both the number of embedded memories, as well as the total embedded memory content in our chips is growing steadily. Time for chip designers, EDA makers, and test engineers to update their knowledge on memories. This Hot Topic paper provides an embedded tutorial on embedded memories, in terms of what is new and coming versus what is old and vanishing, and what are the associated design, test, and repair challenges related to using embedded memories.
With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic register file VLIW architectures present scalability problems due to a centralized register file which is far slower than the functional units (FU). Clustered VLIW architectures, with a subset of FUs connected to any RF are the solution to this scalability problem. Recent studies with a wide variety of inter-cluster interconnection mechanisms have presented substantial gains in performance (number of cycles) over the most studied RF-to-RF type interconnections. However, these studies have compared only one or two design points in the RF-to-RF interconnects design space. In this paper, we extend the previous reported work. We consider both multi-cycle and pipelined buses. To obtain realistic bus latencies, we synthesized the various architectures and found out post layout clock periods. The results demonstrate that while there is very little variation in interconnect area, all the bus based architectures are heavily performance constrained. Also, neither multi-cycle or pipelined buses nor increasing the number of buses itself is able to achieve performance comparable to point-to-point type interconnects.
With the advent of multi-processor systems on a chip, the interest for message passing libraries has revived. Message passing helps in mastering the design complexity of parallel systems. However, to satisfy the stringent energy-budget of embedded applications, the message passing overhead should be limited. Recently, several hardware extensions have been proposed for reducing the transfer cost on a distributed memory architecture. Unfortunately, they ignore the synchronization cost between sender/receiver and/or require many dedicated hardware blocks. To overcome the above limitations, we present in this paper light-weight support for message passing. Moreover, we have made our library as flexible as possible such that we can optimally match the application with the target architecture. We demonstrate the benefits of our approach by means of representative benchmarks from the multimedia domain.
Embedded software continues to play an ever increasing role in the design of complex embedded applications. In part, the elevated level of abstraction provided by a high-level programming paradigm immensely facilitates a short design cycle, fewer errors, portability, and reuse. Serializing compilers have been proposed as an alternative to traditional OS techniques, enabling a designer to develop multitasking applications without the need of OS support. In this work, we outline the inner workings of the Phantom serializing compiler and analyze the quality of the generated code with respect to memory and processing overheads. Our results show that such serializing compilers are extremely efficient, making them ideal to be used in design of highly parallel applications (e.g., multimedia, graphics, and signal processing applications).
Instruction Level Parallelism (ILP) extraction for multi-cluster VLIW processors is a very hard task. In this paper, we propose a retargetable architecture that can exploit ILP and thread level parallelism jointly, thus allowing an easier parallelism extraction and improving the performance with respect to traditional multicluster VLIW processors.
We propose an efficiently preconditioned generalized minimal residual (GMRES) method for fast SPICE-accurate transient simulation of parasitic-sensitive deep-submicron VLSI circuits. First, when time step-sizes vary within a predefined range, the preconditioned GMRES method is applied to solve circuit matrix equations rather than LU factorization. The preconditioner we use comes directly from the previously factorized L and U matrices. Second, to keep using the same preconditioner during nonlinear iteration, the successive variable chord method is applied as an alternative to the Newton-Raphson method. An improved piecewise weakly nonlinear definition of MOSFETs is adopted and the low-rank update technique is implemented to refresh the preconditioner efficiently. With these techniques, the number of required LU factorizations during transient simulation is reduced dramatically. Experimental results on power/ground networks have demonstrated that the proposed method yields SPICE-like accuracy with an about 18X overall CPU time speedup over SPICE3 for circuits with tens of thousands elements.
New nanotechnology based devices are replacing CMOS devices to overcome CMOS technology's scaling limitations. However, many such devices exhibit nonmonotonic I-V characteristics and uncertain properties which lead to the negative differential resistance (NDR) problem and the chaotic performance. This paper proposes a new circuit simulation approach that can effectively simulate nanotechnology devices with uncertain input sources and negative differential resistance (NDR) problem. The experimental results show a 20-30 times speedup comparing with existing simulators.
Variability in process parameters is making accurate timing analysis of nano-scale integrated circuits an extremely challenging task. In this paper, we propose a new algorithm for statistical timing analysis using Levelized Covariance Propagation (LCP). The algorithm simultaneously considers the impact of random placement of dopants (which makes every transistor in a die independent in terms of threshold voltage) and the spatial correlation of the process parameters such as channel length, transistor width and oxide thickness due to the intra-die variations. It also considers the signal correlation due to reconvergent paths in the circuit. Results on several benchmark circuits in 70nm technology show an average of 0.21% and 1.07% errors in mean and the standard deviation, respectively, in timing analysis using the proposed technique compared to the Monte-Carlo analysis.
Since the advent of new nanotechnologies, the variability of gate delay due to process variations has become a major concern. This paper proposes a new gate delay model that includes impact from both process variations and multiple input switching. The proposed model uses orthogonal polynomial based probabilistic collocation method to construct a delay analytical equation from circuit timing performance. From the experimental results, our approach has less than 0.2% error on the mean delay of gates and less than 3% error on the standard deviation.
A technique based on the sensitivity of the output to input waveform is presented for accurate propagation of delay information through a gate for the purpose of static timing analysis (STA) in the presence of noise. Conventional STA tools represent a waveform by its arrival time and slope. However, this is not an accurate way of modeling the waveform for the purpose of noise analysis. The key contribution of our work is the development of a method that allows efficient propagation of equivalent waveforms throughout the circuit. Experimental results demonstrate higher accuracy of the proposed sensitivity-based gate delay propagation technique, SGDP, compared to the best of existing approaches. SGDP is compatible with the current level of gate characterization in conventional ASIC cell libraries, and as a result, it can be easily incorporated into commercial STA tools to improve their accuracy.
For Systems-on-Chip (SoCs) development, a predominant part of the design time is the simulation time. Performance evaluation and design space exploration of such systems in bit- and cycle-true fashion is becoming prohibitive. We propose a traffic generation (TG) model that provides a fast and effective Network-on-Chip (NoC) development and debugging environment. By capturing the type and the timestamp of communication events at the boundary of an IP core in a reference environment, the TG can subsequently emulate the core's communication behavior in different environments. Access patterns and resource contention in a system are dependent on the interconnect architecture, and our TG is designed to capture the resulting reactiveness. The regenerated traffic, which represents a realistic workload, can thus be used to undertake faster architectural exploration of interconnection alternatives, effectively decoupling simulation of IP cores and of interconnect fabrics. The results with the TG on an AMBA interconnect show a simulation time speedup above a factor of 2 over a complete system simulation, with close to 100% accuracy.
Detailed modeling of processors and high performance cycle-accurate simulators are essential for today's hardware and software design. These problems are challenging enough by themselves and have seen many previous research efforts. Addressing both simultaneously is even more challenging, with many existing approaches focusing on one over another. In this paper, we propose the Reduced Colored Petri Net (RCPN) model that has two advantages: first, it offers a very simple and intuitive way of modeling pipelined processors; second, it can generate high performance cycle-accurate simulators. RCPN benefits from all the useful features of Colored Petri Nets without suffering from their exponential growth in complexity. RCPN processor models are very intuitive since they are a mirror image of the processor pipeline block diagram. Furthermore, in our experiments on the generated cycle-accurate simulators for XScale and StrongArm processor models, we achieved an order of magnitude (~15 times) speedup over the popular SimpleScalar ARM simulator.
In this paper, the application of a cycle accurate binary translator for rapid prototyping of SoCs will be presented. This translator generates code to run on a rapid prototyping system consisting of a VLIW processor and FPGAs. The generated code is annotated with information that triggers cycle generation for the hardware in parallel to the execution of the translated program. The VLIW processor executes the translated program whereas the FPGAs contain the hardware for the parallel cycle generation and the bus interface that adapts the bus of the VLIW processor to the SoC bus of the emulated processor core.
Designers of factory automation applications increasingly demand for tools for rapid prototyping of hardware extensions to existing systems and verification of resulting behaviors through hardware and software co-simulation. This work presents a framework for the timing-accurate cosimulation of HDL models and their verification against hardware and software running on an actual embedded device of which only a minimal knowledge of the current design is required. Experiments on real-life applications show that early architectural and design decisions can be taken by measuring the expected performance on the models realized using the proposed framework.
In this paper is proposed a technique to integrate and simulate a dynamic memory in a multiprocessor framework based on C/C++/SystemC. Using host machine's memory management capabilities, dynamic data processing is supported without compromising speed and accuracy of the simulation. A first prototype in a shared memory context is presented.
In today's embedded applications a significant portion of energy is spent in the memory subsystem. Several approaches have been proposed to minimize this energy, including the use of scratch pad memories, with many based on static analysis of a program. However, often it is not possible to perform static analysis and optimization of a program's memory access behavior unless the program is specifically written for this purpose. In this paper we introduce the FORAY model of a program that permits aggressive analysis of the application's memory behavior that further enables such optimizations since it consists of "for" loops and array accesses which are easily analyzable. We present FORAY-GEN: an automated profile-based approach for extraction of the FORAY model from the original program. We also demonstrate how FORAY-GEN enhances applicability of other memory subsystem optimization approaches, resulting in an average of two times increase in the number of memory references that can be analyzed by existing static approaches.
Main memories can consume a large percentage of overall energy in many data-intensive embedded applications. The past research proposed and evaluated memory banking as a possible approach for reducing memory energy consumption. One of the common characteristics/assumptions made by most of the past work on banking is that all the banks are of the same size. While this makes the formulation of the problem easy, it also restricts the potential solution space.Motivated by this observation, this paper investigates the possibility of employing nonuniform bank sizes for reducing memory energy consumption. Specifically, it proposes an integer linear programming (ILP) based approach that returns the optimal nonuniform bank sizes and accompanying data-to-bankmapping. It also studies how data migration can further improve over nonuniform banking. We implemented our approach using an ILP tool and made extensive experiments. The results show that the proposed strategy brings important energy benefits over the uniform banking scheme, and data migration across banks tends to increase these savings.
A formal methodology for the analysis of a closed loop clock distribution and active deskewing network is proposed. In this paper an active clock distribution and deskewing network is modeled as a closed loop feedback system using state space equations. State space analysis allows systematic analysis of any clock distribution and deskewing systems to determine various conditions under which a system can over-compensate and become potentially unstable. Such an analysis can be very useful to designers as they will be able to determine analytically as to how the clock deskewing system behaves. By using the proposed approach, repeated simulations can be greatly limited and maybe entirely avoided. We applied the proposed method to an experimental clock deskewing system to illustrate the effectiveness of the proposed approach. The proposed approach can be further extended to determine performance of such systems under different configurations.
We have presented an optimal buffer sizing and buffer insertion methodology which uses stochastic models of the architecture and Continuous Time Markov Decision Processes CTMDPs. Such a methodology is useful in managing the scarce buffer resources available on chip as compared to network based data communication which can have large buffer space. The modeling of this problem in terms of a CTMDP framework lead to a nonlinear formulation due to usage of bridges in the bus architecture. We present a methodology to split the problem into several smaller though linear systems and we then solve these subsystems.
This paper presents an exploration approach for the researcher to choose the suitable size of Scratch-Pad memory (SPM) for maximal performance improvement of a specified application. The approach uses an extended control flow graph (ECFG) to describe the application and provides a solution to reduce the additional overhead of moving nodes to SPM. Experiments achieves on average 11% increase in performance compared to the previous approaches and 44% decrease in the application's runtime compared to none SPM environment.
The design productivity gap requires more efficient design methods. Software systems have faced the same challenge and seem to have mastered it with the introduction of more abstract design methods. The UML has become the standard for software systems modeling and thus the foundation of new design methods. Although the UML is defined as a general purpose modeling language, its application to hardware and hardware/software codesign is very limited. In order to successfully apply the UML at these fields, it is essential to understand its capabilities and to map it to a new domain.
Let's be clear from the outset: SoC can most certainly make use of UML; SoC just doesn't need more UML, or even all of it. The advent of model mappings, coupled with marks that indicate which mapping rule to apply, enable a major simplification of the use of UML in SoC.
In this paper, we proposed a method for integrating UML model into the current SoC design process. UML is introduced as a formal model of specification for SoC design. The consistency and completeness of the specification is validated based on the formal UML model. The implementation is validated by a systematic derivation of test scenarios from UML model. The method has been applied to the design of a new media-processing chip for mobile devices. The application of the method shows that it is not only effective for finding logical errors in the implementation, but also eliminates errors due to inconsistency and incompleteness of the specification.
Overheating has been acknowledged as a major issue in testing complex SOCs. Several power constrained system-level DFT solutions (power constrained test scheduling) have recently been proposed to tackle this problem. However, as it will be shown in this paper, imposing a chip-level maximum power constraint doesn't necessarily avoid local overheating due to the non-uniform distribution of power across the chip. This paper proposes a new approach for dealing with overheating during test, by embedding thermal awareness into test scheduling. The proposed approach facilitates rapid generation of thermal-safer test schedules without requiring time-consuming thermal simulations. This is achieved by employing a low-complexity test session thermal model used to guide the test schedule generation algorithm. This approach reduces the chances of a design re-spin due to potential overheating during test.
Power dissipation during test is a major challenge in testing integrated circuits. Dynamic power has been the dominant part of power dissipation in CMOS circuits, however, in future technologies the static portion of power dissipation will outreach the dynamic portion. This paper proposes an efficient technique to reduce both dynamic and static power dissipation in scan structures. Scan cell outputs which are not on the critical path(s) are multiplexed to fixed values during scan mode. These constant values and primary inputs are selected such that the transitions occurred on non-multiplexed scan cells are suppressed and the leakage current during scan mode is decreased. A method for finding these vectors is also proposed. Effectiveness of this technique is proved by experiments performed on ISCAS89 benchmark circuits.
This paper proposes a diagnosis scheme aimed at
reducing diagnosis time of distributed small embedded
SRAMs (e-SRAMs). This scheme improves the one
proposed in [7, 8]. The improvements are mainly two-fold.
On one hand, the diagnosis of time-consuming
Data Retention Faults (DRFs), which is neglected by the
diagnosis architecture in [7, 8], is now considered and
performed via a DFT technique referred to as the "No
Write Recovery Test Mode (NWRTM)". On the other
hand, a pair comprising a Serial to Parallel Converter
(SPC) and a Parallel to Serial Converter (PSC) is
utilized to replace the bi-directional serial interface, to
avoid the problems of serial fault masking and defect
rate dependent diagnosis. Results from our evaluations
show that the proposed diagnosis scheme achieves an
increased diagnosis coverage and reduces diagnosis
time compared to those obtained in [7, 8], with
neglectable extra area cost.
Keywords:
Distributed Small Embedded SRAMs,
Memory Diagnosis, Data Retention Fault, SPC, PSC,
Diagnosis Time
This paper gives an overview of a new technique, named pseudo-ring testing (PRT). PRT can be applied for testing wide type of random access memories (RAM): bit-or word-oriented and single- or dual-port RAM's. An essential particularity of the proposed methodology is the emulation of a linear automaton over Galois field by memory own components.
This paper describes a flexible logic BIST scheme that features high fault coverage achieved by fault-simulation guided test point insertion, real at-speed test capability for multi-clock designs without clock frequency manipulation, and easy physical implementation due to the use of a low-speed SE signal. Application results of this scheme to two widely used IP cores are also reported.
In this paper we present an approach to the design optimization of fault-tolerant embedded systems for safety-critical applications. Processes are statically scheduled and communications are performed using the time-triggered protocol. We use process re-execution and replication for tolerating transient faults. Our design optimization approach decides the mapping of processes to processors and the assignment of fault-tolerant policies to processes such that transient faults are tolerated and the timing constraints of the application are satisfied. We present several heuristics which are able to find fault-tolerant implementations given a limited amount of resources. The developed algorithms are evaluated using extensive experiments, including a real-life example.
Utilizing on-chip caches in embedded multiprocessor system-on-a-chip (MPSoC) based systems is critical from both performance and power perspectives. While most of the prior work that targets at optimizing cache behavior are performed at hardware and compilation levels, operating system (OS) can also play major role as it sees the global access pattern information across applications. This paper proposes a cache-conscious OS process scheduling strategy based on data reuse. The proposed scheduler implements two complementary approaches. First, the processes that do not share any data between them are scheduled at different cores if it is possible to do so. Second, the processes that could not be executed at the same time (due to dependences) but share data among each other are mapped to the same processor core so that they share the cache contents. Our experimental results using this new data locality aware OS scheduling strategy are promising, and show significant improvements in task completion times.
Heterogeneous Multi-Processor SoC platforms bear the potential to optimize conflicting performance, flexibility and energy efficiency constraints as imposed by demanding signal processing and networking applications. However, in order to take advantage of the available processing and communication resources, an optimal mapping of the application tasks onto the platform resources is of crucial importance. In this paper, we propose a SystemC-based simulation framework, which enables the quantitative evaluation of application-to-platform mappings by means of an executable performance model. Key element of our approach is a configurable event-driven Virtual Processing Unit to capture the timing behavior of multi-processor/multi-threaded MP-SoC platforms. The framework features an XML-based declarative construction mechanism of the performance model to significantly accelerate the navigation in large design spaces. The capabilities of the proposed framework in terms of design space exploration is presented by a case study of a commercially available MP-SoC platform for networking applications. Focusing on the application to architecture mapping, our introduced framework highlights the potential for optimization of an efficient design space exploration environment.
As compared to a large spectrum of performance optimizations, relatively little effort has been dedicated to optimize other aspects of embedded applications such as memory space requirements, power, real-time predictability, and reliability. In particular, many modern embedded systems operate under tight memory space constraints. One way of satisfying these constraints is to compress executable code and data as much as possible. While research on code compression have studied efficient hardware and software based code strategies, many of these techniques do not take application behavior into account, that is, the same compression/decompression strategy is used irrespective of the application being optimized. This paper presents a code compression strategy based on control flow graph (CFG) representation of the embedded program. The idea is to start with a memory image wherein all basic blocks are compressed, and decompress only the blocks that are predicted to be needed in the near future. When the current access to a basic block is over, our approach also decides the point at which the block could be compressed. We propose several compression and decompression strategies that try to reduce memory requirements without excessively increasing the original instruction cycle counts.
The advent of sensor networks presents untapped opportunities for synthesis. We examine the problem of synthesis of behavioral specifications into networks of programmable sensor blocks. The particular behavioral specification we consider is an intuitive user-created network diagram of sensor blocks, each block having a pre-defined combinational or sequential behavior. We synthesize this specification to a new network that utilizes a minimum number of programmable blocks in place of the predefined blocks, thus reducing network size and hence network cost and power. We focus on the main task of this synthesis problem, namely partitioning pre-defined blocks onto a minimum number of programmable blocks, introducing the efficient but effective PareDown decomposition algorithm for the task. We describe the synthesis and simulation tools we developed. We provide results showing excellent network size reductions through such synthesis, and significant speedups of our algorithm over exhaustive search while obtaining near-optimal results for 15 real network designs as well as nearly 10,000 randomly generated designs.
In this paper, we propose a distributed online HW/SW-partitioning strategy for increasing fault tolerance in HW/SW-reconfigurable networked systems.
In this paper we present our contribution in terms of synchronization processor for a SoC design methodology based on the theory of the latency insensitive systems (LIS) of Carloni et al[1]. Our contribution consists in IP encapsulation into a new wrapper model which speed and area are optimized and synthetizability guaranteed. The main benefit of our approach is to preserve the local IP performances when encapsulating them and reduce SoC silicon area.
Temperature affects not only the reliability but also the performance, power, and cost of the embedded system. This paper proposes a thermal-aware task allocation and scheduling algorithm for embedded systems. The algorithm is used as a sub-routine for hardware/software co-synthesis to reduce the peak temperature and achieve a thermally even distribution while meeting real time constraints. The paper investigates both power-aware and thermal-aware approaches to task allocation and scheduling. The experimental results show that the thermal-aware approach outperforms the power-aware schemes in terms of maximal and average temperature reductions. To the best of our knowledge, this is the first task allocation and scheduling algorithm that takes temperature into consideration.
One of the greatest impediments to achieving high quality placements using force-directed methods lies in the large amount of overlap initially present in these techniques. This overlap makes the determination of cell ordering difficult and can lead to the inadvertent separation of highly-connected cells by the spreading forces. We show that a multi-level clustering strategy can minimize the ill effects of overlap and improve the quality of placements generated by the force-directed tool FDP. Moreover, we present a means of improving initial cell ordering through the unification of min-cut partitioning and force-based placement, and describe an enhanced median improvement heuristic which further aids in minimizing HPWL. Numerical results are presented showing that our flow generates placements which are, on average, 15% better than mPG and 4% better than Capo 9.0 on mixed-size designs.
As feature sizes shrink, it will be necessary to use AAPSM (Alternating-Aperture Phase Shift Masking) to image critical features, especially on the polysilicon layer. This imposes additional constraints on the layouts beyond traditional design rules. Of particular note is the requirement that all critical features be flanked by opposite-phase shifters, while the shifters obey minimum width and spacing requirements. A layout is called phase-assignable, the phase conflicts have to be removed to enable the use of AAPSM for the layout. Previous work has sought to detect a suitable set of phase conflicts to be removed, as well as correct them. [3,4,5,6,8]. The contributions of this paper are the following: (1) a new approach to detect a minimal set of phase conflicts (also referred to as AAPSM conflicts), which when corrected will produce a phase-assignable layout; (2) a novel layout modification scheme for correcting these AAPSM conflicts. The proposed approach for conflict detection shows significant improvements in the quality of results and runtime for real industrial circuits, when compared to previous methods. To the best of our knowledge, this is the first time layout modification results are presented for bright-field AAPSM. Our experiments show that the percentage area increase for making a layout phase-assignable ranges from 0.7-11.8%.
Variability is becoming a serious problem in process technology for nanometer technology nodes. The increasing difficulty in controlling the uniformity of critical process parameters (e.g. doping levels) in the smaller devices, makes the electrical properties of such scaled devices much less predictable than in the past. In this paper, we study how these technology effects influence the energy and delay of a SRAM module. Despite the implications in the correct operation of the module, in practically all cases the affected memory implementations become also slower while consuming on average more energy than nominally. This is partly counter-intuitive and no existing literature describes this in a systematic generic way for SRAMs. In this paper, we identify and illustrate the different mechanisms behind this unexpected behavior and quantify the impact of these effects for on-chip SRAMs at the 65nm technology node.
In this paper, we present an experimental integrated platform for the research, development and evaluation of new VLSI back-end algorithms and design flows. Interconnect scaling to nanometer processes presents many difficult challenges to CAD flows. Academic research on back-end mostly focuses on specific algorithmic issues separately. However one key issue to address also is the cooperation of multiple algorithmic tools. TSUNAMI, our platform, is based on an integrated C++ database around which all tools consistently interact and collaborate. Above this platform a fixed die standard cell timing-driven placement and global routing flow has been developed.
A new routing methodology, which accounts for inductive and capacitive coupling between neighboring wires is proposed. The inductive and capacitive coupling of the wires are introduced through a "moment" based higher order RLCK cost function. The routing process guided by this cost-function ensures that the final solution has minimum ringing and delay.
Operating frequency of a pipelined circuit is determined by the delay of the slowest pipeline stage. However, under statistical delay variation in sub-100nm technology regime, the slowest stage is not readily identifiable and the estimation of the pipeline yield with respect to a target delay is a challenging problem. We have proposed analytical models to estimate yield for a pipelined design based on delay distributions of individual pipe stages. Using the proposed models, we have shown that change in logic depth and imbalance between the stage delays can improve the yield of a pipeline. A statistical methodology has been developed to optimally design a pipeline circuit for enhancing yield. Optimization results show that, proper imbalance among the stage delays in a pipeline improves design yield by 9% for the same area and performance (and area reduction by about 8.4% under a yield constraint) over a balanced design.
Conventional cache models are not suited for real-time parallel processing because tasks may flush each other's data out of the cache in an unpredictable manner. In this way the system is not compositional so the overall performance is difficult to predict and the integration of new tasks expensive. This paper proposes a new method that imposes compositionality to the system's performance and makes different memory hierarchy optimizations possible for multimedia communicating tasks when running on embedded multiprocessor architectures. The method is based on a cache allocation strategy that assigns sets of the unified cache exclusively to tasks and to the communication buffers. We also analytically formulate the problem and describe a method to compute the cache partitioning ratio for optimizing the throughput and the consumed power. When applied to a multiprocessor with memory hierarchy our technique delivers also performance gain. Compared to the shared cache case, for an application consisting of two jpeg decoders and one edge detection algorithm 5 times less misses are experienced and for an mpeg2 decoder 6.5 times less misses are experienced.
Increasing design complexity eventually leads to a design process that is distributed over several companies. This is already found in the automotive industry but SoC design appears to move in the same direction. Design processes for complex systems are iterative, but iteration hardly reaches beyond company borders. Iterations require availability of preliminary design data and estimations, but due to cost and liability issues suppliers often hesitate to provide such preliminary data. Moreover, companies are rarely able to judge the accuracy and precision of externally estimated data. So, the systems integrator experiences increased design risk. Particular mechanisms are needed to ensure, that the integrated system will meet the overall requirements even if part of the early estimations are wrong or imprecise. Based on work in supply chain management, we propose an inter-company design process that is based on formal techniques from real-time systems engineering and so called flexible quantity contracts. In this process, formal techniques control design risk and flexible contracts regulate cooperation and cost distribution. The process effectively delays the design freeze point beyond the contract conclusion to enable design iterations. We explain the process and give an example.
Wire Pipelining (WP) has been proposed in order to limit the impact of increasing wire delays. In general, the added pipeline elements alters the system such that architectural changes are needed to preserve functionality. We illustrate a proposal that, while allowing the use of IP blocks without modification, takes advantage of a minimal knowledge of the IP's communication profile to dramatically increase the performances. We showed the formal equivalence between WP and original system and proved the higher performance achievable through a relevant case study.
The memory subsystem has always been a bottleneck in performance as well as significant power contributor in memory intensive applications. Many researchers have presented multi-layered memory hierarchies as a means to design energy and performance efficient systems. However, most of the previous work do not explore trade-offs systematically. We fill this gap by proposing a formalized technique that takes into consideration data reuse, limited lifetime of the arrays of an application and application specific prefetching opportunities, and performs a thorough tradeoff exploration for different memory layer sizes. This technique has been implemented on a prototype tool, which was tested successfully using nine real-life applications of industrial relevance. Following this approach we have able to reduce execution time up to 60%, and energy consumption up to 70%.
SystemC, users and tool providers are at a crossroads. More and more companies are using SystemC; however EDA companies are hesitant to give a full commitment to SystemC tools, especially at system-level. There are several reasons for this dichotomy. While users seem excited about SystemC for its technical qualities for system-level design, tool providers may not share this excitement because of the current small market for such tools. Are the existing (free) reference implementation and the current small market for system level tools to blame, or are there any technical issues impeding the fast development of SystemC tools? Among the SystemC users from industry and academia there is currently some uncertainty about the future availability of state of the art EDA tools supporting SystemC. This panel brings together industrial SystemC users as well as EDA companies to discuss these issues. The industrial panellists will present the current situation regarding the use of SystemC in industry, its importance (or lack thereof) for system design and future needs for such tools. The tool providers will explain their current position regarding the commitment to SystemC and clarify their future plans for supporting it.
State of the art statistical timing analysis (STA) tools often yield less accurate results when timing variables become correlated due to global source of variations and path reconvergence. To the best of our knowledge, no good solution is available dealing both types of correlations simultaneously. In this paper, we present a novel extended pseudo-canonical timing model to retain and evaluate both type of correlation during statistical timing analysis with minimum computation cost. Also, an intelligent pruning method is introduced to enable trade-off runtime with accuracy. Tested with ISCAS benchmark suites, our method shows both high accuracy and high performance. For example, on the circuit c6288, our distribution estimation error shows 15x accuracy improvement compared with previous approaches.
Assessing IC manufacturing process fluctuations and their impacts on IC interconnect performance has become unavoidable for modern DSM designs. However, the construction of parametric interconnect models is often hampered by the rapid increase in computational cost and model complexity. In this paper we present an efficient yet accurate parametric model order reduction algorithm for addressing the variability of IC interconnect performance. The efficiency of the approach lies in a novel combination of low-rank matrix approximation and multi-parameter moment matching. The complexity of the proposed parametric model order reduction is as low as that of a standard Krylov subspace method when applied to a nominal system. Under the projection-based framework, our algorithm also preserves the passivity of the resulting parametric models.
In this paper, we investigate the impact of interconnect and device process variations on voltage fluctuations in power grids. We consider random variations in the power grid's electrical parameters as spatial stochastic processes and propose a new and efficient method to compute the stochastic voltage response of the power grid. Our approach provides an explicit analytical representation of the stochastic voltage response using orthogonal polynomials in a Hilbert space. The approach has been implemented in a prototype software called OPERA (Orthogonal Polynomial Expansions for Response Analysis). Use of OPERA on industrial power grids demonstrated speed-ups of up to two orders of magnitude. The results also show a significant variation of about ± 35% in the nominal voltage drops at various nodes of the power grids and demonstrate the need for variation-aware power grid analysis.
A comprehensive probabilistic methodology is proposed to solve the buffer insertion problem with the consideration of process variations. In contrast to a recent work, we point out, for the first time, that the correlation between the required arrival time and the downstream loading capacitance must be considered in order to solve the problem "correctly". We develop an efficient bottom-up recursive algorithm to calculate the joint probability density function that accurately captures the above correlation, and propose effective pruning rules to exclude probabilistically inferior solutions. We verify our buffer insertion using timing analysis with both device and interconnect variations, and show that compared to the conventional buffer insertion algorithm using nominal device and interconnect parameters, our new buffer insertion methodology can reduce the probability of timing violation by up to 30%.
This paper presents a new mathematical approach to modeling EM wave coupling noise so that it can be easily integrated into chip-level noise analysis tools. The new method employs Chebyshev approximation technique to model the distributed sources arising in the Telegrapher's equations due to EM wave coupling. A uniform plane wave illumination metric is provided to determine the order of approximation. Closed-form formulas for the noise transfer functions' moments are derived. By utilizing the formulated moments, reduced order models can be efficiently obtained to generate the induced noise caused by EM wave illumination. The accuracy of the proposed method is verified by Hspice simulation.
In signal integrity analysis, the joint effect of propagated noise through library cells, and of the noise injected on a quiet net by neighboring switching nets through coupling capacitances, must be considered in order to accurately estimate the overall noise impact on design functionality and performances. In this work the impact of the cell non-linearity on the noise glitch waveform is analyzed in detail, and a new macromodel that allows to accurately and efficiently modeling the non-linear effects of the victim driver in noise analysis is presented. Experimental results demonstrate the effectiveness of our method, and confirm that existing noise analysis approaches based on linear superposition of the propagated and crosstalk-injected noise can be highly inaccurate, thus impairing the sign-off functional verification phase.
We propose a sensitivity-based method to allocate decaps incorporating leakage constraints and tighter data and clock interactions. The proposed approach attempts to allocate decaps not only based on the power grid integrity criteria, but also based on the impact of power grid noise on timing criticality and robustness. The resulting algorithm reduces the power grid noise to below a threshold and improves the performance or timing robustness of the circuit at the same time.
This paper suggests a methodology to decrease the power of a static CMOS standard cell design at layout level by focusing on switched capacitance. The term switched is the key: if a capacitance is not switched often, it may be high. If it is frequently switched, it should be minimized in order to reduce power consumption. This can be done by an algorithm based on forces that automatically optimizes the position and length of every single wire segment in a routed design. The forces are proportional to the toggle activities derived from a gate level simulation. The novelty is that this allows to iteratively find a new topology for the wire segments. Our algorithm takes as input an already given, grid routed layout.
The first path implicit and exact non-robust path delay fault grading technique for non-scan sequential circuits is presented. Non enumerative exact coverage is obtained, by allowing any latched error representing a delayed transition to propagate to a primary output with the support of other potentially latched errors. The generalized error propagation is done by symbolic simulation. Appropriate data structures for function manipulation are used. The advantage of the proposed method is demonstrated experimentally with consistent improvement in coverage over an existing pessimistic heuristic despite enforced bounds on the memory requirements.
Test model generation is common in the design cycle of custom made high performance low power designs targeted for high volume production. Logic extraction is a key step in test model generation to produce a logic level netlist from the transistor level representation. This is a semi-automated process which is error prone. This paper analyzes typical extraction errors applicable to clocking schemes seen in high-performance designs today. An automated debugging solution for these errors in designs with no state equivalence information is also presented. A suite of experiments on circuits with similar architecture to that found in the industry confirm the fitness and practicality of the solution.
In recent years, several Electronic Design Automation (EDA) problems in testing and verification have been formulated as Boolean Satisfiability (SAT) instances due to the development of efficient general-purpose SAT solvers. Problem-specific learning techniques and heuristics can be integrated into the SAT solver to further speed-up the search for a satisfying assignment. In this paper, we target the problem of generating a complete test-suite for the path delay fault (PDF) model. We provide an Incremental Satisfiability framework that learns from (1) static logic implications, (2) segment-specific clauses, and (3) unsatisfiability cores of each untestable partial PDF. These learning techniques improvise the test generation for path delay faults that have common testable and/or untestable segments. The experimental results show that a significant portion of PDFs can be excluded dynamically in the proposed incremental SAT formulation for large benchmark circuits, thus potentially achieving speed-ups for PDF test generation.
We investigate a new fault ordering heuristic for test generation in full-scan circuits. The heuristic is referred to as the accidental detection index. It associates a value ADI (f ) with every circuit fault f . The heuristic estimates the number of faults that will be detected by a test generated for f . Fault ordering is done such that a fault with a higher accidental detection index appears earlier in the ordered fault set and targeted earlier during test generation. This order is effective for generating compact test sets, and for obtaining a test set with a steep fault coverage curve. Such a test set has several applications. We present experimental results to demonstrate the effectiveness of the heuristic.
We discuss fault equivalence and dominance relations for multiple output combinational circuits. The conventional definition for equivalence says that "Two faults are equivalent if and only if the corresponding faulty circuits have identical output functions". This definition, which is based on indistinguishability of the faults, is extended for multiple output circuits as "Two faults of a Boolean circuit are equivalent if and only if the pair of the output functions is identical at each output of the circuit". This is termed as diagnostic equivalence in this paper. "If all tests that detect a fault also detect another fault, not necessarily on the same output, then the two faults are called detection equivalent". Two detection equivalent faults need not be indistinguishable. The definitions for fault dominance follow on similar lines. A novel algorithm based on redundancy identification has been proposed to find the equivalence and dominance collapsed sets based on diagnostic and detection collapsing. Applying the algorithm to a 4-bit ALU would collapse the total fault set of 502 faults to 253 and 155, respectively, according to diagnostic equivalence and dominance. The collapsed sets have 234 and 92 faults, respectively, for detection equivalence and dominance. In comparison, the traditional structural equivalence and dominance collapsing results in 301 and 248 faults, respectively. Finally, we use library-based functional collapsing in a hierarchical system and find that smaller fault sets are obtained with an order of magnitude reduction in CPU time for very large circuits.
With the increasing complexity of memory behavior,
attempts are being made to come up with a methodical
approach that employs electrical simulation to tackle
the memory test problem. This paper describes a framework
of algorithms and tools developed jointly by the Delft
University of Technology and Infineon Technologies to systematically
generate DRAM tests using Spice simulation.
The proposed Spice-based test approach enjoys the advantage
of being relatively inexpensive, yet highly accurate in
describing the desired memory faulty behavior.
Keywords:
tool framework, DRAM testing, faulty behavior,
defect simulation, test generation.
Our goal is to produce validation data that can be used as an efficient (pre) test set for structural stuck-at faults. In this paper, we detail an original test-oriented mutation sampling technique used for generating such data and we present a first evaluation on these validation data with regard to a structural test.
Fueled by an unprecedented desire for convenience and self-service, consumers are embracing embedded technology solutions that enhance their mobile lifestyles. Consequently, we witness an unprecedented proliferation of embedded/ mobile applications. Most of the environments that execute these applications have severe power, performance, and memory space constraints that need to be accounted for. In particular, memory limitations can present serious challenges to embedded software designers. The current solutions to this problem include sophisticated packaging techniques and code optimizations for effective memory utilization. While the first solution is not scalable, the second one is restricted by intrinsic data dependences in the code that prevent code restructuring. In this paper, we explore an alternate approach for reducing memory space requirements of embedded applications. The idea is to re-compute the result of a code block (potentially multiple times) instead of storing it in memory and performing a memory operation whenever needed. The main benefit of this approach is that it reduces memory space requirements, that is, no memory space is reserved for storing the result of the code block in question.
Memory space limitation is a serious problem for many embedded systems from diverse application domains. While circuit/packaging techniques are definitely important to squeeze large quantities of data/ instruction into small size memories typically employed by embedded systems, software can also play a crucial role in reducing memory space demands of embedded applications. This paper focuses on a software-managed two-level memory hierarchy and instruction accesses. Our goal is to reduce on-chip memory requirements of a given application as much as possible, so that the memory space saved can be used by other simultaneously-executing applications. The proposed approach achieves this by tracking the lifetime of instructions. Specifically, when an instruction is dead (i.e., it could not be visited again in the rest of execution), we deallocate the on-chip memory space allocated to it. Working on the control flow graph representation of an embedded application, our approach performs basic block-level garbage collection for on-chip memories.
We propose a method for fine grain QoS control of dataflow applications. We assume that the application software is described as the composition of actions (C-functions) with quality level parameters. The method allows to compute a QoS controller from this description, and average execution times, worst case execution times and deadlines for its actions. The controller computes dynamically feasible schedules and quality assignments for their actions. Furthermore, the control policy ensures optimal time budget utilization. A prototype tool implementing the method is shown as well as experimental results for a non trivial example. The results show the interest of fine grain QoS control for video encoders.
Embedded software design for real time reactive system has become the bottleneck in the market introduction of complex products such as automobiles, airplanes, and industrial control plants. In particular, functional correctness and reactive performance are increasingly difficult to verify. The advent of model-based design methodologies has alleviated some of the verification-related problems by making the code-generation process flow automatically from the model description. Given the relative infancy of this approach, several companies rely upon design flows based on different tools connected together by file transfer. This way of integrating tools defeats the very purpose of the methodology introducing a high potential of errors in the transformation from one format to another and preventing formal analysis of the properties of the design. In this paper, we propose to adopt a formal transformation across different tools and we give an example of this approach by linking two tools that are widely used in the automotive domain: Simulink and ASCET. We believe that this approach can be applied to any embedded software design flow to leverage the power of all the tools in the flow.
We introduce galsC, a language designed for programming event-driven embedded systems such as sensor networks. galsC implements the TinyGALS programming model. At the local level, software components are linked via synchronous method calls to form actors. At the global level, actors communicate with each other asynchronously via message passing, which separates the flow of control between actors. A complementary model called TinyGUYS is a guarded yet synchronous model designed to allow thread-safe sharing of global state between actors via parameters without explicitly passing messages. The galsC compiler extends the nesC compiler, which allows for better type checking and code generation. Having a well-structured concurrency model at the application level greatly reduces the risk of concurrency errors, such as deadlock and race conditions. The galsC language is implemented on the Berkeley motes and is compatible with the TinyOS/nesC component library. We use a multi-hop wireless sensor network as an example to illustrate the effectiveness of the language.
In this work, we experiment with complier-directed instruction duplication to detect soft errors in VLIW datapaths . In the proposed approach, the compiler determines the instruction schedule by balancing the permissible performance degradation with the required degree of duplication. Our experimental results show that our algorithms allow the designer to perform tradeoff analysis between performance and reliability.
Demands for implementing original OSs that can achieve high I/O performance on PC/AT compatible hardware have recently been increasing, but conventional OS debugging environments have not been able to simultaneously assure their stability, be easily customized to new OSs and new I/O devices, and assure efficient execution of I/O operations. We therefore developed a novel OS debugging method using a lightweight virtual machine. We evaluated this debugging method experimentally and confirmed that it can transfer data about 5.4 times as fast as the conventional virtual machine monitor.
In this paper, the program control unit of an embedded RISC processor is enhanced with a novel zero-overhead loop controller (ZOLC) supporting arbitrary loop structures with multiple-entry/exit nodes. The ZOLC has been incorporated to an open RISC processor core to evaluate the performance of the proposed unit for alternative configurations of the selected processor. It is proven that speed improvements of 8.4% to 48.2% are feasible for the used benchmarks.
The knowledge of optimal design space boundaries of component circuits can be extremely useful in making good subsystem-level design decisions which are aware of the parasitics and other second-order circuit-level details. However, direct application of popular Multi-objective genetic optimization algorithms were found to produce Pareto fronts with poor diversity for analog circuits problems. This work proposes a novel approach to control the diversity of solutions by partitioning the solution space, using Local Competition to promote diversity and Global competition for convergence, and by controlling the proportion of these two mechanisms by a Simulated Annealing based formulation. The algorithm was applied to extract numerical results on analog switched capacitor integrator circuits with a wide range of tight specifications. The results were found to be significantly better than traditional GA based uncontrolled optimization methods.
An efficient methodology is presented to generate the Pareto-optimal hypersurface of the performance space of a complete mixed-signal electronic system. This Pareto-optimal front offers the designer access to all optimal design solutions: starting from the performance specifications, a satisfactory point can a posteriori be selected on the hypersurface which immediately determines the final design parameters. Fast execution is guaranteed by using multi-objective evolutionary optimization techniques and hierachical decomposition. The presented method takes advantage of the Pareto hypersurfaces of the subblocks to generate the overall Pareto front. The hierarchical approach combines behavioral simulation with behavioral-models at the higher levels, with SPICE simulations with transistor-level accuracy at the lowest level. Storing the performance data of all subblocks enables reuse for other systems later on.
Design of electrical systems demands simulations using models evaluated in different design parameters choices. To enable the simulation of linear systems, one often requires their modeling as ordinary differential equations given tabular data obtained from device simulations or measurements. Existing techniques need to do this for every choice of design parameters since the model representations don't scale smoothly with the external parameter. The paper describes a frequency-domain identification algorithm to extract the poles and zeros of linear MIMO systems. Furthermore, it expresses the poles and zeros as trajectories that are functions of the design parameter(s). The paper describes the used framework, solves the starting-value problem, presents a solution for high-order systems and provides a model-order selection strategy. The properties of the algorithm are illustrated on microwave measurements of inductors, a variable gain amplifier and a high-order SAW-filter. As shown by these examples, the proposed identification algorithm is very well suited to derive scalable, physically relevant models out of tabular frequency-response data.
This paper presents a method to automatically generate compact symbolic performance models of analog circuits with no prior specification of an equation template. The approach takes SPICE simulation data as input, which enables modeling of any nonlinear circuits and circuit characteristics. Genetic programming is applied as a means of traversing the space of possible symbolic expressions. A grammar is specially designed to constrain the search to a canonical form for functions. Novel evolutionary search operators are designed to exploit the structure of the grammar. The approach generates a set of symbolic models which collectively provide a tradeoff between error and model complexity. Experimental results show that the symbolic models generated are compact and easy to understand, making this an effective method for aiding understanding in analog design. The models also demonstrate better prediction quality than posynomials.
In this paper, we present a two-level modeling approach to performance macromodeling based on radial basis function Support Vector Machine (SVM). The two-level model consists of a feasibility model and a set of performance models. The feasibility model identifies the feasible designs that satisfy the design constraints. The performance macromodel is valid for feasible designs. We formulate the feasibility macromodeling problem as a classification problem and the performance macromodeling as a regression problem and apply SVM algorithm to build the classifier and regressors correspondingly. Our experiment shows that performance macromodels for feasible designs are much more accurate, faster to train and evaluate than those without functional or performance constraints considered.
Application of Microelectronic to bioanalysis is an emerging field which holds great promise. From the standpoint of electronic and system design, biochips imply a radical change of perspective, since new, completely different constraints emerge while other usual constraints can be relaxed. While electronic parts of the system can rely on the usual established design-flow, fluidic and packaging design, calls for a new approach which relies significantly on experiments. We hereby make some general considerations based on our experience in the development of biochips for cell analysis.
We describe verification techniques for embedded memory systems using efficient memory modeling (EMM), without explicitly modeling each memory bit. We extend our previously proposed approach of EMM in Bounded Model Checking (BMC) for a single read/write port single memory system, to more commonly occurring systems with multiple memories, having multiple read and write ports. More importantly, we augment such EMM to providing correctness proofs, in addition to finding real bugs as before. The novelties of our verification approach are in a) combining EMM with proof-based abstraction that preserves the correctness of a property up to a certain analysis depth of SAT-based BMC, and b) modeling arbitrary initial memory state precisely and thereby, providing inductive proofs using SAT-based BMC for embedded memory systems. Similar to the previous approach, we construct a verification model by eliminating memory arrays, but retaining the memory interface signals with their control logic and adding constraints on those signals at every analysis depth to preserve the data forwarding semantics. The size of these EMM constraints depends quadratically on the number of memory accesses and the number of read and write ports; and linearly on the address and data widths and the number of memories. We show the effectiveness of our approach on several industry designs and software programs.
A sequential SAT solver Satori [1] was recently proposed as an alternative to combinational SAT in verification applications. This paper describes the design of Seq-SAT - an efficient sequential SAT solver with improved search strategies over Satori. The major improvements include (1) a new and better heuristic for minimizing the set of assignments to state variables, (2) a new priority-based search strategy and a flexible sequential search framework which integrates different search strategies, and (3) a decision variable selection heuristic more suitable for solving the sequential problems. We present experimental results to demonstrate that our sequential SAT solver can achieve orders-of-magnitude speedup over Satori. We plan to release the source code of Seq-SAT along with this paper.
Boolean Satisfiability (SAT) has seen significant use in various tasks in circuit verification in recent years. A key contributor to the efficiency of contemporary SAT solvers is fast deduction using Boolean Constraint Propagation (BCP). This can be efficiently implemented with a Conjunctive Normal Form (CNF) representation of a circuit. However, most circuit verification tasks start from a logic circuit description of the problem instance. Fortunately, there is a simple conversion from a logic circuit to a CNF [12] that enables the use of the CNF representation even for circuit verification tasks. However, this process loses some information regarding the structure of the circuit. One example of such structural information is the Circuit Observability Don't Cares. Several recent papers [6] [7] [8] [9] [11] [13] have addressed the issue of handling circuit unobservability in CNF-based SAT. However, as we will demonstrate, none of these accurately captures the conditions for use of this information in all stages of a CNF-based SAT solver. In this paper, we propose a broader approach to take such Don't Care information into consideration in a CNF-based SAT solver. It does so by adding certain don't care literals to clauses in the CNF representation. These don't care literals are treated differently at different times during the solution process, much like don't cares in logic synthesis. The major merit of this scheme, unlike other recently proposed techniques, is that the solver can continue to use this don't care information during the learning process, which is an important part of contemporary SAT solvers. We have implemented this approach in the zChaff SAT solver and experiments show that significant performance gain can be obtained through their use.
We present our experience of designing a single-chip controller for advanced digital still camera from specification all the way to mass production. The process involves collaboration with camera system designer, IP vendors, EDA vendors, silicon wafer foundry, package & testing houses, and camera maker. We also co-work with academic research groups to develop a JPEG codec IP and memory BIST and SOC testing methodology. In this presentation, we cover the problems encountered, our solutions, and lessons learned.
This paper summarizes our design experiences of various image and video codec IPs. The design issues and methodology of custom video codecs are discussed. The design methodology can be summarized as four stages, system analysis, algorithm optimization, architecture exploration, and code development. Based on these guidelines, several design cases are presented, including the proposed JPEG, MPEG-4, and H.264 architectures.
On a commercial digital still camera (DSC) controller chip we practice a novel SOC test integration platform, solving real problems in test scheduling, test IO reduction, timing of functional test, scan IO sharing, embedded memory built-in self-test (BIST), etc. The chip has been fabricated and tested successfully by our approach. Test results justify that short test integration cost, short test time, and small area overhead can be achieved. To support SOC testing, a memory BIST compiler and an SOC testing integration system have been developed.
We provide a general formulation for the code-based test
compression problem with fixed-length input blocks and propose
a solution approach based on Evolutionary Algorithms.
In contrast to existing code-based methods, we allow unspecified
values in matching vectors, which allows encoding of
arbitrary test sets using a relatively small number of codewords.
Experimental results for both stuck-at and path delay
fault test sets for ISCAS circuits demonstrate an improvement
compared to existing techniques.
Keywords:
Test compression, code-based compression,
evolutionary algorithms
A methodology for designing a reconfigurable linear decompressor is presented. A symbolic Gaussian elimination method to solve a constrained Boolean matrix is proposed and utilized for designing the reconfigurable network. The proposed scheme can be implemented in conjunction with any decompressor that has a combinational linear network. Using the given linear decompressor as a starting point, the proposed method improves the compression further. A nice feature of the proposed method is that it can be implemented with very little hardware overhead. Experimental results indicate that significant improvements can be achieved.
With increasing process fluctuations in nano-scale technology, testing for delay faults is becoming essential in manufacturing test to complement stuck-at-fault testing. Design-for-testability techniques, such as enhanced scan are typically associated with considerable overhead in die-area, circuit performance, and power during normal mode of operation. This paper presents a novel test technique, which can be used as an alternative to the enhanced scan based delay fault testing method, with significantly less design overhead. Instead of using an extra latch as in the enhanced scan method, we propose using supply gating at the first level of logic gates to hold the state of a combinational circuit. Experimental results on a set of ISCAS89 benchmarks show an average reduction of 33% in area overhead with an average improvement of 71% in delay overhead and 90% in power overhead during normal mode of operation, compared to the enhanced scan implementation.
We present a hybrid BIST approach that extracts the most frequently occurring sequences from deterministic test patterns; these extracted sequences are stored on-chip. We use cluster analysis for sequence extraction, and encode deterministic patterns on the basis of the stored sequences. Experimental results for the ISCAS-89 benchmark circuits show that the proposed approach often requires less on-chip storage and test data volume than other recent BIST methods.
Efficient architecture exploration and design of application specific instruction-set processors (ASIPs) requires retargetable software development tools, in particular C compilers that can be quickly adapted to new architectures. A widespread approach is to model the target architecture in a dedicated architecture description language (ADL) and to generate the tools automatically from the ADL specification. For C compiler generation, however, most existing systems are limited either by the manual retargeting effort or by redundancies in the ADL models that lead to potential inconsistencies. We present a new approach to retargetable compilation, based on the LISA 2.0 ADL with instruction semantics, that minimizes redundancies while simultaneously achieving a high degree of automation. The key of our approach is to generate the mapping rules needed in the compiler's code selector from the instruction semantics information. We describe the required analysis and generation techniques, and present experimental results for several embedded processors.
While loop restructuring based code optimization for array intensive applications has been successful in the past, it has several problems such as the requirement of checking dependences (legality issues) and transformation of all of the array references within the loop body indiscriminately (while some of the references can benefit from the transformation, others may not). As a result, data transformations, i.e., transformations that modify memory layout of array data instead of loop structure have been proposed. One of the problems associated with data transformations is the difficulty of selecting a memory layout for an array that is acceptable to the entire program (not just to a single loop). In this paper, we formulate the problem of determining the memory layouts of arrays as a constraint network, and explore several methods of solution in a systematic way. Our experiments provide strong support in favor of employing constraint processing, and point out future research directions.
Scratch-pad memory is becoming an important fixture in embedded multimedia systems. It is significantly more efficient than the cache, in performance and power, and has the added advantage of better timing-predictability. Current techniques for the management of the scratch-pad are quite mature in the case of arrays accessed in a regular fashion, i.e. inside nested-loop by index expressions which are affine functions of the loop-iterators. Many multimedia codes, however, also use arrays as subscripted variables in the index expression of other arrays, thereby making the access pattern irregular. Existing techniques fail in such cases, bringing down the performance. In this paper, we extend the framework that exists today, to the case of irregular access. We provide a clear and precise compiler-based technique for analyzing irregular array-access, and efficiently mapping such arrays to the scratch-pad. On the average, 20% reduction in energy consumption, for a set of realistic applications, was achieved using our methods.
Structural testing techniques, such as statement and branch coverage, play an important role in improving dependability of software systems. However, finding a set of tests which guarantees high coverage is a time-consuming task. In this paper we present a technique for structural testing based on kernel computation. A kernel satisfies the property that any set of tests which executes all vertices (edges) of the kernel executes all vertices (edges) of the program's flowgraph. We present a linear-time algorithm for computing minimum kernels based on pre- and post-dominator relations of a flowgraph.
As the communication requirements of current and future
Multiprocessor Systems on Chips (MPSoCs) continue to increase,
scalable communication architectures are needed to
support the heavy communication demands of the system.
This is reflected in the recent trend that many of the standard
bus products such as STbus, have now introduced the
capability of designing a crossbar with multiple buses operating
in parallel. The crossbar configuration should be designed
to closely match the application traffic characteristics
and performance requirements. In this work we address
this issue of application-specific design of optimal crossbar
(using STbus crossbar architecture), satisfying the performance
requirements of the application and optimal binding
of cores onto the crossbar resources. We present a simulation
based design approach that is based on analysis of actual
traffic trace of the application, considering local variations
in traffic rates, temporal overlap among traffic streams
and criticality of traffic streams. Our methodology is applied
to several MPSoC designs and the resulting crossbar
platforms are validated for performance by cycle-accurate
SystemC simulation of the designs. The experimental case
studies show large reduction in packet latencies (up to 7x)
and large crossbar component savings (up to 3.5x) compared
to traditional design approaches.
Keywords:
Systems on Chips, Networks on Chips,
crossbar, bus, application-specific, SystemC.
Systems on chip (SOC) are composed of intellectual property blocks (IP) and interconnect. While mature tooling exists to design the former, tooling for interconnect design is still a research area. In this paper we describe an operational design flow that generates and configures application-specific network on chip (NOC) instances, given application communication requirements. The NOC can be simulated in SystemC and RTL VHDL. An independent performance verification tool verifies analytically that the NOC instance (hardware) and its configuration (software) together meet the application performance requirements. The Æthereal NOC's guaranteed performance is essential to replace time-consuming simulation by fast analytical performance validation. As a result, application-specific NOCs that are guaranteed to meet the application's communication requirements are generated and verified in minutes, reducing the number of design iterations. A realistic MPEG SOC example substantiates our claims.
The limited scalability of current bus topologies for Systems on Chips (SoCs) dictates the adoption of Networks on Chips (NoCs) as a scalable interconnection scheme. Current SoCs are highly heterogeneous in nature, denoting homogeneous, preconfigured NoCs as inefficient drop-in alternatives. While highly parametric, fully synthesizeable (soft) NoC building blocks appear as a good match for heterogeneous MPSoC architectures, the impact of instantiation-time flexibility on performance, power and silicon cost has not been quantified yet. This work details xpipes Lite, a design flow for automatic generation of heterogeneous NoCs. xpipes Lite is based on highly customizable, high frequency and low latency NoC modules, that are fully synthesizeable. Synthesis results provide with modules that are directly comparable, if not better, than the current published state-of-the-art NoCs in terms of area, power, latency and target frequency of operation measurements.
As microfluidics-based biochips become more complex, manufacturing yield will have significant influence on production volume and product cost. We propose an interstitial redundancy approach to enhance the yield of biochips that are based on droplet-based microfluidics. In this design method, spare cells are placed in the interstitial sites within the microfluidic array, and they replace neighboring faulty cells via local reconfiguration. The proposed design method is evaluated using a set of concurrent real-life bioassays.
Microfluidics-based biochips are soon expected to revolutionize clinical diagnosis, DNA sequencing, and other laboratory procedures involving molecular biology. Most microfluidic biochips are based on the principle of continuous fluid flow and they rely on permanently-etched microchannels, micropumps, and microvalves. We focus here on the automated design of "digital" droplet-based microfluidic biochips. In contrast to continuous-flow systems, digital microfluidics offers dynamic reconfigurability; groups of cells in a microfluidics array can be reconfigured to change their functionality during the concurrent execution of a set of bioassays. We present a simulated annealing-based technique for module placement in such biochips. The placement procedure not only addresses chip area, but it also considers fault tolerance, which allows a microfluidic module to be relocated elsewhere in the system when a single cell is detected to be faulty. Simulation results are presented for a case study involving the polymerase chain reaction.
Optimal synthesis of quantum circuits is intractable and heuristic methods must be employed. Templates are a general approach to reversible and quantum circuit simplification. In this paper, we consider the use of templates to simplify a quantum circuit initially found by other means. We present and analyze templates in the general case, and then provide particular details for circuits composed of NOT, CNOT and controlled-sqrt-of-NOT gates. We introduce templates for this set of gates and apply them to simplify both known quantum realizations of Toffoli gates and circuits found by earlier heuristic Fredkin and Toffoli gate synthesis algorithms. While the number of templates is quite small, the reduction in quantum cost is often significant.
Quantum-dot Cellular Automata (QCA) is attracting a lot of attentions due to their extremely small feature sizes and ultra low power consumption. Up to now there are several designs using QCA technology have been proposed. However, we found not all of the designs function properly. Further, no general design guidelines have been proposed so far. A straightforward extension of a simple functional design pattern may fail. This makes designing a large scale circuits using QCA technology an extremely time-consuming process. In this paper we show several critical vulnerabilities in the structures of primitive QCA gates and QCA interconnects, and propose a disciplinary guideline to prevent any additional plausible but malfunctioning QCA designs.
CMOS-based sensor array chips provide new and attractive features as compared to today's standard tools for medical, diagnostic, and biotechnical applications. Examples for molecule- and cell-based approaches and related circuit design issues are discussed.
On-chip networks for future system-on-chip designs need simple, high performance implementations. In order to promote system-level integrity, guaranteed services (GS) need to be provided. We propose a network-on-chip (NoC) router architecture to support this, and demonstrate with a CMOS standard cell design. Our implementation is based on clockless circuit techniques, and thus inherently supports a modular, GALS-oriented design flow. Our router exploits virtual channels to provide connection-oriented GS, as well as connection-less best-effort (BE) routing. The architecture is highly flexible, in that support for different types of BE routing and GS arbitration can be easily plugged into the router.
As Moore's Law continues to fuel the ability to build ever increasingly complex system-on-chips (SoCs), achieving performance goals is rising as a critical challenge to completing designs. In particular, the system interconnect must efficiently service a diverse set of data flows with widely ranging quality-of-service (QoS) requirements. However, the known solutions for off-chip interconnects such as large-scale networks are not necessarily applicable to the on-chip environment. Latency and memory constraints for on-chip interconnects are quite different from larger-scale interconnects. This paper introduces a novel on-chip interconnect arbitration scheme. We show how this scheme can be distributed across a chip for high-speed implementation. We compare the performance of the arbitration scheme with other known interconnect arbitration schemes. Existing schemes typically focus heavily on either low latency of service for some initiators, or alternatively on guaranteed bandwidth delivery for other initiators. Our scheme allows service latency on some initiators to be traded off smoothly against jitter bounds on other initiators, while still delivering bandwidth guarantees. This scheme is a subset of the QoS controls that are available in the SonicsMX. (SMX) product.
As packet-switching interconnection networks replace buses and dedicated wires to become the standard on-chip interconnection fabric, reducing their power consumption has been identified to be a major design challenge. Network topologies have high impact on network power consumption. Technology scaling is another important factor that affects network power since each new technology changes semiconductor physical properties. As shown in this paper, these two aspects need to be considered synergistically. In this paper, we characterize the impact of process technologies on network energy for a range of topologies, starting from 2-dimensional meshes/tori, to variants of meshes/tori that incorporate higher dimensions, multiple hierarchies and express channels. We present a method which uses an analytical model to predict the most energy-efficient topology based on network size and architecture parameters for future technologies. Our model is validated against cycle-accurate network power simulation and shown to arrive at the same predictions. We also show how our method can be applied to actual parallel benchmarks with a case study. We see this work as a starting point for defining a roadmap of future on-chip networks.
Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. In this paper, we present ISEGEN, an approach that identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin (K-L) min-cut heuristic. Experimental results on a number of MediaBench, EEMBC and cryptographic applications show that our approach matches the quality of the optimal solution obtained by exhaustive search. We also show that our ISEGEN technique is on average 20x faster than a genetic formulation that generates equivalent solutions. Furthermore, the ISEs identified by our technique exhibit 35% more speedup than the genetic solution on a large cryptographic application (AES) by effectively exploiting its regular structure.
Early scheduling algorithms usually adjusted the clock cycle duration to the execution time of the slowest operation. This resulted in large slack times wasted in those cycles executing faster operations. To reduce the wasted times multi-cycle and chaining techniques have been employed. While these techniques have produced successful designs, its effectiveness is often limited due to the area increment that may derive from chaining, and the extra latencies that may derive from multicycling. In this paper we present an optimization method that solves the time-constrained scheduling problem by transforming behavioural specifications into new ones whose subsequent synthesis substantially improves circuit performance. Our proposal breaks up some of the specification operations, allowing their execution during several possibly unconsecutive cycles, and also the calculation of several data-dependent operation fragments in the same cycle. To do so, it takes into account the circuit latency and the execution time of every specification operation. The experimental results carried out show that circuits obtained from the optimized specification are on average 60% faster than those synthesized from the original specification, with only slight increments in the circuit area.
Importance of addressing soft errors in both safety critical applications and commercial consumer products is increasing, mainly due to ever shrinking geometries, higher-density circuits, and employment of power-saving techniques such as voltage scaling and component shut-down. As a result, it is becoming necessary to treat reliability as a first-class citizen in system design. In particular, reliability decisions taken early in system design can have significant benefits in terms of design quality. Motivated by this observation, this paper presents a reliability-centric high-level synthesis approach that addresses the soft error problem. The proposed approach tries to maximize reliability of the design while observing the bounds on area and performance, and makes use of our reliability characterization of hardware components such as adders and multipliers. We implemented the proposed approach, performed experiments with several designs, and compared the results with those obtained by a prior proposal.
Varying partial bypassing in pipelined processors is an effective way to make performance, area and energy trade-offs in embedded processors. However, performance evaluation of partial bypassing in processors has been inaccurate, largely due to the absence of bypass-sensitive retargetable compilation techniques. Furthermore no existing partial bypass exploration framework estimates the power and cost overhead of partial bypassing. In this paper we present PBExplore: A framework for Compiler-in-the-Loop exploration of partial bypassing in processors. PBExplore accurately evaluates the performance of a partially bypassed processor using a generic bypass-sensitive compilation technique. It synthesizes the bypass control logic and estimates the area and energy overhead of each bypass configuration. PBExplore is thus able to effectively perform multi-dimensional exploration of the partial bypass design space. We present experimental results on the Intel XScale architecture on MiBench benchmarks and demonstrate the need, utility and exploration capabilities of PBExplore.
We discuss the problem of Concurrent Error Detection (CED) in a popular class of asynchronous controllers, namely Burst-Mode machines. We first outline the particularities of these clock-less circuits, including the use of redundancy to ensure hazard-free operation, and we explain how they limit the applicability and effectiveness of traditional CED methods, such as duplication. We then demonstrate how duplication can be enhanced to resolve these limitations through additional hardware for comparison synchronization and detection of error-induced hazards, which jeopardize the interaction of the circuit with its environment. Finally, we propose a Transition-Triggered CED method which employs a transition prediction function to eliminate the need for hazard detection circuitry and hazard-free implementation of the duplicate. As indicated by experimental results, the proposed method reduces significantly the cost of CED, with an average of 22% in hardware savings.
The design of reliable circuits has received a lot of attention in the past, leading to the definition of several design techniques introducing fault detection and fault tolerance properties in systems for critical applications/environments. Such design methodologies tackled the problem at different abstraction levels, from switch-level to logic, RT level, and more recently to system level. Aim of this paper is to introduce a novel system-level technique based on the redefinition of the operators functionality in the system specification. This technique provides reliability properties to the system data path, transparently with respect to the designer. Feasibility, fault coverage, performance degradation and overheads are investigated on a FIR circuit.
This paper addresses error-resilience as the capability to
tolerate bit-flips in a compressed test data stream (which is
transferred from an Automatic Test Equipment (ATE) to the
Device-Under-Test (DUT)). In an ATE, bit-flips may occur
in either the electronics components of the loadboard, or the
high speed serial communication links (between the user interface
workstation and the head). It is shown that errors
caused by bit-flips can seriously degrade the test quality (as
measured by coverage) of the compressed data streams. The
effects of bit-flips on compression are analyzed and various
test data compression techniques are evaluated. It is shown
that for benchmark circuits, coverage of test sets can be reduced
by 10%-30%.
Index terms:
error resilience, fault tolerance, yield, reliable
operation of ATE, compression.
Triple Modular Redundancy (TMR) is a suitable fault tolerant technique for SRAM-based FPGA. However, one of the main challenges in achieving 100% robustness in designs protected by TMR running on programmable platforms is to prevent upsets in the routing from provoking undesirable connections between signals from distinct redundant logic parts, which can generate an error in the output. This paper investigates the optimal design of the TMR logic (e.g., by cleverly inserting voters) to ensure robustness. Four different versions of a TMR digital filter were analyzed by fault injection. Faults were randomly inserted straight into the bitstream of the FPGA. The experimental results presented in this paper demonstrate that the number and placement of voters in the TMR design can directly affect the fault tolerance, ranging from 4.03% to 0.98% the number of upsets in the routing able to cause an error in the TMR circuit.
In this paper we describe a fully-automated methodology for formal verification of fused-multiply-add floating point units (FPUs). Our methodology verifies an implementation FPU against a simple reference model derived from the processor's architectural specification, which may include all aspects of the IEEE specification including denormal operands and exceptions. Our strategy uses a combination of BDD- and SAT-based symbolic simulation. To make this verification task tractable, we use a combination of case-splitting, multiplier isolation, and automatic model reduction techniques. The case-splitting is defined only in terms of the reference model, which makes this approach easily portable to new designs. The methodology is directly applicable to multi-GHz industrial implementation models (e.g., HDL or gate-level circuit representations) that contain all details of the high-performance transistor-level model, such as aggressive pipelining, clocking, etc. Experimental results are provided to demonstrate the computational efficiency of this approach.
While most of the effort in improving verification times for pipeline machine verification has focused on faster decision procedures, we show that the refinement maps used also have a drastic impact on verification times. We introduce a new class of refinement maps for pipelined machine verification, and using the state-of-the-art verification tools UCLID and Siege we show that one can attain several orders of magnitude improvements in verification times over the standard flushing-based refinement maps, even enabling the verification of machines that are too complex to otherwise automatically verify.
Development of energy and performance-efficient embedded software is increasingly relying on application of complex transformations on the critical parts of the source code. Designers applying such nontrivial source code transformations are often faced with the problem of ensuring functional equivalence of the original and transformed programs. Currently they have to rely on incomplete and time-consuming simulation. Formal automatic verification of the transformed program against the original is instead desirable. This calls for equivalence checking tools similar to the ones available for comparing digital circuits. We present such a tool to compare array-intensive programs related through a combination of important global transformations like expression propagations, loop and algebraic transformations. When the transformed program fails to pass the equivalence check, the tool provides specific feedback on the possible locations of errors.
Inductive cross-talk within IC packaging is becoming a significant bottleneck in high-speed inter-chip communication. The parasitic inductance within IC packaging causes bounce on the power supply pins in addition to glitches and rise-time degradation on the signal pins. Until recently, the parasitic inductance problem was addressed by aggressive package design. In this work we present a technique to encode the off-chip data transmission to limit bounce on the supplies and reduce inductive signal coupling due to transitions on neighboring signal lines. Both these performance limiting factors are modeled in a common mathematical framework. Our experimental results show that the proposed encoding based techniques result in reduced supply bounce and signal degradation due to inductive cross-talk, closely matching the theoretical predictions. We demonstrate that the overall bandwidth of a bus actually increases by 85% using our technique, even after accounting for the encoding overhead. The asymptotic bus size overhead is between 30% and 50%, depending on how stringent the user-specified inductive cross-talk parameters are.
Buffer insertion is a popular technique to reduce the interconnect delay. The classic buffer insertion algorithm of van Ginneken has time complexity O(n2), where n is the number of buffer positions. Lillis, Cheng and Lin extended van Ginneken's algorithm to allow b buffer types in time O(b2n2). For modern design libraries that contain hundreds of buffers, it is a serious challenge to balance the speed and performance of the buffer insertion algorithm. In this paper, we present a new algorithm that computes the optimal buffer insertion in O(bn2) time. The reduction is achieved by the observation that the (Q,C) pairs of the candidates that generate the new candidates must form a convex hull. On industrial test cases, the new algorithm is faster than the previous best buffer insertion algorithms by orders of magnitude.
This paper presents a novel repeater insertion algorithm for interconnect power minimization. The novelty of our approach is in the judicious integration of an analytical solver and a dynamic programming based method. Specifically, the analytical solver chooses a concise repeater library and a small set of repeater location candidates such that the dynamic programming algorithm can be performed fast with little degradation of the solution quality. In comparison with previously reported repeater insertion schemes, within comparable runtimes, our approach achieves up to 37% higher power savings. Moreover, for the same design quality, our scheme attains a speedup of two orders of magnitude.
Most of the DNA chips available on the market are based on external or internal optical detection (fluorescence or chemiluminescence) and need a bulky chip reader (optics, laser, camera or PMT). We will present a new detection strategy using direct electrochemical detection of DNA hybridisation using conductive polymers grafted on an active silicon chip. We will report results on the first step of the fabrication process and emphasis on full wafer electro-polymerisation of DNA probes on modified CMOS technology.
Single-chip CMOS-based biosensors that feature microcantilevers as transducer elements are presented. The cantilevers are functionalized for the capturing of specific analytes, e.g., proteins or DNA. The binding of the analyte changes the mechanical properties of the cantilevers such as surface stress and resonant frequency, which can be detected by an integrated Wheatstone bridge. The monolithic integrated readout allows for a high signal-to-noise ratio, lowers the sensitivity to external interference and enables autonomous device operation.