Sessions: [Keynote Address] [2.2] [2.3] [2.4] [2.5] [2.6] [2.7] [2.8] [3.2] [3.3] [3.4] [3.5] [3.6] [3.7] [3.8] [IP1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [5.1] [5.2] [5.3] [5.4] [5.5] [5.6] [5.7] [IP2] [6.1.1] [6.1.2] [6.2] [6.3] [6.4] [6.5] [6.6] [6.7] [6.8] [7.1] [7.2] [7.3] [7.4] [7.5] [7.6] [7.7] [7.8] [IP3] [8.1] [8.2] [8.3] [8.4] [8.5] [8.6] [8.7] [8.8] [9.1] [9.2] [9.3] [9.4] [9.5] [9.6] [9.7] [IP4] [10.1.1] [10.1.2] [10.2] [10.3] [10.4] [10.5] [10.6] [10.7] [10.8] [11.1] [11.2] [11.3] [11.4] [11.5] [11.6] [11.7] [11.8] [IP5] [12.1] [12.2] [12.3] [12.4] [12.5] [12.6] [12.7] [12.8]
DATE Executive Committee
DATE Sponsors
Technical Program Topic Chairs
Technical Program Committee
Reviewers
Foreword
Best Paper Awards
Tutorials
PH.D. Forum
Call for Papers: DATE 2012
Moore's Law continues to deliver ever-more transistors on an integrated circuit, but discontinuities in the progress of technology mean that the future isn't simply an extrapolation of the past. For example, design cost and complexity constraints have recently caused the microprocessor industry to switch to multi-core architectures, even though these parallel machines present programming challenges that are far from solved. Moore's Law now translates into ever-more processors on a multi-, and soon many-core chip. The software challenge is compounded by the need for increasing fault-tolerance as near-atomic-scale variability and robustness problems bite harder. We look beyond this transitional phase to a future where the availability of processor resource is effectively unlimited and computations must be optimised for energy usage rather than load balancing, and we look to biology for examples of how such systems might work. Conventional concerns such as synchronisation and determinism are abandoned in favour of real-time operation and adapting around component failure with minimal loss of system efficacy.
We address the problem of analyzing the performance of System-on-chip (SoC) architectures in the presence of variations. Existing techniques such as gate-level statistical timing analysis compute the distributions of clock frequencies of SoC components. However, we demonstrate that translating component-level characteristics into a system-level performance distribution is a complex and challenging problem due to the inter-dependencies between components - execution, indirect effects of shared resources, and interactions between multiple system-level "execution paths". We argue that accurate variation-aware system-level performance analysis requires repeated system execution, which is prohibitively slow when based on simulation. Emulation is a widely-used approach to drastically speedup system-level simulation, but it has not been hitherto applied to variation analysis. We describe a framework - Variability Emulation for SoC Performance Analysis (VESPA) - that adapts and applies emulation to the problem of variation aware SoC performance analysis. The proposed framework consists of three phases: component variability characterization, variation-aware emulation setup, and Monte-carlo driven emulation. We demonstrate the utility of the proposed framework by applying it to design variation-aware architectures for two example SoCs - an 802.11 MAC processor and an MPEG encoder. Our results suggest that variability emulation has great potential to enable variation-aware design and exploration at the system level.
Three-dimensional integrated circuit (3D IC) has
become an emerging technology in view of its advantages in
packing density and flexibility in heterogeneous integration.
The multi-core processor (MCP), which is able to deliver
equivalent performance with less power consumption, is a
candidate for 3D implementation. However, when maximizing
the throughput of 3D MCP, due to the inherent heat removal
limitation, thermal issues must be taken into consideration.
Furthermore, since the temperature of a core strongly depends
on its location in the 3D MCP, a proper task allocation helps to
alleviate any potential thermal problem and improve the
throughput. In this paper, we present a thermal-aware on-line
task allocation algorithm for 3D MCPs. The results of our
experiments show that our proposed method achieves 16.32X
runtime speedup, and 23.18% throughput improvement. These
are comparable to the exhaustive solutions obtained from
optimization modeling software LINGO. On average, our
throughput is only 0.85% worse than that of the exhaustive
method. In 128 task-to-core allocations, our method takes only
0.932 ms, which is 57.74 times faster than the previous work.
Keywords: Multi-core processor, task allocation, thermal
awareness, three-dimensional integration, throughput
optimization, temperature uniformity.
NAND flash memory is widely used in embedded systems due to its non-volatility, shock resistance and high cell density. In recent years, various Flash Translation Layer (FTL) schemes (especially hybrid-level FTL schemes) have been proposed. Although these FTL schemes provide good solutions in terms of endurance and wear-leveling, none of them have considered to reuse free pages in both data blocks and log blocks during a merge operation. By reusing these free pages, less free blocks are needed and the endurance of NAND flash memory is enhanced. We evaluate our reuse strategy using a variety of application specific I/O traces from Windows systems. Experimental results show that the proposed scheme can effectively reduce the erase counts and enhance the endurance of flash memory.
In this paper, we focus on register allocation techniques
to simultaneously reduce energy consumption and heat
buildup of register accesses. The conflict between these two
objectives is resolved through the introduction of a hardware
rotator. A register allocation algorithm followed by a refinement
method is proposed based on the access patterns and the effects
of the rotator. Experimental results show that the proposed
algorithms obtain notable improvements in energy consumption
and temperature reduction for embedded applications.
Index Terms -Register allocation, Bit transition activity, Heat
buildup, Rotator
The passivity characterization and enforcement of linear interconnect macromodels has received much attention in the recent literature. It is now widely recognized that the Hamiltonian eigensolution is a very reliable technique for such characterization. However, most available algorithms for the determination of the required Hamiltonian eigenvalues still require excessive computational resources for large-size macromodels with thousands of states. This work intends to break this complexity by introducing the first parallel implementation of a specialized Hamiltonian eigensolver, designed and optimized for shared memory multicore architectures. Our starting point is a multi-shift restarted and deflated Arnoldi process. Excellent parallel efficiency is obtained by running different Arnoldi iterations concurrently on different threads. The numerical results show that macromodels with several thousands states are characterized in few seconds on a 16-core machine, with close to ideal speedup factors.
This paper proposes a highly efficient methodology for the statistical analysis of RC nets subject to manufacturing variabilities, based on the combination of parameterized RC extraction and structure-preserving parameterized model order reduction methods. The sensitivity-based layout-to-circuit extraction generates first-order Taylor series approximations of resistances and capacitances with respect to multiple geometric parameter variations. This formulation becomes the input of the parameterized model order reduction, which exploits the explicit parameter dependence to produce a linear combination of multiple non-parameterized transfer functions weighted by the parameter variations. Such a formulation enables a fast computation of statistical properties such as the standard deviation of the transfer function given the process spreads of the technology. Both the extraction and the reduction techniques avoid any parameter sampling. Therefore, the proposed method achieves a significant speed up compared to the Monte Carlo approaches.
The analysis of on-chip power grids requires the solution of large systems of linear algebraic equations with specific properties. Lately, a class of random walk based solvers have been developed that are capable of handling these systems: these are especially useful when only a small part of the original system must be solved. These methods build a probabilistic network that corresponds to the power grid. However, this construction does not fully exploit the properties of the problem and can result in large variances for the random walks, and consequently, large run times. This paper presents an efficient methodology, inspired by the idea of importance sampling, to improve the runtime of random walk based solvers. Experimental results show significant speedups, as compared to naive random walks used by the state-of-the-art random walk solvers.
We propose a block-diagonal structured model order reduction (BDSM) scheme for fast power grid analysis. Compared with existing power grid model order reduction (MOR) methods, BDSM has several advantages. First, unlike many power grid reductions that are based on terminal reduction and thus error-prone, BDSM utilizes an exact column-by-column moment matching to provide higher numerical accuracy. Second, with similar accuracy and macromodel size, BDSM generates very sparse block-diagonal reduced-order models (ROMs) for massive-port systems at a lower cost, whereas traditional algorithms such as PRIMA produce full dense models inefficient for the subsequent simulation. Third, different from those MOR schemes based on extended Krylov subspace (EKS) technique, BDSM is input-signal independent, so the resulting ROM is reusable under different excitations. Finally, due to its block-diagonal structure, the obtained ROM can be simulated very fast. The accuracy and efficiency of BDSM are verified by industrial power grid benchmarks.
Panelists: G. De Micheli, P. Groeneveld, H. Hiller, E. Macii, P. Magarshack
Virtually all current integrated circuits and systems would not exist without the use of logic synthesis and physical design tools. These design technologies were developed in the last fifty years and it is hard to say if they have come to full maturity. Physical design evolved from methods used for printed-circuit boards where the classic problems of placement and routing surfaced for the first time [1]. Logic synthesis evolved in a different trajectory, starting from the classic works on switching theory [2], but took a sharp turn in the eighties when multiple-level logic synthesis, coupled to semicustom technologies, provided designers with a means to map models in hardware description languages into netlists ready for physical design [3], [4]. The clear separation between logic and physical design tasks enabled the development of effective design tool flows, where signoff could be done at the netlist level. Nevertheless, the relentless downscaling of semiconductor technologies forced this separation to disappear, once circuit delays became interconnect-dominated. Since the nineties, design flows combined logic and physical design tools to address the so-called timing closure problem, i.e., to reduce the designer effort to synthesize a design that satisfies all timing constraints. Despite many efforts in various directions, most notably with the use of the fixed timing methodology, this problem is not completely solved yet. The complexity of integrated logic and physical tool flows, as well as the decrease in design starts of large ASICs, limits the development of these flows to a few EDA companies.
With shrinking transistor sizes and supply voltages, errors in combinational logic due to radiation particle strikes are on the rise. A broad range of applications will soon require protection from this type of error, requiring an effective and inexpensive solution. Many previously proposed logic protection techniques rely on duplicate logic or latches, incurring high overheads. In this paper, we present a technique for transient error detection using parity trees for power and area efficiency. This approach is highly customizable, allowing adjustment of a number of parameters for optimal error coverage and overhead. We present simulation results comparing our scheme to latch duplication, showing on average greater than 55% savings in area and power overhead for the same error coverage. We also demonstrate adding protection to reach a target logic soft error rate, constituting at best a 59X reduction in the error rate with under 2% power and area overhead.
As the feature size of FPGA shrinks to nanometers,
soft errors increasingly become an important concern for SRAM-based
FPGAs. Without consideration of the application level
impact, existing reliability-oriented placement and routing
approaches analyze soft error rate (SER) only at the physical
level, consequently completing the design with suboptimal soft
error mitigation. Our analysis shows that the statistical variation
of the application level factor is significant. Hence in this work,
we first propose a cube-based analysis to efficiently and
accurately evaluate the application level factor. And then we
propose a cross-layer optimized placement and routing algorithm
to reduce the SER by incorporating the application level and the
physical level factor together. Experimental results show that, the
average difference of the application level factor between our
cube-based method and Monte Carlo golden simulation is less
than 0.01. Moreover, compared with the baseline VPR placement
and routing technique, the cross-layer optimized placement and
routing algorithm can reduce the SER by 14% with no area and
performance overhead.
Keywords- cross-layer optimization, cube-based analysis,
FPGA, placement and routing, soft error rate.
We present a novel trigonometry-based probability
calculation (TPC) method for analyzing circuit behavior and
reliability in the presence of errors that occur with extremely low
probability. Signal and error probabilities are represented by
trigonometric functions controlled by their corresponding angles.
By combining trigonometric identities and Taylor expansions, the
effect of an error at a particular gate is simulated as a rotation.
In addition, the correlations among signals caused by re-convergence
are carefully handled. The TPC method is shown to
be more scalable and accurate than prior approaches, especially
for very low-probability errors. We measure the performance of
TPC by applying it to the ISCAS and LGSyn-91 benchmark
circuits. Experimental results show that TPC achieves near-linear
runtime complexity even with the largest circuits, while the
accuracy gradually increases with decreasing error probabilities.
Keywords- Error modeling, logic circuits, probabilistic analysis,
reliability, soft errors.
In this paper, we present a very fast and accurate technique to estimate the soft error rate of digital circuits in the presence of Multiple Event Transients (METs). In the proposed technique, called Multiple Event Probability Propagation (MEPP), a four-value logic and probability set are used to accurately propagate the effects of multiple erroneous values (transients) due to METs to the outputs and obtain soft error rate. MEPP considers a unified treatment of all three masking mechanisms i.e., logical, electrical, and timing, while propagating the transient glitches. Experimental results through comparisons with statistical fault injection confirm accuracy (only 2.5% difference) and speed-up (10,000X faster) of MEPP.
It is projected that the communication data volume in electric vehicles will significantly increase compared to state-of-the-art vehicles due to additional functionalities like x-by-wire and safety functions. This paper presents a networking concept for electric vehicles to cope with the high data volume in cases where a single FlexRay bus is not sufficient. We present a FlexRay switch concept that is capable of increasing the effective bandwidth and improving the safety of existing FlexRay buses. A prototype FPGA implementation shows the feasibility of our approach. Further, a scheduling approach for the FlexRay switch that obtains the optimal results based on Integer Linear Programming (ILP) is presented. Since the ILP approach becomes intractable for real-world problems, we present a heuristic three-step approach that determines the branches of the network, performs a local scheduling for each node, and finally assembles the local schedules into a global schedule. Test cases and an entire realistic in-vehicle network are used to emphasize the benefits of the proposed approach.
In this paper we present an approach for the configuration and reconfiguration of FlexRay networks to increase their fault tolerance. To guarantee a correct and deterministic system behavior, the FlexRay specification does not allow a reconfiguration of the schedule during run time. To avoid the necessity of a complete bus restart in case of a node failure, we propose a reconfiguration using redundant slots in the schedule and/or combine messages in existing frames and slots, to compensate node failures and increase robustness. Our approach supports the developer to increase the fault tolerance of the system during the design phase. It is a heuristic, which, additionally to a determined initial configuration, calculates possible reconfigurations for the remaining nodes of the FlexRay network in case of a node failure, to keep the system working properly. An evaluation by means of realistic safety-critical automotive real-time systems revealed that it determines valid reconfigurations for up to 80% of possible individual node failures. In summary, our approach offers major support for the developer of FlexRay networks since the results provide helpful feedback about reconfiguration capabilities. In an iterative design process these information can be used to determine and optimize valid reconfigurations.
Detecting and reacting to faults is an indispensable capability for many wireless sensor network applications. Unfortunately, implementing fault detection and error correction algorithms is challenging. Programming languages and fault tolerance mechanisms for sensor networks have historically been designed in isolation. This is the first work to combine them. Our goal is to simplify the design of fault-tolerant sensor networks. We describe a system that makes it unnecessary for sensor network application developers and users to understand the intricate implementation details of fault detection and tolerance techniques, while still using their domain knowledge to support fault detection, error correction, and error estimation mechanisms. Our FACTS system translates low-level faults into their consequences for application-level data quality, i.e., consequences domain experts can appreciate and understand. FACTS is an extension of an existing sensor network programming language; its compiler and runtime libraries have been modified to support automatic generation of code for on-line fault detection and tolerance. This code determines the impacts of faults on the accuracies of the results of potentially complex data aggregation and analysis expressions. We evaluate the overhead of the proposed system on code size, memory use, and the accuracy improvements for data analysis expressions using a small experimental testbed and simulations of large-scale networks.
In the last decades there is an exponential growth in
the amount of genomic data that need to be analyzed. A very
important problem in biology is the extraction of the biologically
functional genomic DNA from the actual genome of the
organisms. There have been proposed many computational
biology algorithms that solve the gene finding problem which
utilize various approaches; GlimmerHMM is considered one of
the most efficient such algorithms. This paper presents two
different accelerators for the GlimmerHMM algorithm. One of
them is implemented on a modern FPGA platform exploiting the
parallelism that reconfigurable logic offers and the other one
utilizes a GPU (Graphic Processing Unit) taking advantage of a
highly multithreaded operational environment. The performance
of the implemented systems is compared against the one achieved
when the official distribution of the algorithm is executed on a
high-end multi-core server; the speedup initiated, for the most
compute intensive part, is up to 200x for the FPGA-based system
and up to 34x for the GPU-based system.
Keywords- Gene finding, FPGA, GPU, bioinformatics
The impact of variability on sub-45nm CMOS multimedia platforms makes hard to provide application QoS guarantees, as the speed variations across the cores may cause sub-optimal and sample-dependent utilization of the available resources and energy budget. These effects can be compensated by an efficient allocation of the workload at run-time. In the context of multimedia applications, a critical objective is to compensate core speed variability while matching time constraints without impacting the energy consumption. In this paper we present a new approach to compute optimal task allocations at run-time. The proposed strategy exploits an efficient and scalable implementation to find on-line the best possible solution in a tightly bounded time. Experimental results demonstrate the effectiveness of compensation both in terms of deadline miss rate and energy savings. Results have been compared with those obtained applying state-of-art techniques on a multithreaded MPEG2 decoder. The validation has been performed on a cycle-accurate virtual prototype of a next-generation industrial multicore platform that has been extended with process variability models.
This paper presents a new technique, called subclock power gating, for reducing leakage power in digital circuits. The proposed technique works concurrently with voltage and frequency scaling and power reduction is achieved by power gating within the clock cycle during active mode unlike traditional power gating which is applied during idle mode. The proposed technique can be implemented using standard EDA tools with simple modifications to the standard power gating design flow. Using a 90nm technology library, the technique is validated using two case studies: 16-bit parallel multiplier and ARM Cortex-M0TM microprocessor, provided by our industrial project partner. Compared to designs without sub-clock power gating, in a given power budget, we show that leakage power saved allows 45x and 2.5x improvements in energy efficiency in the case of multiplier and microprocessor, respectively.
In premium vehicles, the number of distributed
comfort-, safety-, and infotainment-related functions is steadily
increasing. For this reason, the requirements for the underlying
communication architecture are also becoming stronger.
In addition, the diversity of today's deployed communication
technologies and the need for higher bandwidths complicate
the design of future network architectures. Ethernet and IP,
both standardized and widely used, could be one solution to
homogenize communication architectures and to provide higher
bandwidths. This paper focuses on a migration concept for
replacing today's employed CAN-buses by Ethernet/IP-based
networks. It highlights several concepts to minimize the protocol
header overhead by using EA- and rule-based algorithms and
presents migration results for currently deployed automotive
CAN subnetworks.
Index Terms - Ethernet, IP, UDP, CAN, migration, optimization,
automotive, embedded, CANoverIP, XoverIP
Increasingly intelligent energy-management and
safety systems are developed to realize safe and economic
automobiles. The realization of these systems is only possible
with complex and distributed software. This development poses a
challenge for verification and validation. Upcoming standards like
ISO 26262 provide requirements for verification and validation
during development phases. Advanced test methods are requested
for safety critical functions. Formal specification of requirements
and appropriate testing strategies in different stages of the development
cycle are part of it. In this paper we present our approach
to formalize the requirements specification by test models. These
models serve as basis for the following testing activities, including
the automated derivation of executable test cases from it. Test
cases can be derived statistically, randomly on the basis of
operational profiles, and deterministically in order to perform
different testing strategies. We have applied our approach with a
large German OEM in different development stages of active
safety and energy management functionalities. The test cases
were executed in model-in-the-loop and in hardware-in-the-loop
simulation. Errors were identified with our approach both in the
requirement specification and in the implementation that were
not discovered before.
Keywords: Road Vehicles, Safety Critical Systems,
Software Testing, Requirements Engineering, Automated
Testing, Verification, Validation
Dynamic Voltage and Frequency Scaling (DVFS), a widely adopted technique to ensure safe thermal characteristics while delivering superior energy efficiency, is rapidly becoming inefficient with technology scaling due to two critical factors: (a) inability to scale the supply voltage due to reliability concerns; and (b) dynamic adaptations through DVFS cannot alter underlying power hungry circuit characteristics, designed for the nominal frequency. In this paper, we show that DVFS scaled circuits substantially lag in energy efficiency, by 22-86%, compared to ground up designs for target frequency levels. We propose Topologically Homogeneous Power-Performance Heterogeneous multicore systems (THPH), a fundamentally alternate means to design energy efficient multicore systems. Using a system level CAD approach, we seamlessly integrate architecturally identical cores, designed for different voltage-frequency (VF) domains. We use a combination of standard cell library based CAD flow and full system architectural simulation to demonstrate 11-22% improvement in energy efficiency using our design paradigm.
Instance and temperature-dependent leakage power variability is already a significant issue in contemporary embedded processors, and one which is expected to increase in importance with scaling of semiconductor technology. We measure and characterize this leakage power variability in current microprocessors, and show that variability aware duty cycle scheduling produces 7.1x improvement in sensing quality for a desired lifetime. In contrast, pessimistic estimations of power consumption leave 61% of the energy untapped, and datasheet power specifications fail to meet required lifetimes by 14%. Finally, we introduce a duty cycle abstraction for TinyOS that allows applications to explicitly specify lifetime and minimum duty cycle requirements for individual tasks, and dynamically adjusts duty cycle rates so that overall quality of service is maximized in the presence of power variability.
Advances in chip-multiprocessor processing capabilities have led to an increased power consumption and temperature hotspots. Reducing the on-die peak temperature is important from the power reduction and reliability considerations. However, the presence of task deadlines constrain the reduction of peak temperature and thus complicates the determination of optimal speeds for minimizing the peak temperature. We formulate the determination of optimal speeds for minimizing the peak temperature of execution with task deadlines as a quasiconvex optimization problem. This formulation includes accurate power and thermal models with the leakage power dependency on temperature. Experiments demonstrate that our approach is very flexible in adapting to various scenarios of workload and deadline specifications. We obtained an 8 °C reduction in peak temperature for a sample execution of benchmarks.
Satisfiability (SAT) solvers often benefit from clauses learned by the DPLL procedure, even though they are by definition redundant. In addition to those derived from conflicts, the clauses learned by dominator analysis during the deduction procedure tend to produce smaller implication graphs and sometimes increase the deductive power of the input CNF formula. We extend dominator analysis with an efficient self-subsumption check. We also show how the information collected by dominator analysis can be used to detect redundancies in the satisfied clauses and, more importantly, how it can be used to produce supplemental conflict clauses. We characterize these transformations in terms of deductive power and proof conciseness. Experiments show that the main advantage of dominator analysis and its extensions lies in improving proof conciseness.
In this paper we present a method for integrating two complementary solving techniques for QBF formulas, i. e. variable elimination based on an AIG-framework and search with DPLL based solving. We develop a sophisticated mechanism for coupling these techniques, enabling the transfer of partial results from the variable elimination part to the search part. This includes the definition of heuristics to (1) determine appropriate points in time to snapshot the current partial result during variable elimination (by estimating its quality) and (2) switch from variable elimination to search-based methods (applied to the best known snapshot) when the progress of variable elimination is supposed to be too slow or when representation sizes grow too fast. We will show in the experimental section that our combined approach is clearly superior to both individual methods run in a stand-alone manner. Moreover, our combined approach significantly outperforms all other state-of-the-art solvers.
This paper presents a new SMT solver, STABLE, for formulas of the quantifier-free logic over fixed-sized bit vectors (QF-BV). The heart of STABLE is a computer-algebra-based engine which provides algorithms for simplifying arithmetic problems of an SMT instance prior to bit-blasting. As the primary application domain for STABLE we target an SMT-based property checking flow for System-on-Chip (SoC) designs. When verifying industrial data path modules we frequently encounter custom-designed arithmetic components specified at the logic level of the hardware description language being used. This results in SMT problems where arithmetic parts may include non-arithmetic constraints. STABLE includes a new technique for extracting arithmetic bit-level information for these non-arithmetic constraints. Thus, our algebraic engine can solve subproblems related to the entire arithmetic design component. STABLE was successfully evaluated in comparison with other state-of-the-art SMT solvers on a large collection of SMT formulas describing verification problems of industrial data path designs that include multiplication. In contrast to the other solvers STABLE was able to solve instances with bit-widths of up to 64 bits.
Coverage model is the main technique to evaluate the thoroughness of dynamic verification of a Design-under-Verification (DUV). However, rather than achieving a high coverage, the essential purpose of verification is to expose as many bugs as possible. In this paper, we propose a novel verification methodology that leverages the early bug prediction of a DUV to guide and assess related verification process. To be specific, this methodology utilizes predictive models built upon artificial neural networks (ANNs), which is capable of modeling the relationship between the high-level attributes of a design and its associated bug information. To evaluate the performance of constructed predictive model, we conduct experiments on some open source projects. Moreover, we demonstrate the usability and effectiveness of our proposed methodology via elaborating experiences from our industrial practices. Finally, discussions on the application of our methodology are presented. Index Terms - Verification; Complexity Metric; Bug Prediction; Empirical Study
SAT-based BMC is promising for directed test generation since it can locate the reason of an error within a small bound. However, due to the state space explosion problem, BMC cannot handle complex designs and properties. Although various optimization methods are proposed to address a single complex property, the test generation process cannot be fully automated. This paper presents an efficient automated approach that can scale down the falsification complexity using property decomposition and learning techniques. Our experimental results using both software and hardware benchmarks demonstrate that our approach can drastically reduce the overall test generation effort.
We present a methodology to generate input stimulus for design validation using GoldMine, an automatic assertion generation engine that uses data mining and formal verification. GoldMine mines the simulation traces of a behavioral Register Transfer Level (RTL) design using a decision tree based learning algorithm to produce candidate assertions. These candidate assertions are passed to a formal verification engine. If a candidate assertion is false, a counterexample trace is generated. In this work, we feed these counterexample traces to iteratively refine the original simulation trace data. We introduce an incremental decision tree to mine the new traces in each iteration. The algorithm converges when all the candidate assertions are true. We prove that our algorithm will always converge and capture the complete functionality of an output on convergence.We show that our method always results in a monotonic increase in simulation coverage. We also present an output-centric notion of coverage, and argue that we can attain coverage closure with respect to this notion of coverage. Experimental results to validate our arguments are presented on several designs from Rigel, OpenRisc and SpaceWire.
The verification of embedded software has become an important subject over the last years. However, neither standalone verification approaches, like simulation-based or formal verification, nor state-of-the-art hybrid/semiformal verification approaches are able to verify large and complex embedded software with hardware dependencies. This work presents a new scalable and extendable hybrid verification approach for the verification of temporal properties in embedded software with hardware dependencies using for the first time a new mixed bottom-up/top-down algorithm. Therefore, new algorithms and methodologies like static parameter assignment and counter-example guided simulation are proposed in order to combine simulation-based and formal verification in a new way. We have successfully applied this hybrid approach to embedded software applications: Motorola's Powerstone Benchmark suite and a complex industrial embedded automotive software. The results show that our approach scales better than stand-alone software model checkers to reach deep state spaces. The whole approach is best suited for fast falsification.
Excessive test mode power-ground noise in nanometer scale chips causes large delay uncertainties in scan chains, resulting in a highly elevated rate of timing failures. The hybrid timing violation types in scan chains, plus their possibly intermittent manifestations, invalidate the traditional assumptions in scan chain fault behavior, significantly increasing the ambiguity and difficulty in diagnosis. In this paper, we propose a novel methodology to resolve the challenge of diagnosing multiple permanent or intermittent timing faults in scan chains. Instead of relying on fault simulation that is incapable of approximating the intermittent fault manifestation, the proposed technique characterizes the impact of timing faults by analyzing the phase movement of scan patterns. Extracting fault-sensitive statistical features of phase movement information provides strong signals for the precise identification of fault locations and types. The manifestation probability of each fault is furthermore computed through a mathematical transformation framework which accurately models the behavior of multiple faults as a Markov chain. The fault model utilized in the proposed scheme considers the effect of possibly asymmetric fault manifestation, thus maximally approximating the realistic failure behavior. Simulations on large benchmark circuits and two industrial designs have confirmed that the proposed methodology can yield highly accurate diagnosis results even for complicated fault manifestations such as multiple intermittent faults with mixed fault types.
While scan-based testing achieves a high fault coverage, it requires long test application times and substantial tester memory, in addition to the overhead in chip area and high test power. Functional testing, on the other hand, suffers from low coverage but can be applied at-speed. In this paper, we propose a novel three-step design-for-test (DFT) methodology which enhances the performance of functional testing to a great extent. In the first step we expand the state space of the circuit beyond functionally reachable space without scan or reset. These new states create conditions to activate/propagate fault effects that are otherwise hard-to-detect. Since structural correlation between D flip-flops (DFFs) of a circuit restricts its state space variation, the second step consists of partitioning the DFFs into different groups that helps to break such correlations. In the third step, we make internal hard-to-observe points in the circuit more observable by directly XORing them with selected primary outputs. This method can be applied at-speed (since no scan shifting is involved) saving significant amount of test application time, with comparable area overhead as scan-based DFT. Our experiments on large ISCAS'89 and ITC'99 benchmarks show that we are able to achieve very high non-scan fault coverages while simultaneously reducing the test application time (114x as compared to scan based techniques.
Excessive power dissipation caused by large amount of switching activities has been a major issue in scan-based testing. For large designs, the excessive switching activities during launch cycle can cause severe power droop, which cannot be recovered before capture cycle, rendering the at-speed scan testing more susceptible to the power droop. In this paper, we present a methodology to avoid power droop during scan capture without compromising at-speed test coverage. It is based on the use of a low area overhead hardware controller to control the clock gates. The methodology is ATPG (Automatic Test Pattern Generation)-independent, hence pattern generation time is not affected and pattern manipulation is not required. The effectiveness of this technique is demonstrated on several industrial designs.
Synchronous programs execute in discrete instants, called ticks. For real-time implementations, it is important to statically determine the worst case tick length, also known as the worst case reaction time (WCRT). While there is a considerable body of work on the timing analysis of procedural programs, such analysis for synchronous programs has received less attention. Current state-of-the art analyses for synchronous programs use integer linear programming (ILP) combined with path pruning techniques to achieve tight results. These approaches first convert a concurrent synchronous program into a sequential program. ILP constraints are then derived from this sequential program to compute the longest tick length. In this paper, we use an alternative approach based on model checking. Unlike conventional programs, synchronous programs are concurrent and state-space oriented, making them ideal for model checking based analysis. We propose an analysis of the abstracted state-space of the program, which is combined with expressive data-flow information, to facilitate effective path pruning.We demonstrate through extensive experimentation that the proposed approach is both scalable and about 67% tighter compared to the existing approaches.
This work presents a SystemC-based simulation approach for fast performance analysis of parallel software components, using source code annotated with low-level timing properties. In contrast to other source-level approaches for performance analysis, timing attributes obtained from binary code can be annotated even if compiler optimizations are used without requiring changes in the compiler. To consider concurrent accesses to shared resources like caches accurately during a source-level simulation, an extension of the SystemC TLM-2.0 standard for reducing the necessary synchronization overhead is proposed as well. This enables the simulation of low-level timing effects without performing a full-fledged instruction set simulation and at speeds close to pure native execution. Index Terms - System analysis and design; Timing; Modeling; Software performance;
Virtual Prototypes (VPs) based on Transaction Level Models (TLMs) have become a de-facto standard for design space exploration and validation of complex software-centric multicore or multiprocessor systems. The most popular method to get timed software TLMs is to annotate timing information at the basic-block level granularity back into application source code, called source code instrumentation (SCI). The existing SCI approaches realize the back-annotation of timing information based on mapping between source code and binary code. However, optimizing compilation has a large impact on the code mapping and will lower the accuracy of the generated source-level TLMs. In this paper, we present an efficient approach to tackle this problem. We propose to use mapping between source-level and binary-level control flows as the basis for timing annotation instead of code mapping. Software TLMs generated by our approach allow for accurate evaluation of multiprocessor systems at a very high speed. This has been proven by our experiments with a set of benchmark programs and a case study.
With increasing demand for higher performance under limited power budgets, multicore processors are rapidly becoming the norm in today's embedded systems. Embedded software constitutes a large portion of today's systems and real-time software design on multicore platforms opens new design challenges. In this paper, we introduce a high-level, host-compiled multicore software simulator that incorporates an abstract real-time operating system (RTOS) model to enable early, fast and accurate software exploration in a symmetric multi-processing (SMP) context. Our proposed model helps designers to explore different scheduling parameters within a framework of a general SMP execution environment. A designer can easily adjust application and OS parameters to evaluate their effect on real-time system performance. We demonstrate the efficiency of our models on a suite of industrial-strength and artificial task sets. Results show that models simulate at up to 1000 MIPS with 1-3% timing error across a variety of different OS configurations.
In order to address the large variety of channel coding options specified in existing and future digital communication standards, there is an increasing need for flexible solutions. This paper presents a multi-core architecture which supports convolutional codes, binary/duo-binary turbo codes, and LDPC codes. The proposed architecture is based on Application Specific Instruction-set Processors (ASIP) and avoids the use of dedicated interleave/deinterleave address lookup memories. Each ASIP consists of two datapaths one optimized for turbo and the other for LDPC mode, while efficiently sharing memories and communication resources. The logic synthesis results yields an overall area of 2.6mm2 using 90nm technology. Payload throughputs of up to 312Mbps in LDPC mode and of 173Mbps in Turbo mode are possible at 520MHz, fairing better than existing solutions. Index Terms - ASIP,LDPC,Turbo decoding.
New generation of telecommunication applications requires highly efficient processing units to tackle with the increasing signal processing algorithmic complexity. They also need to be flexible for handling a large range of radio access technology with specifications moving very fast. As devices including telecommunication features are, per nature, mobile, the high level of flexibility must be achieved while preserving very low power consumption. In this paper, a high performance low-power application-specific processor is proposed for complex signal processing. Thanks to dedicated control architecture, this processor exhibits an average 81% utilization rate of its principal operator, a complex MAC for a 3GPP-LTE application. The main innovations are the use of a reconfigurable profile and instruction cache strategy to reduce power consumption. This leads to a 10x reduction of the control power consumption. As a result, an average 50 mW power consumption is measured after implementation in a low-power 65 nm technology while delivering 3.2 GOPS. Finally, a comparison with state-of-the-art low-power DSP shows at least 24 % gain. Digital Baseband, Signal processor, VLIW, Low-Power, 3GPP-LTE
Since Multiple Input Multiple Output (MIMO) transmission has become more and more popular for current and future mobile communication systems, MIMO detection is a big issue. Linear detection algorithms are less complex and well understood but their BER performance is limited. ML detectors achieve the optimum result but have exponential computational complexity. Hence, iterative tree-search algorithms like the sphere decoder or the K-Best detector, which reduce the computational complexity, has become a major topic in research. In this paper a modified K+-Best detector is introduced which is able to achieve the BER performance of a common K-Best detector with K=12, by using a sorting algorithm for K=8. This novel sorting approach based on Batchers Odd-Even Mergesort is less complex compared to other parallel sorting designs and saves valuable hardware resources. Due to an efficient implementation the throughput of the detector is about 455 Mbit/s which is twice as high as the LTE peak data rate of 217.6 Mbit/s for a 16-QAM modulated signal. In this paper the architecture and the implementation issues are demonstrated in detail and the BER performance of the K+-Best FPGA implementation is shown. Index Terms - K-Best Detector; MIMO; Odd-Even Mergesort; FPGA-Implementation.
A power/area aware design is mandatory for the
MIMO (Multi-Input Multi-Output) detectors used in LTE and
WiMAX standards. The 64-QAM modulation used in the MIMO
detector requires more detection effort compared to the smaller
constellation sizes widely implemented in the literature. In this
work we propose a new architecture for the K-best detector,
which unlike the popular multi-stage architecture used for K-best
detectors, implements just one core. Also, we introduce a slight
modification to the K-best algorithm that reduces the number of
multiplications by 44%, and reduces the total power consumption
by 27%, without any noticeable performance degradation. The
overall architecture consumes only 24KGate, which is the
smallest area compared to the other implementations in the
literature. It also results in an at least 4-fold greater throughput-efficiency
(Mbps/KiloGate) compared to the other detectors,
while consuming a small power. The decoder implemented in a
commercial 130nm process provides a data-rate of 107Mbps, and
consumes 54.4mW.
Keywords-MIMO; K-best; single-core; 64-QAM; LTE; WiMAX
Panelists: J. Biggs, C. Clavel, O. Domerego, K. Just
Two formats for specifying power intent are currently
in wide use in the industry today and as designers continue to
strive for more power efficient designs new issues arise that need
new solutions to improve on today's standards. This panel will
discuss areas for improving today's power formats and the
direction that these formats need to move, in order to provide the
most efficient flows for design and verification and especially
with regards to low-power. The scope of the formats and their
suitability from early ESL design exploration to back-end signoff
checking will also be discussed.
Keywords-component; Unified Power Format; UPF; Common
Power Format; CPF; Low-Power; Power-Aware; Power-Efficient;
Design; Verification
Emerging nanotechnology-based systems encounter new non-functional requirements. This work addresses MEMS storage, an emerging technology that promises ultra-high density and energy-efficient storage devices. We study the buffering requirement of MEMS storage in streaming applications. We show that capacity and lifetime of a MEMS device dictate the buffer size most of the time. Our study shows that trading off 10% of the optimal energy saving of a MEMS device reduces its buffer capacity by up to three orders of magnitude. Index Terms - Secondary storage, energy efficiency, layout.
To ensure the robustness of an integrated circuit, its power distribution network (PDN) must be validated beforehand against any voltage drop on VDD nets. However, due to the increasing size of PDNs, it is becoming difficult to verify them in a reasonable amount of time. Lately, much work has been done to develop Model Order Reduction (MOR) techniques to reduce the size of power grids but their focus is more on simulation. In verification, we are concerned about the safety of nodes, including the ones which have been eliminated in the reduction process. This paper proposes a novel approach to systematically reduce the power grid and accurately compute an upper bound on the voltage drops at power grid nodes which are retained. Furthermore, a criterion for the safety of nodes which are removed is established based on the safety of other nearby nodes and a user specified margin.
3D integration has the potential to increase performance and decrease energy consumption. However, there are many unsolved issues in the design of these systems. In this work we study the design of many-tier (more than 4 tiers stacked) 3D power-supply networks and demonstrate a technique specific to 3D systems that improves IR-drop over a straightforward extension of traditional design techniques. Previous work in 3D power delivery network design has simply extended 2D techniques by treating through-silicon vias (TSVs) as extensions of the C4 bumps. By exploiting the smaller size and much higher interconnect density possible with TSVs we demonstrate significant reduction of nearly 50% in the IR-drop of our 3D design. Simulations also show that a 3-tier stack with the distributed TSV topology actually lowers IR-drop by 20% over a non-3D system with less power dissipation. Finally, we analyze the power distribution network of an envisioned 1000-core processor with 30 stacked dies and show scaling trends related to both increased stacking and power distribution TSVs. Our 3D analysis technique is validated using commercial-grade sign-off IR-drop software from a major EDA vendor.
Existing work on fault tolerance in hybrid nanoelectronic memories (hybrid memories) assumes that faults only occur in the memory array and the encoder, not in the decoder. However, as the decoder is structured using scaled CMOS devices, it is also becoming vulnerable to faults. This paper presents a cost-efficient fault-tolerant decoder for hybrid memories that are impacted by a high degree of non-permanent clustered faults. Fault-tolerant capability is achieved by combining partial hardware redundancy scheme and on-line masking scheme based on Muller C-gates. In addition, the cost-efficient implementation of the decoder is realized by modifying the decoding sequence and implementing it based on time redundancy. Experimental results show that the proposed decoder is able to provide better reliability of the overall hybrid memory system, yet requires smaller area as compared to conventional decoder. For example, when assuming the fault ratio between decoder and memory array is 1:10 and at 10% fault rate, the proposed decoder ensures 1% higher reliability of the overall hybrid memory system. Moreover, the proposed decoder realizes 18.4% smaller area overhead for 64-bit word hybrid memory.
CAN bus systems are used in many industrial control applications, particularly automotive. Due to growing system and functional requirements, the low capacity of the CAN bus and usually strict conditions under which it is used in real-time applications, applicability of CAN bus is severely limited. The paper presents an approach for achieving high utilization and breathes new life to CAN bus based systems by proposing a dynamic offset adaptation algorithm for scheduling messages and improving message response times without any changes to a standard CAN bus. This simple algorithm, which runs on all nodes of the system, results in excellent average response times at all loads and makes the approach particularly attractive for soft real-time systems.We demonstrate the performance improvement of the proposed approach by comparisons to other approaches and introduce a new performance measure in the form of a rating function. Index Terms - WCRT, Controller Area Network, CAN, response time, distributed embedded systems
This work presents an orientation tracking system for
6D inertial measurements units. The system was modeled with
MathWorks Simulink and experimentally tested with the Cube
Demo board by SensorDynamics, used to simulate a 3D gyro and
a 3D accelerometer. Quaternions were used to represent the
angular position and an Extended Kalman filter was used for the
sensor fusion algorithm. The goal was to obtain an integrated
system that could be easily integrated within the logic of the new
6D sensor family produced by SensorDynamics. We propose a
Kalman filter simplification for a fixed point arithmetic
implementation to reduce the system complexity with negligible
performance degradation.
Keywords: orientation tracking; angular position; Kalman
filter; quaternions; inertial measurement unit; sensor fusion.
This paper presents a strategy to speed-up the simulation of processors having SIMD extensions using dynamic binary translation. The idea is simple: benefit from the SIMD instructions of the host processor that is running the simulation. The realization is unfortunately not easy, as the nature of all but the simplest SIMD instructions is very different from a manufacturer to an other. To solve this issue, we propose an approach based on a simple 3-addresses intermediate SIMD instruction set on which and from which mapping most existing instructions at translation time is easy. To still support complex instructions, we use a form of threaded code. We detail our generic solution and demonstrate its applicability and effectiveness using a parametrized synthetic benchmark making use of the ARMv7 NEON extensions executed on a Pentium with MMX/SSE extensions.
In this paper, we present a system level dynamic scheduling algorithm to minimize the energy consumption by the DVS processor and multiple non-DVS peripheral devices in a hard real-time system. We show that the previous work which adopts the critical speed as the lower bound for scaling might not be most energy efficient when the energy overhead of shuttingdown/ waking-up is not negligible. Moreover, the widely used statically defined break even idle time might not be overall energy efficient due to its independence of job execution situations. In our approach, we first present an approach to enhance the computation of break even idle time dynamically. Then a dynamic scheduling approach is proposed in the management of speed determination and task preemption to reduce the energy consumption of the processor and devices. Compared with existing research, our approach can effectively reduce the system-level energy consumption for both CPU and peripheral devices.
This paper makes a case for developing statistical timing error models of DSP kernels implemented in nanoscale circuit fabrics. Recently, stochastic computation techniques have been proposed [1], [2], [3], where the explicit use of error-statistics in system design has been shown to significantly enhance robustness and energy-efficiency. However, obtaining the error statistics at different process, voltage, and temperature (PVT) corners is hard. This paper: 1) proposes a simple additive error model for timing errors in arithmetic computations due PVT variations, 2) analyzes the relationship between error statistics and parameters, specifically the input statistics, and 3) presents a characterization methodology to obtain the proposed model parameters and thus enabling efficient implementations of emerging stochastic computing techniques. Key results include the following observations: 1) the output error statistics is a weak function of input statistics, and 2) the output error statistics depends upon the one's probability profile of the input word. These observations enable a one-time off-line statistical error characterization of DSP kernels similar to delay and power characterization done presently for standard cells and IP cores. The proposed error model is derived for a number of DSP kernels in a commercial 45nm CMOS process.
We propose a roll-forward error recovery technique based on multiple scan chains for TMR systems, called Scan chained TMR (ScTMR). ScTMR reuses the scan chain flip-flops employed for testability purposes to restore the correct state of a TMR system in the presence of transient or permanent errors. In the proposed ScTMR technique, we present a voter circuitry to locate the faulty module and a controller circuitry to restore the system to the fault-free state. As a case study, we have implemented the proposed ScTMR technique on an embedded processor, suited for safety-critical applications. Exhaustive fault injection experiments reveal that the proposed architecture has the error detection and recovery coverage of 100% with respect to Single Event Upset (SEU) while imposing a negligible area and performance overhead as compared to traditional TMR-based techniques.
In recent years, chip multiprocessors (CMP) have emerged as a solution for high-speed computing demands. However, power dissipation in CMPs can be high if numerous cores are simultaneously active. Dynamic voltage and frequency scaling (DVFS) is widely used to reduce the active power, but its effectiveness and cost depends on the granularity at which it is applied. Per-core DVFS allows the greatest flexibility in controlling power, but incurs the expense of an unrealistically large number of on-chip voltage regulators. Per-chip DVFS, where all cores are controlled by a single regulator overcomes this problem at the expense of greatly reduced flexibility. This work considers the problem of building an intermediate solution, clustering the cores of a multicore processor into DVFS domains and implementing DVFS on a per-cluster basis. Based on a typical workload, we propose a scheme to find similarity among the cores and cluster them based on this similarity. We also provide an algorithm to implement DVFS for the clusters, and evaluate the effectiveness of per-cluster DVFS in power reduction.
3D multi-core architectures are seen to provide increased transistor density, reduced power consumption, and improved performance through wire length reduction. However, 3D suffers from increased power density, which exacerbates thermal hotspots. In this paper, we present a novel 3D multi-core architecture that reduces processor activity on the die distant to the heat sink and a core-level dynamic thermal management technique based on the architectural adaptation, e.g. dynamically adapting core-resources depending on diverse application requirements and thermal behavior. The proposed thermal management technique synergistically combines the benefits of the architectural adaptation supported by our 3D multi-core architecture with dynamic voltage and frequency scaling. Our proposed technique provides 19.4% (maximum 24.4%, minimum 15.5%) improvement in the instruction throughput compared to the state-of-the-art thermal management techniques [4, 5] applied to the thermal-aware 3D processor architecture without considering run-time adaptation [10].
Modern systems on chip (SoCs) are rapidly becoming complex high-performance computational devices, featuring multiple general purpose processor cores and a variety of functional IP blocks, communicating with each other through on-die fabric. While modular SoC design provides power savings and simplifies the development process, it also leaves significant room for a special type of hardware bugs, interaction errors, to slip through pre- and post-silicon verification. Consequently, hard to fix silicon escapes may be discovered late in production schedule or even after a market release, potentially causing costly delays or recalls. In this work we propose a unified error detection and recovery framework that incorporates programmable features into the on-die fabric of an SoC, so triggers of escaped interaction bugs can be detected at runtime. Furthermore, upon detection, our solution locks the interface of an IP for a programmed time period, thus altering interactions between accesses and bypassing the bug in a manner transparent to software. For classes of errors that cannot be circumvented by this in-hardware technique our framework is programmed to propagate the error detection to the software layer. Our experiments demonstrate that the proposed framework is capable of detecting a range of interaction errors with less than 0.01% performance penalty and 0.45% area overhead.
Supply voltage fluctuation caused by inductive noises has become a critical problem in microprocessor design. A voltage emergency occurs when supply voltage variation exceeds the acceptable voltage margin, jeopardizing the microprocessor reliability. Existing techniques assume all voltage emergencies would definitely lead to incorrect program execution and prudently activate rollbacks or flushes to recover, and consequently incur high performance overhead. We observe that not all voltage emergencies result in external visible errors, which can be exploited to avoid unnecessary protection. In this paper, we propose a substantial-impact-filter based method to tolerate voltage emergencies, including three key techniques: 1) Analyze the architecture-level masking of voltage emergencies during program execution; 2) Propose a metric intermittent vulnerability factor for intermittent timing faults (IV Fitf ) to quantitatively estimate the vulnerability of microprocessor structures (load/store queue and register file) to voltage emergencies; 3) Propose a substantial-impact-filter based method to handle voltage emergencies. Experimental results demonstrate our approach gains back nearly 57% of the performance loss compared with the once-occur-then-rollback approach.
This work revisits the formulation of interpolation sequences, in order to better understand their relationships with Bounded Model Checking and with other Unbounded Model Checking approaches relying on standard interpolation. We first focus on different Bounded Model Checking schemes (bound, exact and exact-assume), pointing out their impact on the interpolation-based strategy. Then, we compare the abstraction ability of interpolation sequences with standard interpolation, highlighting their convergence at potentially different sequential depths. We finally propose a tight integration of interpolation sequences with an abstraction-refinement strategy. Our contributions are first presented from a theoretical standpoint, then supported by experimental results (on academic and industrial benchmarks) adopting a state-of-the-art academic tool.
In the last decade, functional verification has become a major bottleneck in the design flow. To relieve this growing burden, assertion-based verification has gained popularity as a means to increase the quality and efficiency of verification. Although robust, the adoption of assertion-based verification poses new challenges to debugging due to presence of errors in the assertions. These unique challenges necessitate a departure from past automated circuit debugging techniques which are shown to be ineffective. In this work, we present a methodology, mutation model and additional techniques to debug errors in SystemVerilog assertions. The methodology uses the failing assertion, counterexample and mutation model to produce alternative properties that are verified against the design. These properties serve as a basis for possible corrections. They also provide insight into the design behavior and the failing assertion. Experimental results show that this process is effective in finding high quality alternative assertions for all empirical instances.
The quality of network-on-chip (NoC) designs depends crucially on the size of buffers in NoC components. While buffers impose a significant area and power overhead, they are essential for ensuring high throughput and low latency. In this paper, we present a new approach for minimizing the cumulative buffer size in on-chip networks, so as to meet throughput and latency requirements, given high-level specifications on traffic behavior. Our approach uses model checking based on satisfiability modulo theories (SMT) solvers, within an overall counterexample-guided synthesis loop. We demonstrate the effectiveness of our technique on NoC designs involving arbitration, credit logic, and virtual channels.
Operating system (OS) models are widely used to alleviate
the overwhelmed complexity of running system-level simulation
of software applications on specific OS implementation.
Nevertheless, current OS modeling approaches are unable to
maintain both simulation speed and accuracy when dealing with
preemptive scheduling. This paper proposes a Data-dependency-Oriented
Modeling (DOM) approach. By guaranteeing the order
of shared variable accesses, accurate simulation results are obtained.
Meanwhile, the simulation effort of our approach is considerably
less than that of the conventional Cycle-Accurate (CA)
modeling approach, thereby leading to high simulation speed, 42
to 223 million instructions per second (MIPS) or 114 times faster,
than CA modeling as supported by our experimental results.
Keywords-OS modeling; preemptive scheduling; simulation
Ideally, system-level simulation should provide a high simulation speed with sufficient timing details for both functional verification and performance evaluation. However, existing cycle-accurate (CA) and cycle-approximate (CX) processor models either incur low simulation speeds due to excessive timing details or low accuracy due to simplified timing models. To achieve high simulation speeds while maintaining timing accuracy of the system simulation, we propose a first cycle-count-accurate (CCA) processor modeling approach which pre-abstracts internal pipeline and cache into models with accurate cycle count information and guarantees accurate timing and functional behaviors on processor interface. The experimental results show that the CCA model performs 50 times faster than the corresponding CA model while providing the same execution cycle count information as the target RTL model.
This paper proposes a shared-variable-based approach
for fast and accurate multi-core cache coherence simulation.
While the intuitive, conventional approach -
synchronizing at either every cycle or memory access - gives
accurate simulation results, it has poor performance due to
huge simulation overloads. We observe that timing synchronization
is only needed before shared variable accesses in order
to maintain accuracy while improving the efficiency in the
proposed shared-variable-based approach. The experimental
results show that our approach performs 6 to 8 times faster
than the memory-access-based approach and 18 to 44 times
faster than the cycle-based approach while maintaining accuracy.
Keywords- cache-coherence; timing synchronization
Virtual platform simulation is an essential technique
for early-stage system-level design space exploration and
embedded software development. In order to explore the
hardware behavior and verify the embedded software, simulation
speed and accuracy are the two most critical factors. However,
given the increasing complexity of the Multi-Processor System-on-Chip
(MPSoC) designs, even the state-of-the-art virtual
platform simulation algorithms may suffer from the simulation
speed issue. In this paper, we proposed an Ultra Synchronization
Checking Method (USCM) for fast and robust virtual platform
simulation. We devise a data dependency table (DDT) so that the
memory access information by the hardware modules and
software programs can be predicted and checked. By reducing
the unnecessary synchronizations among simulation modules and
utilizing the asynchronous discrete event simulation technique,
we can significantly improve the virtual platform simulation
speed. Our experimental results show that the proposed USCM
can simulate a 32-processor SoC design in the speed of multi-million
instructions per second. We also demonstrate that our
method is less sensitive to the number of cores in the virtual
platform simulation.
Keywords-Virtual Platform Simulation, SoC, Synchronization.
This paper presents an all-digital built-in self-test (BIST) technique for characterizing the error transfer function of RF PLLs. This BIST scheme, with on-chip stimulus synthesis and response analysis completely done in the digital domain, achieves high-accuracy characterization and is applicable to a wide range of PLL architectures. For the popular sigma-delta fractional-N RF PLLs, the added circuitry required for this BIST solution is all digital except a bang-bang phase-frequency detector (BB-PFD), which incurs an area of only 0.0001mm2 for our implementation in a 65nm CMOS technology. The silicon characterization results at 3.6GHz reported by this BIST solution and by explicit measurement have a root-mean-square difference of 0.375dB only. Index Terms - BIST, PLL, frequency synthesizer, frequency modulator.
Different built-in self testing schemes for RF circuits
have been developed resorting to peak voltage detectors. These are
simple to implement but provide a conditional RF power
measurement accuracy as impedance is assumed to be known. A
true power detector is presented which allows obtaining more
accurate measurements, namely as far as output load variations are
concerned. The theoretical fundaments underlining the power
detector operating principle are presented and simulation and
experimental results obtained with a prototype chip are described
which confirm the benefits of measuring true power, comparing to
output peak voltage, when observing output load matching
deviations and complex waveforms.
Keywords-RF testing; power amplifier; power sensor
We present an application of Defect Oriented Testing (DOT1) to an industrial mixed signal device to reduce test time and maintain quality. The device is an automotive IC product with stringent quality requirements and a mature test program that is already in volume production. A complete flow is presented including defect extraction, defect simulation, test selection, and validation. A major challenge of DOT for mixed signal devices is the simulation time. We address this challenge with a new fault simulation algorithm that provides significant speedup in the DOT process. Based on the fault simulations, we determine a minimal set of tests which detects all defects. The proposed minimal test set is compared with the actual test results of more than a million ICs. We prove that the production tests of the device can be reduced by at least 35%.
Testing of high-speed Digital-to-Analog Converters (DACs) is a challenging task, as it requires large number of high-speed synchronized input signals with specific test patterns. To overcome this problem, we propose use of PRBS signals with an "Alternate-Bit-Tapping" technique and eye-diagram measurement as a solution to efficiently generate the test-vectors and test the DACs. This approach covers all levels and transitions necessary for testing the dynamic behavior of the DAC completely, in minimum possible time. Circuit level simulations are used to verify its usefulness in testing a 4-bit 20-GS/s current-steering DAC.
Thermal issues have become critical roadblocks for
achieving highly reliable three-dimensional (3D) integrated
circuits. This paper performs both the evaluation and mitigation
of the impact of leakage power variations on the temperature
profile of 3D Chip-Multiprocessors (CMPs). Furthermore, this
paper provides a learning-based model to predict the maximum
temperature, based on which a simple, yet effective tier-stacking
algorithm to mitigate the impact of variations on the temperature
profile of 3D CMPs is proposed. Results show that (1) the
proposed prediction model achieves more than 98% accuracy, (2)
a 4-tier 3D implementation can be more than 40oC hotter than its
2D counterpart and (3) the proposed tier-stacking algorithm
significantly improves the thermal yield from 44.4% to 81.1% for
a 3D CMP.
Keywords-thermal; leakage; process variation; 3D; stack; yield;
chip-multiprocessor; statistical learning; regression
3D integration based on TSV (through silicon via) technology enables stacking of multiple memory layers and has the advantage of higher bandwidth at lower energy consumption for the memory interface. As in mobile applications energy efficiency is key, 3D integration is especially here a strategic technology. In this paper we focus on the design space exploration of 3D-stacked DRAMs with respect to performance, energy and area efficiency for densities from 256Mbit to 4Gbit per 3D-DRAM channel. We investigate four different technology nodes from 75nm down to 45nm and show the optimal design point for the currently most common commodity DRAM density of 1Gbit. Multiple channels can be combined for main memory sizes of up to 32GB. We present a functional SystemC model for the 3D-stacked DRAM which is coupled with a SDR/DDR 3D-DRAM channel controller. Parameters for this model were derived from detailed circuit level simulations. The exploration demonstrates that an optimized 1Gbit 3D-DRAM stack is 15x more energy efficient compared to a commodity Low-Power DDR SDRAM part without IO drivers and pads. To the best of our knowledge this is the first design space exploration for 3D-stacked DRAM considering different technologies and real world physical commodity DRAM data.
Thermal issues are one of the primary challenges in 3-D integrated circuits. Thermal through-silicon vias (TTSVs) are considered an effective means to reduce the temperature of 3-D ICs. The effect of the physical and technological parameters of TTSVs on the heat transfer process within 3-D ICs is investigated. Two resistive networks are utilized to model the physical behavior of TTSVs. Based on these models, closed-form expressions are provided describing the flow of heat through TTSVs within a 3-D IC. The accuracy of these models is compared with results from a commercial FEM tool. For an investigated three-plane circuit, the average error of the first and second models is 2% and 4%, respectively. The effect of the physical parameters of TTSVs on the resulting temperature is described through the proposed models. For example, the temperature changes non-monotonically with the thickness of the silicon substrate. This behavior is not described by the traditional single thermal resistance model. The proposed models are used for the thermal analysis of a 3-D DRAM-..P system where the conventional model is shown to considerably overestimate the temperature of the system. Index Terms - 3-D ICs, Thermal through-silicon via (TTSV), thermal resistance, heat conductivity.
Providing high vertical interconnection density between device tiers, through silicon via (TSV) offers a promising solution in 3D IC to reduce the length of global interconnection. However, some design issues hinder TSV from volumes of adoption, such as IR drop, thermal dissipation, current delivery per package pin and various voltage domains among tiers. To tackle these problems, the design of power network plays an important role in 3D IC. A new integrated architecture of stacked-TSV and power distributed network (STDN) is proposed in this paper. Our new STDN serves triple roles: power network to deliver larger current and reduce IR drop, thermal network to reduce temperature, and decoupling capacitor network to reduce power noise. As well, it helps to alleviate the limitation of the number of IO power pins. For both single and multiple power domains, the proposed STDN architecture demonstrates good performance in 3D floorplan, IR drop, power noise, temperature, area and even the total length of signal connections for selected MCNC benchmarks.
Multiprocessors systems on chip (MPSoCs) have
become the de-facto standard in embedded systems. The use of
Networks-on-chip (NoCs) provides to these platforms scalability
and support for parallel transactions. The computational power
of these architectures enables the simultaneous execution of
several applications, with different time constraints. However, as
the number of applications executing simultaneously increases,
the performance of such applications may be affected due to
resources sharing. To ensure applications requirements are met,
mechanisms are necessary for ensuring proper isolation. Such a
feature is referred to as composability. As the NoC is the main
shared component in NoC-based MPSoCs, quality-of-service
(QoS) mechanisms are mandatory to meet application
requirements in term of communication. In this work, we
propose a hardware/software approach to achieve applications
composability by means of QoS management mechanisms at the
software level. The conducted experiments show the efficiency of
the proposed method in terms of throughput, latency and jitter
for a real time application sharing communication resources with
best-effort applications.
Keywords-component; MPSoC; NoC; QoS; Composability; API
(key words)
In this paper, we propose a processor allocation
mechanism for run-time assignment of a set of communicating
tasks of input applications onto the processing nodes of a Chip
Multiprocessor (CMP), when the arrival order and execution lifetime
of the input applications are not known a priori. This
mechanism targets the on-chip communication and aims to
reduce the power and latency of the NoC employed as the
communication infrastructure. In this work, we benefit from the
advantages of non-contiguous processor allocation mechanisms,
by allowing the tasks of the input application mapped onto
disjoint regions (sub-meshes) and then virtually connecting them
by bypassing the router pipeline stages of the inter-region
routers. The experimental results show considerable
improvement over one of the best existing allocation mechanisms.
Keywords-chip multiprocessors; network-on-Chip; processor
allocation; contiguous allocation; non-contiguous allocation;
power consumption; performance.
Quality-of-Service becomes a vital requirement in MPSoCs with NoCs. In order to serve them NoCs provide guarantees for latency, jitter and bandwidth by virtual channels. But the allocation of these guaranteed service channels is still an important question. In this paper we present and evaluate different realizations of a central hardware unit which allocates at run-time guaranteed service virtual channels providing QoS in packet-switched NoCs. We evaluate their performance in terms of allocation success, compare it to distributed channel setup techniques for different NoC sizes and traffic scenarios and analyze the required hardware area consumption. We find centralized channel allocation to be very suitable for our run-time task scheduling programming model. Index Terms - Network-on-Chip, virtual channel, guaranteed service, channel allocation, Quality-of-Service.
FPGA prototyping of recent large Systems on Chip SoCs) is very challenging due to the resource limitation of a single FPGA. Moreover, having external access to SoCs for verification and debug purposes is essential. In this paper, we suggest to partition a network-on-chip (NoC) based system into smaller sub-systems each with their own NoC, and each of which is implemented on a separate FPGA board. Multiple SoC ASICs can be bridged in the same way. The scheme that interconnects the sub-systems should offer the application connections the required quality of service (QoS). In this paper, we investigate bridging schemes at different levels of the NoC protocol stack. Comparing the distinct design criteria for the proposed schemes, a bridge is designed. The bridge experiments show that it provides QoS in terms of bandwith and latency.
Wireless communications has been a hot area of technology advancement for the past two decades. As long as memory sizes increase, the demand for higher data rates of communications increases on the same scale. This means that one must understand today's high-end 10Gbit/s wireless technology to get prepared for 100 Gbit/s and 1Tbit/s data rates of tomorrow. This paper presents key boundary conditions learned by understanding leading edge wireless links of today to prepare for the Tbit/s technology of the year 2020.
This paper presents the evolutions of CMOS image
sensors. From the early works, highly image processing oriented,
the main research effort has then emphasized on image
acquisition. To overcome the rising limitations of standard
approaches and to promote new functionalities, several research
directions are underway with promising results.
Keywords-image sensors; vision chips; imagers; 3D technology
This work presents a method for global routing (GR) to minimize interconnect power. We consider design with multi-supply voltage, where level converters are added to nets that connect driver cells to sink cells of higher supply voltage. The level converters are modeled as additional terminals during GR. Given an initial GR solution obtained with the objective of minimizing wirelength, we propose a GR method to detour nets to further save the interconnect power. When detouring routes via this procedure, overflow is not increased, and the increase in wirelength is bounded. The power saving opportunities include: 1) reducing the area capacitance of the routes by detouring from the higher metal layers to the lower ones, 2) reducing the coupling capacitance between adjacent routes by distributing the congestion, and 3) considering different power-weights for each segment of a routed net with level converters (to capture its corresponding supply voltage and activity factor). We present a mathematical formulation to capture these power saving opportunities and solve it using integer programming techniques. In our simulations, we show considerable saving in an interconnect power metric for GR, without any wirelength degradation.
Based on the width determination of any current-driven connection for electromigration and IR-drop avoidance, an area-driven multiple-source routing tree can be firstly constructed to minimize the total wiring area with satisfying the current flow in Kirchhoff's current laws and the electromigration and IR-drop constraints. Furthermore, some Steiner points can be assigned onto feasible locations to reduce the total wiring area under the electromigration and IR-drop constraints. Finally, an obstacle-aware multiple-source rectilinear Steiner tree can be constructed by assigning the obstacle-aware minimum-length physical paths for all the connections. Compared with Lienig's multiple-source Steiner tree[7], the experimental results show that our proposed approach without any IR-drop constraint can reduce 10.5% of the total wiring area. Under 10%Vdd and 5%Vdd IR-drop constraints, the experimental results show that our proposed approach can satisfy 100% electromigration and IRdrop constraints and reduce 7.5% and 4.9% of the original total wiring area on the average for tested examples, respectively.
A novel rotary clock network routing method is proposed for the low-power resonant rotary clocking technology which guarantees: 1. The balanced capacitive load driven by each of the tapping points on the rotary rings, 2. Customized bounded clock skew among all the registers on chip, 3. A sub-optimally minimized total wirelength of the clock wire routes. In the proposed method, a forest of steiner trees is first created which connects the registers so as to achieve zero skew and greedily balance the total capacitance of each tree. Then, a balanced assignment of the steiner trees to the tapping points is performed to guarantee a balanced capacitive load on the rotary network. The proposed routing method is tested with the ISPD clock network contest and IBM r1-r5 benchmarks. The experimental results show that the capacitive load imbalance is very limited. The total wirelength is reduced by 64.2% compared to the best previous work known in literature through the combination of steiner tree routing and the assignment of trees to the tapping points. The average clock skew simulated using HSPICE is only 8.8ps when the bounded skew target is set to 10.0ps.
Routing for high speed boards is still achieved manually nowadays. There have been some related works in escape routing to solve this problem recently, however a more practical problem is not addressed. Usually the packages/ components are designed with or without the requirement from board designers, and the boundary pins are usually fixed or advised to follow when the board design starts. Previous works in escape routing are not likely to be used due to this nature, in this work, we describe this fixed ordering boundary pin escaping problem, and propose a practical approach to solve it. Not only can we have a way to address, we also further plan the wires in a better way to preserve the precious routing resources in the limited number of layers on the board, and to effectively deal with obstacles. our approach has different feature compared with conventional shortest-path-based routing paradigm. In addition, we consider length-matching requirement and wire shape resemblance for high speed signal routes on board. Our results show that we can utilize routing resource very carefully, and can account for the resemblance of nets in the presence of the obstacles. Our approach is workable for board busses as well.
Dynamic stability analysis for SRAM has been growing in importance with technology scaling. This paper analyzes dynamic writability for designing low voltage SRAM in nanoscale technologies. We propose a definition for dynamic write limited VMIN. To the best of our knowledge, this is the first definition of a VMIN based on dynamic stability. We show how this VMIN is affected by the array capacity, the voltage scaling of the word-line pulse, the bitcell parasitics, and the number of cycles prior to the first read access. We observe that the array can be either dynamically or statically write limited depending on the aforementioned factors. Finally, we look at how voltage-bias based write assist techniques affect the dynamic write limited VMIN.
With the increasing levels of variability in the
characteristics of VLSI circuits and continued uncertainty in the
operating conditions of processors, achieving predictable power
efficiency and high performance in the electronic systems has
become a daunting, yet vital, task. This paper tackles the problem
of system-level dynamic power management (DPM) in the state-of-the-art
chip multiprocessor (CMP) architectures that are
manufactured in nanoscale CMOS technologies with large
process variations or are operated under widely varying
environmental conditions over their lifetime. We adopt a
Markovian Decision Process based approach to CMP power
management problem. The proposed technique models the
underlying variability and uncertainty of parameters in system
level as a partially observable MDP, and finds the optimal policy
that stochastically minimizes energy per request. Experimental
results demonstrate the high efficacy of the proposed power
management framework.
Keywords - Chip multiprocessor; Dynamic power management;
partially observable Markovian decision process;
In this paper, we study the problem on how to reduce the overall energy consumption while at the same time ensuring the timing and maximum temperature constraints for a real-time system. We incorporate the interdependence of leakage, temperature and supply voltage into analysis and develop a novel method to quickly estimate the overall energy consumption. Based on this method, we then propose a scheduling technique to minimize the overall energy consumption under the maximum temperature constraint. Our experimental results show that the proposed energy estimation method can achieve up to four-order-of-magnitude speedup compared with existing approaches while keeping the maximum estimation error within 4.8%. In addition, simulation results also demonstrate that our proposed energy minimization method consistently outperforms previous related approaches significantly.
This paper presents a power and performance multi-objective
Tabu Search based technique for designing application-specific
Network-on-Chip architectures. The topology generation
approach uses an automated technique to incorporate floorplan
information and attain accurate values for wirelength and area.
The method also takes dynamic effects such as contention into
account, allowing performance constraints to be incorporated
during topology synthesis. A new method for contention analysis is
presented in this work which makes use of power and performance
objectives using a Layered Queuing Network (LQN) contention
model. The contention model is able to analyze rendezvous
interactions between NoC components and alleviate potential
bottleneck points within the system. Several experiments are
conducted on various SoC benchmark applications and compared
to previous works.
Keywords - Network-on-Chip, Topology Generation, Tabu Search,
Layered Queuing Networks, Contention
Shared L1 memory is an interesting architectural option for building tightly-coupled multi-core processor clusters. We designed a parametric, fully combinational Mesh-of-Trees (MoT) interconnection network to support high-performance, single-cycle communication between processors and memories in L1-coupled processor clusters. Our interconnect IP is described in synthesizable RTL and it is coupled with a design automation strategy mixing advanced synthesis and physical optimization to achieve optimal delay, power, area (DPA) under a wide range of design constraints. We explore DPA for a large set of network configurations in 65nm technology. Post placement& routing delay is 38FO4 for a configuration with 8 processors and 16 32-bit memories (8x16); when the number of both processors and memories is increased by a factor of 4, the delay increases almost logarithmically, to 84FO4, confirming scalability across a significant range of configurations. DPA tradeoff flexibility is also promising: in comparison to the max-performance 16x32 configuration, there is potential to save power and area by 45% and 12 % respectively, at the expense of 30% performance degradation.
Interconnection networks with adaptive routing are susceptible to deadlock, which could lead to performance degradation or system failure. Detecting deadlocks at run-time is challenging because of their highly distributed characteristics. In this paper, we present a deadlock detection method that utilizes run-time Transitive Closure (TC) computation to discover the existence of deadlock-equivalence sets, which imply loops of requests in networks-on-chip (NoC). This detection scheme guarantees the discovery of all true deadlocks without false alarms unlike state-of-the-art approximation and heuristic approaches. A distributed TC-network architecture which couples with the NoC architecture is also presented to realize the detection mechanism efficiently. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the TC-network method. It drastically outperforms timing-based deadlock detection mechanisms by eliminating false detections and thus reducing energy dissipation in various traffic scenarios. For example, timing based methods may produce two orders of magnitude more deadlock alarms than the TC-network method. Moreover, the implementations presented in this paper demonstrate that the hardware overhead of TC-networks is insignificant.
As design complexity of LSI systems increase, so does
the verification challenges. It is very important, yet difficult to
find all design errors and correct them in a timely manner. This
paper presents our experience with a new verification and debug
methodology based on the combination of formal verification and
automated debugging. This methodology, which is applied to the
development of a DDR2 memory design targeted for an FGPA, is
found to significantly reduce the verification and debug tasks
typically performed.
Keywords-system LSI; verification; debug; methodology
We present a compact model that provides a quick estimation of the stress and mobility patterns around arbitrary configurations of Through-Silicon Via's (TSVs). No separate TCAD simulations are required for these configurations. It estimates nFET and pFET mobility for industry-standard as well as for (100)/<100> substrate orientations. As the model provides mobility info in less than 0.1 millisecond/transistor/TSV, it is possible to be used in combination with layouting tools and circuit simulators to optimise layouts of circuits for digital and analog applications. The model has been integrated into the 3D PathFinding flow, for steering 3D IO placement during stack definition.
We look into the validation a power managed ARM
Cortex A-8 core used in SoCs targeted for mobile segment. Low
Power design techniques used on the chip include clock gating,
voltage scaling, and power gating. We focus on the verification
challenges faced in designing the processor core including RTL
modeling of power switches, isolation, and level-shifting cells,
simulation of voltage ramps, generation of appropriate control
signals to put the device into various power states, and ensuring
correct operation of chip in these states as well as during the
transitions between these states.
Keywords- low power, verification, power gating, dynamic
volatage scaling, power switches, isolatio, ARM Cortex A-8
Some constraints imposed on the design of components
for mobile devices are the size of the handheld device, safety for
handling, heat dissipation, and in-system electromagnetic
interference. This paper discusses challenges in designing the
next generation low power DRAM subsystem operating at multi-gigabits
per second. A new mobile DRAM interface that can meet
the challenges and some test data are presented.
Keywords-low power DRAM; thermal; EM emission; high data
rate; package-on-package; mobile phone
One of the most important challenges facing the entire globe is
the trend towards an aging population. By 2045, there will be
more people over 60 years old than younger than 15, thus
raising from 600mln to 2bln worldwide. This will raise the
number of patients with age-specific, chronic and degenerative
diseases (e.g. cardio-vascular, cancer, diabetes, Alzheimer's,
Parkinson's). Minimally-invasive imaging technologies such as
PET (Positron Emission Tomography) and MRI (Magnetic
Resonance Imaging) play a vital role in detecting and tracking
the evolution of the above mentioned illnesses and determining
the strategy and the effectiveness of the prescribed therapies.
So far the detection unit of PET equipment has been
implemented using photomultipliers tubes (PMTs). A novel
solid state photo-detector, the Silicon photomultiplier (SiPM),
can replace the PMT, offering, among many other advantages,
the possibility of PET/MRI combo equipment.
Keywords: Nuclear Medicine, Photomultiplier, central nervous
system's diagnostics, PET, SiPM.
Today, electronic devices are increasingly employed in
different fields, including safety- and mission-critical
applications, where the quality of the product is an essential
requirement. In the automotive field, on-line self-test is a
dependability technique currently demanded by emerging
industrial standards. This paper presents an approach employed
by STMicroelectronics for evaluating, or grading, the
effectiveness of Software-Based Self-Test (SBST) procedures used
for on-line testing microcontrollers to be included in safety-critical
vehicle parts, such as in airbags and steering systems.
Keywords-SoC, test, software-based self-test, fault grading
Multi core architectures that are built to reap performance and energy efficiency benefits from the parallel execution of applications often employ runtime adaptive techniques in order to achieve, among others, load balancing, dynamic thermal management, and to enhance the reliability of a system. Typically, such runtime adaptation in the system level requires the ability to quickly and consistently migrate a task from one core to another. For distributed memory architectures, the policy for transferring the task context between source and destination cores is of vital importance to the performance and to the successful operation of the system. As its performance is negatively correlated with the communication overhead, energy consumption and the dissipated heat, task migration needs to be runtime adaptive to account for the system load, chip temperature, or battery capacity. This work presents a novel context-aware runtime adaptive task migration mechanism (CARAT) that reduces the task migration latency by 93.12%, 97.03% and 100% compared to three state-of-the-art mechanisms and allows to control the maximum migration delay and the performance overhead tradeoff at runtime. This novel mechanism is built on an in-depth analysis of the memory access behavior of several multi-media and robotic embedded-systems applications.
In this paper, an efficient embedded software synthesis approach based on a generalized clustering algorithm for static dataflow subgraphs embedded in general dataflow graphs is proposed. The clustered subgraph is quasi-statically scheduled, thus improving performance of the synthesized software in terms of latency and throughput compared to a dynamically scheduled execution. The proposed clustering algorithm outperforms previous approaches by a faster computation and a more compact representation of the derived quasi-static schedules. This is achieved by a rule-based approach, which avoids an explicit enumeration of the state space. Experimental results show significant improvements in both performance and code size when compared to a state-of-the-art clustering algorithm. Index Terms - MPSoC Scheduling, Software Synthesis, Actor- Oriented Design
NAND flash is preferred for code and data storage in embedded devices due to its high density and low cost. However, NAND flash requires code to be copied to main memory for execution. In inexpensive devices without hardware memory management, full shadowing of an application binary is commonly used to load the program. This approach can lead to a high initial application start-up latency and poor amortization of copy overhead. To overcome these problems, we describe a software-only demand-paging approach that incrementally copies code to memory with a dynamic binary translator (DBT). This approach does not require hardware or operating system support. With careful management, a savings can be achieved in total code footprint, which can offset the size of data structures used by DBT. For applications that cannot amortize full shadowing cost, our approach can reduce start-up latency by 50% or more, and improve performance by 11% on average.
The huge investment in the design and production of
multicore processors may be put at risk because the emerging
highly miniaturized but unreliable fabrication technologies will
impose significant barriers to the life-long reliable operation of
future chips. Extremely complex, massively parallel, multi-core
processor chips fabricated in these technologies will become more
vulnerable to: (a) environmental disturbances that produce
transient (or soft) errors, (b) latent manufacturing defects as well
as aging/wearout phenomena that produce permanent (or hard)
errors, and (c) verification inefficiencies that allow important
design bugs to escape in the system. In an effort to cope with
these reliability threats, several research teams have recently
proposed multicore processor architectures that provide low-cost
dependability guarantees against hardware errors and design
bugs. This paper focuses on dependable multicore processor
architectures that integrate solutions for online error detection,
diagnosis, recovery, and repair during field operation. It
discusses taxonomy of representative approaches and presents a
qualitative comparison based on: hardware cost, performance
overhead, types of faults detected, and detection latency. It also
describes in more detail three recently proposed effective
architectural approaches: a software-anomaly detection
technique (SWAT), a dynamic verification technique (Argus),
and a core salvaging methodology.
Keywords: multicore microprocessors; dependable architectures;
online error detection/recovery/repair.
In this paper, we propose an energy-efficient 3D-stacked CMP design by both temporally and spatially fine-grained tuning of processor cores and caches. In particular, temporally fine-grained DVFS is employed by each core and L2 cache to reduce the dynamic energy consumption, while spatially fine-grained DVS is applied to the cache hierarchy for the leakage energy reduction. Our tuning technique is implemented by integrating an array of on-chip voltage regulators into the original processor. Experimental results show that the proposed design can provide an energy-efficient, direct, and adaptive control to the system, leading to 20% dynamic and 89% leakage energy reductions, and an average of 34% total energy saving compared to the baseline design.
This paper addresses the problem of model checking multiple properties on the same circuit/system. Although this is a typical scenario in several industrial verification frameworks, most model checkers currently handle single properties, verifying multiple properties one at a time. Possible correlations and shared sub-problems, that could be considered while checking different properties, are typically ignored, either for the sake of simplicity or for Cone-Of-Influence minimization. In this paper we describe a preliminary effort oriented to exploit possible synergies among distinct verification tasks of several properties on the same circuit. Besides considering given sets of properties, we also show that multiple properties can be automatically extracted from individual properties, thus simplifying difficult model checking tasks. Preliminary experimental results indicate that our approach can lead to significant performance improvements.
This paper describes a new and efficient solution to a distributed event-driven gate-level HDL simulation. It is based on a novel concept of spatial parallelism using accurate prediction of input and output signals of individual local modules in local simulations, derived from a model at a higher abstraction level (RTL). Using the predicted rather than actual signal values makes it possible to eliminate or greatly reduce the communication and synchronization overhead in a distributed event-driven simulation.
This paper discusses some specific circuit, and analog
DFT techniques and methodologies used in integrated power
management (PM) systems to overcome challenges of mixed-signal
SoC qualification. They are mainly targeted at achieving
the following: 1. Enabling the robust digital and system level test
and burn-in (BI) with external supplies by disabling the on-chip
PM with robust power-on performance, 2. Minimising external
on-board active components in BI board and making the whole
BI process more robust, 3. Making the IDDQ tests more robust,
increasing the IDDQ sensitivity by less error prone design methods
and enabling IDDQ tests possible on analog supplies, and 4.
Defining separate BI strategy for the whole PM modules on-chip
and enabling it by targeted analog test modes.
Keywords: Burn-in, electrical reliability qualification, IDDQ,
analog DFT, power management.
To address the problem of prohibitive costs of advanced technologies, one solution consists in reusing masks to address a wide range of systems. This could be achieved by a modular circuit that can be stacked to build 3D systems with processing performance adapted to several applications. This paper focuses on 4G wireless telecom applications. We propose a basic circuit that meets the SISO (Single Input Single Output) transmission mode. By stacking multiple instances of this same circuit, it will be possible to address several MIMO (Multiple Input Multiple Output) modes. The proposed circuit is composed of several processing units interconnected by a 3D NoC and controlled by a host processor. Compared to a 2D reference platform, the proposed circuit keeps at least the same performance and power consumption in the context of 4G telecom applications, while reducing total mask cost.
As the operating frequency of LSI becomes higher and
the power supply voltage becomes lower, the on-chip power
supply variation has become a dominant factor which influences
the signal delay of the circuits. The static timing analysis (STA)
considering on-chip power supply variations (IR-drop) is
therefore one of the most crucial issues in the LSI designs
nowadays. We propose an efficient STA method to consider on-chip
power supply variations in the static timing analysis by
utilizing the spatial correlations of IR-drop. The proposed
method is based on the widely-used technique in STA considering
OCV (on-chip variations), which is called LOCV (Location-based
OCV) technique, and therefore our method is easy to be
incorporated into the existing timing analysis flow. The proposed
method is evaluated by using test data including H-tree clock
structure with various on-chip IR-drop distributions. The
experimental results show that the proposed method can reduce
the design margin with respect to power supply variations by 6-85%
(47% on the average) compared with the conventional
practical approach with a constant OCV derating factor, while
requiring no additional computation cost in the static timing
analysis. Thus the proposed method can contribute to a fast
timing closure considering on-chip power supply variations.
Keywords-static timing analysis; power supply variation; OCV
SyncCharts are a synchronous Statechart variant to model reactive systems with a precise and deterministic semantics. The simulation and software synthesis for SyncCharts usually involve the compilation into Esterel, which is then further compiled into C code. This can produce efficient code, but has two principal drawbacks: 1) the arbitrary control flow that can be expressed with SyncChart transitions cannot be mapped directly to Esterel, and 2) it is very difficult to map the resulting C code back to the original SyncChart, which hampers traceability. This paper presents an alternative software synthesis approach for SyncCharts that compiles SyncCharts directly into Synchronous C (SC). The compilation preserves the structure of the original SyncChart, which is advantageous for validation and possibly certification. We present a static thread-scheduling scheme that reflects data dependencies and optimizes both the number of used threads as well as the maximal used priorities. This results in SC code with competitive speed and little memory requirements.
In many computing domains, hardware accelerators can improve throughput and lower power consumption, instead of executing functionally equivalent software on the general-purpose micro-processors cores. While hardware accelerators often are stateless, network processing exemplifies the need for stateful hardware acceleration. The packet oriented streaming nature of current networks enables data processing as soon as packets arrive rather than when the data of the whole network flow is available. Due to the concurrence of many flows, an accelerator must maintain and switch contexts between many states of the various accelerated streams embodied in the flows, which increases overhead associated with acceleration. We propose and evaluate dynamic reordering of requests of different accelerated streams in a hybrid on-chip/memory based request queue in order to reduce the associated overhead.
Reset is one of the most important signals in many designs. Since reset is typically not timing critical, it is handled at late physical design stages. However, the large fanout of reset and the lack of routing resources at these stages can create variant delays on different targets of the reset signal, creating reset recovery problems. Traditional approaches address this problem using physical design methods such as buffer insertion or rerouting. However, these methods may invalidate previous optimization efforts, making timing closure difficult. In this work we propose a formal method to calculate reset recovery slacks for registers at the register transfer level. Designers and physical design tools can then utilize this information throughout the design flow to reduce reset problems at later design stages.
Three-dimensional (3D) integrated circuits (IC) are emerging as a viable solution to enhance the performance of Multi-processor System-On-Chip (MPSoC). The use of high-speed hardware and the increased density of 3D architectures present novel challenges concerning thermal dissipation and power management. Most approaches at power and thermal modeling use either static analytical models or slow low-level analog simulations. In this paper, we propose a novel thermal modeling methodology for evaluation of 3D MPSoCs. The integration of this methodology in a virtual platform enables efficient dynamic thermal evaluation of a chip. We present initial results for an architecture based on a 3D Network-On-Chip (NoC) interconnecting 2D processing elements (PE). Our methodology is based on the finite difference method: we perform an initial static characterization, after which high-speed dynamic simulation is possible. Index Terms - Virtual Platform, 3D IC, MPSoC, Dynamic Evaluation of Performance, Power Estimation, Thermal Analysis
This paper describes and compares two methods for
producing digital test signals up to 24 Gbps. Prototypes are
experimentally characterized to determine signal quality, and the
two methods are demonstrated and compared. The residual
timing errors are dominated by jitter. Typical random jitter (RJ)
is about 1.17ps to 1.4ps (RMS) including system measurement
errors for the two methods. Deterministic Jitter (DJ) is between
2.4ps and 8.5ps. Total jitter (TJ) ranges between 18.9ps and
28.2ps at a bit-error-rate BER=10-12.
Keywords-multi-Gbps; Test Synthesis; Jitter; ATE
Resistive random access memory (ReRAM) has been demonstrated as a promising non-volatile memory technology with features such as high density, low power, good scalability, easy fabrication and compatibility to the existing CMOS technology. The conventional three-dimensional (3D) bipolar ReRAM design usually stacks up multiple memory layers that are separated by isolation layers, e.g. Spin-on-Glass (SOG). In this paper, we propose a new 3D bipolar ReRAM design with interleaved complimentary memory layers (3D-ICML) which can form a memory island without any isolation. The set of metal wires between two adjacent memory layers in vertical direction can be shared. 3D-ICML design can reduce fabrication complexity and increase memory density. Meanwhile, multiple memory cells interconnected horizontally and vertically can be accessed at the same time, which dramatically increases the memory bandwidth.
The emerging 3D technology, which stacks multiple
dies within a single chip and utilizes through-silicon vias (TSVs)
as vertical connections, is considered a promising solution for
achieving better performance and easy integration. Similarly, a
generic 2D FPGA architecture can evolve into a 3D one by
extending its signal switching scheme from 2D to 3D by means of
TSVs. However, replacing all 2D switch boxes (SBs) by 3D ones
with full vertical connectivity is found both area-consuming and
resource-squandering. Therefore, it is possible to greatly reduce
the footprint with only minor delay increase by properly tailoring
the structure and deployment strategy of 3D SB. In this paper, we
perform a comprehensive architectural exploration of 3D FPGAs.
Various architectural alternatives are proposed and then
evaluated thoroughly to pick out the most appropriate ones with
a better balance between area and delay. Finally, we recommend
several configurations for generic 3D FPGA architectures, which
can save up to 52% area with virtually no delay penalty.
Keywords-3D ICs; 3D FPGAs; architectural exploration;
area/delay trade-off
For many embedded systems, data protection is becoming a major issue. On those systems, processors are often heterogeneous and prevent from deploying a common, trusted hypervisor on all of them. Multiple native software stacks are thus bound to share the resources without protection between them. NoC-MPU is a Memory Protection Unit allowing to support the secure and flexible co-hosting of multiple native software stacks running in multiple protection domains, on any shared memory MP-SoC using a NoC. This paper presents a complete hardware architecture of this NoC-MPU mechanism, along with a software trusted model organization.
Measurement equipment for process control in the chemical industry has to face severe restrictions due to safety concerns and regulations. In this work, we discuss the challenges raised by safety concerns and explain how they lead to strong power and energy constraints in the design of industrial measurement equipment. We argue that a comprehensive strategy in the design and implementation of hardware and software on one hand, and power management on the other hand is required to satisfy these constraints. Furthermore we demonstrate solutions for the power efficient design of the computing system and bus topology in an industrial environment.
Driven by continued scaling of Moore's Law, the number of processing elements on a die are increasing dramatically. Recently there has been a surge of wide single instruction multiple data architectures designed to handle computationally intensive applications like 3D graphics, high definition video, image processing, and wireless communication. A limit of the SIMD width of these types of architectures is the scalability of the interconnect network between the processing elements in terms of both area and power. To mitigate this problem, we propose the use of a new interconnect topology, XRAM, which is a low power high performance matrix style crossbar. It re-uses output buses for control programming, and stores multiple swizzle configurations at the cross points using SRAM cells, significantly reducing routing congestion and control signaling. We show that compared to conventionally implemented crossbars, the area scales with the product of inputxoutput ports while consuming almost 50% less energy. We present an application case study, color-space conversion, utilizing XRAM and show a 1.4x gain in performance while consuming 1.5-2.5x less power.
The ever increasing demand for fast mobile internet connectivity continues to set challenges for research in radio communications. On one hand the capacity demand can be served by offloading data traffic to local networks; on the other hand using more bandwidth, and possibly dynamically allocating spectrum in a flexible way, will improve the usage of the available spectrum. The future of wireless access continues to be defined by the 3GPP and IEEE standards setting bodies. Radios can also provide innovative features that offer new functionalities for consumers, such as ultra fast local connectivity, sensing and positioning. This talk will present examples of various radio innovations and the challenges related to commercializing them.
This paper presents a new quadratic, partitioning-based placement algorithm which is able to handle non-convex and overlapping position constraints to subsets of cells, the movebounds. Our new flow-based partitioning (FBP) combines a global MinCostFlow model for computing directions with extremely fast and highly parallelizable local realization steps. Despite its global view, the size of the MinCostFlow instance is only linear in the number of partitioning regions and does not depend on the number of cells. We prove that our partitioning scheme finds a (fractional) solution for any given placement or decides in polynomial time that none exists. In practice, BonnPlace with FBP can place huge designs with almost 10 million cells and dozens of movebounds in 90 minutes of global placement. On instances with movebounds, the netlengths of our placements are more than 32% shorter than RQL's [25] and our tool is 9-20 times faster. Even without movebounds, the FBP improves the quality and runtime of BonnPlace significantly and our tool shows the currently best results on the latest placement benchmarks [16].
Thermal problems are important for integrated circuits with high power densities. Three-dimensional stacked-wafer integrated circuit technology reduces interconnect lengths and improves performance compared to two-dimensional integration. However, it intensifies thermal problems. One remedy is to redistribute white space during floorplanning. In this paper, we propose a two-phase algorithm to redistribute white space. In the first phase, the lateral heat flow white space redistribution problem is formulated as a minimum cycle ratio problem, in which the maximum power density is minimized. Since this phase only considers lateral heat flow, it also works for traditional two-dimensional integrated circuits. In the second phase, to consider inter-layer heat flow in three-dimensional integrated circuits, we discretize the chip into an array of tiles and use a dynamic programming algorithm to minimize the maximum stacked tile power consumption. We compared our algorithms with a previously proposed technique based on mathematical programming. Our iterative minimum cycle ratio algorithm achieves 35% more reduction in peak temperature. Our two-phase algorithm achieves 4.21x reduction in peak temperature for three-dimensional integrated circuits compared to applying the first phase, alone.
Due to inappropriate assignment of bump pads or improper placement of I/O buffers, the configured delays of I/O signals may not satisfy the timing requirement inside die core. In this paper, the problem of timing-constrained I/O buffer placement in an area-IO flip-chip design is firstly formulated. Furthermore, an efficient two-phase approach is proposed to place I/O buffers onto feasible buffer locations between I/O pins and bump pads with the consideration of the timing constraints. Compared with Peng's SA-based approach[7], with no timing constraint, our approach can reduce 71.82% of total wirelength and 55.74% of the maximum delay for 7 tested cases on the average. Under the given timing constraints, our result obtains higher timing-constrained satisfaction ratio(TCSR) than the SA-based approach[7].
The Network-on-Chip (NoC) paradigm has emerged as a revolutionary methodology in current System-on-Chips (SoCs) for integrating a large number of processing elements in a single die. It has the advantage of enhanced performance, scalability and modularity, compared with previous bus-based communication architectures. Recently, A new Triplet-based Hierarchical Interconnection Network (THIN) has been proposed. In this paper, we explore the three-dimensional (3D) floorplanning of THIN and present two different floorplanning and routing methods using both the Manhattan routing and the Y-architecture routing architectures. A cycle-accurate simulator is developed based on Noxim NoC simulator and ORION 2.0 energy model. The latency, power consumption and area requirement of both THIN and Mesh are evaluated. The experimental results indicate that the proposed design provides 24.95% reduction in average power consumption and 16.84% improvement in area requirement.
With the evolution of today's semiconductor technology, chip temperature
increases rapidly mainly due to the growth in power density. For modern
embedded real-time systems, it is crucial to estimate maximal temperatures in
order to take mapping or other design decisions to avoid burnout, and still be able
to guarantee meeting real-time constraints. This paper provides answers to the
question: When work-conserving scheduling algorithms, such as earliest-deadline-first
(EDF), rate-monotonic (RM), deadline-monotonic (DM), are applied, what is
the worst-case peak temperature of a real-time embedded system under all possible
scenarios of task executions? We propose an analytic framework, which considers
a general event model based on network and real-time calculus. This analysis
framework has the capability to handle a broad range of uncertainties in terms of
task execution times, task invocation periods, and jitter in task arrivals. Simulations
show that our framework is a cornerstone to design real-time systems that have
guarantees on both schedulability and maximal temperatures.
Keywords-real-time systems; compositional analysis; worst-case peak temperature;
thermal analysis
In this paper, we present an automatic leakage power modeling method for standard cell library as well as SRAM compiler. For this problem, there are two major challenges - (1) the high sensitivity of leakage power to the temperature (e.g., the leakage power of an inverter can be different by 19.28X when temperature rises from 25°C to 100°C in 90nm technology), and (2) the large number of models to be built (e.g., there could be 80,835 SRAM macros supported by an SRAM compiler). Our method achieves high accuracy efficiently by two formula-based prediction techniques. First of all, we incorporate a quick segmented exponential interpolation scheme to take into account the effects of the temperature. Secondly, we use a MUX-oriented linear extrapolation scheme, which is so accurate that it allows us to build the leakage power models for all SRAM macros based on linear regression using only the simulation results of 9 small-sized SRAM macros. Experimental results show that this method is not only accurate but also highly efficient. Index Terms - Leakage Power Modeling, Leakage Power Estimation, Standard Cell Library, SRAM Compiler
Clock gating is an effective method of reducing power
dissipation of a high-performance circuit. However, deployment
of gated cells increases the difficulty of optimizing a clock tree. In
this paper, we propose a delay-matching approach to addressing
this problem. Delay-matching uses gated cells whose timing
characteristics are similar to that of their clock buffer (inverter)
counterparts. It attains better slew and much smaller latency
with comparable clock skew and less area when compared to
type-matching. The skew of a delay-matching gated tree, just like
the one generated by type-matching, is insensitive to process and
operating corner variations. Besides, delay-matching ECO of a
gated tree excels in preserving the original timing characteristics
of the gated tree.
Keywords- Clock gating; low power design; clock tree
Turbo codes are proposed in most of the advanced
digital communication standards, such as 3GPP-LTE. However,
due to its computational complexity, the turbo decoder is one of
the most power hungry blocks in digital baseband. To alleviate
this issue, one way is to avoid surplus computing phases thanks to
the early termination of the iterative decoding process. The use of
stopping criteria is one of the most common algorithm level
power reduction methods in literature. These methods always
come with some hardware overhead. In this paper, a new trellis
based stopping criterion is proposed. The novelty of this
approach is the lower hardware overhead thanks to the use of
trellis states as key parameter to stop the iterative process.
Results are showing the importance of this added hardware in
terms of method efficiency. Compared to state-of-the-art Log
Likelihood Ratio (LLR) based techniques, proposed Low
Complexity Trellis Based (LCTB) is demonstrating 23% less
power consumption on average, for comparable performance
level in terms of Bit Error Rate (BER) and Frame Error Rate
(FER).
Keywords-stopping criteria; turbo-decoder; trellis states; low
complexity; power reduction
Tag comparisons occupy a significant portion of cache power consumption in the highly associative cache such as L2 cache. In our work, we propose a novel tag access scheme which applies a partial tag-enhanced Bloom filter to reduce tag comparisons by detecting per-way cache misses. The proposed scheme also classifies cache data into hot and cold data and the tags of hot data are compared earlier than those of cold data exploiting the fact that most of cache hits go to hot data. In addition, the power consumption of each tag comparison can be further reduced by dividing the tag comparison into two micro-steps where a partial tag comparison is performed first and, only if the partial tag comparison gives a partial hit, then the remaining tag bits are compared. We applied the proposed scheme to an L2 cache with 10 programs from SPEC2000 and SPEC2006. Experimental results show average 23.69% and 8.58% reduction in cache energy consumption compared with the conventional serial tag-data access and the other existing methods, respectively.
This paper proposes a built-in self-test/self-diagnosis procedure at start-up of an on-chip network (NoC). Concurrent BIST operations are carried out after reset at each switch, thus resulting in scalable test application time with network size. The key principle consists of exploiting the inherent structural redundancy of the NoC architecture in a cooperative way, thus detecting faults in test pattern generators too. At-speed testing of stuck-at faults can be performed in less than 1200 cycles regardless of their size, with an hardware overhead of less than 11%.
The reliability of networks-on-chip (NoC) is threatened by low yield and device wearout in aggressively scaled technology nodes. We propose ReliNoC, a network-on-chip architecture which can withstand failures, while maintaining not only basic connectivity, but also quality-of-service support based on packet priorities. Our network leverages a dual physical channel switch architecture which removes the control overhead of virtual channels (VCs) and utilizes the inherent redundancy within the 2-channel switch to provide spares for faulty elements. Experimental results show that ReliNoC provides 1.5 to 3 times better network physical connectivity in presence of several faults, and reduces the latency of both high and low priority traffic by 30 to 50%, compared to a traditional VC architecture. Moreover, it can tolerate up to 50 faults within an 8x8 mesh at only 10 and 40% latency overhead on control and data packets for PARSEC traces [24]. Synthesis results show that our reliable architecture incurs only 13% area overhead on the baseline 2-channel switch.
In this paper, we address the problem of run-time resource management in non-ideal multiprocessor platforms where communication happens via the Network-on-chip (NoCs) approach. More precisely, we propose a system-level fault-tolerant technique for application mapping which aims at optimizing the entire system performance and communication energy consumption, while considering the occurrence of permanent, transient, and intermittent faults in the system. As the main theoretical contribution, we address the problem of spare core placement and its impact on system fault-tolerance (FT) properties. Then, we investigate several metrics and provide insight into the fault-aware resource management process for such non-ideal multiprocessor platforms. Experimental results show that our proposed resource management technique is efficient and highly scalable and significant throughput improvements can be achieved compared to the existing solutions that do not consider failures in the system.
With the exponential growth in the number of transistors, not only test data volume and test application time may increase, but also multiple faults may exist in one chip. Test compaction has been a de-facto design-for-testability technique to reduce the test cost. However, the compacted test responses make multiple-fault diagnosis rather difficult. When there is no space compactor, the most likely suspect fault is considered producing the failing responses most similar to the failing responses observed from the automatic test equipment. But when compactor exists, those suspect faults may no longer have the same high possibility of being the actual faults. To address this problem, we introduce a novel metric explanation necessity. By using both of the new metric and the traditional metric explanation capability, we evaluate the possibility of a suspect fault to be the actual fault. For ISCAS'89 and ITC'99 benchmark circuits equipped with extreme space compactors, experimental results show that 98.8% of the top-ranked suspect faults hit the actual faults, outperforming a previous work by 11.3%.
Trace-based debug solutions facilitate to eliminate design errors escaped from pre-silicon verification and have gained wide acceptance in the industry. Existing techniques typically trace the same set of signals throughout each debug run, which is not quite effective for catching design errors. In this work, we propose a multiplexed signal tracing strategy that is able to significantly increase debuggability of the circuit. That is, we divide the tracing procedure in each debug run into a few periods and trace different sets of signals in each period. A novel trace signal grouping algorithm is presented to maximize the probability of catching the propagated evidences from design errors, considering the trace interconnection fabric design constraints. Experimental results on benchmark circuits demonstrate the effectiveness of proposed solution.
A critical concern for post-silicon debug is the need to control the chip at clock cycle level. In a single clock chip, runstop control can be implemented by gating the clock signal using a stop signal. However, data invalidation might occur when it comes to multiple-clock chips. In this paper, we analyze the possible data invalidation, including data repetition and data loss, when stopping and resuming a multiple-clock chip. Furthermore, we propose an efficient solution to eliminate data repetition and data loss. Theoretical analysis and simulation experiments are both conducted for the proposed solution. We implement the proposed Design-for-Debug (DfD) circuit with SMIC 0.18μm technology and simulate the data transfer across clock domains using SPICE tool. The results show that both data repetition and data loss can be avoided with the proposed solution, even if metastability occurs.
Many applications contain loops with an undetermined number of iterations. These loops have to be parallelized in order to increase the throughput when executed on an embedded multiprocessor platform. This paper presents a method to automatically extract a parallel task graph based on function level parallelism from a sequential nested loop program with while loops. In the parallelized task graph loop iterations can overlap during execution. We introduce the notion of a single assignment section such that we can exploit single assignment to overlap iterations of the while loop during the execution of the parallel task graph. Synchronization is inserted in the parallelized task graph to ensure the same functional behavior as the sequential nested loop program. It is shown that the generated parallel task graph does not introduce deadlock. A DVB-T radio receiver where the user can switch channels after an undetermined amount of time illustrates the approach.
Nowadays, Graphics Processing Unit (GPU), as a kind of massive parallel processor, has been widely used in general purposed computing tasks. Although there have been mature development tools, it is not a trivial task for programmers to write GPU programs. Based on this consideration, we propose a novel parallel computing architecture. The architecture includes a parallel programming model, named Gemma, and a programming framework, named April. Gemma is based on generalized matrix operations, and helps to alleviate the difficulty of describing parallel algorithms. April is a high-level framework that can compile and execute tasks described in Gemma with OpenCL. In particular, April can automatically 1) choose the best parallel algorithm and mapping scheme, and generate OpenCL kernels, 2) schedule Gemma tasks based on execution costs such as data storing and transferring. Our experimental results show that with competitive performance, April considerably reduces the programs' code length compared with OpenCL.
Today's high performance embedded computing
applications are posing significant challenges for processing
throughout. Traditionally, such applications have been realized
on application specific integrated circuits (ASICs) and/or digital
signal processors (DSP). However, ASICs' advantage in
performance and power often could not justify the fast
increasing fabrication cost, while current DSP offers a limited
processing throughput that is usually lower than 100GFLOPS.
On the other hand, current multi-core processors, especially
graphics processing units (GPUs), deliver very high computing
throughput, and at the same time maintain high flexibility and
programmability. It is thus appealing to study the potential of
GPUs for high performance embedded computing. In this
work, we perform a comprehensive performance evaluation on
'PUs with the high performance embedded computing
(HPEC) benchmark suite, which consist a broad range of
signal processing benchmarks with an emphasis on radar
processing applications. We develop efficient GPU
implementations that could outperform previous results for all
the benchmarks. In addition, a systematic instruction level
analysis for the GPU implementations is conducted with a GPU
micro-architecture simulator. The results provide key insights
on optimizing GPU hardware and software. Meanwhile, we
also compared the performance and power efficiency between
GPU and DSP with the HPEC benchmarks. The comparison
reveals that the major hurdle for GPU's applications in
embedded computing is its relatively low power efficiency.
Keywords: GPU, Multi-core, HPEC benchmark, DSP,
parallel computing, Fermi, GFLOPS
The evolution to Manycore platforms is real, both in the High-Performance Computing domain and in embedded systems. If we start with ten or more cores, we can see the evolution to many tens of cores and to platforms with 100 or more occurring in the next few years. These platforms are heterogeneous, homogeneous, or a mixture of subsystems of both types, both relatively generic and quite application-specific. They are applied to many different application areas. When we consider the design, verification, software development and debugging requirements for applications on these platforms, the need for virtual platform technologies for Manycore systems grows quickly as the systems evolve. As we move to Manycore, the key issue is simulation speed, and trying to keep pace with the target complexity using host-based simulation is a major challenge. New Instruction Set Simulation technologies, such as compiled, JIT, DBT, sampling, abstract, hybrid and parallel have all emerged in the last few years to match the growth in complexity and requirements. At the same time, we have seen consolidation in the virtual platform industrial sector, leading to some concerns about whether the market can support the required continued development of innovations to give the needed performance. This special session deals with Manycore virtual platforms from several different perspectives, highlighting new research approaches for high speed simulation, tool and IP marketing opportunities, as well as real life virtual platform needs of industrial end users.
Panelists: F. Cerisier, S. Davidmann, L. Ducuosso, J. Engblom, and A. Mayer
Today's complexity of embedded software is steadily
increasing. The growing number of processors in a system and
the increased communication and synchronization of all
components requires scalable debug and test methods for each
component as well as the system as a whole. Considering today's
cost and time to market sensitivity it is important to find and
debug errors as early as possible and to increase the degree of
test and debug automation to avoid the loss of quality, cost and
time. These challenges are not only requiring new tools and
methodologies but also organizational changes since hardware
and software developer have to work together to achieve the
necessary productivity and quality gain.
This panel brings together users and solution provider
experienced in debugging embedded systems to discuss
requirements for robust systems that are easy to debug.
Keywords: embedded systems, software debugging,model-based software
debug, test automation, virtual prototype, hardware-software co-verification,
silicon debug,debug standards
This paper deals with system level design
considerations for mm-size implantable electronic devices with
wireless connectivity. In particular, it focuses on neural sensors
as one application requiring such miniature interfaces. Common
to all these implants is the need for power supply and a wireless
interface. Wireless power transfer via electromagnetic fields is
identified as a promising option for powering such devices.
Design methodologies, system level trade-offs, as well as
limitations of power supply systems based on electromagnetic
coupling are discussed in detail. Further, various wireless data
communication architectures are evaluated for their feasibility in
the application. Reflective impulse radios are proposed as an
alternative scheme for enabling highly scalable data transmission
at <1pJ/bit. Finally, design considerations for the corresponding
reader system are addressed.
Keywords-component; implantable neural sensors, brain-machine
interfaces, wireless power transfer, ultra low power, data
communication
This paper presents the design of a 2.4 GHz antenna and
a BAW filter for cardiac implants in the ISM band. These
components are both sensitive to their environment. The antenna
modelisation in human body is presented in order to characterize
its impedance. The BAW filter connection to a substrate modifies
its impedance, so the link between the two components is the key of
the Radio-Frequency transmission. The antenna-filter exhibits a
Standing Wave Ratio better than 2 and a maximum insertion loss
of 5.6 dB in the 2.4-2.48 GHz frequency band.
Keywords: Filter, antenna, modelisation, cardiac implants, co-design
Emerging non-volatile memory (NVM) technologies are getting mature in recent years. These emerging NVM technologies have demonstrated great potentials for the universal memory hierarchy design. Among all the technology candidates, resistive random- access memory (RRAM) is considered to be the most promising as it operates faster than phase-change memory (PCRAM), and it has simpler and smaller cell structure than magnetic memory (MRAM or STT-RAM). In contrast to a conventional MOS-accessed memory cell, memristor-based RRAM has the potential of forming a cross-point structure without using access devices, achieving ultra high density. The cross-point structure, however, brings extra challenges to the peripheral circuitry design. In this work, we study the memristor-based RRAM array design and focus on the choices of different peripherals to achieve the best trade-off among performance, energy, and area. In addition, a system-level model is built to estimate the performance, energy, and area values.
SRAMs based on tunneling field effect transistors (TFETs) consume very low static power, but the unidirectional conduction inherent to TFETs calls for special care when designing the SRAM cell. In this work, we make the following contributions. (i) We perform the first study of 6T TFET SRAMs based on both n-type and p-type access transistors and determine that only inward p-type TFETs are suitable as access transistors. However, even using inward p-type access transistors, the 6T TFET SRAM achieves only the write or the read operation reliably. (ii) In order to improve the reliability of 6T TFET SRAMs, we perform the first study of four leading write-assist (WA) and four leading read-assist (RA) techniques in TFET SRAMs. We conclude that the 6T TFET SRAM with GND lowering RA is the most reliable 6T TFET SRAM during write and read, and we verify that it is also robust under process variations. It also achieves the best performance and reliability, as well as the least static power and area, in comparison to other existing TFET SRAM structures. Further, it not only has comparable performance and reliability to the 32nm 6T CMOS SRAM, but also consumes 6-7 orders of magnitude lower static power, making it attractive for low-power high-density SRAM applications.
Scratch Pad Memory (SPM), a software-controlled on-chip memory, has been widely adopted in many embedded systems due to its small area and low power consumption. As technology scaling reaches the sub-micron level, leakage energy consumption is surpassing dynamic energy consumption and becoming a critical issue. In this paper, we propose a novel hybrid SPM which consists of non-volatile memory (NVM) and SRAM to take advantage of the ultra-low leakage power consumption and high density of NVM as well as the efficient writes of SRAM. A novel dynamic data allocation algorithm is proposed to make use of the full potential of both NVM and SRAM. According to the experimental results, with the help of the proposed algorithm, the novel hybrid SPM architecture can reduce memory access time by 18.17%, dynamic energy by 24.29%, and leakage power by 37.34% on average compared with a pure SRAM based SPM with the same size area.
Power consumption is dramatically increasing for Static Random Access Memory Field Programmable Gate Arrays (SRAM-FPGAs), therefore lower power FPGA circuitry and new CAD tools are needed. Clock-gating methodologies have been applied in low power FPGA designs with only minor success in reducing the total average power consumption. In this paper, we developed a new structural clock-gating technique based on internal partial reconfiguration and topological modifications. The solution is based on the dynamic partial reconfiguration of the configuration memory frames related to the clock routing resources. For a set of design cases, figures of static and dynamic power consumption were obtained. The analyses have been performed on a synchronous FIFO and on a r-VEX VLIW processor. The experimental results shown that the efficiency in the total average power consumptions ranges from about 28% to 39% with respect to standard clock-gating approaches. Besides, the proposed method is not intrusive, and presents a very limited cost in term of area overhead.
In embedded digital signal processing (DSP) systems, quality is set by a signal-to-noise ratio (SNR) floor. Conventional digital design strategies guarantee timing correctness of all operations, which leaves large quality margins in practical systems and sacrifices energy efficiency. This paper presents techniques to significantly improve energy efficiency by shaping the quality-energy tradeoff achievable via VDD scaling. In an unoptimized design, such scaling leads to rapid loss of quality due to the onset of timing errors. We introduce techniques that modify the behavior of the early and worst timing error offenders to allow for larger VDD reduction. We demonstrate the effectiveness of the proposed techniques on a 2D-IDCT design. The design was synthesized using a 45nm standard cell library. The experiments show that up to 45% energy savings can be achieved at a cost of 10dB peak signal-to-noise ratio (PSNR). The resulting PSNR remains above 30dB, which is a commonly accepted value for lossy image and video compression. Achieving such energy savings by direct VDD scaling without the proposed transformations results in a 35dB PSNR loss. The overhead for the needed control logic is less than 3% of the original design.
Inexact Circuits or circuits in which the accuracy of the output can be traded for energy or delay savings, have been receiving increasing attention of late due to invariable inaccuracies in designs as Moore's law approaches the low nanometer range, and a concomitant growing desire for ultra low energy systems. In this paper, we present a novel design-level technique called probabilistic pruning to realize inexact circuits. Unlike the previous techniques in literature which relied mostly on some form of scaling of operational parameters such as the supply voltage (Vdd) to achieve energy and accuracy tradeoffs, our technique uses pruning of portions of circuits having a lower probability of being active, as the basis for performing architectural modifications resulting in significant savings in energy, delay and area. Our approach yields more savings when compared to any of the conventional voltage scaling schemes, for similar error values. Extensive simulations using this pruning technique in a novel logic synthesis based CAD framework on various architectures of 64-bit adders demonstrate that normalized gains as great as 2X-7.5X in the Energy-Delay- Area product can be obtained, with a relative error percentage as low as 10-6% up to 10%, when compared to corresponding conventionally correct designs.
Micro-scale energy harvesting has become an increasingly viable and promising option for powering ultra-low power systems. A power converter is a key component in microscale energy harvesting systems. Various design parameters of the power converter, most notably the number of stages in a multi-stage power converter, play a crucial role in determining the amount of electrical power that can be extracted from a micro-scale energy transducer such as a miniature solar cell. Existing stage number optimization techniques for switched capacitor power converters, when used for energy harvesting systems, result in a substantial degradation in the amount of harvested electrical power. To address this problem, this paper proposes a new stage number optimization technique for switched capacitor power converters that maximizes the net harvested power in micro-scale energy harvesting systems. The proposed technique is based on a new figure-of-merit that is well suited for energy-harvesting systems. We have validated the proposed technique through circuit simulations using IBM 65nm technology. Our simulation results demonstrate that the proposed stage number optimization technique results in an increase of 60% - 290% in net harvested power, compared to existing stage number optimization techniques.
Delay-insensitive asynchronous on-chip communication links are a key element to realize a highly reliable asynchronous Network-on-Chip system. However, even a single permanent fault, such as an interconnect fault, causes a dead-lock state in the system. This paper presents an interconnect-fault-resilient delay-insensitive asynchronous communication link based on current-flow monitoring. Since current flow upon an interconnect is cut off by an open fault in the interconnect, the current is fed back to a transmitter, which increases a feedback current monotonically. Monitoring the feedback current makes it possible to detect the interconnect fault with delay insensitivty. The proposed link is evaluated by a 0.13μm CMOS technology with a Triple Modular Redundancy (TMR)-based asynchronous communication link which is resilient to the interconnect fault without the delay insensitivity. As a result, the energy consumption and the number of wires of the proposed link are reduced to 57% and 33%, respectively, in comparison with those of the conventional one.
Continuing to scale CMP performance at reasonable power budgets has forced chip designers to consider emerging silicon-photonic technologies as the primary means of on- and off-chip communication. Different designs for chip-scale photonic interconnects have been proposed, and system-level simulations have shown them to be far superior to purely electronic network solutions. However, specifying the exact geometries for all the photonic devices used in these networks is currently a time-consuming and difficult manual process. We present VANDAL, a layout tool which provides a user with semi-automatic assistance for placing silicon photonic devices, modifying their geometries, and routing waveguides for hierarchically building photonic networks. VANDAL also includes SCILL, a scripting language that can be used to automate photonic device place and route for repeatability, automation, verification, and scaling. We demonstrate some of the features and flexibility of the CAD environment with a case study, designing modulator and detector banks for integrated photonic links.
State-of-the-art System-on-Chip (SoC) consists of hundreds of processing elements, while trends in design of the next generation of SoC point to integration of thousand of processing elements, requiring high performance interconnect for high throughput communications. Optical on-chip interconnects are currently considered as one of the most promising paradigms for the design of such next generation Multi- Processors System on Chip (MPSoC). They enable significantly increased bandwidth, increased immunity to electromagnetic noise, decreased latency, and decreased power. Therefore, defining new architectures taking advantage of optical interconnects represents today a key issue for MPSoC designers. Moreover, new design methodologies, considering the design constraints specific to these architectures are mandatory. In this paper, we present a contention-free new architecture based on optical network on chip, called Optical Ring Network-on-Chip (ORNoC). We also show that our network scales well with both large 2D and 3D architectures. For the efficient design, we propose automatic wavelength- /waveguide assignment and demonstrate that the proposed architecture is capable of connecting 1296 nodes with only 102 waveguides and 64 wavelengths per waveguide.
This work proposes a wafer probe parametric test set optimization method for predicting dies which are likely to fail in the field based on known in-field or final test fails. Large volumes of wafer probe data across 5 lots and hundreds of parametric measurements are optimized to find test sets that help predict actually observed test escapes and final test failures. Simple rules are generated to explain how test limits can be tightened in wafer probe to prevent test escapes and final test fails with minimal overkill. The proposed method is evaluated on wafer probe data from a current automotive IC with near zero DPPM requirements resulting in improved test quality and reduced test cost.
Lithographic process variations, such as changes in
focus, exposure, resist thickness introduce distortions to line shapes
on a wafer. Large distortions may lead to line open and bridge
faults and the locations of such defects vary with lithographic
process corner. Based on lithographic simulation, it is easily
verified that for a given layout, changing one or more of the process
parameters shifts the defect location. Thus, if the lithographic
process corner of a die is known, test patterns can be better
targeted for both hard and parametric defects. In this paper, we
present design of control structures such that preliminary testing of
these structures can uniquely identify the manufacturing process
corner. If the manufacturing process corner is known, we can easily
attain highest possible fault coverage for lithography related defects
during manufacturing test. Parametric defects such as delay defects
are notorious to test because such defects may affect paths that are
subcritical under nominal conditions and not ordinarily targeted
for test. Adoption of the proposed approach can easily flag such
paths for delay tests.
Keywords-photolithography, defocus, Resistance, process corner
analysis, test pattern optimization
In this paper, an alternative test method for MEMS
convective accelerometers is presented. It is first demonstrated
that device sensitivity can be determined without the use of
physical test stimuli by simple electrical measurements. Using a
previously developed behavioral model that allows efficient
Monte-Carlo simulations, we have established a good correlation
between electrical test parameters and device sensitivity.
Proposed test method is finally evaluated for different strategies
that privilege yield, fault coverage or test efficiency.
Keywords: MEMS testing, convective accelerometer, alternative
electrical test
In semiconductor manufacturing, a wealth of wafer-level measurements, generally termed inline data, are collected from various on-die and between-die (kerf) test structures and are used to provide characterization engineers with information on the health of the process. While it is generally believed that these measurements also contain valuable information regarding die performances, the vast amount of inline data collected often thwarts efficient and informative correlation with final test outcomes. In this work, we develop a data mining approach to automatically identify and explore correlations between inline measurements and final test outcomes in analog/RF devices. Significantly, we do not depend on statistical methods in isolation, but incorporate domain expert feedback into our algorithm to identify and remove spurious autocorrelations which are frequently present in semiconductor manufacturing data. We demonstrate our method using data from an analog/RF product manufactured in IBM's 90nm low-power process, on which we successfully identify a set of key inline parameters correlating to module final test (MFT) outcomes.
The increasing electrode density in multi-electrode arrays and the use of new materials for electrode fabrication are motivating the migration from passive to active neuroprobes. Numerous circuit design challenges for the implementation of optimal integrated neural recording systems are still present and need to be addressed. In this paper we present the systematic design of a programmable low-noise multi-channel neural interface that can be used for the recording of neural activity in in vitro and in vivo experiments. The design methodology includes modeling and simulation of important parameters, allowing the definition, optimization and testing of the architecture and the circuit blocks. In the proposed architecture, individual channel programmability is provided in order to address different neural signals and electrode characteristics. A 16-channel fully-differential architecture is fabricated in a 0.35 μm CMOS technology, with a die size of 5.6 mm x 4.5 mm. Gains (40-75.6 dB) and band-pass filter cut-off frequencies (1-6000 Hz) can be digitally programmed using 7 bits per channel and a serial interface. The circuit consumes a maximum of 1.8 mA from a 3.3 V supply and the measured input-referred noise is between 2.3 and 2.9 μVrms for the different configurations. We successfully performed simultaneous recordings of action potential signals, using different electrode characteristics in in vitro experiments.
Wireless body sensor networks (WBSN) hold the promise to enable next-generation patient-centric mobile-cardiology systems. A WBSN-enabled electrocardiogram (ECG) monitor consists of wearable, miniaturized and wireless sensors able to measure and wirelessly report cardiac signals to a WBSN coordinator, which is responsible for reporting them to the tele-health provider. However, state-of-the-art WBSN-enabled ECG monitors still fall short of the required functionality, miniaturization and energy efficiency. Among others, energy efficiency can be significantly improved through embedded ECG compression, which reduces airtime over energy-hungry wireless links. In this paper, we propose a novel real-time energy-aware ECG monitoring system based on the emerging compressed sensing (CS) signal acquisition/compression paradigm for WBSN applications. For the first time, CS is demonstrated as an advantageous real-time and energy-efficient ECG compression technique, with a computationally light ECG encoder on the state-of-the-art ShimmerTM wearable sensor node and a real-time decoder running on an iPhone (acting as a WBSN coordinator). Interestingly, our results show an average CPU usage of less than 5% on the node, and of less than 30% on the iPhone.
High-end multicore processors are characterized by high power density with significant spatial and temporal variability. This leads to power and temperature hot-spots, which may cause non-uniform ageing and accelerated chip failure. These critical issues can be tackled on-line by closed-loop thermal and reliability management policies. Model predictive controllers (MPC) outperform classic feedback controllers since they are capable of minimizing a cost function while enforcing safe working temperature. Unfortunately basic MPC controllers rely on a-priori knowledge of multicore thermal model and their complexity exponentially grows with the number of controlled cores. In this paper we present a scalable, fully-distributed, energy-aware thermal management solution. The model- predictive controller complexity is drastically reduced by splitting it in a set of simpler interacting controllers, each allocated to a core in the system. Locally, each node selects the optimal frequency to meet temperature constraints while minimizing the performance penalty and system energy. Global optimality is achieved by letting controllers exchange a limited amount of information at run-time on a neighbourhood basis. We address model uncertainty by supporting learning of the thermal model with a novel distributed self-calibration approach that matches well the controller architecture.
Small autonomous embedded systems powered by means of energy harvesting techniques, have gained momentum in industry and research. This paper presents a simple, yet effective and complete energy harvesting solution which permits the exploitation of an arbitrary number of ambient energy sources. The proposed modular architecture collects energy from each of the connected harvesting subsystems in a concurrent and independent way. The possibility of connecting a lithiumion or nickel-metal hydride rechargeable battery protects the system against long periods of ambient energy shortage and improves its overall dependability. The simple, fully analogue design of the power management and battery monitoring circuits minimizes the component count and the parasitic consumption of the harvester. The numerical simulation of the system behavior allows an in-depth analysis of its operation under different environmental conditions and validates the effectiveness of the design.
Component-based validation techniques for parallel and distributed embedded systems should be able to deal with heterogeneous components, interactions, and specification mechanisms. This paper describes various approaches that allow the composition of subsystems with different execution and interaction semantics by combining computational and analytic models. In particular, this work shows how finite state machines, timed automata, and methods from classical real-time scheduling theory can be embedded into MPA (modular performance analysis), a contemporary framework for system-level performance analysis. The result is a powerful tool for compositional performance validation of distributed real-time systems.
Modern FPGAs enable complete system designs that include processors, interconnect systems,
memory subsystems and a number of application functions that are implemented using
High-Level Synthesis tools.
Keywords-HDTV systems, processor subsystem, image processing applications, High-Level Synthesis.
Designing multi-processor systems-on-chips becomes increasingly complex, as more applications with real-time requirements execute in parallel. System resources, such as memories, are shared between applications to reduce cost, causing their timing behavior to become inter-dependent. Using conventional simulation-based verification, this requires all concurrently executing applications to be verified together, resulting in a rapidly increasing verification complexity. Predictable and composable systems have been proposed to address this problem. Predictable systems provide bounds on performance, enabling formal analysis to be used as an alternative to simulation. Composable systems isolate applications, enabling them to be verified independently. Predictable and composable systems are built from predictable and composable resources. This paper presents three general techniques to implement and model predictable and composable resources, and demonstrates their applicability in the context of a memory controller. The architecture of the memory controller is general and supports both SRAM and DDR2/DDR3 SDRAM and a wide range of arbiters, making it suitable for many predictable and composable systems. The modeling approach is based on a shared-resource abstraction that covers any combination of supported memory and arbiter and enables system-level performance analysis with a variety of well-known frameworks, such as network calculus or data-flow analysis. Index Terms - predictability; composability; memory controller; memory patterns; real-time; SDRAM; arbitration; latency-rate servers
Advanced SoCs integrate a diverse set of system
functions that pose different requirements on the SoC
infrastructure. Predictable integration of such SoCs, with
guaranteed Quality-of-Service (QoS) for the real-time functions,
is becoming increasingly challenging. We present a structured
approach to predictable integration based on a combination of
architectural principles and associated analysis techniques. We
identify four QoS classes and define the type of QoS guarantees
to be supported for the two classes targeted at real-time
functions. We then discuss how a SoC infrastructure can be built
that provides such QoS guarantees on its interfaces and how
network calculus can be applied for analyzing worst-case
performance and sizing of buffers. Benefits of our approach are
predictable performance and improved time-to-market, while
avoiding costly over-design.
Keywords - SoC infrastructure; system integration; real-time;
predictability; Quality-of-Service; network calculus;
The design of high-performance servers has always
been a challenging art. Now, server designers are being asked to
explore a much larger design space as they consider multicore
heterogeneous architecture and the limits of advancing silicon
technology. Bringing automation to the early stages of design can
enable more rapid and accurate trade-off analysis. In this paper,
we introduce an Early Chip Planner which allows designers to
rapidly analyze microarchitecture, physical and package design
trade-offs for 2D and 3D VLSI chips and generates an attributed
netlist to be carried on to the implementation stage. We also
describe its use in planning a 3D special-purpose server
processor.
Keywords-system level design automation; early chip planning
Assuming continuous cell sizes we have robustly
achieved global minimization of the total transistor sizes needed
to achieve a delay goal, thus minimizing dynamic power (and
reducing leakage power). We then developed a feasible branch-and-bound
algorithm that maps the continuous sizes to the discrete
sizes available in the standard cell library. Results show
that a typical library gives results close to the optimal continuous
size results. After using state-of-the-art commercial synthesis, the
application of our discrete size selection tool results in a dynamic
power reduction of 40% (on average) for large industrial designs.
Keywords- power-delay optimization, discrete cell-size selection,
delay modelling, parallelism
Packet classification has been a fundamental processing pattern of modern networking
devices. Today's high-performance routers use specialized hardware for packet
classification, but such solutions suffer from prohibitive cost, high power consumption,
and poor extensibility. On the other hand, software-based routers offer the best
flexibility, but could only deliver limited performance (<10Gbps). Recently, graphics
processing units (GPUs) have been proved to be an efficient accelerator for software
routers. In this work, we propose a GPU-based linear search framework for packet
classification. The core of our framework is a metaprogramming technique that dramatically
enhances the execution efficiency. Experimental results prove that our solution could
outperform a CPU-based solution by a factor of 17, in terms of classification throughput.
Our technique is scalable to large rule sets consisting of over 50K rules and thus provides
a solid foundation for future applications of packet context inspection.
Keywords- Packet Classification; Software Router; GPU; CUDA; Metaprogramming
Modern batteries (e.g., Li-ion batteries) provide high discharge efficiency, but the rate capacity effect in these batteries drastically decreases the discharge efficiency as the load current increases. Electric double layer capacitors, or simply supercapacitors, have extremely low internal resistance, and a battery-supercapacitor hybrid may mitigate the rate capacity effect for high pulsed discharging current. However, a hybrid architecture comprising a simple parallel connection does not perform well when the supercapacitor capacity is small, which is a typical situation because of the low energy density and high cost of supercapacitors. This paper presents a new battery-supercapacitor hybrid system that employs a constant-current charger. The constant-current charger isolates the battery from supercapacitor to improve the end-to-end efficiency for energy from the battery to the load while accounting for the rate capacity effect of Li-ion batteries and the conversion efficiencies of the converters.
A strong dI/dt event in a VLSI circuit can induce a
temporary voltage drop and consequent malfunctioning of logic
as for instance failing speed paths. This event, called power
droop, usually manifests itself in at-speed scan test where a surge
in switching activity (capture phase) follows a period of quiescent
circuit state (shift phase). Power droop is also present during
mission mode operation. However, because of the less predictable
occurrence of the switching events in mission mode, usually the
values of power droop measured during test are different from
those measured in mission mode. To overcome the power droop
problem, different mitigation techniques have been proposed.
The goal of these techniques is to create a uniform current
demand throughout the test. This paper proposes a feedback
based droop mitigation technique which can adapt to the droop
by reading the level of VDD and modifying real time the current
flowing on ad-hoc droop mitigators. It is shown that the proposed
solution not only can compensate for droop events occurring
during test mode but also can be used as a method of mission
mode droop mitigation and yield enhancement if higher power
consumption is acceptable.
Keywords:droop, mitigation techniques, ATPG, power supply;
This paper concerns the design and optimization of a digital hearing aid application. It aims to show that a suitably adapted ASIP can be constructed to create a highly optimized solution for the wide variety of complex algorithms that play a role in this domain. These algorithms are configurable to fit the various hearing impairments of different users. They pose significant challenges to digital hearing aids, having strict area and power consumption constraints. First, a typical digital hearing aid application is proposed and implemented, comprising all critical parts of today's products. Then a small area and ultra low-power 16-bit processor is designed for the application domain. The resulting hearing aid system achieves a power reduction of ≥ 56x over the RISC implementation and can operate for > 300 hours on a typical battery.
This paper presents HypoEnergy, a framework for extending the hybrid battery-supercapacitor power supply lifetime. HypoEnergy combines high energy density and reliable workload supportability of an electrochemical battery with high power density and high number of recharge cycles of supercapacitors. The lifetime optimizations consider nonlinear battery characteristics and supercapacitors' charging overhead. HypoEnergy-KI studies the hybrid supply lifetime optimization for a preemptively known workload and for one ideal supercapacitor. We show a mapping of HypoEnergy-KI to the multiple-choice knapsack problem and use dynamic programming to address the problem. HypoEnergy-KN considers the optimization for the known workload but in the case of having a non-ideal supercapacitor bank that leaks energy. Evaluations on iPhone load measurements demonstrate the efficiency and applicability of the HypoEnergy framework in extending the system's lifetime.
We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC multi-core architecture, to implement efficiently the OpenMP programming model and enable the exploitation of fine-grain parallelism in OpenMP programs. We explore the design space of implementation of OpenMP directives and runtime intrinsics, using a family of hardware primitives; remote stores, remote DMAs, hardware counters and hardware event queues with automatic responses, to support static and dynamic scheduling and data transfers in local memories. Using an FPGA prototype with four cores, we achieve OpenMP task creation latencies of 30-35 processor clock cycles, initiation of parallel contexts in 50 cycles and synchronization primitives in 65-210 cycles.
Excessive capture power in at-speed scan testing may cause timing failures, resulting in test-induced yield loss. This has made capture-safety checking mandatory for test vectors. This paper presents a novel metric, called the TTR (Transition-Time-Relation-based) metric, which takes transition time relations into consideration in capture-safety checking. Capture-safety checking with the TTR metric greatly improves the accuracy of test vector sign-off and low-capture-power test generation.
Organic electronics, such as OLEDs, OPVs, and polymer based power storage units (batteries and capacitors) are rapidly becoming low-cost viable alternatives to silicon-based devices. These organic devices however, are still reliant on the support functions of standard silicon components such as power and logic transistors. Integration of these organic devices with standard silicon electronics into a combined heterogeneous system requires specific design and fabrication considerations. Full-scale integration with conventional silicon based electronic components is challenging due to their incompatibility with common semiconductor fabrication process that can damage the active organic compounds. The printable/spray/spin nature of organic electronics fabrication makes 3D integration an attractive methodology. We propose to combine the organic and inorganic portions of a heterogeneous system by fabricating the modules separately (hence enabling parallel manufacturing) in a specific 2D layout scheme, and subsequently connecting the devices together in a post fabrication process. In this paper we discuss the 2D designs in detail and propose a 2D-3D hybrid design as well as a fully 3D stacked design for organic electronics with energy storage devices in a face-to-back configuration. The fabrication process of each device and the integration of OPVs and OLEDs with power storage devices are discussed. An overview of test procedure and fault tolerances for the proposed configuration is provided. Finally, a potential solution for a new test environment derived from a mixed configuration of different technologies and materials is proposed. Index Terms - 3D Integration, Organic Electronics, Interconnects, Photovoltaics, Polymer Battery, Capacitor, OLED.
Photovoltaic (PV) energy harvesting is commonly used to power wireless sensor nodes. To optimise harvesting efficiency, maximum power point tracking (MPPT) techniques are often used. Recently-reported techniques focus solely on outdoor applications, being too power-hungry for use under indoor lighting. Additionally, some techniques have required light sensors (or pilot cells) to control their operating point. This paper describes an ultra low-power MPPT technique which is based on a novel system design and sample-and-hold arrangement, which enables MPPT across the range of light intensities found indoors and outdoors and is capable of cold-starting. The proposed sample-and-hold based technique has been validated through a prototype system. Its performance compares favourably against state-of-the-art systems, and does not require an additional pilot cell or photodiode. This represents an important contribution, in particular for sensors which may be exposed to different types of lighting (such as body-worn or mobile sensors).
Future applications will require processors with many cores communicating through a regular interconnection network. Meanwhile, the Deep submicron technology foreshadows highly defective chips era. In this context, not only fault-tolerant designs become compulsory, but their performance under failures gains importance. In this paper, we present a deadlock-free fault-tolerant adaptive routing algorithm featuring Explicit Path Routing in order to limit the latency degradation under failures. This is particularly interesting for streaming applications, which transfer huge amount of data between the same source-destination pairs. The proposed routing algorithm is able to route messages in the presence of any set of multiple nodes and links failures, as long as a path exists, and does not use any routing table. It is scalable and can be applied to multicore chips with a 2D mesh core interconnect of any size. The algorithm is deadlock-free and avoids infinite looping in fault-free and faulty 2D meshes. We simulated the proposed algorithm using the worst case scenario, with different failure rates. Experimentation results confirmed that the algorithm tolerates multiple failures even in the most extreme failure patterns. Additionally, we monitored the interconnect traffic and average latency for faulty cases. For 20x20 meshes, the proposed algorithm reduces the average latency by up to 50%.
Panelists: P. Urard, J. Rabaey, R. Bramley, A. King-Smith, W. Burleson, and F. Perruchot
In this paper, we consider a cyber-physical architecture where multiple control applications are divided into multiple tasks, spatially distributed over various processing units that communicate over a bus implementing a hybrid communication protocol, i.e., a protocol with both time-triggered and event-triggered communication schedules (e.g., FlexRay). In spite of efficient utilization of communication bandwidth (BW), event-triggered protocols suffer from unpredictable temporal behavior, which is exactly the opposite in the case of their time-triggered counterparts. In the context of communication delays experienced by the control-related messages exchanged over the shared communication bus, we observe that a distributed control application is more prone to performance deterioration in transient phases compared to in the steady-state. We exploit this observation to re-engineer control applications to operate in two modes, in order to optimally exploit the bi-modal(time- and event-triggered) characteristics of the underlying communication medium. Depending on the state (transient or steady) of the system. Using a FlexRay-based case study, we show that such a design provides a good trade-off between control performance and bus utilization.
Embedded hard real-time systems that are based on software product lines using dynamically derivable variants are prone to overestimations in static WCET analyses. This is due to the fact that infeasible paths in the code resulting from infeasible variant combinations are unknown to the analysis. This paper presents an approach to incorporate variant constraints in the calculation to exclude infeasible paths and thus to decrease the WCET overestimation. Based on feature models we propose a sound approach to identify significant infeasible paths that can be safely discarded in the analysis. The benefits of the approach are exemplified by a real world example from the automotive domain where we are able to reduce the WCET bound by up to 50 percent.
This paper introduces the concept of switched FlexRay networks and proposes two algorithms to schedule data communication for this new type of network. Switched FlexRay networks use an intelligent star coupler, called a switch, to temporarily decouple network branches, thereby increasing the effective network bandwidth. Although scheduling for basic FlexRay networks is not new, prior work in this domain does not utilize the branch parallelism that is available when a FlexRay switch is used. In addition to the novel exploitation of branch parallelism, the scheduling algorithms proposed in this paper also support all slot multiplexing options as defined in the FlexRay v3.0 protocol specification. This includes support for the newly added repetition rates and support for multiplexing frames from different sending nodes in the same slot. Our first algorithm quickly produces a schedule given the communication requirements, network topology and FlexRay parameters, but cannot guarantee an optimal schedule in terms of the bandwidth efficiency and extensibility. Therefore, a second, branch-and-price algorithm is introduced that does find optimal schedules.
Negative Bias Temperature Instability (NBTI) has become an important reliability issue in modern semiconductor processes. Recent work has attempted to address NBTI-induced degradation at the architecture level. However, such work has relied on device-level analytical models that, we argue, are limited in their flexibility to model the impact of architecture-level techniques on NBTI degradation. In this paper, we propose a flexible numerical model for NBTI degradation that can be adapted to better estimate the impact of architecture-level techniques on NBTI degradation. Our model is a numerical solution to the reaction-diffusion equations describing NBTI degradation that has been parameterized to model the impact of dynamic voltage scaling, averaging effects across logic paths, power gating, and activity management. We use this model to understand the effectiveness of different classes of architecture-level techniques that have been proposed to mitigate the effects of NBTI. We show that the potential benefits from these techniques are, for the most part, smaller than what has been previously suggested, and that guardbanding may still be an efficient way to deal with aging.
Conventional power management knobs such as voltage scaling or power gating have been shown to have a beneficial effect on the aging phenomena caused Negative Bias Temperature Instability (NBTI). Such a benefit can be especially exploited in SRAM memories, which are particularly sensitive to NBTI effects: given their symmetric structure, they cannot in fact take advantage of value-dependent recovery. We propose an architectural solutions that is based on the idea of partitioning a memory into multiple banks of identical size. While this organization has been widely used for reducing both dynamic and static power, its exploitation for aging benefits requires proper management of the existing idleness of the various banks. This can be achieved by means of a sort of time-varying addressing scheme in which addresses are mapped to different banks over time in such a way that the idleness is uniformly distributed over all the banks. Experimental analysis shows that it is possible to simultaneously reducing leakage power and aging in caches, with minimal overhead and without modifying the internal structure of the SRAM arrays.
We present an energy-reduction strategy for applications which are resilient, i. e. can tolerate occasional errors, based on an adaptive voltage control. The voltage is lowered, possibly beyond the safe-operation region, as long as no errors are observed, and raised again when the severity of the detected errors exceeds a threshold. Due to the resilient nature of the applications, lightweight error detection logic is sufficient for operation, and no expensive error recovery circuitry is required. On a hardware block implementing texture decompression, we observe 25% to 30% energy reduction at negligible quality loss (compared to the error introduced by the lossy compression algorithm). We investigate the strategy's performance under temperature and process variations and different assumptions on voltage-control circuitry. The strategy automatically chooses the lowest appropriate voltage, and thus the largest energy reduction, for each individual manufactured instance of the circuit.
Approximate computing techniques that exploit the inherent resilience in algorithms through mechanisms such as voltage over-scaling (VOS) have gained significant interest. In this work, we focus on meta-functions that represent computational kernels commonly found in application domains that demonstrate significant inherent resilience, namely Multimedia, Recognition and Data Mining. We propose design techniques (dynamic segmentation with multi-cycle error compensation, and delay budgeting for chained data path components) which enable the hardware implementations of these meta-functions to scale more gracefully under voltage over-scaling. The net effect of these design techniques is improved accuracy (fewer and smaller errors) under a wide range of over-scaled voltages. Results based on extensive transistor-level simulations demonstrate that the optimized meta-function implementations consume up to 30% less energy at iso-error rates, while achieving upto 27% lower error rates at iso-energy when compared to their baseline counterparts. System-level simulations for three applications, motion estimation, support vector machine based classification and k-means based clustering are also presented to demonstrate the impact of the improved meta-functions at the application level. Index Terms - Approximate Computing, Low Power Design, Voltage Over-scaling, Meta-functions.
Main memory plays a critical role in a computer system's performance and energy efficiency. Three key parameters define a main memory system's efficiency: latency, bandwidth, and power. Current memory systems tries to balance all these three parameters to achieve reasonable efficiency for most programs. However, in a multi-core system, applications with various memory demands are simultaneously executed. This paper proposes a heterogeneous main memory with three different memory modules, where each module is heavily optimized for one the three parameters at the cost of compromising the other two. Based on the memory access characteristics of an application, the operating system allocates its pages in a memory module that satisfies its memory requirements. When compared to a homogeneous memory system, we demonstrate through cycle-accurate simulations that our design results in about 13.5% increase in system performance and a 20% improvement in memory power.
Non-volatile memories, such as Flash and Phase- Change Memory, are replacing other memory and storage technologies. Although these new technologies have desirable energy and scalability properties, they are prone to wear-out due to excessive write operations. Because wear-out is an important phenomenon, a number of endurance management schemes have been proposed. There is a trade-off between what techniques to use, depending on the range of bit cell lifetime within a device. This range in cell durability arises from effects due to process variation. In this paper, we describe modeling techniques to analyze trade-offs for endurance management based on the anticipated distribution of cell lifetime. This analysis considers two general endurance strategies (physical capacity degradation and physical sparing) under four distributions of cell lifetime (constant, linear, normal, and bimodal). The modeling techniques can be used to determine how much redundancy is needed when a sparing endurance strategy is adopted. With the correct choice of technique, the device lifetime can be doubled.
The emerging nanophotonic technology can avoid the
limitation of I/O pin count, and provide abundant memory
bandwidth. However, current DRAM organization has mainly
been optimized for a higher storage capacity and package pin
utilization. The resulted data fetching mechanism is quite
inefficient in performance and energy saving, and cannot
effectively utilize the abundant optical bandwidth in off-chip
communication. This paper inspects the opportunity brought by
optical communication, and revisits the DRAM memory
architecture considering the technology trend towards multiprocessors.
In our FlexMemory design, super-line prefetching is
proposed to boost system performance and promote energy
efficiency, which leverages the abundant photonic bandwidth to
enlarge the effective data fetch size per memory cycle. To further
preserve locality and maintain service parallelism for different
workloads, page folding technique is employed to achieve
adaptive data mapping in photonics-connected DRAM chips via
optical wavelengths allocation. By combining both techniques,
surplus off-chip bandwidth can be utilized and effectively
managed adapting to the workloads intensity. Experimental
results show that our FlexMemory achieves considerable
improvements in performance and energy efficiency.
Keywords-DRAM; nanophotonic; memory architecture; locality
Modern digital signal processors (DSPs) need to support a diverse array of applications ranging from digital filters to video decoding. Many of these applications have drastically different precision and on-chip memory requirements. Moreover, DSPs often employ aggressive dynamic voltage and frequency scaling (DVFS) techniques to minimize power consumption. However, at reduced voltages, process variations can significantly increase the failure rate of on-chip SRAMs designed with small transistors to achieve high integration density, resulting in low yields. Consequently, the size of transistors in SRAMcells and cell size needs to be increased to satisfy the target yield. However, this can result in high area overhead since on-chip memories consume a significant portion of the die area. In this paper, we present a scratchpad memory design that exploits the tradeoffs between SRAM cell sizes, their failure rates, the minimum operating voltage for target yield (Vddmin), and application characteristics to achieve an on-chip memory area reduction of up to 17%. Our approach reduces Vddmin, which allows dynamic and leakage power savings of 42% and 36% respectively with DVFS. Moreover, for error-tolerant DSP applications we allow voltage scaling below Vddmin to achieve further power savings while incurring lower mean error as compared to short word-length memory. Finally, for error-sensitive applications, we propose a reconfigurable memory organization that trades memory capacity for higher precision at a lower Vddmin.
Process variability is becoming a major challenge in
CMOS design of general and embedded SRAMs in particular
due to continuous device scaling. The main problems are the
increased static power and reduced operating margins,
robustness and reliability. A common way to reduce the static
power consumption of an SRAM memory array is to decrease its
supply voltage when in memory retention mode. However, this
leads to a further reduction in memory robustness. The most
common tool for statistical analysis of circuits under process
variability is standard Monte Carlo simulation which has been
proven to be too expensive when applied on an ultra dense
SRAM [1]-[6]. In this paper a statistical robustness analysis
method is proposed based on decoupling statistical integration
from robustness region determination in the parameter domain.
The robustness is estimated with a ~ 556X speed up relation to
Monte Carlo and an error of ~ 1%.
Keywords-6T SRAM;Robustness Analysis;Data Retention; PVT
Variability.
SRAM cell stability analysis is typically based on Static Noise Margin (SNM) evaluation when in hold mode, although memory errors may also occur during read operations. Given that SNM varies with each cell operation, a thorough analysis of SNM in read mode is required. In this paper we investigate the SNM of OAM cells during write operations. The Word- Line Voltage modulation is proposed as an alternative to improve cell stability when in this mode. We show that it is possible to improve 8T OAM cells stability during write operations while reducing current leakage, as opposed to present methods that improve cell stability at the cost of leakage increase.
Recent studies of BTI behavior in SRAM cells showed that for high-.. metal gate stack technology, PBTI induced ....h shift in NMOS is as significant as NBTI induced ....h shift in PMOS. Previous techniques of mitigating NBTI in SRAM focus mainly on PMOS and thus lack the ability to mitigate PBTI of NMOS transistors. In this paper, we propose a novel design to recover 4 internal gates within a SRAM cell simultaneously to mitigate both NBTI and PBTI effects. In the evaluated L2 cache, our technique effectively slows down the cell failure probability increase, and achieves 4.64/2.86x (best/worst case) lifetime improvement over normal design. Index Terms - high-.., NBTI, PBTI, recovery, SRAM
Modern reconfigurable technologies can have a number of inherent advantages for cryptanalytic applications. Aimed at the cryptanalysis of the SHA-1 hash function, this work explores this potential showing new approaches inherently based on hardware reconfigurability, enabling algorithm and architecture exploration, input-dependent system specialization, and low-level optimizations based on static/dynamic reconfiguration. As a result of this approach, we identified a number of new techniques, at both the algorithmic and architectural level, to effectively improve the attacks against SHA-1. We also defined the architecture of a high-performance FPGA-based cluster, that turns out to be the solution with the highest speed/cost ratio for SHA-1 collision search currently available. A small-scale prototype of the cluster enabled us to reach a real collision for a 72-round version of the hash function.
SPA/SEMA (Simple Power/Electro-magnetic Analysis)
attacks performed on public-key cryptographic modules
implemented on FPGA platforms are well known from the
theoretical point of view. However, the practical aspect is
not often developed in the literature. But researchers know
that these attacks do not always work, like in the case of an
RSA accelerator. Indeed, SEMA on RSA needs to make a
difference between square and multiply which use the same
logic; this contrast with SEMA on ECC, which is easier
since doubling and add that are two different operations
from the hardware point of view. In this paper, we wonder
what to do if a SEMA fails to succeed on a device.Does it
mean that no attack is possible? We show that hardware
demodulation techniques allow the recording of a signal
with more information on the leakage than a raw recording.
Then, we propose a generic and fast method enabling
to find out demodulation frequencies. The effectiveness of
our methods is demonstrated through actual experiments
using an RSA processor on the SASEBO FPGA board.
We show cases where only demodulated signals permit to
defeat RSA.
Keywords: Demodulation, Simple Electro-Magnetic
Analysis, Mutual Information, Modular Exponentiation.
This paper presents LOEDAR, a novel low-cost Error
Detection and Recovery scheme, for Montgomery Ladder
Algorithm based Elliptic Curve Scalar Multiplication (ECSM).
The LOEDAR scheme exploits the invariance among the
intermediate results produced by the algorithm to detect errors.
The error detection process can be carried periodically during
ECSM to verify data correctness, and will recover the
cryptosystem back to the latest checkpoint upon detecting errors.
The frequency of running the error detection process can be
adjusted to trade off the power and time overhead with error
detection latency and recovery overhead. The hardware and
power overhead of LOEDAR are about 37% and 69%
respectively. Each additional error detection process contributes
less than 1% additional time overhead and power overhead.
Keywords - elliptic curve cryptography(ECC); elliptic curve
scalar multiplication(ECSM); concurrent error detection;
montgomery ladder;
When using Elliptic Curve Cryptography (ECC) in constrained embedded devices such as RFID tags, L'opez-Dahab's method along with the Montgomery powering ladder is considered as the most suitable method. It uses x-coordinate only for point representation, and meanwhile offers intrinsic protection against simple power analysis. This paper proposes a low-cost fault detection mechanism for Elliptic Curve Scalar Multiplication (ECSM) using the L'opez-Dahab algorithm. Introducing minimal changes to the last round of the algorithm, we make it capable of detecting faults with a very high probability. In addition, by reusing the existing resources, we significantly reduce both performance losses and area overhead compared to other methods in this scenario. This method is suitable especially for constrained devices. Index Terms - Elliptic Curve Cryptosystems (ECC), Montgomery Powering Ladder, Fault Attacks, Low Overhead, L'opez-Dahab algorithm.
Traditional engineering disciplines such as civil or mechanical engineering are based on solid theory for building artefacts with predictable behavior over their life-time. In contrast, we lack similar constructivity results for computing systems engineering: computer science provides only partial answers to particular system design problems. With few exceptions, predictability is impossible to guarantee at design time and therefore, a posteriori verification remains the only means for ensuring their correct operation.
We elaborate on the theoretical foundation and practical application of the contract-based specification method originally developed in the Integrated Project SPEEDS [11], [9] for two key use cases in embedded systems design. We demonstrate how formal contract-based component specifications for functional, safety, and real-time aspects of components can be expressed using the pattern-based requirement specification language RSL developed in the Artemis Project CESAR, and develop a formal approach for virtual integration testing of composed systems based on such contract-specifications of subsystems. We then present a methodology for multi-criteria architecture evaluation developed in the German Innovation Alliance SPES on Embedded Systems.
Motivation. The specific root causes of the design problems that are haunting system companies such as automotive and avionics companies are complex and relate to a number of issues ranging from design processes and relationships with different departments of the same company and with suppliers1 to incomplete requirement specification and testing.2 Further, there is a widespread consensus in the industry that there is much to gain by optimizing the implementation phase that today is only considering a very small subset of the design space. Some attempts at a more efficient design space exploration have been afoot but there is a need to formalize the problem better and to involve in major ways the different players of the supply chain. Information about the capabilities of the subsystems in terms of timing, power consumed, size, weight and other physical aspects transmitted to the system assemblers during design time would go a long way in providing a better opportunity to design space exploration. In this landscape, a wrong turn in a system design project could cause so much economic, social and organizational upheaval that it may imperil the life of an entire company. No wonder that there is much interest in risk management approaches to assess risks associated to design errors, delays, recalls and liabilities. Finding appropriate countermeasures to lower risks and to develop contingency plans is then a mainstay of the way large projects are managed today. The overarching issue is the need of a substantive evolution of the design methodology in use today in system companies. The issue to address is the understanding of the principles of system design, the necessary change to design methodologies, and the dynamics of the supply chain. Developing this understanding is necessary to define a sound approach to the needs of the system companies as they try to serve their customers better, to develop their products faster and with higher quality. An important approach to tackle in part these issues is component-based design.
Modern cars are equipped with hundreds of sensors, not only used in the traditional powertrain, chassis, and body areas, but also in more advanced applications related to multimedia, infotainment, and x-by-wire systems. Such a big quantity of sensing elements require a particular attention in the design phase of the in-vehicle communication networks. This paper provides an overview of the most commonly used automotive sensors and describes the traditional networks nowadays used to collect their measurements. Moreover, it considers some possible alternative solutions that could be used in the future to have a single uniform network as asked by the automotive industry in order to reduce weight, space, and cost of the communication system.
Wireless communication in a car has several advantages, given that demanded safety and real-time requirements are fulfilled. This paper presents a wireless MAC protocol designed for the needs of automotive and industrial applications. The proposed MAC protocol provides special support for network traffic prioritization in order to guarantee worst-case message delays for a set of high-prioritized nodes. The performance is further analyzed with a network simulator and compared with the IEEE 802.15.4 standard CSMA/CA protocol.
Using wireless communication and energy harvesting in automobiles might have significant advantages considering dependability (no wires and contacts) and weight (no cable tree). In this paper, we give a brief overview of the related technologies, surrounding conditions, and methods for design and optimization. As examples, we focus on methods for harvesting kinetic energy and wireless transmission in a tire pressure metering system (TPMS).
Mobile consumer electronics continue to converge, in terms of functionality and feature sets, bringing many challenges to the circuits required to power these applications. This paper outlines some of the technology available to address these challenges.
In sub-wavelength lithography, traditional resolution enhancement techniques (e.g., OPC) cannot guarantee the optimality of the mask. In this paper, we present a novel inverse lithography method to solve the mask optimization problem. Recognizing that when formulated on a pixel-by-pixel basis with partially coherent optical models, the problem is a large-scale nonlinear optimization problem, we cast the optimization flow into a homotopy framework and apply an efficient numerical continuation technique. Compared to earlier pixel-based inverse lithography methods, our homotopy approach is not only more efficient, but also capable of naturally addressing the mask manufactureability problem. Experiment results in a state-of-the-art lithography environment show that our method generates high fidelity wafer images, and is 100x faster than previously reported inverse lithography method.
A key challenge in design automation of digital microfluidic biochips is to carry out on-chip dilution/mixing of biochemical samples/reagents for achieving a desired concentration factor (CF). In a bioassay, reducing the waste is crucial because the waste droplet handling is cumbersome and the number of waste reservoirs on-chip needs to be minimized to use limited volume of sample and expensive reagents and hence to reduce the cost of a biochip. The existing dilution algorithms attempt to reduce the number of mix/split steps required in the process but focus little on minimization of sample requirement or waste droplets. In this work, we characterize the underlying combinatorial properties of waste generation and identify the inherent limitations of two earlier mixing algorithms (BS algorithm by Thies et al., Natural Computing 2008; DMRW algorithm by Roy et al., IEEE TCAD 2010) in addressing this issue. Based on these properties, we design an improved dilution/mixing algorithm (IDMA) that optimizes the usage of intermediate droplets generated during the dilution process, which in turn, reduces the demand of sample/reagent and production of waste. The algorithm terminates in O(n) steps for producing a target CF with a precision of 1/2n . Based on simulation results for all CF values ranging from 1/1024 to 1023/1024 using a sample (100% concentration) and a buffer solution (0% concentration), we present an integrated scheme of choosing the best waste-aware dilution algorithm among BS, DMRW, and IDMA for any given value of CF. Finally, an architectural layout of a DMF biochip that supports the proposed scheme is designed.
Many industrial systems, sensors and advanced
propulsion systems demand electronics capable of functioning at
high ambient temperature in the range of 500-600°C.
Conventional Si-based electronics fail to work reliably at such
high temperature ranges. In this paper we propose, for the first
time, a high-temperature reconfigurable computing platform
capable of operating at temperature of 500°C or higher. Such a
platform is also amenable for reliable operation in high-radiation
environment. The hardware reconfigurable platform follows the
interleaved architecture of conventional Field Programmable
Gate Array (FPGA) and provides the usual benefits of lower
design cost and time. However, high-temperature operation is
enabled by choice of a special device material, namely silicon
carbide (SiC), and a special switch structure, namely
Nano-Electro-Mechanical-System (NEMS) switch. While SiC provides
excellent mechanical and chemical properties suitable for
operation at extreme harsh environment, NEMS switch provides
low-voltage operation, ultra-low leakage and radiation hardness.
We propose a novel multi-layer NEMS switch structure and
efficient design of each building block of FPGA using nanoscale
SiC NEMS switches. Using measured switch parameters from a
number of SiC NEMS switches we fabricated, we compare the
power, performance and area of an all-mechanical FPGA with
alternative implementations for several benchmark circuits.
Keywords- High Temperature Electronics; SiC; NEMS; FPGA
The increasing power consumption of integrated circuits (ICs) enabled by technology scaling requires more efficient heat dissipation solutions to improve overall chip reliability and reduce hotspots. Thermal interface materials (TIMs) are widely employed to improve the thermal conductivity between the chip and the cooling facilities. In recent years, carbon nanotubes (CNTs) have been proposed as a promising TIM due to their superior thermal conductivity. Some CNT-based thermal structures for improving chip heat dissipation have been proposed, and they have demonstrated significant temperature reduction. In this paper, we present an improved CNT TIM design which includes a CNT grid and thermal vias to dissipate heat more efficiently to obtain a more uniform chip thermal profile. We present simulation-based experimental results that indicate a 32% / 25% peak temperature reduction and 48% / 22% improvement in chip reliability for two industrial processor benchmarks, showing the effectiveness of our proposed thermal structure.
The Unified Modeling Language (UML) as a defacto standard for software development finds more and more application in the design of systems which also contain hardware components. Guaranteeing the correctness of a system specified in UML is thereby an important as well as challenging task. In recent years, first approaches for this purpose have been introduced. However, most of them focus only on the static view of a UML model. In this paper, an automatic approach is presented which checks verification tasks for dynamic aspects of a UML model. That is, given a UML model as well as an initial system state, the approach proves whether a sequence of operation calls exists so that a desired behavior is invoked. The underlying verification problem is encoded as an instance of the satisfiability problem and subsequently solved using a SAT Modulo Theory solver. An experimental evaluation confirms the applicability of the proposed approach.
Rapidly and accurately estimating the impact of design decisions on performance metrics is critical to both the manual and automated design of wireless sensor networks. Estimating system-level performance metrics such as lifetime, data loss rate, and network connectivity is particularly challenging because they depend on many factors, including network design and structure, hardware characteristics, communication protocols, and node reliability. This paper describes a new method for automatically building efficient and accurate predictive models for a wide range of system-level performance metrics. These models can be used to eliminate or reduce the need for simulation during design space exploration. We evaluate our method by building a model for the lifetime of networks containing up to 120 nodes, considering both fault processes and battery energy depletion. With our adaptive sampling technique, only 0.27% of the potential solutions are evaluated via simulation. Notably, one such automatically produced model outperforms the most advanced manually designed analytical model, reducing error by 13% while maintaining very low model evaluation overhead. We also propose a new, more general definition of system lifetime that accurately captures application requirements and decouples the specification of requirements from implementation decisions.
Simulation is a bottleneck in the design flow of on-chip
multiprocessors. This paper addresses that problem by reducing
the simulation time of complex on-chip interconnects through
transaction-level modelling (TLM). A particular on-chip
interconnect architecture was chosen, namely a wormhole
network-on-chip with priority preemptive virtual channel
arbitration, because its mechanisms can be modelled at
transaction level in such a way that accurate figures for
communication latency can be obtained with less simulation time
than a cycle-accurate model. The proposed model produced
latency figures with more than 90% accuracy and simulated
more than 1000 times faster than a cycle-accurate model.
Keywords-system specification; transaction-level modeling;
network-on-chip; on-chip multiprocessing; simulation.
We present a high-level analytical model for chip-multiprocessors (CMPs) that encompasses processors, memory, and communication in an area-constrained, global optimization process. Applying this analytical model to the design of a symmetric CMP for speech recognition, we demonstrate a methodology for estimating model parameters prior to design exploration. Then we present an automated approach for finding the optimal high-level CMP architecture. The result is the ability to find the allocation of silicon resources for each architectural element that maximizes overall system performance. This balances the performance gains from parallelism, processor microarchitecture, and cache memory with the energy-delay costs of computation and communication.
Design and optimization of microwave passive components is one
of the most critical problems for RF IC designers. However, the
state-of-the-art methods either have good efficiency but highly
depend on the accuracy of the equivalent circuit models, which
may fail the synthesis when the frequency is high; or fully depend
on electromagnetic (EM) simulations, whose solution quality is
high but are too expensive. To address the problem, a new
method, called Gaussian Process-Based Differential Evolution for
Constrained Optimization (GPDECO) is proposed. In particular,
GPDECO performs global optimization of the microwave
structure using EM simulations, and a Gaussian process (GP)
based surrogate model is constructed ON-LINE at the same time
to predict the results of expensive EM simulations. GPDECO is
tested by two 60GHz transformers and comparisons with the
state-of-the-art methods are performed. The results show that
GPDECO can generate high performance RF passive components
that cannot be generated by the available efficient methods.
Compared with available methods with the best solution quality,
GPDECO can achieve comparable results but only costs 20%- 25%
of the computational effort. Using parallel computation in an
8-core CPU, the synthesis can be finished in less than 0.5 hour.
Keywords - Transformer synthesis, Microwave components, Microwave
design, Gaussian process, Surrogate model, Differential evolution
We propose a fast method for identifying the jitter tolerance curves of high-speed phase locked loops. The method is based on an adaptive recursion and uses known tail fitting methods to realize a fast optimization combined with a small number of jitter samples. It allows for efficient behavioral simulations, and can also be applied to hardware measurements. A typical modeling example demonstrates applicability to both software and hardware scenarios and achieves simulated measurement times in the range of few hundred milliseconds.
In latest CMOS technologies, Random Telegraph Noise (RTN) has emerged as an important challenge for SRAM design. Due to rapidly shrinking device sizes and heightened variability, analytical approaches are no longer applicable for characterising the circuit-level impact of non-stationary RTN. Accordingly, this paper presents SAMURAI, a computational method for accurate, trap-level, non-stationary analysis of RTN in SRAMs. The core of SAMURAI is a technique called Markov Uniformisation, which extends stochastic simulation ideas from the biological community and applies them to generate realistic traces of non-stationary RTN in SRAM cells. To the best of our knowledge, SAMURAI is the first computational approach that employs detailed trap-level stochastic RTN generation models to obtain accurate traces of non-stationary RTN at the circuit level. We have also developed a methodology that integrates SAMURAI and SPICE to achieve a simulation-driven approach to RTN characterisation in SRAM cells under (a) arbitrary trap populations, and (b) arbitrarily time-varying bias conditions. Our implementation of this methodology demonstrates that SAMURAI is capable of accurately predicting non-stationary RTN effects such as write errors in SRAM cells.
The flexibility of an Intelligent Power Switch (IPS)
designed in HV-CMOS technology for incandescent lamp in
automotive scenarios has been evaluated for the driving of a LED
in presence of wiring parasitics. The paper presents how it is
possible, through proper reconfiguration of the flexible IPS, to
reduce the undesired ringing phenomenon when driving a LED
with wiring parasitics thus reducing Electromagnetic
Interferences (EMI) and spikes on supply voltage. Electrical
simulation and experimental measurements prove the
effectiveness of the proposed IPS.
Keywords - Intelligent Power Switch; wiring parasitics; LED
driving; High Voltage CMOS Circuit; Automotive Electronics
The increasing demand of "safe" vehicles requires continuous design of innovative devices and sensors. This paper presents a methodology for an efficient energy analysis of a self-powered sensor in an ultra-low power automotive application. In order to achieve this goal, new tools have been developed for storing and elaborating data (e.g., power consumption values, operating conditions, etc.) and even for reporting the energy balance, after considering the source (i.e. a scavenger device) that supplies the sensor. Index Terms - Low-power design, wireless sensors, energy scavenging, analysis tools
In high power microcontrollers, a decrease in circuit lifetime is often observed in safey-critical applications where circuitry is subjected to the most severe stresses and reliability has become a major concern. Thus, ad-hoc design solutions become necessary to mitigate the impact of ageing. In this paper, we discuss hard-software approaches that exploit distributed on-chip monitoring of wear-out parameters to perform ageing-aware allocation of computation and recovery periods on the various computational units.
We propose a new system-level methodology for
relative power estimation, which is independent of register
transfer level models. Our methodology monitors the number of
bit transitions for all input/output gate signals on a bit- and cyclea-ccurate
SystemC virtual platform model. For absolute results
and reliable technology-based predictions of system power and
speed (e.g. in future 32/22nm technology nodes and variations),
relative metrics can be multiplied with bit energy coefficients
provided by semiconductor technology datasheets and device
models.
Keywords-design methodology; multicore; network-on-chip;
SystemC; system-on-chip; TLM; component;
Energy efficiency is one of the most critical aspects of today's information society. The most obvious benefits of being Green are reduced environmental impact and cost savings. Reducing energy consumption of electronic devices, circuits and heterogeneous systems, however, is not trivial. This requires the development of innovative energy-aware vertical design solutions and EDA technologies for next generations' nanoelectronics circuits and systems, and the related energy generation, conversion and management systems.
Modern Multiprocessor Systems-on-Chips (MPSoCs) are ideal platforms for co-hosting multiple applications, which may have very distinct resource requirements (e.g. data processing intensive or communication intensive) and may start/stop execution independently at time instants unknown at design time. In such systems, the runtime task allocator, which is responsible for assigning appropriate resources to each task, is a key component to achieve high system performance. This paper presents a new task allocation strategy in which self-adaptability is introduced. By dynamically adjusting a set of key parameters at runtime, the optimization criteria of the task allocator adapts itself according to the relative scarcity of different types of resources, so that resource bottlenecks can be effectively mitigated. Compared with traditional task allocators with fixed optimization criteria, experimental results show that our adaptive task allocator achieves significant improvement both in terms of hardware efficiency and stability.
While much work has addressed the energy-efficient scheduling problem for uniprocessor or multiprocessor systems, little has been done for multicore systems. We study the multicore architecture with a fixed number of cores partitioned into clusters (or islands), on each of which all cores operate at a common frequency. We develop algorithms to determine a schedule for real-time tasks to minimize the energy consumption under the timing and operating frequency constraints. As technical contributions, we first show that the optimal frequencies resulting in the minimum energy consumption for each island is not dependent on the workload mapped but the number of cores and leakage power on the island, when not considering the timing constraint. Then for systems with timing constraints, we present a polynomial algorithm which derives the minimum energy consumption for a given task partition. Finally, we develop an efficient algorithm to determine the number of active islands, task partition and frequency assignment. Our simulation result shows that our approach significantly outperforms the related approaches in terms of energy saving.
The dual effects of larger die sizes and technology scaling, combined with aggressive voltage scaling for power reduction, increase the error rates for on-chip memories. Traditional on-chip memory reliability techniques (e.g., ECC) incur significant power and performance overheads. In this paper, we propose a low-power-and-performance-overhead Embedded RAID (E-RAID) strategy and present Embedded RAIDs-on-Chip (E-RoC), a distributed dynamically managed reliable memory subsystem. E-RoC achieves reliability through redundancy by optimizing RAID-like policies tuned for on-chip distributed memories. We achieve on-chip reliability of memories through the use of distributed dynamic scratch pad allocatable memories (DSPAMs) and their allocation policies. We exploit aggressive voltage scaling to reduce power consumption overheads due to parallel DSPAM accesses, and rely on the E-RoC manager to automatically handle any resulting voltage-scaling-induced errors. Our experimental results on multimedia benchmarks show that E-RoC's fully distributed redundant reliable memory subsystem reduces power consumption by up to 85% and latency up to 61% over traditional reliability approaches that use parity/cyclic hybrids for error checking and correction.
For process nodes 22nm and below, a multitude of new
manufacturing solutions have been proposed to improve the yield
of devices being manufactured. With these new solutions come an
increasing number of defect mechanisms. There is a need to
model and characterize these new defect mechanisms so that (i)
ATPG patterns can be properly targeted, (ii) defects can be
properly diagnosed and addressed at design or manufacturing
level. This presentation reviews currently available defect
modeling and test solutions and summarizes open issues faced by
the industry today. It also explores the topic of creating special
test structures to expose manufacturing process parameters
which can be used as input to software defect models to predict
die specific defect locations for better targeting of test.
Keywords-Manufacturing test; Photolithography; Defect
Modeling; Fault Diagnosis; Layout Enhancements for
Manufacturing
Anticipating silicon response in the presence or process variability is essential to avoid costly silicon re-spins. EDA industry is trying to provide the right set of tools to designers for statistical characterization of SRAM and logic. Yet design teams (also in foundries) are still using classical corner based characterization approaches. On the one hand the EDA industry fails to meet the demands on the appropriate functionality of the tools. On the other hand, design teams are not yet fully aware of the trade-offs involved when designing under extreme process variability. This paper summarizes the challenges for statistical characterization of SRAM and logic. It describes the key features of a set of prototype tools addressing that required functionality together with their application to a number of case studies aiming at enhancing yield at product level.
This paper discusses one of the key challenges of
design-for-yield: namely, the difficulty in correlating observed
behavior with modeled behavior. In order to achieve good
parametric yield, the design process must account for a large
number of sources of variability in the silicon, ranging from those
inherent in the device and wire models themselves through
approximations made in library modeling, extraction, tool
algorithms and so on. The problem is further complicated by
defects and systematic errors that can be present in early silicon
but are expected to be fixed as part of the volume ramp. In
addition, environmental factors such as temperature and power
delivery must be understood, and variation in the measurement
equipment must also be correctly accounted for. Examples are
given for validating standard cell and memory based designs as
well as a general methodology that can be used to enable chip
bring-up.
Keywords - yield optimization, variability, silicon correlation
The yield of homogeneous network-on-chip based
multi-processor chips can be improved with the addition of spare
tiles. However, the impact of this reliability approach on the chip
energy consumption is not documented. For instance, in a
homogeneous MPSoC, application tasks can be placed onto any
tile of a defect-free chip. On the other hand, a chip with defective
tile needs a special task placement, where the faulty tile is
avoided. This paper presents a task placement tool and the
evaluation of energy consumption of homogeneous NoC-based
MPSoCs with spare tiles. Results show NoC energy consumption
overhead ranging from 1 to 10% when considering up to three
faults randomly distributed over the tiles of a 3x4 mesh network.
The results also indicate that faults on the central tiles typically
have more impact on energy overhead.
Keywords-component; network-on-chip, homogeneous
MPSoCs, reliability estimation.
Transactional Memories (TM) have attracted much
interest as an alternative to lock-based synchronization in
shared-memory multiprocessors. Considering the use of TM on
an embedded, NoC-based MPSoC, this work evaluates a LogTM
implementation. It is shown that the time an aborted transaction
waits before restarting its execution (the backoff delay) can
seriously affect the overall performance and energy consumption
of the system. This work also shows the difficulty to find a
general and optimal solution to set this time and analyzes three
backoff policies to handle it. A new solution to this issue is
presented based on a handshake between transactions. Results
suggest up to 20% in performance gains and up to 53% in energy
savings when comparing our new solution to the best backoff
delay alternative found in our experiments.
Keywords: Hardware Transactional Memories; Multiprocessor
Systems-on-Chip; Networks-on-Chip; Embedded Systems;
Performance; Energy Consumption.
Progressive gate oxide breakdown is emerging as one of the most important source of stability degradation in nanoscale SRAMs, especially at lower supply voltages. Low voltage operation of SRAM arrays is critical in reducing the power consumption of embedded microprocessors, thus necessitating the lowering of Vmin. However, the oxide breakdown undesirably increases Vmin due to increase in dynamic write failures and eventually static write failures as the supply voltage decreases. In this work, we describe an analytical model based on the Kohlrausch-William-Watts (KWW) function to predict the degradation in the WLcrit as the oxide breakdown increases. The KWW model also accurately predicts the efficacy of the word-line boosting and Vdd lowering write-assist techniques in reducing WLcrit. Simulation results from an industrial low-power 32nm SRAM show that model is accurate to within 1% of SPICE across range of supply voltages and severity of oxide breakdown with orders of improvement in runtime.
Modern hardware and software implementations of cryptographic algorithms are subject to multiple sophisticated attacks, such as differential power analysis (DPA) and fault-based attacks. In addition, modern integrated circuit (IC) design and manufacturing follows a horizontal business model where different third-party vendors provide hardware, software and manufacturing services, thus making it difficult to ensure the trustworthiness of the entire process. Such business practices make the designs vulnerable to hard-to-detect malicious modifications by an adversary, termed as "Hardware Trojans". In this paper, we show that malicious nexus between multiple parties at different stages of the design, manufacturing and deployment makes the attacks on cryptographic hardware more potent. We describe the general model of such an attack, which we refer to as Multi-level Attack, and provide an example of it on the hardware implementation of the Advanced Encryption Standard (AES) algorithm, where a hardware Trojan is embedded in the design. We then analytically show that the resultant attack poses a significantly stronger threat than that from a Trojan attack by a single adversary. We validate our theoretical analysis using power simulation results as well as hardware measurement and emulation on a FPGA platform.
Reversible logic is one of the emerging technologies having promising applications in quantum computing. In this work, we present new design of the reversible BCD adder that has been primarily optimized for the number of ancilla input bits and the number of garbage outputs. The number of ancilla input bits and the garbage outputs is primarily considered as an optimization criteria as it is extremely difficult to realize a quantum computer with many qubits. As the optimization of ancilla input bits and the garbage outputs may degrade the design in terms of the quantum cost and the delay, thus the quantum cost and the delay parameters are also considered for optimization with primary focus towards the optimization of the number of ancilla input bits and the garbage outputs. Firstly, we propose a new design of the reversible ripple carry adder having the input carry C0 and is designed with no ancilla input bits. The proposed reversible ripple carry adder design with no ancilla input bits has less quantum cost and the logic depth (delay) compared to its existing counterparts. The existing reversible Peres gate and a new reversible gate called the TR gate is efficiently utilized to improve the quantum cost and the delay of the reversible ripple carry adder. The improved quantum design of the TR gate is also illustrated. Finally, the reversible design of the BCD adder is presented which is based on a 4 bit reversible binary adder to add the BCD number, and finally the conversion of the binary result to the BCD format using a reversible binary to BCD converter.
Virtual prototypes are simulators used in the consumer electronics industry. Transaction-level Modeling (TLM) is a widely used technique for designing such virtual prototypes. In particular, they allow for early development of embedded software. The SystemC modeling language is the current industry standard for developing virtual prototypes. Our experience suggests that writing TLM models exclusively in SystemC leads sometimes to confusion between modeling concepts and their implementation, and may be the root of some known bad practices. This paper introduces jTLM, an experimentation framework that allow us to study the extent to which common modeling issues come from a more fundamental constraint of the TLM approach. We focus on a discussion of the two modes of simulation scheduling: cooperative and preemptive. We con- front the implications of these two modes on the way of designing TLM models, the software bugs exposed by the simulators and the performance.
This paper relies on the longest closest subsequence (LCSS), a variant of the longest common subsequence (LCS), to account for noise and process variations inherited by analog circuits. The idea is to use stochastic differential equations (SDE) to model the design and integrate device variation due to the 0.18..m fabrication process in a MATLAB simulation environment. LCSS is used to find the longest and closest subsequence that matches with the subsequence of an ideal circuit. We illustrate the proposed approach on a Colpitts oscillator circuit. Advantages of the proposed methods are robustness and flexibility to account for wide range of variations.
This paper presents an efficient technique to perform multi-objective design space exploration of a multiprocessor platform. Instead of using semi-random search algorithms (like simulated annealing, tabu search, genetic algorithms, etc.), we use the domain knowledge derived from the platform architecture to set-up the exploration as a discrete-space multi-objective Markov Decision Process (MDP). The system walks the design space changing its parameters, performing simulations only when probabilistic information becomes insufficient for a decision. The algorithm employs a novel multi-objective value function and exploration strategy, which guarantees high accuracy and minimizes the number of necessary simulations. The proposed technique has been tested with a small benchmark (to compare the results against exhaustive exploration) and two large applications (to prove effectiveness in a real case), namely the ffmpeg transcoder and pigz parallel compressor. Results show that the exploration can be performed with 10% of the simulations necessary for state-of-the-art exploration algorithms and with unrivaled accuracy (0:6 ± 0:05% error).
We present a high-level method for rapidly and accurately predicting bus contention effects on energy and performance in multi-processor SoCs. Unlike most other approaches, which rely on Transaction-Level Modeling (TLM), we infer the information we need directly from executing the algorithmic specification, without needing to build any high-level architectural model. This results in higher estimation speed and allows us to maintain our prediction results within ~2% of gate-level estimation accuracy.
Current loop buffer has been mainly explored as an effective architectural technique for low-power execution in embedded processor. Another avenue, however, for exploiting loop buffer is to obtain its performance benefit. In this paper, we propose an application specific loop buffer organization for vectorized processing kernels, to achieve low-power and high-performance goals. The vectorized loop buffer (VLB) is simplified with single loop support for SIMD devices. Since significant data rearrangement overhead is required in order to use the SIMD capabilities, the VLB is specialized for zero-overhead implicit data permutation. We extend several instructions to the baseline ISA for programming and integrate it into an embedded processor for evaluation. Our results show that VLB improves the performance and power measures significantly compared to conventional SIMD devices.
Synthesis of reversible circuits is an active research area motivated by its applications e.g. in quantum computation or low-power design. The number of used circuit lines thereby a crucial criterion. In this paper, we introduce several methods (including a theoretical upper bound) for the efficient computation or at least approximation of the minimal number of lines needed to realize a given function in reversible logic. While the proposed exact approach requires a significant amount of run-time (exponential in the worst case), the heuristic methods lead to very precise approximations in very short run-time. Using this, it can be shown that current synthesis approaches for large functions are still far away from producing optimal circuits with respect to the number of lines.
In this paper we propose a design methodology to explore dynamic and partial reconfiguration (DPR) of modern FPGAs. We define a set of rules in order to model DPR by means of UML and design patterns. Our approach targets MPSoPC (Multiprocessor System on Programmable Chip) which allows: a) area optimization through partial reconfiguration without performance penalty and b) increased system flexibility through dynamic behavior modeling and implementation. In our case, area reduction is achieved by reconfiguring co-processors connected to embedded processors, and flexibility is achieved by permitting new behavior to be easily added to the system. Most of the system is automatically generated by means of MDE techniques. Our modeling approach allows designers to target dynamic reconfiguration without being experts of modern FPGAs. Such a methodology allows design time speed-up and a significant reduction of the gap between hardware and software modeling.
This paper presents a technique for automated generation of hierarchical classification schemes to express the main similarities and differences between analog circuits. The produced classification schemes offer insight about the uniqueness and importance of specific design features in setting various performance attributes as well as the limiting factors of designs. Hence, the classification schemes serve as a systematic way of relating one circuit design to alternatives. The automatically produced classification schemes for a set of OpAmps are discussed.
Although general purpose GPUs have relatively high
computing capacity, they also introduce high power consumption
compared with general purpose CPUs. Therefore low-power
techniques targeted for GPUs will be one of the most hot topics
in the future. On the other hand, in several application domains,
users are unwilling to sacrifice performance to save power. In this
paper, we propose an effective kernel fusion method to reduce
the power consumption for GPUs without performance loss.
Different from executing multiple kernels serially, the proposed
method fuses several kernels into one larger kernel. Owing to
the fact that most consecutive kernels in an application have
data dependency and could not be fused directly, we split large
kernel into multiple slices with strip-mining method, then fuse
independent sliced kernels into one kernel. Based on the CUDA
programming model, we propose three different kernel fusion
implementations, with each one targeting for a special case.
Based on the different strip-ming methods, we also propose two
fusion mechanisms, which are called invariant-slice fusion and
variant-slice fusion. The latter one could be better adapted to the
requirements of the kernels to be fused. The experimental results
validate that the proposed kernel fusion method could effectively
reduce the power consumption for GPU.
Keywords-GPGPU, Kernel Fusion, Strip-mining, Power Efficiency
This paper presents a ripple-carry adder module that can serve as a basic component for Quantum Dot Automata arithmetic circuits. The main methodological design innovation over existing state of the art solutions was the adoption of so called minority gates in addition to the more traditional majority voters. Exploiting this widened basic block set, we obtained a more compact, and thus less expensive circuit. Moreover, the layout was designed in order to comply with the rules for robustness again noise paths [6].
A "smart microgrid" refers to a distribution network for electrical energy, starting from electricity generation to its transmission and storage with the ability to respond to dynamic changes in energy supply through co-generation and demand adjustments. At the scale of a small town, a microgrid is connected to the wide-area electrical grid that may be used for "baseline" energy supply; or in the extreme case only as a storage system in a completely self-sufficient microgrid. Distributed generation, storage and intelligence are key components of a smart microgrid. In this paper, we examine the significant role that buildings play in energy use and its management in a smart microgrid. In particular, we discuss the relationship that IT equipment has on energy usage by buildings, and show that control of various building subsystems (such as IT and HVAC) can lead to significant energy savings. Using the UCSD as a prototypical smart microgrid, we discuss how buildings can be enhanced and interfaced with the smart microgrid, and demonstrate the benefits that this relationship can bring as well as the challenges in implementing this vision.
The exponential increase of world energy demand, with a forecasted rise of 45%[1] between 2010 and 2030, makes energy management one of the most urgent topics of the century and a key driver for semiconductors and electronics products evolution. The main solutions for world energy demand and global warming issues have been divided in two main streams: an increasing offer from alternative energy sources and their integration into the new Smart Grid and a reduction of the demand through an increase in the efficiency of systems.
Maximizing the performance of the Itoh-Tsujii finite field inversion algorithm (ITA) on FPGAs requires tuning of several design parameters. This is often time consuming and difficult. This paper presents a theoretical model for the ITA for any Galois field and k-input LUT based FPGA (k > 3). Such a model would aid a hardware designer to select the ideal design parameters quickly. The model is experimentally validated with the NIST specified fields and with 4 and 6 LUT based FPGAs. Finally, it is demonstrated that the resultant designs of the Itoh-Tsujii Inversion algorithm is most optimized among contemporary works on LUT based FPGAs.
Reconfigurable hardware such as FPGAs are being increasingly employed for accelerating compute-intensive applications. While recent advances in technology have increased the capacity of FPGAs, lack of standard models for developing custom accelerators creates issues with scalability and compatibility. We present SHARC - Streaming Hardware Accelerator with Run-time Configurability, for an FPGA-based accelerator. This model is at a lower-level compared to existing stream processing models and provides the hardware designer with a flexible platform for developing custom accelerators. The SHARC model provides a generic interface for each hardware module and a hierarchical structure for parallelism at multiple levels in an accelerator. It also includes a parameterization and hierarchical run-time reconfiguration framework to enable hardware reuse for flexible yet high throughput design. This model is very well suited for compute-intensive applications in areas such as real-time vision and signal processing, where stream processing provides enormous performance benefits. We present a case-study by implementing a bio-inspired Saliency-based visual attention system using the proposed model and demonstrate the benefits of run-time reconfiguration. Experimental results show about 5X speedup over an existing CPU implementation and up to 14X higher Performance-per-Watt over a relevant GPU implementation.
Several approaches have been proposed to accelerate
the NP-complete Boolean Satisfiability problem (SAT) using
reconfigurable computing. In this paper, we present a five-stage
pipelined SAT solver. SAT solving is broken into five stages:
variable decision, variable effect fetch, clause evaluation, conflict
detection, and conflict analysis. The solver performs a novel
search algorithm combining state-of-the-art SAT solvers
advanced techniques: non-chronological backjumping, dynamic
backtracking and learning without explicit traversal of
implication graph. SAT instance information is stored into FPGA
block RAMs avoiding synthesizing overhead for each instance.
The proposed solver achieves up to 70x speedup over other
hardware SAT solvers with 200x less resource utilization.
Keywords - Boolean Satisfiability, Conflict-directed jumping.
We introduce on-demand redundancy, a set of architectural techniques that leverage the tightly-coupled nature of components in systems-on-chip to reduce the cost of safety-critical systems. On-demand redundancy eases the assumptions that traditionally segregate the execution of critical and non-critical tasks (NCTs), making resources available for critical tasks at potentially arbitrary points in both space and time, and otherwise freeing resources to execute non-critical tasks when critical tasks are not executing. Relaxed dedication is one such technique that allows non-critical tasks to execute on critical task resources. Our results demonstrate that for a wide variety of applications and architectures, relaxed dedication is more cost-effective than a traditional approach that employs dedicated resources executing in lockstep. Applied to dual-modular redundancy (DMR), relaxed dedication exposes 73% more NCT cycles than traditional DMR on average, across a wide variety of usage scenarios.
Given the projected higher variations in the availability of computational resources, adaptive static schedules have been developed to attain high-speed execution reconfiguration with no reliance on any runtime rescheduling decisions. These schedules are able to deliver predictable execution despite the increased levels of device unreliability in future multicore systems. Yet the associated runtime reconfiguration overhead is largely determined by the underlying system topology. Fully connected architectures, although they can effectively hide the overhead in execution migration, become infeasible as the core count grows to hundreds in the near future. We exploit in this paper the high locality associated with adaptive static schedules, and outline a scalable and locally shareable system organization for multicore platforms. With the incorporation of a limited set of neighborhood-centered communication links, threads are allowed to be directly migrated among adjacent cores without physical data movement. At the architecture level, a set of 2-dimensional physical topologies with such a local sharing property embedded is furthermore proposed. The inherent regularity allows these topologies to be adopted as a fixed-silicon multicore platform that can be flexibly redefined according to the parallelism characteristics and resilience needs of each application.
A novel policy for allocating reconfigurable fabric resources in multi-core processors is presented. We deploy a Minority-Game to maximize the efficient use of the reconfigurable fabric while meeting performance constraints of individual tasks running on the cores. As we will show, the Minority Game ensures a fair allocation of resources, e.g., no single core will monopolize the reconfigurable fabric. Rather, all cores receive a "fair" share of the fabric, i.e., their tasks would miss their performance constraints by approximately the same margin, thus ensuring an overall graceful degradation. The policy is implemented on a Virtex-4 FPGA and evaluated for diverse applications ranging from security to multimedia domains. Our results show that the Minority-Game policy achieves on average 2x higher application performance and a 5x improved efficiency of resource utilization compared to state-of-the-art.
This paper proposes a linearised state-space technique to accelerate the simulation of tunable vibration energy harvesting systems by at least two orders of magnitude. The paper provides evidence that currently available simulation tools are inadequate for simulating complete energy harvesting systems where prohibitive CPU times are encountered due to disparate time scales. In the proposed technique, the model of a complete mixed-technology energy harvesting system is divided into component blocks whose mechanical and analogue electrical parts are modelled by local state equations and terminal variables while the digital electrical part is modelled as a digital process. Unlike existing simulation tools that use Newton-Raphson method, the proposed technique uses explicit integration such as Adams- Bashforth method to solve the state equations of the complete energy harvester model in short simulation time. Experimental measurements of a practical tunable energy harvester have been carried out to validate the proposed technique.
We report an approach targeted to aid design exploration, early decisioning in model refinement, optimization and trade offs. The approach consists of SystemC AMS coupling with a descriptive functional simulator. System engineering tools are typically used in design and analysis of system prototypes captured at very high level. Naturally, in high level analysis accuracy and detail of results is compromised in lieu simulation speed and design effort. In the presented approach, the much needed abstraction and simulation speed is retained during simulation of platform architecture while near implementation models (RTL, SPICE) may be also be cosimulated with the architecture.
This contribution proposes syntax extensions to SystemC-A that support mixed-technology system modelling where components might exhibit distributed behaviour modelled by partial differential equations. The important need for such extensions arises from the well known modelling difficulties in hardware description languages where complex electronics in a mixed-technology system interfaces with distributed components from different physical domains, e.g. mechanical, magnetic or thermal. A digital MEMS accelerometer with distributed mechanical sensing element is used as a case study to illustrate modelling capabilities offered by the proposed extended syntax of SystemC-A.
Stochastic circuit reliability analysis, as described in this work, matches the statistical attributes of underlying device fabrics and transistor aging to the spatial and temporal reliability of an entire circuit. For the first time, spatial and temporal stochastic and deterministic reliability effects are handled together in an efficient framework. The paper first introduces an equivalent transistor SPICE model, comprising the currently most important aging effects (i.e NBTI, hot carriers and soft breakdown). A simulation framework then uses this SPICE model to minimize the number of circuit factors and to build a circuit model. The latter allows for example very fast circuit yield analysis. Using experimental design techniques the proposed method is very efficient and also proves to be very flexible. The simulation technique is demonstrated on an example 6-bit current-steering DAC, where the creation of soft breakdown spots can result in circuit failure due to increasing time-dependent transistor mismatch. Index Terms - NBTI, Hot Carrier Degradation, TDDB, SBD, HBD, Failure-Resilience, Aging, Design for Reliability.
Delay testing is performed to guarantee that a manufactured chip is free of delay defects and meets its performance specification. However, only few delay faults are robustly testable. For robustly untestable faults, non-robust tests which are of lesser quality are typically generated. Due to significantly relaxed conditions, there is a large quality gap between non-robust and robust tests. This paper presents a test generation procedure for As-Robust-As-Possible (ARAP) tests to increase the overall quality of the test set. Instead of generating a non-robust test for a robustly untestable fault, an ARAP test is generated which maximizes the number of satisfiable conditions required for robust test generation by pseudo-Boolean optimization. Additionally, the problem formulation is extended to incorporate the increased significance of small delay defects. By this, the likeliness that small delay defects invalidate the test is reduced. Experimental results on large industrial circuits confirm the quality gap and show that the generated ARAP tests satisfy a large percentage of all robustness conditions on average which signifies a very high quality.
Functional broadside tests are two-pattern scan-based tests that avoid overtesting by ensuring that a circuit traverses only reachable states during the functional clock cycles of a test. On-chip test generation has the added advantage that it reduces test data volume and facilitates at-speed test application. This paper shows that on-chip generation of functional broadside tests can be done using simple hardware, and can achieve high transition fault coverage for testable circuits. With the proposed on-chip test generation method, the circuit is used for generating reachable states during test application. This alleviates the need to compute reachable states off-line.
Fault simulation of digital circuits must correctly compute fault coverage to assess test and product quality. In case of unknown values (X-values), fault simulation is pessimistic and underestimates actual fault coverage, resulting in increased test time and data volume, as well as higher overhead for design-for-test. This work proposes a novel algorithm to determine fault coverage with significantly increased accuracy, offering increased fault coverage at no cost, or the reduction of test costs for the targeted coverage. The algorithm is compared to related work and evaluated on benchmark and industrial circuits. Index Terms - Unknown values, fault coverage, precise fault simulation
Exhaustive state space exploration based verification of embedded system designs remains a challenge despite three decades of active research into Model Checking. On the other hand, simulation based verification of even critical embedded system designs is often subject to financial budget considerations in practice. In this paper, we suggest an algorithm that minimizes the overall cost of producing an embedded system including the cost of testing the embedded system and expected losses from an incompletely tested design. We seek to quantify the trade-off between the budget for testing and the potential financial loss from an incorrect design. We demonstrate that our algorithm needs only a logarithmic number of test samples in the cost of the potential loss from an incorrect validation result. We also show that our approach remains sound when only upper bounds on the potential loss and lower bounds on the cost of simulation are available. We present experimental evidence to corroborate our theoretical results.
Parallel stream processing applications are often executed on shared-memory multiprocessor systems. Synchronization between tasks is needed to guarantee correct functional behavior. An increase in the communication granularity of the tasks in the parallel application can decrease the synchronization overhead. However using coarser-grained synchronization can result in deadlock or violation of the throughput constraint for the application in case of cyclic data dependencies. Resynchronization tries to change the synchronization behavior in order to reduce the synchronization overhead. Determining the amount of resynchronization while preventing deadlock and satisfying the throughput constraint of the application, forms a global analysis problem. In this paper we present a Linear Programming (LP) algorithm for minimizing synchronization by means of resynchronization that is based on the properties of dataflow models. We demonstrate our approach with an extended Constant Modulus Algorithm (CMA) in a beam-forming application. For this application we reduce the number of synchronization statements with 30% while having a memory constraint of 200 tokens. The algorithm which calculates this reduction takes less than 20 milliseconds for this problem instance.
Heterogeneous multi-core platforms are widely accepted for high performance multimedia embedded systems. Although pipeline techniques can enhance performance for the multi-core platforms, data dependencies for processing compressed multimedia data makes it difficult, if not impossible, to automate pipelined design. In this paper, we target on multimedia streaming applications on heterogeneous multi-core platforms and develop the "Tile Piecing Algorithm" for pipelined schedule synthesis within the targeted applications and platforms. The algorithm gives an efficient way to construct a pipelined schedule. The performance evaluation result shows that the algorithm performs as well as the optimal algorithm to utilize the computation resource. On the other hand, the algorithm only takes hundreds of milliseconds to complete, which is less than one tenth of the running time for optimal algorithm. Last, the synthesized schedule is well packed. The short execution time and schedule make-span makes the algorithm more practical to be used during the run-time.
3D technologies using Through Silicon Vias (TSV)
have not yet proved their viability for being deployed in large-range
products. In this paper, we investigate three promising
perspectives for short to medium terms adoption of such
technology in high-end System-on-Chip built around multi-core
architectures: the wide bus concept will help solving high
bandwidth requirements with external memory. 3D Network-on-
Chip is a promising solution for increased modularity and
scalability. We show that an efficient implementation provides an
available bandwidth outperforming classical interfaces. Finally,
we put in perspective the active interposer concept which aims at
simplifying and improving power, test and debug management.
Keywords: 3D, TSV, Through Silicon Via, Network-on-Chip,
NoC, power management, test, debug
3D stacked DRAM improves peak memory performance. However, its effective performance is often limited by the constraints of row-to- row activation delay (tRRD), four active bank window (tFAW), etc. In this paper, we present a quantitative analysis of the performance impact of such constraints. In order to resolve the problem, we propose balancing the budget of DRAM row activation across DRAM channels. In the proposed method, an inter-memory controller coordinator receives the current demand of row activation from memory controllers and re-distributes the budget to the memory controllers in order to improve DRAM performance. Experimental results show that sharing the budget of row activation between memory channels can give average 4.72% improvement in the utilization of 3D stacked DRAM.
Panelists: O. Bringmann, C. Chevallaz, B. Dickman, V. Esen, and M. Rohleder
For years people have been designing electronic and
computing systems focusing on improving performance but
"keeping power and energy consumption in mind". This is a way
to design energy-aware or power-efficient systems, where energy
is considered as a resource whose utilization must be optimized in
the realm of performance constraints. Increasingly, energy and
power turn from optimization criteria into constraints,
sometimes as critical as, for example, reliability and timing.
Furthermore, quanta of energy or specific levels of power can
shape the system's action. In other words, the system's behavior,
i.e. the way how computation and communication is carried out,
can be determined or modulated by the flow of energy into the
system. This view becomes dominant when energy is harvested
from the environment. In this paper, we attempt to pave the way
to a systematic approach to designing computing systems that are
energy-modulated. To this end, several design examples are
considered where power comes from energy harvesting sources
with limited power density and unstable levels of power. Our
design examples include voltage sensors based on self-timed logic
and speed-independent SRAM operating in the dynamic range of
Vdd 0.2-1V. Overall, this work advocates the vision of designing
systems in which a certain quality of service is delivered in return
for a certain amount of energy.
Keywords-charge-to-digital converter; energy; energy-frugality;
energy-harvesting; power; power-proportionality; self-timed logic;
SRAM; voltage sensor
Integrating coarse-grained reconfigurable architectures (CGRAs) into a System-on-a-Chip (SoC) presents many benefits as well as important challenges. One of the challenges is how to customize the architecture for the target applications efficiently and effectively without explicit design space exploration. In this paper we present a novel methodology for incremental interconnect customization of CGRAs that can suggest a new interconnection architecture that can maximize the performance for a given set of application kernels while minimizing the hardware cost. Applying the inexact graph matching analogy, we translate our problem into graph matching taking into account the cost of various graph edit operations, which we solve using the A∗ search algorithm with a heuristic tailored to our problem. Our experimental results demonstrate that our customization method can quickly find application-optimized interconnections that exhibit 70% higher performance on average compared to the base architecture, with relatively little hardware increase in interconnections and muxes.
We describe a parameterized memory system suitable as target for automatic high-level language to hardware compilers for reconfigurable computers. It fully supports the spatial computation paradigm by allowing the realization of each memory operator by a dedicated hardware memory port. Interport coherency is maintained only for those ports that actually require it, and efficient speculative execution is enabled by a dynamic scheme for arbitrating access to shared resources (such as main memory), relying on techniques inspired by the branch prediction of conventional software-programmable processors.
This paper presents an adaptable softcore chip multiprocessor (CMP). The processor instruction set architecture (ISA) is based on the VEX ISA. The issue-width of the processor can be adjusted at run-time (before an application starts). The processor has eight 2-issue cores that can run independently from each other. If not in use, each core can be taken to a lower power mode by gating off its source clock. Multiple 2-issue cores can be combined at run-time to form a variety of configurations of very long instruction word (VLIW) processors. The CMP is implemented in the Xilinx Virtex-6 XC6VLX240T FPGA. It has a single ISA and requires no specialized compiler support. The CMP can target a variety of applications having instruction and/or data level parallelism. We found that applications/kernels with larger instruction level parallelism (ILP) performs better when run on a larger issue-width core, while applications with larger data level parallelism (DLP) performs better when run on multiple 2-issue cores with the data distributed among the cores.
Conventional clock skew scheduling for sequential circuits can be formulated as a minimum cycle ratio (MCR) problem, and hence can be solved effectively by methods such as Howard's algorithm. However, its application is practically limited due to the difficulties in reliably implementing a large set of arbitrary dedicated clock delays for the flip-flops. Multi-domain clock skew scheduling was proposed to tackle this impracticality by constraining the total number of clock delays. Even though this problem can be formulated as a mixed integer linear programming (MILP), it is expensive to solve optimally in general. In this paper, we show that, under mild restrictions, the underlying domain assignment problem can be formulated as a special MILP that can be solved effectively using similar techniques for the MCR problem. In particular, we design a generalized Howard's algorithm for solving this problem efficiently. We also develop a critical-cycle-oriented refinement algorithm to further improve the results. The experimental results on ISCAS89 benchmarks show both the accuracy and efficiency of our algorithm. For example, only 4.3% of the tests have larger than 1% degradation (3% in the worst case), and all the tests finish in less than 0.7 seconds on a laptop with a 2.1GHz processor.
A new class of delay-insensitive (DI) codes, called DI Bus-Invert, is introduced for timing-robust global asynchronous communication. This work builds loosely on an earlier synchronous bus-invert approach for low power by Stan and Burleson, but with significant modifications to ensure that delay-insensitivity is guaranteed. The goal is to minimize the average number of wire transitions per communication (a metric for dynamic power), while maintaining good coding efficiency. Basic implementations of the key supporting hardware blocks (encoder, completion detector, decoder) for the DI bus-invert codes are also presented. Each design was synthesized using the UC Berkeley ABC tool and technology mapped to a 90nm industrial standard cell library. When compared to the most coding-efficient systematic DI code (i.e. Berger) over a range of field sizes from 2 to 14 bits, the DI bus-invert codes had 24.6 to 42.9% fewer wire transitions per transaction, while providing comparable coding efficiency. In comparison to the most coding-efficient non-systematic DI code (i.e. m-of-n), the DI bus-invert code had similar coding efficiency and number of wire transitions per transaction, but with significantly lower hardware overhead.
The class of speed independent (SI) circuits opens a promising way towards tolerating process variations. However, the fundamental assumption of speed independent circuit is that forks in some wires (usually, large percentage of wires) in such circuits are isochronic; this assumption is more and more challenged by the shrinking technology. This paper suggests a method to generate the weakest timing constraints for a SI circuit to work correctly under bounded delays in wires. The method works for all SI circuits and the generated timing constraints are significantly weaker than those suggested in the current literature claiming the weakest formally proved conditions.
This paper describes an approach to pipelining in high-level
synthesis that modifies the control/data flow graph before and after
scheduling. This enables the direct re-use of a pre-existing, timing- and
area-aware non-pipelined simultaneous scheduler and binder. Such an
approach ensures that the RTL output can be synthesized within the
given timing and area constraints. Results from real industrial designs
show the effectiveness of this approach in improving Pareto optimality
with respect to area, delay and power.
Keywords- pipelining, high-level synthesis, design exploration
Arithmetic blocks consume a major portion of chip area, delay and power. The arithmetic sum-of-product (SOP) is a widely used block. We introduce a novel binary integer linear program (BLP) based algorithm for optimising a general class of mutually exclusive SOPs. Benchmarks drawn from existing literature, standard APIs and constructed for demonstration purposes, exhibit speed improvements of up to 16% and area reduction of up to 57% in a 65nm TSMC process.
Creating parameterized "chip generators" has been proposed as one way to decrease chip NRE costs. While many approaches are available for creating or generating flexible data path elements, the design of flexible controllers is more problematic. The most common approach is to create a microcoded engine as the controller, which offers flexibility through programmable table-based lookup functions. This paper shows that after "programming" the hardware for the desired application, or applications, these flexible controller designs can be easily converted to efficient fixed (or less programmable) solutions using partial evaluation capabilities that are already present in most synthesis tools.
Heterogenous datapaths maximize the utilization of
functional units (FUs) by customizing their widths individually
through fragmentation of wide operands. In comparison, slices in
large functional units in a homogenous datapath could be
spending many cycles not performing actual useful work. Various
fragmentation techniques demonstrated benefits in minimizing
the total functional unit area. Upon a closer look at
fragmentation techniques, we observe that the area savings
achieved by heterogenous datapaths can be traded-off for power
optimization. Our specific approach is to introduce choices for
functional units with power/area trade-offs for different
fragmentation and allocation choices, for reducing power
consumption while satisfying the area constraint imposed on the
heterogenous datapath. As low power FUs in literature produce
an area penalty, a methodology must be developed in order to
introduce them in the HLS flow while complying with the area
constraint. We propose an allocation and module selection
algorithms that pursue a trade-off between area and power
consumption for fragmented datapaths under a total area
constraint. Results show that it is possible to reduce power by
37% on average (49% in the best case). Moreover latency and
cycle time will be equal or nearly the same as in the baseline case,
which will lead to an energy reduction, too.
Keywords: low-power, area, HLS
This work presents a high-level synthesis methodology that uses the abstract state machines (ASMs) formalism as an intermediate representation (IR). We perform scheduling and allocation on this IR, and generate synthesizable VHDL. We have the following advantages when using ASMs as an IR: 1) it allows the specification of both sequential and parallel computation, 2) it supports an extension of a clean timing model based on an interpretation of the sequential semantics, and 3) it has well-defined formal semantics, which allows the integration of formal methods into the methodology. While we specify our designs using ASMs, we do not mandate this. Instead, one can create translators that convert the algorithmic specifications from C-like languages into their equivalent ASM specifications. This makes the hardware synthesis transparent to the designer. We experiment our methodology with examples of a FIR, microprocessor, and an edge detecteor. We synthesize these designs and validate our designs on an FPGA.
The IEEE P1687 (IJTAG) standard proposal aims
at standardizing the access to embedded test and debug logic
(instruments) via the JTAG TAP. P1687 specifies a component
called Segment Insertion Bit (SIB) which makes it possible to
construct a multitude of alternative P1687 instrument access
networks for a given set of instruments. Finding the best access
network with respect to instrument access time and the number
of SIBs is a time-consuming task in the absence of EDA support.
This paper is the first to describe a P1687 design automation
tool which constructs and optimizes P1687 networks. Our EDA
tool, called PACT, considers the concurrent and sequential access
schedule types, and is demonstrated in experiments on industrial
SOCs, reporting total access time and average access time.
Keywords-IEEE P1687 IJTAG, Design Automation, Instrument
Access, Access Time Optimisation
3D integration of ICs is an emerging technology where
multiple silicon dies are stacked vertically. The manufacturing
itself is based on wafer-to-wafer bonding, die-to-wafer bonding or
die-to-die bonding. Wafer-to-wafer bonding has the lowest yield
as a good die may be stacked against a bad die, resulting in a
wasted good die. Thus the latter two options are preferred to
keep yield high and manufacturing costs low. However, these
methods require dies to be tested separately before they are
stacked. A problem with testing dies separately is that the clock
network of a prebond die may be incomplete before stacking. In
this paper we present a solution to address this problem. The
solution is based on on-die DLL implementations that are only
activated during testing prebond unstacked dies to synchronize
disconnected clock regions. A problem with using DLLs in testing
is that they cannot be turned on or off within a single cycle. Since
scan-based testing requires that test patterns be scanned in at a
slow clock frequency before fast capture clocks are applied [1],
on-product clock generation (OPCG) must be used. The
proposed solution addresses the above problems. Furthermore,
we show that a higher-speed DLL is better suited to not only high
frequency system clocks, but lower power as well due to a smaller
variable delay line.
Keywords-3D integrated circuit testing, delay lock loops, low
power testing, on-product clock generation
3D IC technology has demonstrated significant performance and power gains over 2D. However, for technology to be viable yield should be increased. Testing a complete 3D IC after stacking leads to an exponential decay in yield. Pre-bond tests are required to insure correct functionality of the die. In this work we propose a hypergraph based biased netlist partitioning scheme scheme for pre-bond testing of individual dies to reduce extra-hardware (flip-flops) required. Further reduction in hardware is achieved by a logic cone based flip-flop sharing scheme. Simulation results on ISCAS89 benchmark circuits and several industrial benchmarks demonstrate the effectiveness of the proposed approach.
Production test suites include a large number of redundant test patterns due to the inclusion of multiple test types with overlapping defect detection and the use of simple fault models for test generation. Identification and elimination of ineffective test patterns promises a significant reduction in test cost. This paper proposes a test framework that learns, without extensive data collection and at no additional test time, the effectiveness of individual test patterns during production testing by getting defect detection feedback from a dynamic test flow. The proposed technique is further capable of adapting to changes in the underlying defect mechanisms by tracking the defect detection trend of test patterns.
The determination of the optical flow is a central problem in image processing, as it allows to describe how an image changes over time by means of a numerical vector field. The estimation of the optical flow is however a very complex problem, which has been faced using many different mathematical approaches. A large body of work has been recently published about variational methods, following the technique for total variation minimization proposed by Chambolle. Still, their hardware implementations do not offer good performance in terms of frames that can be processed per time unit, mainly because of the complex dependency scheme among the data. In this work, we propose a highly parallel and accelerated FPGA implementation of the Chambolle algorithm, which splits the original image into a set of overlapping sub-frames and efficiently exploits the reuse of intermediate results. We validate our hardware on large frames (up to 1024x768), and the proposed approach significantly improves state-of-the-art implementations, reaching up to 76x speedups, which enables real-time frame rates even at high resolutions.
Object detection is a vital task in several emerging
applications, requiring real-time detection frame-rate and low
energy consumption for use in embedded and mobile devices.
This paper proposes a hardware-based, depth-directed search
method for reducing the search space involved in object
detection, resulting in significant speed-ups and energy savings.
The proposed architecture utilizes the disparity values computed
from a stereoscopic camera setup, in an attempt to direct the
detection classifier to regions that contain objects of interest. By
eliminating large amounts of search data, the proposed system
achieves both performance gains and reduced energy
consumption. FPGA simulation results indicate performance
speedups up to 4.7 times and high energy savings ranging from
41-48%, when compared to the traditional sliding window
approach.
Keywords- Hardware Object Detection; Stereoscopic Disparity
Computation; FPGA Image Processing
This paper presents a novel motion and disparity estimation (ME, DE) scheme in Multiview Video Coding (MVC) that addresses the high throughput challenge jointly at the algorithm and hardware levels. Our scheme is composed of a fast ME/DE algorithm and a multi-level pipe-lined parallel hardware architecture. The proposed fast ME/DE algorithm exploits the correlation available in the 3D-neighborhood (spatial, temporal, and view). It eliminates the search step for different frames by prioritizing and evaluating the neighborhood predictors. It thereby reduces the coding computations by up to 83% with 0.1dB quality loss. The proposed hardware architecture further improves the throughput by using parallel ME/DE modules with a shared array of SAD (Sum of Absolute Differences) accelerators and by exploiting the four levels of parallelism inherent to the MVC prediction structure (view, frame, reference frame, and macroblock levels). A multi-level pipeline schedule is introduced to reduce the pipeline stalls. The pro-posed architecture is implemented for a Xilinx Virtex-6 FPGA and as an ASIC with an IBM 65nm low power technology. It is compared to state-of-the-art at both algorithm and hardware levels. Our scheme achieves a real-time (30fps) ME/DE in 4-view High Definition (HD1080p) encoding with a low power consumption of 81 mW.
The objective of this work is the systematic study of the use of electrochemical readout for advanced diagnosis and drug monitoring. Whereas to date various electrochemical principles have been studied and successfully tested, they typically operate on a single target molecule and are not integrated in a full data analysis chain. The present work aims to view various sensing approaches and explore the design space for integrated realization of multi-target sensors and sensor arrays. Index Terms - biosensor, integrated circuit, metabolite, oxidase, cytochrome P450, potentiostat.
The field of Wireless Sensor Networks (WSNs) is now in a stage where serious applications of societal and economical importance are in reach. For example, it is well known that the global climate change dramatically influences the visual appearance of mountain areas like the European Alps. Very destructive geological processes may be triggered or intensified, impacting the stability of slopes, possibly inducing landslides. Unfortunately, the interactions between these complex processes is poorly understood. Therefore, one needs to develop wireless sensing technology as a new scientific instrument for environmental sensing under extreme conditions. Large variations in temperature, humidity, mechanical forces, snow coverage, and unattended operation play a crucial role in long-term deployments. We argue that, in order to significantly advance the application domain, it is inevitable that sensor networks be created as a quality scientific instrument with known and predictable properties, and not as a research toy delivering average observations at best. In this paper, key techniques for achieving highly reliable, yet resource efficient wireless sensor networks are discussed on the basis of productive wireless sensor networks measuring permafrost processes in the Swiss Alps.
New tendencies envisage 3D Multi-Processor System-On-Chip (MPSoC) design as a promising solution to keep increasing the performance of the next-generation high-performance computing (HPC) systems. However, as the power density of HPC systems increases with the arrival of 3D MPSoCs, supplying electrical power to the computing equipment and constantly removing the generated heat is rapidly becoming the dominant cost in any HPC facility. Thus, both power and thermal/cooling implications play a major role in the design of new HPC systems, given the energy constraints in our society. Therefore, EPFL, IBM and ETHZ have been working within the CMOSAIC Nano-Tera.ch program project in the last three years on the development of a holistic thermally-aware design. This paper presents the exploration in CMOSAIC of novel cooling technologies, as well as suitable thermal modeling and system-level design methods, which are all necessary to develop 3D MPSoCs with inter-tier liquid cooling systems. As a result, we develop energy-efficient run-time thermal control strategies to achieve energy-efficient cooling mechanisms to compress almost 1 Tera nano-sized functional units into one cubic centimeter with a 10 to 100 fold higher connectivity than otherwise possible. The proposed thermally-aware design paradigm includes exploring the synergies of hardware-, software- and mechanical-based thermal control techniques as a fundamental step to design 3D MPSoCs for HPC systems. More precisely, we target the use of inter-tier coolants ranging from liquid water and two-phase refrigerants to novel engineered environmentally friendly nano-fluids, as well as using specifically designed micro-channel arrangements, in combination with the use of dynamic thermal management at system-level to tune the flow rate of the coolant in each micro-channel to achieve thermally-balanced 3D-ICs. Our management strategy prevents the system from surpassing the given threshold temperature while achieving up to 67% reduction in cooling energy and up to 30% reduction in system-level energy in comparison to setting the flow rate at the maximum value to handle the worst-case temperature.
Recognizing the importance of interfacing a variety of sensors and networking such sensors around the body area and by cellular services, a Swiss project within the Nano-Tera.ch Initiative is dedicated to developing a platform of circuit technologies for medical data acquisition and communication.
The paper discusses reliability threats and opportunities for analog circuit design in high-k sub-32 nanometer technologies. Compared to older SiO2 or SiON based technologies, transistor reliability is found to be worse in high-k nodes due to larger oxide electric fields, the severely aggravated PBTI effect and increased time-dependent variability. Conventional reliability margins, based on accelerated stress measurements on individual transistors, are no longer sufficient nor adequate for analog circuit design. As a means to find more accurate, circuit-dependent reliability margins, advanced degradation effect models are reviewed and an efficient method for stochastic circuit reliability simulation is discussed. Also, an example 6- bit 32nm current-steering digital-to-analog convertor is studied. Experiments demonstrate how the proposed simulation tool, combined with novel design techniques, can provide an up to 89% better area-power product of the analog part of the circuit under study, while still guaranteeing a 99.7% yield over a lifetime of 5 years. Index Terms - NBTI, PBTI, Hot Carriers, TDDB, SBD, HBD, Failure-Resilience, Aging, Design for Reliability, High-k CMOS.
Quantitative simulations of the statistical impact of
negative-bias-temperature-instability (NBTI) on pMOSFETs,
and positive-bias-temperature-instability (PBTI) on nMOSFETs
are carried out for a 45nm low power technology generation.
Based on the statistical simulation results, we investigate the
impact of NBTI and PBTI on the degradation of the static noise
margin (SNM) of SRAM cells. The results indicate that SNM
degradation due only to NBTI follows a different evolution
pattern compared with the impact of simultaneous NBTI and
PBTI degradation.
Keywords-NBTI; PBTI; Statistical Variability; SRAM; Static
Noise Margin
This paper describes a Design Of Experiments (DOE) based method used in computer-aided design to simulate the impact of process variations on circuit performances. The method is based on a DOE approach using simple first and second order polynomial models with multiple experiment maps. It is a technology & circuit-independent method which allows circuit designers to perform statistical analysis with a dramatically reduced number of simulations compared to traditional methods, and hence to estimate more realistic worst cases, resulting in a reduced design cycle time. Moreover, the simple polynomial models enable direct linking of performance sensitivity to process parameters. The method is demonstrated on a set of circuits. It showed very accurate results in linking linearity, gain and noise performances to process parameters, for both RF and analog circuit.
In this paper, we propose a system-assisted analog mixed-signal (SAMS) design paradigm whereby the mixed-signal components of a system are designed in an application-aware manner in order to minimize power and enhance robustness in nanoscale process technologies. In a SAMS-based communication link, the digital and analog blocks from the output of the information source at the transmitter to the input of the decision device in the receiver are treated as part of the composite channel. This comprehensive systems-level view enables us to compensate for impairments of not just the physical communication channel but also the intervening circuit blocks, most notably the analog/mixed-signal blocks. This is in stark contrast to what is done today, which is to treat the analog components in the transmitter and the analog front-end at the receiver as transparent waveform preservers. The benefits of the proposed system-aware mixed-signal design approach are illustrated in the context of analog-to-digital converters (ADCs) for high-speed links. CAD challenges that arise in designing system-assisted mixed-signal circuits are also described.
In state-of-the-art multi-processor systems-on-chip (MPSoC), interconnect of processing elements has a major impact on the system's overall average-case and worst-case performance. Moreover, in real-time applications predictability of inter-chip communication latency is imperative for bounding the response time of the overall system. In shared-memory MPSoCs buses are still the prevalent means of on-chip communication for small to medium size chip-multi-processors (CMPs). Still, bus arbitration schemes employed in current architectures either deliver good average-case performance (i.e. maximize bus utilization) or enable tight bounding of worst-case-execution time. This paper presents a shared bus arbitration approach allowing high bus utilization while guaranteeing a fixed bandwidth per time frame to each master. Thus it provides high-performance to both real-time and any-time applications or even a mixture of both. The paper includes performance results obtained while executing random traffic on a shared bus implemented on a FPGA. The results show that our approach provides bus utilization close to static priority based arbitration, a fairer bandwidth distribution than Round Robin and latency guarantees identical to TDMA. With this it combines the best properties of these schemes.
Due to the increasing advance on wireless communication and sensors, Wireless Sensor Networks (WSN) have been widely used in several fields, such as medicine, science, industrial automation and security. A possible solution is to use CMOS System on Chip (SoC) sensor nodes as hardware platforms due to its extremely low power, sensing, computation and communication capabilities. This work presents the modeling of a mixed-signal SoC for WSN using a system-level approach. The digital section was modeled using SystemC Transaction Level Modeling (TLM) and consists of a 32-bit RISC microprocessor, memory, interrupt controller and serial interface. The analog block consists of an Analog-to-Digital Converter (ADC) described in SystemC-AMS. An application was implemented to test the correctness of the model and perform the communication between the SoC and a functional level node model.
This paper proposes a unified framework for the hardware/software codesign of body sensor network applications that aims to enhance both modularity and reusability. The proposed framework consists of a Unified Modeling Language (UML) 2 profile for TinyOS applications and a corresponding simulator. The UML profile allows for the description of the low-level details of the hardware simulator, thereby providing a higher level of abstraction for application developers to visually design, document and maintain their systems that consist of both hardware and software components. With the aid of a predefined component repository, minimum TinyOS knowledge is needed to construct a body sensor network system. A novel feature of our framework is that we have modeled not only software applications but the simulator platform in UML. A new instance of the simulator can be automatically generated whenever hardware changes are made. Key design issues, such as timing and energy consumption can be tested by simulating the generated software implementation on the automatically customized simulator. The framework ensures a separation of software and hardware development while maintaining a close connection between them. This paper describes the concepts and implementation of the proposed framework, and presents how the framework is used in the development of nesC-TinyOS based body sensor network applications. Two actual case studies are used to show how the proposed framework can quickly and automatically adapt the software implementation to efficiently accommodate hardware changes.
This paper presents a new asynchronous design template using single-track handshaking that targets medium-to-high performance applications. Unlike other single-track templates, the proposed template supports multiple levels of logic per pipeline stage, improving area efficiency by sharing the control logic among more logic while at the same time providing higher robustness to timing variability. The template also yields higher throughput than most four-phase templates and lower latency than bundled-data templates. The template has been incorporated into the asynchronous ASIC flow Proteus and experiments on ISCAS benchmarks show significant improvement in achievable throughput per area.
Coarse Grained Reconfigurable Arrays (CGRAs) are a promising class of architectures conjugating flexibility and efficiency. Devising effective methodologies to map applications onto CGRAs is a challenging task, due to their parallel execution paradigm and sparse interconnection topology. In this paper we present a scheduling framework that is able to efficiently map operations on CGRA architectures. It leverages differences in delays of various operations, which a reconfigurable architecture always exhibits at run-time, to effectively route data. We call this ability "slack-awareness". Experimental evidence showcases the benefit of slack-aware scheduling in a coarse-grained reconfigurable environment, as more complex applications can be mapped for a given mesh size and more efficient schedules can be achieved, compared to the state of the art methods.
In this paper, we propose a technique for custom
instruction (CI) extension considering process variations. It
bridges the gap between the high level custom instruction
extension and chip fabrication in nanotechnologies. In the
proposed method, instead of using the conventional static timing
analysis (STA), statistical static timing analysis (SSTA) which in
turn results in a probabilistic approach to identifying and
selecting different parts of the CI extension is utilized. More
precisely, we use the delay Probability Density Function (PDF) of
the CIs in identification and selection phases of the CI extension.
In the identification phase, the delay of each CI is modeled by
PDF whereas the performance yield is added as a constraint.
Additionally, in the selection phase, the merit function of the
conventional approaches is modified to increase the performance
gain of the selected CIs at the price of slightly sacrificing the
design yield. Also, to make the approach computationally more
efficient, we propose a method for reducing the modeling time of
the PDF of the CIs by reducing the number of candidate CIs
before extracting the PDF.
Keywords-component; ASIP, Custom Instruction, Process
Variation, PDF
This paper introduces a novel compact implicit model for a probabilistic set of waveforms (PSoW) which arise as representations for uncertain signal waveforms in Statistical Static Timing Analysis (SSTA). In traditional SSTA tools, signals are just represented as (distributions of) arrival time and slew. In our approach, to increase accuracy, PSoW's are used instead. However, to represent PSoW's explicitly, a very large amount of data is necessary, which can be problematic. To solve this problem, a compact implicit model is introduced, which can be characterized with just a handful of parameters. The results obtained show that the implicit model can generate real-life PSoW's with high accuracy.
A genetic programming-based circuit synthesis method is proposed that enables to globally optimize the number of gates in circuits that have already been synthesized using common methods such as ABC and SIS. The main contribution is a proposal for a new fitness function that enables to significantly reduce the fitness evaluation time in comparison to the state of the art. The fitness function performs optimized equivalence checking using a SAT solver. It is shown that the equivalence checking time can significantly be reduced when knowledge of the parent circuit and its mutated offspring is taken into account. For a cost of a runtime, results of conventional synthesis conducted using SIS and ABC were improved by 20-40% for the LGSynth93 benchmarks.
Statistical design approaches have been studied
intensively in the last decade so as to deal with the process
variability, and statistical delay fault testing is one of key
techniques for the statistical design. In order to represent the
distributions of timing information such as a gate delay, a signal
arrival time, and a slack, various techniques have been proposed.
Among them, Gaussian mixture model is distinguished from the
others in that it can handle any correlation, non-Gaussian
distributions, and slew distributions easily. However, the
previous method of computing the statistical maximum for
Gaussian mixture models has a defect such that it produces
distribution similar to Gaussian in a certain case, although the
correct distribution is far from Gaussian. In this paper, we
propose a novel method for statistical maximum (minimum)
operation for Gaussian mixture models. It takes cumulative
distribution function curve into consideration so as to compute
accurate criticalities (probabilities of timing violation), which is
important for detecting delay faults and circuit optimization. The
proposed method reduces the error of criticality almost 80%
from the previous method.
Keywords- criticality; probability of timing violation; statistical
static timing analysis; Gaussian mixture model; cumulative
distribution curve
Modern logic synthesis systems apply a sequence of loosely-related function-preserving transformations to gradually improve the circuit with respect to certain criteria such as area, performance, power, etc. For the quality of a complete synthesis run, the application order of the transformations for the individual steps are critical as they can produce vastly different outcomes. In practice, the transformation sequences is encoded in synthesis scripts which are derived manually based on experience and intuition of the tool developer. These scripts are static in the sense that transformations are applied independently of the result of previous transformations or the current status of the design. Despite the importance of obtaining high quality scripts, there are only a few attempts to optimize them. In this paper, we present a novel method to select transformations dynamically during the synthesis run leveraging the theory of Markov Decision Processes. The decision to select a particular transformation is based on transition probabilities, the history of the applied synthesis steps, and expectations for future steps. We report experimental results obtained from an implementation of the approach using the logic synthesis system ABC.
The scaling of MOSFETs has improved performance and lowered the cost per function of CMOS integrated circuits and systems over the last 40 years, but devices are subject to increasing amounts of statistical variability within the decanano domain. The causes of these statistical variations and their effects on device performance have been extensively studied, but there have been few systematic studies of their impact on circuit performance. This paper describes a method for modelling the impact of random intra-die statistical variations on digital circuit timing and power consumption. The method allows the variation modelled by large-scale statistical transistor simulations to be propagated up the design flow to the circuit level, by making use of commercial STA and standard cell characterisation tools. The method provides circuit designers with the information required to analyse power, performance and yield trade-offs when fabricating a design, while removing the large levels of pessimism generated by traditional Corner Based Analysis.
Panelists: L Bomhold, T Green, A Ephrimides, C Blumstein
Until recently the electrical power industry has relied solely on traditional technologies - copper and iron as cables, transformers and machines as the mainstream solution for the generation, transmission and distribution of power. Whilst use of these materials and technologies is here to stay, improvements in power semiconductor technology mean that the industry is moving into a position where more and faster control of power systems can be achieved. This high level control requires a sensing and communication infrastructure to be put in place across the network. At the same time, the use of electricity in the home, through the potential of real time consumer pricing requires new technologies. This panel session aims to pull together heavy current electrical power engineers and light current electronic engineers to form a discussion and debate about the future role of EDA in applications which are being brought about by changes in the functioning of the power industry. Power engineers from both industry and academia will stimulate the discussion with requirements both from a system perspective and consumer perspective. The representatives from the EDA side will respond with what contributions they believe EDA can make, what already exists or is a simple development problem and what research issues remain in achieving these goals. In summary, this panel aims to provide motivation for the EDA industry to work on useful technology that can be applied to heavy power systems with a view to improving global energy efficiency.
This paper introduces the first available tool flow for Dynamic Partial Reconfiguration on the Spartan-6 family. In addition, the paper proposes a new configuration method called Fast Start-up targeting modern FPGA architectures, where the FPGA is configured in two-steps, instead of using a single (monolithic) full device configuration. In this novel approach, only the timing-critical modules are loaded at power-up using the first high-priority bitstream, while the non-timing critical modules are loaded afterwards. This two-step or prioritized FPGA start-up is used in order to meet the extremely tight startup timing specifications found in many modern applications, like PCI-express or automotive applications. Finally, the developed tool flow and methods for Fast Start-up have been used and tested to implement a CAN-based automotive ECU on a Spartan-6 evaluation board (i.e., SP605). By using this novel approach, it was possible to decrease the initial bitstream size and hence, achieve a configuration time speed-up of up to 4.5x, when compared to a standard configuration solution.
Within the context of Reconfigurable Architectures, we define a kernel loop (K-loop) as a loop containing in the loop body one or more kernels mapped on the reconfigurable hardware. In this paper, we analyze how loop distribution can be used in the context of K-loops. We propose an algorithm for splitting K-loops that contain more than one kernel and intra-iteration dependencies. The purpose is to create smaller loops (Ksub-loops) that have more speedup potential when parallelized. Making use of partial reconfigurability, the K-sub-loops can take advantage of having more area available for multiple kernel instances to execute in parallel on the FPGA. In order to study the potential for performance improvement of using the loop distribution on K-loops, we make use of a suite of randomly generated test cases. The results show an improvement of more than 40% over previously proposed methods in more than 60% of the cases. The algorithm is also validated with a K-loop extracted from the MJPEG application. A speedup of maximum 8.22 is achieved when mapping MJPEG on VirtexIIPro with partial reconfiguration and 13.41 when statically mapping it on the Virtex-4.
We present a run-time system for a multi-grained reconfigurable processor in order to provide a dynamic trade-off between performance and available area budgets for both fine- as well as coarse-grained reconfigurable fabrics as part of one reconfigurable processor. Our run-time system is the first implementation of its kind that dynamical-ly selects and steers a performance-maximizing multi-grained instruction set under run-time varying constraints. It achieves a performance improvement of more than 2x compared to state-of-the-art run-time systems for multi-grained architectures. To elaborate the benefits of our approach further, we also compare it with offline- and online-optimal instruction-set selection schemes.
This paper describes two algorithms for the selective assignment of input don't cares (DCs) for logical derating of input errors to enhance reliability. It is motivated by the observation that reliability-driven assignment of DCs can improve input error resilience by up to 49.7% in logic circuits. Two algorithms - ranking-based and complexity-factor-based - for reliability-driven DC assignment are proposed in this paper. Both algorithms use Hamming distance metrics to determine 0/1 assignments for the most critical DC terms, thereby leaving flexibility in the circuit specification for subsequent optimization. Since ranking-based DC assignment offers less control over overhead, we develop a complexity-factor-based DC assignment algorithm that can achieve up to 21.4% improvement in error rate with a simultaneous 4.3% reduction in area over conventional DC assignment. Finally, we derive analytical estimates on min-max reliability improvements to evaluate the effectiveness of the proposed algorithms.
Starting from a functional description or a gate
level circuit, the goal of the multi-level logic optimization is to
obtain a version of the circuit that implements the original
function at a lower cost. For error tolerant applications -
images, video, audio, graphics, and games - it is known that
errors at the outputs are tolerable provided that their
severities are within application-specified thresholds. In this
paper, we perform application level analysis to show that
significant errors at the circuit level are tolerable. Then we
develop a multi-level logic synthesis algorithm for error
tolerant applications that minimizes the cost of the circuit by
exploiting the budget for approximations provided by error
tolerance. We use circuit area as the cost metric and use a test
generation algorithm to select faults that introduce errors of
low severities but provide significant area reductions. Selected
faults are injected to simplify the circuit for the experiments.
Results show that our approach provides significant reductions
in circuit area even for modest error tolerance budgets.
Keywords- Error tolerance, circuit optimization, ATPG, DCT,
redundancy removal
Device aging, which causes significant loss on circuit performance and lifetime, has been a main factor in reliability degradation of nanoscale designs. Aggressive technology scaling trends, such as thinner gate oxide without proportional downscaling of supply voltage, necessitate an aging-aware analysis and optimization flow during early design stages. Since only a small portion of critical and near-critical paths can be sensitized and may determine the circuit delay under aging, path sensitization should also be explicitly addressed for more accurate and efficient optimization. In this paper, we first investigate the impact of path sensitization on aging-aware timing analysis and then present a novel framework for aging-aware timing optimization considering path sensitization. By extracting and manipulating critical sub-circuits accounting for the effective circuit delay, our proposed framework can reduce aging-induced performance degradation to only 1.21% or one-seventh of the original performance loss with less than 2% area overhead.
This paper addresses the problem of efficient and effective parameter variation modeling and sampling in computer architecture simulations. While there has been substantial progress in accelerating simulation time for circuit designs subject to manufacturing variations, these approaches are not generally suitable for architectural studies. Toward this we investigated two complementary avenues: (1) adapting low-discrepancy sampling methods for use in Monte Carlo architectural simulations. We apply techniques previously developed for gate-level circuit models to higher level component models and in so doing drastically reduce the number of samples needed for detailed simulation; (2) applying multi-resolution analysis to appropriately decompose geometric regions of a chip, and achieve more effective description of parameter variations without increasing computational complexity. Our experimental results demonstrate that the combined techniques can reduce the number of Monte Carlo trials by a factor of 3.3, maintaining the same accuracy while significantly reducing the overall simulation run-time.
Simulation speedup offered by distributed parallel
event-driven simulation is known to be seriously limited by the
synchronization and communication overhead. These limiting
factors are particularly severe in gate-level timing simulation.
This paper describes a radically different approach to gate-level
simulation based on a concept of temporal rather than
conventional spatial parallelism. The proposed method partitions
the entire simulation run into simulation slices in temporal
domain and each slice is simulated separately. With each slice
being independent from each other, an almost linear speedup
achievable with a large number of simulation nodes. This concept
naturally enables "correct by simulation" methodology that
explicitly maintains the consistency between the reference and
the target specifications. Experimental results clearly show
significant simulation speed-up.
Keywords : Event-driven simulation; parallel simulation;
verilog simulation; Gate-level simulation.
The growing importance of post-silicon validation in ensuring functional correctness of high-end designs increases the need for synergy between the pre-silicon verification and post-silicon validation. We propose a unified functional verification methodology for the pre- and post-silicon domains. This methodology is based on a common verification plan and similar languages for test-templates and coverage models. Implementation of the methodology requires a user-directable stimuli generation tool for the post-silicon domain. We analyze the requirements for such a tool and the differences between it and its pre-silicon counterpart. Based on these requirements, we implemented a tool called Threadmill and used it in the verification of the IBM POWER7 processor chip with encouraging results.
We present HYBRO, an automatic methodology to generate high coverage input vectors for Register Transfer Level (RTL) designs based on branch-coverage directed approach. HYBRO uses dynamic simulation data and static analysis of RTL control flow graphs (CFGs). A concrete simulation is applied over a fixed number of cycles. Instrumented code records the branches covered. The corresponding symbolic trace is extracted from the CFG with an RTL symbolic execution engine. A guard in the symbolic expression is mutated. If the mutated guard has dependent branches that have not already been covered, it is mutated and passed to an SMT solver. A satisfiable assignment generates a valid input vector. We implement the Verilog RTL symbolic execution engine and show that the notion of branch-coverage directed exploration can avoid path explosion caused by previous path-based approach to input vector generation and achieve full branch and more than 90% functional(assertion) coverage quickly on ITC99 benchmark and several Openrisc designs. We also describe two types of optimizations a) dynamic UD chain slicing b)local conflict resolution to speed up HYBRO by 1.6-12 times on different benchmarks.
We present a STA tool based on a single-pass true path
computation that efficiently determines the critical path list. Given
that it does not rely on a two-step process it can be programmed to
find efficiently the N true paths from a circuit. We also report and
analyze the dependence of complex gates delay with the sensitization
vector and its variation (that gets up to 15% in 65nm technologies),
and consider such effect in the path delay estimation. Delay is
computed from a simple polynomial analytical description that
requires a one-time library parameter extraction process, making it
highly scalable. Results on combinational ISCAS synthesized for
three technologies (130nm, 90nm and 65nm) provide better results in
computation time, number of paths reported and delay estimation for
these paths compared to a commercial tool.
Keywords: delay-model, timing-analysis
We propose an adaptive reliability enhancement structure
for deeply-scaled CMOS and future devices that exhibit
nondeterministic behavior. This structure forms the basis of
confidence-driven computing model that can be implemented in
either a rollback recovery or an iterative dual modular redundancy
method incorporating synchronous handshake schemes.
The performance and cost of the computing model are estimated
using a 45 nm CMOS technology and the functionality is verified
by FPGA-based emulation. The confidence-driven computing
model is demonstrated using a 16-bit, 12-stage CORDIC processor
operating under random, transient errors. The confidence-driven
computing model adapts to the fluctuating error rates at
the device substrate level to guarantee the reliability of computation
at the system level. This computing model costs 4.2 times
smaller area and 2.7 times less energy overhead than triple modular
redundancy to guarantee a system-level mean time to failure
of two years.
Keywords: reliability, transient error, confidence estimator, rollback
recovery, dual modular redundancy.
Drastic device shrinking, power supply reduction,
increasing complexity and increasing operating speeds that
accompanying technology scaling have reduced the reliability of
nowadays ICs. The reliability of embedded memories is affected
by particle strikes (soft errors), very low voltage operating
modes, PVT variability, EMI and accelerated circuit aging. Error
correcting codes (ECC) is an efficient mean for protecting
memories against failures. A major issue with ECC is the speed
penalty induced by the encoding and decoding circuits. In this
paper we present an effective approach for eliminating this
penalty and we demonstrate its efficiency in the case of an
advanced reconfigurable OFDM modulator.
Keywords-Reliability, technoloy scalling, ECC, performance
In this paper we address the issue of improving ECC correction ability beyond that provided by the standard SEC/DED Hsiao code. We analyze the impact of the standard SEC/DED Hsiao ECC and for several double error correcting (DEC) codes on area overhead and cache memory access time for different codeword sizes and code-segment sizes, as well as their correction ability as a function of codeword/codesegment sizes. We show the different trade-offs that can be achieved in terms of impact on area overhead, performance and correction ability, thus giving insight to designers for the selection of the optimal ECC and codeword organization/codesegment size for a given application.
Small circuit defects occurred during manufacturing and/or enhanced/induced by various aging mechanisms represent a serious challenge in advanced scaled CMOS technologies. These defects initially manifest as small delay faults that may evolve in time and exceed the slack time in the clock cycle period. Periodic tests performed with reduced slack time provide a low-cost solution that allows to predict failures induced by slowly evolving delay faults. Unfortunately, such tests have limited fault coverage and fault detection latency. Here, we introduce a way to complement or completely replace the periodic testing with reduced slack time. Delay control structures are proposed to enable arbitrarily small parts of the monitored component to switch fast between a normal operating mode and a degraded mode characterized by a smaller slack time. Only two or three additional transistors are needed for each flip-flop in the monitored logic. Micro-architectural support for a concurrent self-test of pipelined logic that takes benefit of the introduced degraded mode is presented as well. Test stimuli are produced on the fly by the last two valid operations executed before each stall cycle. Test result evaluation is facilitated by the replication of the last valid operation during a stall cycle. Protection against transient faults can be achieved if each operation is replicated via stall cycle insertion.
Physically unclonable functions (PUF) are designed on
integrated circuits (IC) to generate unique signatures that can be
used for chip authentication. PUFs primarily rely on
manufacturing process variations to create distinction between
chips. In this paper, we present novel PUF circuits designed to
exploit inherent fluctuations in physical layout due to
photolithography process. Variations arising from proximity
effects, density effects, etch effects, and non-rectangularity of
transistors is leveraged to implement lithography-based physically
unclonable functions (litho-PUFs). We show that the uniqueness
level of these PUFs are adjustable and are typically much higher
than traditional ring-oscillator or tri-state buffer based approaches.
Keywords-PUF, IC authentication, photolithography, proximity
effect, chemical mechanical polishing, hardware security
Integrated circuits (ICs) are becoming increasingly vulnerable to malicious alterations, referred to as hardware Trojans. Detection of these inclusions is of utmost importance, as they may potentially be inserted into ICs bound for military, financial, or other critical applications. A novel on-chip structure including a ring oscillator network (RON), distributed across the entire chip, is proposed to verify whether the chip is Trojan-free. This structure effectively eliminates the issue of measurement noise, localizes the measurement of dynamic power, and additionally compensates for the impact of process variations. Combined with statistical data analysis, the separation of process variations from the Trojan contribution to the circuit's transient power is made possible. Simulation results featuring Trojans inserted into a benchmark circuit using 90nm technology and experimental results on Xilinx Spartan-3E FPGA demonstrate the efficiency and scalability of the RON architecture for Trojan detection.
Modern security-aware embedded systems need protection against fault attacks. These attacks rely on intentionally induced faults. Such intentional faults have not only a different origin, but also a different nature than errors that fault-tolerant systems usually have to face. For instance an adversary who attacks the circuit with two lasers can potentially induce two errors at different positions. Such errors can not only defeat simple double modular redundancy schemes, but as we show, also naive schemes based on any linear code over GF(2). In this article, we describe arithmetic logic units (ALUs) which provide high error detection rates even in the presence of such errors. The contribution in this article is threefold. First, we show that the minimum weight of an undetected error is no longer defined by the code distance when certain arithmetic and logic operations are applied to the codewords. As a result, additional hardware is needed to preserve the minimum error weight for a given code. Second, we show that for multi-residue codes, these delicate operations are rare in typical smart card applications. This allows for an efficient time-area trade-off for checking the codewords and thus to significantly reduce the hardware costs for such a protected ALU. Third, we implement the proposed architectures and study the influence of the register file and a multiplier on the area and on the critical path.
The SHA-3 competition organized by NIST has triggered significant efforts in performance evaluation of cryptographic hardware and software. These benchmarks are used to compare the implementation efficiency of competing hash candidates. However, such benchmarks test the algorithm in an ideal setting, and they ignore the effects of system integration. In this contribution, we analyze the performance of hash candidates on a high-end computing platform consisting of a multi-core Xeon processor with an FPGA-based hardware accelerator. We implement two hash candidates, Keccak and SIMD, in various configurations of multi-core hardware and multi-core software. Next, we vary application parameters such as message length, message multiplicity, and message source. We show that, depending on the application parameter set, the overall system performance is limited by three possible performance bottlenecks, including limitations in computation speed, in communication band-width, and in buffer storage. Our key result is to demonstrate the dependency of these bottlenecks on the application parameters. We conclude that, to make sound system design decisions, selecting the right hash candidate is only half of the solution: one must also understand the nature of the data stream which is hashed.
While traditional cluster computers are more constrained
by power and cooling costs for solving extreme-scale (or exascale)
problems, the continuing progress and integration levels in silicon
technologies make possible complete end-user systems on a single
chip. This massive level of integration makes modern multicore chips
all pervasive in domains ranging from climate forecasting and
astronomical data analysis, to consumer electronics, smart phones,
and biological applications. Consequently, designing multicore chips
for exascale computing while using the embedded systems design
principles looks like a promising alternative to traditional cluster-based
solutions. This paper aims to present an overview of new, far-reaching
design methodologies and run-time optimization techniques
that can help breaking the energy efficiency wall in massively
integrated single-chip computing platforms.
Keywords - Exascale computing, multicore, small-world, game
theory, consensus theory, fractal behavior.
This contribution will present a fully automated approach for explorative topology synthesis of small analog circuit blocks. Circuits are composed from a library of basic building blocks. Therefore, various algorithms are used to explore the entire design space, even allowing to generate unusual circuits. Correct combination of the basic blocks is accomplished through generic electrical rules, which ensure the fundamental electrical functionality of the generated circuit. Additionally, symmetry constraints are introduced to narrow the design space, which leads to more reasonable circuits. Further a replaceable bias-voltage generator is included into the circuit to replicate real world circumstances. For the first evaluation and selection of best candidate circuits, fast symbolic analysis techniques are used. The final sizing is done through a parallelized industrial based sizing method. Experimental results show the feasibility of this synthesis approach.
In this paper a new frequency compensation method based on automatic topology modification of analog amplifier circuits is presented. Starting from an uncompensated circuit topology in closed-loop configuration, a capacitance is inserted between each pair of nodes. Subsequently, the set of inserted capacitances is reduced to a manageable size using a selection algorithm based on eigenvalue sensitivity calculation. Finally, the remaining capacitances are sized by a numerical optimization method. The presented method is demonstrated on a transimpedance amplifier design for an industrial HDTV application.
This work presents novel analog sizing flows based on analytical techniques. A graph-based operating point driven sizing approach provides operating point voltages and a rough sizing with respect to constraints. A voltage-range analysis method using linearized operating-point models obtains information about feasible voltage ranges. A direct-sizing method solves nonlinear algebraic circuit equations directly to obtain design parameters from specifications. All three methods require no or only minimum simulation effort and can provide quick insight into circuit design space and constraints in an early design stage. They allow flexible inclusion into state-of-the-art simulation-based optimization flows, where they lead to improved results with less optimization effort and prevent unnecessary simulation effort on unfeasible circuit topologies. The sizing flows are enhanced by a commercial optimization tool in order to obtain reliable circuits.
Layout generation remains a critical bottleneck in
analog circuit design. It is especially distracting when re-using an
existing design for a similar specification or when transferring a
working design to a new technology. This paper presents a new
methodology for layout generation of analog circuits that is based
on a modular circuit design and a so-called "executable design
flow description". This is created once manually and allows to
describe the layout in a technology independent and
parameterizable manner assuring a consistent view of circuit and
layout design. Complex layouts can be created in negligible time,
achieving an early involvement of layout effects in the circuit
design. Furthermore, the parameterization of the design
description allows simplified technology transfer and seamless
access to sizing tools.
Keywords - circuit design, layout design, analog circuits,
parametrizable design cells