DATE 2011, ABSTRACTS

DATE 2011 ABSTRACTS

Sessions: [Keynote Address] [2.2] [2.3] [2.4] [2.5] [2.6] [2.7] [2.8] [3.2] [3.3] [3.4] [3.5] [3.6] [3.7] [3.8] [IP1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [5.1] [5.2] [5.3] [5.4] [5.5] [5.6] [5.7] [IP2] [6.1.1] [6.1.2] [6.2] [6.3] [6.4] [6.5] [6.6] [6.7] [6.8] [7.1] [7.2] [7.3] [7.4] [7.5] [7.6] [7.7] [7.8] [IP3] [8.1] [8.2] [8.3] [8.4] [8.5] [8.6] [8.7] [8.8] [9.1] [9.2] [9.3] [9.4] [9.5] [9.6] [9.7] [IP4] [10.1.1] [10.1.2] [10.2] [10.3] [10.4] [10.5] [10.6] [10.7] [10.8] [11.1] [11.2] [11.3] [11.4] [11.5] [11.6] [11.7] [11.8] [IP5] [12.1] [12.2] [12.3] [12.4] [12.5] [12.6] [12.7] [12.8]

DATE Executive Committee
DATE Sponsors
Technical Program Topic Chairs
Technical Program Committee
Reviewers
Foreword
Best Paper Awards
Tutorials
PH.D. Forum
Call for Papers: DATE 2012

Keynote Address

Biologically-Inspired Massively-Parallel Architectures - Computing Beyond A Million Processors [p. 1]

S. Furber

Moore's Law continues to deliver ever-more transistors on an integrated circuit, but discontinuities in the progress of technology mean that the future isn't simply an extrapolation of the past. For example, design cost and complexity constraints have recently caused the microprocessor industry to switch to multi-core architectures, even though these parallel machines present programming challenges that are far from solved. Moore's Law now translates into ever-more processors on a multi-, and soon many-core chip. The software challenge is compounded by the need for increasing fault-tolerance as near-atomic-scale variability and robustness problems bite harder. We look beyond this transitional phase to a future where the availability of processor resource is effectively unlimited and computations must be optimised for energy usage rather than load balancing, and we look to biology for examples of how such systems might work. Conventional concerns such as synchronisation and determinism are abandoned in favour of real-time operation and adapting around component failure with minimal loss of system efficacy.

2.2: System-Level Techniques to Handle Performance, Reliability and Thermal Issues

Moderators: D. Goswami, TU Munich, DE; T. Stefanov, Leiden U, NL

VESPA: Variability Emulation for System-on-Chip Performance Analysis [p. 2]

V. Kozhikkottu, R. Venkatesan, A. Raghunathan and S. Dey

We address the problem of analyzing the performance of System-on-chip (SoC) architectures in the presence of variations. Existing techniques such as gate-level statistical timing analysis compute the distributions of clock frequencies of SoC components. However, we demonstrate that translating component-level characteristics into a system-level performance distribution is a complex and challenging problem due to the inter-dependencies between components - execution, indirect effects of shared resources, and interactions between multiple system-level "execution paths". We argue that accurate variation-aware system-level performance analysis requires repeated system execution, which is prohibitively slow when based on simulation. Emulation is a widely-used approach to drastically speedup system-level simulation, but it has not been hitherto applied to variation analysis. We describe a framework - Variability Emulation for SoC Performance Analysis (VESPA) - that adapts and applies emulation to the problem of variation aware SoC performance analysis. The proposed framework consists of three phases: component variability characterization, variation-aware emulation setup, and Monte-carlo driven emulation. We demonstrate the utility of the proposed framework by applying it to design variation-aware architectures for two example SoCs - an 802.11 MAC processor and an MPEG encoder. Our results suggest that variability emulation has great potential to enable variation-aware design and exploration at the system level.

Thermal-Aware On-Line Task Allocation for 3D Multi-Core Processor Throughput Optimization [p. 8]

C.-L. Lung, Y.-L. Ho, D.-M. Kwai and S.-C. Chang

Three-dimensional integrated circuit (3D IC) has become an emerging technology in view of its advantages in packing density and flexibility in heterogeneous integration. The multi-core processor (MCP), which is able to deliver equivalent performance with less power consumption, is a candidate for 3D implementation. However, when maximizing the throughput of 3D MCP, due to the inherent heat removal limitation, thermal issues must be taken into consideration. Furthermore, since the temperature of a core strongly depends on its location in the 3D MCP, a proper task allocation helps to alleviate any potential thermal problem and improve the throughput. In this paper, we present a thermal-aware on-line task allocation algorithm for 3D MCPs. The results of our experiments show that our proposed method achieves 16.32X runtime speedup, and 23.18% throughput improvement. These are comparable to the exhaustive solutions obtained from optimization modeling software LINGO. On average, our throughput is only 0.85% worse than that of the exhaustive method. In 128 task-to-core allocations, our method takes only 0.932 ms, which is 57.74 times faster than the previous work.
Keywords: Multi-core processor, task allocation, thermal awareness, three-dimensional integration, throughput optimization, temperature uniformity.

An Endurance-Enhanced Flash Translation Layer via Reuse for NAND Flash Memory Storage Systems [p. 14]

Y. Wang, D. Liu, Z. Qin and Z. Shao

NAND flash memory is widely used in embedded systems due to its non-volatility, shock resistance and high cell density. In recent years, various Flash Translation Layer (FTL) schemes (especially hybrid-level FTL schemes) have been proposed. Although these FTL schemes provide good solutions in terms of endurance and wear-leveling, none of them have considered to reuse free pages in both data blocks and log blocks during a merge operation. By reusing these free pages, less free blocks are needed and the endurance of NAND flash memory is enhanced. We evaluate our reuse strategy using a variety of application specific I/O traces from Windows systems. Experimental results show that the proposed scheme can effectively reduce the erase counts and enhance the endurance of flash memory.

Register Allocation for Simultaneous Reduction of Energy and Peak Temperature on Registers [p. 20]

T. Liu, A. Orailoglu, C.J. Xue and M. Li

In this paper, we focus on register allocation techniques to simultaneously reduce energy consumption and heat buildup of register accesses. The conflict between these two objectives is resolved through the introduction of a hardware rotator. A register allocation algorithm followed by a refinement method is proposed based on the access patterns and the effects of the rotator. Experimental results show that the proposed algorithms obtain notable improvements in energy consumption and temperature reduction for embedded applications.
Index Terms -Register allocation, Bit transition activity, Heat buildup, Rotator

2.3: Modeling and Simulation of Interconnects

Moderators: W. Schilders, TU Eindhoven, NL; S. Grivet-Talocia, Politecnico di Torino, IT

A Parallel Hamiltonian Eigensolver for Passivity Characterization and Enforcement of Large Interconnect Macromodels [p. 26]

L. Gobbato, A. Chinea and S. Grivet-Talocia

The passivity characterization and enforcement of linear interconnect macromodels has received much attention in the recent literature. It is now widely recognized that the Hamiltonian eigensolution is a very reliable technique for such characterization. However, most available algorithms for the determination of the required Hamiltonian eigenvalues still require excessive computational resources for large-size macromodels with thousands of states. This work intends to break this complexity by introducing the first parallel implementation of a specialized Hamiltonian eigensolver, designed and optimized for shared memory multicore architectures. Our starting point is a multi-shift restarted and deflated Arnoldi process. Excellent parallel efficiency is obtained by running different Arnoldi iterations concurrently on different threads. The numerical results show that macromodels with several thousands states are characterized in few seconds on a 16-core machine, with close to ideal speedup factors.

Fast Statistical Analysis of RC Nets Subject to Manufacturing Variabilities [p. 32]

Y. Bi, K.-J. van der Kolk, J. Fernández Villena, L.M. Silveira and N. van der Meijs

This paper proposes a highly efficient methodology for the statistical analysis of RC nets subject to manufacturing variabilities, based on the combination of parameterized RC extraction and structure-preserving parameterized model order reduction methods. The sensitivity-based layout-to-circuit extraction generates first-order Taylor series approximations of resistances and capacitances with respect to multiple geometric parameter variations. This formulation becomes the input of the parameterized model order reduction, which exploits the explicit parameter dependence to produce a linear combination of multiple non-parameterized transfer functions weighted by the parameter variations. Such a formulation enables a fast computation of statistical properties such as the standard deviation of the transfer function given the process spreads of the technology. Both the extraction and the reduction techniques avoid any parameter sampling. Therefore, the proposed method achieves a significant speed up compared to the Monte Carlo approaches.

A Scaled Random Walk Solver for Fast Power Grid Analysis [p. 38]

B. Boghrati and S. Sapatnekar

The analysis of on-chip power grids requires the solution of large systems of linear algebraic equations with specific properties. Lately, a class of random walk based solvers have been developed that are capable of handling these systems: these are especially useful when only a small part of the original system must be solved. These methods build a probabilistic network that corresponds to the power grid. However, this construction does not fully exploit the properties of the problem and can result in large variances for the random walks, and consequently, large run times. This paper presents an efficient methodology, inspired by the idea of importance sampling, to improve the runtime of random walk based solvers. Experimental results show significant speedups, as compared to naive random walks used by the state-of-the-art random walk solvers.

A Block-Diagonal Structured Model Reduction Scheme for Power Grid Networks [p. 44]

Z. Zhang, X. Hu, C.-K. Cheng and N. Wong

We propose a block-diagonal structured model order reduction (BDSM) scheme for fast power grid analysis. Compared with existing power grid model order reduction (MOR) methods, BDSM has several advantages. First, unlike many power grid reductions that are based on terminal reduction and thus error-prone, BDSM utilizes an exact column-by-column moment matching to provide higher numerical accuracy. Second, with similar accuracy and macromodel size, BDSM generates very sparse block-diagonal reduced-order models (ROMs) for massive-port systems at a lower cost, whereas traditional algorithms such as PRIMA produce full dense models inefficient for the subsequent simulation. Third, different from those MOR schemes based on extended Krylov subspace (EKS) technique, BDSM is input-signal independent, so the resulting ROM is reusable under different excitations. Finally, due to its block-diagonal structure, the obtained ROM can be simulated very fast. The accuracy and efficiency of BDSM are verified by industrial power grid benchmarks.

2.4: PANEL AND EMBEDDED TUTORIAL SESSION - Logic Synthesis and Place and Route: After 20 Years of Engagement, Wedding in View? [p. 50]

Moderator: A Domic, Synopsys, US

Panelists: G. De Micheli, P. Groeneveld, H. Hiller, E. Macii, P. Magarshack

Logic synthesis and Physical Design: Quo Vadis [p. 51]

G. De Micheli

Virtually all current integrated circuits and systems would not exist without the use of logic synthesis and physical design tools. These design technologies were developed in the last fifty years and it is hard to say if they have come to full maturity. Physical design evolved from methods used for printed-circuit boards where the classic problems of placement and routing surfaced for the first time [1]. Logic synthesis evolved in a different trajectory, starting from the classic works on switching theory [2], but took a sharp turn in the eighties when multiple-level logic synthesis, coupled to semicustom technologies, provided designers with a means to map models in hardware description languages into netlists ready for physical design [3], [4]. The clear separation between logic and physical design tasks enabled the development of effective design tool flows, where signoff could be done at the netlist level. Nevertheless, the relentless downscaling of semiconductor technologies forced this separation to disappear, once circuit delays became interconnect-dominated. Since the nineties, design flows combined logic and physical design tools to address the so-called timing closure problem, i.e., to reduce the designer effort to synthesize a design that satisfies all timing constraints. Despite many efforts in various directions, most notably with the use of the fixed timing methodology, this problem is not completely solved yet. The complexity of integrated logic and physical tool flows, as well as the decrease in design starts of large ASICs, limits the development of these flows to a few EDA companies.

2.5: Transient Faults and Soft Errors

Moderators: D. Appello, STMicroelectronics, IT; C. Metra, Bologna U, IT

Time Redundant Parity for Low-Cost Transient Error Detection [p. 52]

D.J. Palframan, N.S. Kim and M.H. Lipasti

With shrinking transistor sizes and supply voltages, errors in combinational logic due to radiation particle strikes are on the rise. A broad range of applications will soon require protection from this type of error, requiring an effective and inexpensive solution. Many previously proposed logic protection techniques rely on duplicate logic or latches, incurring high overheads. In this paper, we present a technique for transient error detection using parity trees for power and area efficiency. This approach is highly customizable, allowing adjustment of a number of parameters for optimal error coverage and overhead. We present simulation results comparing our scheme to latch duplication, showing on average greater than 55% savings in area and power overhead for the same error coverage. We also demonstrate adding protection to reach a target logic soft error rate, constituting at best a 59X reduction in the error rate with under 2% power and area overhead.

Cross-Layer Optimized Placement and Routing for FPGA Soft Error Mitigation [p. 58]

K. Huang, Y. Hu and X. Li

As the feature size of FPGA shrinks to nanometers, soft errors increasingly become an important concern for SRAM-based FPGAs. Without consideration of the application level impact, existing reliability-oriented placement and routing approaches analyze soft error rate (SER) only at the physical level, consequently completing the design with suboptimal soft error mitigation. Our analysis shows that the statistical variation of the application level factor is significant. Hence in this work, we first propose a cube-based analysis to efficiently and accurately evaluate the application level factor. And then we propose a cross-layer optimized placement and routing algorithm to reduce the SER by incorporating the application level and the physical level factor together. Experimental results show that, the average difference of the application level factor between our cube-based method and Monte Carlo golden simulation is less than 0.01. Moreover, compared with the baseline VPR placement and routing technique, the cross-layer optimized placement and routing algorithm can reduce the SER by 14% with no area and performance overhead.
Keywords- cross-layer optimization, cube-based analysis, FPGA, placement and routing, soft error rate.

Trigonometric Method to Handle Realistic Error Probabilities in Logic Circuits [p. 64]

C.-C. Yu and J.P. Hayes

We present a novel trigonometry-based probability calculation (TPC) method for analyzing circuit behavior and reliability in the presence of errors that occur with extremely low probability. Signal and error probabilities are represented by trigonometric functions controlled by their corresponding angles. By combining trigonometric identities and Taylor expansions, the effect of an error at a particular gate is simulated as a rotation. In addition, the correlations among signals caused by re-convergence are carefully handled. The TPC method is shown to be more scalable and accurate than prior approaches, especially for very low-probability errors. We measure the performance of TPC by applying it to the ISCAS and LGSyn-91 benchmark circuits. Experimental results show that TPC achieves near-linear runtime complexity even with the largest circuits, while the accuracy gradually increases with decreasing error probabilities.
Keywords- Error modeling, logic circuits, probabilistic analysis, reliability, soft errors.

Soft Error Rate Estimation of Digital Circuits in the Presence of Multiple Event Transients (METs) [p. 70]

M. Fazeli, S.N. Ahmadian, S.G. Miremadi, H. Asadi, M.B. Tahoori

In this paper, we present a very fast and accurate technique to estimate the soft error rate of digital circuits in the presence of Multiple Event Transients (METs). In the proposed technique, called Multiple Event Probability Propagation (MEPP), a four-value logic and probability set are used to accurately propagate the effects of multiple erroneous values (transients) due to METs to the outputs and obtain soft error rate. MEPP considers a unified treatment of all three masking mechanisms i.e., logical, electrical, and timing, while propagating the transient glitches. Experimental results through comparisons with statistical fault injection confirm accuracy (only 2.5% difference) and speed-up (10,000X faster) of MEPP.

2.6: Networked Embedded Systems

Moderators: L. Almeida, Porto U, PT; P. Puschner, TU Vienna, AT

FlexRay Switch Scheduling - A Networking Concept for Electric Vehicles [p. 76]

M. Lukasiewycz, S. Chakraborty and P. Milbredt

It is projected that the communication data volume in electric vehicles will significantly increase compared to state-of-the-art vehicles due to additional functionalities like x-by-wire and safety functions. This paper presents a networking concept for electric vehicles to cope with the high data volume in cases where a single FlexRay bus is not sufficient. We present a FlexRay switch concept that is capable of increasing the effective bandwidth and improving the safety of existing FlexRay buses. A prototype FPGA implementation shows the feasibility of our approach. Further, a scheduling approach for the FlexRay switch that obtains the optimal results based on Integer Linear Programming (ILP) is presented. Since the ILP approach becomes intractable for real-world problems, we present a heuristic three-step approach that determines the branches of the network, performs a local scheduling for each node, and finally assembles the local schedules into a global schedule. Test cases and an entire realistic in-vehicle network are used to emphasize the benefits of the proposed approach.

A Reconfiguration Approach for Fault-Tolerant FlexRay Networks [p. 82]

K. Klobedanz, A. Koenig and W. Mueller

In this paper we present an approach for the configuration and reconfiguration of FlexRay networks to increase their fault tolerance. To guarantee a correct and deterministic system behavior, the FlexRay specification does not allow a reconfiguration of the schedule during run time. To avoid the necessity of a complete bus restart in case of a node failure, we propose a reconfiguration using redundant slots in the schedule and/or combine messages in existing frames and slots, to compensate node failures and increase robustness. Our approach supports the developer to increase the fault tolerance of the system during the design phase. It is a heuristic, which, additionally to a determined initial configuration, calculates possible reconfigurations for the remaining nodes of the FlexRay network in case of a node failure, to keep the system working properly. An evaluation by means of realistic safety-critical automotive real-time systems revealed that it determines valid reconfigurations for up to 80% of possible individual node failures. In summary, our approach offers major support for the developer of FlexRay networks since the results provide helpful feedback about reconfiguration capabilities. In an iterative design process these information can be used to determine and optimize valid reconfigurations.

Simplified Programming of Faulty Sensor Networks via Code Transformation and Run-Time Interval Computation [p. 88]

L.S. Bai, R.P. Dick, P.A. Dinda and P.H. Chou

Detecting and reacting to faults is an indispensable capability for many wireless sensor network applications. Unfortunately, implementing fault detection and error correction algorithms is challenging. Programming languages and fault tolerance mechanisms for sensor networks have historically been designed in isolation. This is the first work to combine them. Our goal is to simplify the design of fault-tolerant sensor networks. We describe a system that makes it unnecessary for sensor network application developers and users to understand the intricate implementation details of fault detection and tolerance techniques, while still using their domain knowledge to support fault detection, error correction, and error estimation mechanisms. Our FACTS system translates low-level faults into their consequences for application-level data quality, i.e., consequences domain experts can appreciate and understand. FACTS is an extension of an existing sensor network programming language; its compiler and runtime libraries have been modified to support automatic generation of code for on-line fault detection and tolerance. This code determines the impacts of faults on the accuracies of the results of potentially complex data aggregation and analysis expressions. We evaluate the overhead of the proposed system on code size, memory use, and the accuracy improvements for data analysis expressions using a small experimental testbed and simulations of large-scale networks.

2.7: Design of Energy-Efficient and Automotive Systems

Moderators: K. Danne, Intel; M. Di Natale, Scuola S S Anna, IT

Parallel Accelerators for GlimmerHMM Bioinformatics Algorithm [p. 94]

N. Chrysanthou, G. Chrysos, E. Sotiriades and I. Papaefstathiou

In the last decades there is an exponential growth in the amount of genomic data that need to be analyzed. A very important problem in biology is the extraction of the biologically functional genomic DNA from the actual genome of the organisms. There have been proposed many computational biology algorithms that solve the gene finding problem which utilize various approaches; GlimmerHMM is considered one of the most efficient such algorithms. This paper presents two different accelerators for the GlimmerHMM algorithm. One of them is implemented on a modern FPGA platform exploiting the parallelism that reconfigurable logic offers and the other one utilizes a GPU (Graphic Processing Unit) taking advantage of a highly multithreaded operational environment. The performance of the implemented systems is compared against the one achieved when the official distribution of the algorithm is executed on a high-end multi-core server; the speedup initiated, for the most compute intensive part, is up to 200x for the FPGA-based system and up to 34x for the GPU-based system.
Keywords- Gene finding, FPGA, GPU, bioinformatics

An Efficient On-Line Task Allocation Algorithm for QoS and Energy Efficiency in Multicore Multimedia Platforms [p. 100]

F. Paterna, A. Acquaviva, A. Caprara, F. Papariello, G. Desoli and L. Benini

The impact of variability on sub-45nm CMOS multimedia platforms makes hard to provide application QoS guarantees, as the speed variations across the cores may cause sub-optimal and sample-dependent utilization of the available resources and energy budget. These effects can be compensated by an efficient allocation of the workload at run-time. In the context of multimedia applications, a critical objective is to compensate core speed variability while matching time constraints without impacting the energy consumption. In this paper we present a new approach to compute optimal task allocations at run-time. The proposed strategy exploits an efficient and scalable implementation to find on-line the best possible solution in a tightly bounded time. Experimental results demonstrate the effectiveness of compensation both in terms of deadline miss rate and energy savings. Results have been compared with those obtained applying state-of-art techniques on a multithreaded MPEG2 decoder. The validation has been performed on a cycle-accurate virtual prototype of a next-generation industrial multicore platform that has been extended with process variability models.

Sub-Clock Power-Gating Technique for Minimising Leakage Power During Active Mode [p. 106]

J.N. Mistry, B.M. Al-Hashimi, D. Flynn and S. Hill

This paper presents a new technique, called subclock power gating, for reducing leakage power in digital circuits. The proposed technique works concurrently with voltage and frequency scaling and power reduction is achieved by power gating within the clock cycle during active mode unlike traditional power gating which is applied during idle mode. The proposed technique can be implemented using standard EDA tools with simple modifications to the standard power gating design flow. Using a 90nm technology library, the technique is validated using two case studies: 16-bit parallel multiplier and ARM Cortex-M0TM microprocessor, provided by our industrial project partner. Compared to designs without sub-clock power gating, in a given power budget, we show that leakage power saved allows 45x and 2.5x improvements in energy efficiency in the case of multiplier and microprocessor, respectively.

An Automated Data Structure Migration Concept - From CAN to Ethernet/IP in Automotive Embedded Systems (CANoverIP) [p. 112]

A. Kern, T. Streichert and J. Teich

In premium vehicles, the number of distributed comfort-, safety-, and infotainment-related functions is steadily increasing. For this reason, the requirements for the underlying communication architecture are also becoming stronger. In addition, the diversity of today's deployed communication technologies and the need for higher bandwidths complicate the design of future network architectures. Ethernet and IP, both standardized and widely used, could be one solution to homogenize communication architectures and to provide higher bandwidths. This paper focuses on a migration concept for replacing today's employed CAN-buses by Ethernet/IP-based networks. It highlights several concepts to minimize the protocol header overhead by using EA- and rule-based algorithms and presents migration results for currently deployed automotive CAN subnetworks.
Index Terms - Ethernet, IP, UDP, CAN, migration, optimization, automotive, embedded, CANoverIP, XoverIP

Formal Specification and Systematic Model-Driven Testing of Embedded Automotive Systems [p. 118]

S. Siegl, K.-S. Hielscher, R. German and C. Berger

Increasingly intelligent energy-management and safety systems are developed to realize safe and economic automobiles. The realization of these systems is only possible with complex and distributed software. This development poses a challenge for verification and validation. Upcoming standards like ISO 26262 provide requirements for verification and validation during development phases. Advanced test methods are requested for safety critical functions. Formal specification of requirements and appropriate testing strategies in different stages of the development cycle are part of it. In this paper we present our approach to formalize the requirements specification by test models. These models serve as basis for the following testing activities, including the automated derivation of executable test cases from it. Test cases can be derived statistically, randomly on the basis of operational profiles, and deterministically in order to perform different testing strategies. We have applied our approach with a large German OEM in different development stages of active safety and energy management functionalities. The test cases were executed in model-in-the-loop and in hardware-in-the-loop simulation. Errors were identified with our approach both in the requirement specification and in the implementation that were not discovered before.
Keywords: Road Vehicles, Safety Critical Systems, Software Testing, Requirements Engineering, Automated Testing, Verification, Validation

2.8: EMBEDDED TUTORIAL - Addressing Critical Power Management Verification Issues in Low Power Designs p. 124]

Moderator: K. Just, Infineon, DE

: Power management techniques that leverage voltage as a handle are being extensively used in power sensitive designs. These techniques include power gating, power gating with retention, multiple supply voltages, dynamic voltage scaling, adaptive voltage scaling, multi-threshold CMOS, and active body bias. The use of the power management techniques also imply new challenges in validation and testing of designs as new power states are created. We look into verification issues along with the solutions to these issues using a verification strategy that involves power-aware simulation, rule-based structural checking, formal tools, and methodology recommendations. We detail our varied experiences with various design teams in addressing these low power verification issues for applications such as the wireless handset, low power microprocessors, and GPS.

3.2: Power Optimization of Multi-Core Architectures

Moderators: C. Piguet, CSEM, CH; M. Lopez-Vallejo,, UP Madrid, ES

Topologically Homogeneous Power-Performance Heterogeneous Multicore Systems [p. 125]

K. Chakraborty and S. Roy

Dynamic Voltage and Frequency Scaling (DVFS), a widely adopted technique to ensure safe thermal characteristics while delivering superior energy efficiency, is rapidly becoming inefficient with technology scaling due to two critical factors: (a) inability to scale the supply voltage due to reliability concerns; and (b) dynamic adaptations through DVFS cannot alter underlying power hungry circuit characteristics, designed for the nominal frequency. In this paper, we show that DVFS scaled circuits substantially lag in energy efficiency, by 22-86%, compared to ground up designs for target frequency levels. We propose Topologically Homogeneous Power-Performance Heterogeneous multicore systems (THPH), a fundamentally alternate means to design energy efficient multicore systems. Using a system level CAD approach, we seamlessly integrate architecturally identical cores, designed for different voltage-frequency (VF) domains. We use a combination of standard cell library based CAD flow and full system architectural simulation to demonstrate 11-22% improvement in energy efficiency using our design paradigm.

Variability-Aware Duty Cycle Scheduling in Long Running Embedded Sensing Systems [p. 131]

L. Wanner, R. Balani, S. Zahedi, C. Apte, P. Gupta and M. Srivastava

Instance and temperature-dependent leakage power variability is already a significant issue in contemporary embedded processors, and one which is expected to increase in importance with scaling of semiconductor technology. We measure and characterize this leakage power variability in current microprocessors, and show that variability aware duty cycle scheduling produces 7.1x improvement in sensing quality for a desired lifetime. In contrast, pessimistic estimations of power consumption leave 61% of the energy untapped, and datasheet power specifications fail to meet required lifetimes by 14%. Finally, we introduce a duty cycle abstraction for TinyOS that allows applications to explicitly specify lifetime and minimum duty cycle requirements for individual tasks, and dynamically adjusts duty cycle rates so that overall quality of service is maximized in the presence of power variability.

Reliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors [p. 137]

V. Hanumaiah and S. Vrudhula

Advances in chip-multiprocessor processing capabilities have led to an increased power consumption and temperature hotspots. Reducing the on-die peak temperature is important from the power reduction and reliability considerations. However, the presence of task deadlines constrain the reduction of peak temperature and thus complicates the determination of optimal speeds for minimizing the peak temperature. We formulate the determination of optimal speeds for minimizing the peak temperature of execution with task deadlines as a quasiconvex optimization problem. This formulation includes accurate power and thermal models with the leakage power dependency on temperature. Experiments demonstrate that our approach is very flexible in adapting to various scenarios of workload and deadline specifications. We obtained an 8 °C reduction in peak temperature for a sample execution of benchmarks.

3.3: Core Algorithms for Formal Verification Engines

Moderators: S. Quer, Politecnico di Torino, IT; S. Seshia, UC Berkeley, US

Clause Simplification through Dominator Analysis [p. 143]

H. Han, H. Jin and F. Somenzi

Satisfiability (SAT) solvers often benefit from clauses learned by the DPLL procedure, even though they are by definition redundant. In addition to those derived from conflicts, the clauses learned by dominator analysis during the deduction procedure tend to produce smaller implication graphs and sometimes increase the deductive power of the input CNF formula. We extend dominator analysis with an efficient self-subsumption check. We also show how the information collected by dominator analysis can be used to detect redundancies in the satisfied clauses and, more importantly, how it can be used to produce supplemental conflict clauses. We characterize these transformations in terms of deductive power and proof conciseness. Experiments show that the main advantage of dominator analysis and its extensions lies in improving proof conciseness.

Integration of Orthogonal QBF Solving Techniques [p. 149]

S. Reimer, F. Pigorsch, C. Scholl and B. Becker

In this paper we present a method for integrating two complementary solving techniques for QBF formulas, i. e. variable elimination based on an AIG-framework and search with DPLL based solving. We develop a sophisticated mechanism for coupling these techniques, enabling the transfer of partial results from the variable elimination part to the search part. This includes the definition of heuristics to (1) determine appropriate points in time to snapshot the current partial result during variable elimination (by estimating its quality) and (2) switch from variable elimination to search-based methods (applied to the best known snapshot) when the progress of variable elimination is supposed to be too slow or when representation sizes grow too fast. We will show in the experimental section that our combined approach is clearly superior to both individual methods run in a stand-alone manner. Moreover, our combined approach significantly outperforms all other state-of-the-art solvers.

STABLE: A New QF-BV SMT Solver for Hard Verification Problems combining Boolean Reasoning with Computer Algebra [p. 155]

E. Pavlenko, M. Wedler, D. Stoffel, W. Kunz, A. Dreyer, F. Seelisch and G.-M. Greuel

This paper presents a new SMT solver, STABLE, for formulas of the quantifier-free logic over fixed-sized bit vectors (QF-BV). The heart of STABLE is a computer-algebra-based engine which provides algorithms for simplifying arithmetic problems of an SMT instance prior to bit-blasting. As the primary application domain for STABLE we target an SMT-based property checking flow for System-on-Chip (SoC) designs. When verifying industrial data path modules we frequently encounter custom-designed arithmetic components specified at the logic level of the hardware description language being used. This results in SMT problems where arithmetic parts may include non-arithmetic constraints. STABLE includes a new technique for extracting arithmetic bit-level information for these non-arithmetic constraints. Thus, our algebraic engine can solve subproblems related to the entire arithmetic design component. STABLE was successfully evaluated in comparison with other state-of-the-art SMT solvers on a large collection of SMT formulas describing verification problems of industrial data path designs that include multiplication. In contrast to the other solvers STABLE was able to solve instances with bit-widths of up to 64 bits.

3.4: Predicting Bugs and Generating Tests for Validation

Moderators: D. Grosse, Bremen U, DE; V. Bertacco, U of Michigan, US

Empirical Design Bugs Prediction for Verification [p. 161]

Q. Guo, T. Chen, H. Shen, Y. Chen, Y. Wu and W. Hu

Coverage model is the main technique to evaluate the thoroughness of dynamic verification of a Design-under-Verification (DUV). However, rather than achieving a high coverage, the essential purpose of verification is to expose as many bugs as possible. In this paper, we propose a novel verification methodology that leverages the early bug prediction of a DUV to guide and assess related verification process. To be specific, this methodology utilizes predictive models built upon artificial neural networks (ANNs), which is capable of modeling the relationship between the high-level attributes of a design and its associated bug information. To evaluate the performance of constructed predictive model, we conduct experiments on some open source projects. Moreover, we demonstrate the usability and effectiveness of our proposed methodology via elaborating experiences from our industrial practices. Finally, discussions on the application of our methodology are presented. Index Terms - Verification; Complexity Metric; Bug Prediction; Empirical Study

Decision Ordering Based Property Decomposition for Functional Test Generation [p. 167]

M. Chen and P. Mishra

SAT-based BMC is promising for directed test generation since it can locate the reason of an error within a small bound. However, due to the state space explosion problem, BMC cannot handle complex designs and properties. Although various optimization methods are proposed to address a single complex property, the test generation process cannot be fully automated. This paper presents an efficient automated approach that can scale down the falsification complexity using property decomposition and learning techniques. Our experimental results using both software and hardware benchmarks demonstrate that our approach can drastically reduce the overall test generation effort.

Towards Coverage Closure: Using Goldmine Assertions for Generating Design Validation Stimulus [p. 173]

L. Liu, D. Sheridan, W. Tuohy and S. Vasudevan

We present a methodology to generate input stimulus for design validation using GoldMine, an automatic assertion generation engine that uses data mining and formal verification. GoldMine mines the simulation traces of a behavioral Register Transfer Level (RTL) design using a decision tree based learning algorithm to produce candidate assertions. These candidate assertions are passed to a formal verification engine. If a candidate assertion is false, a counterexample trace is generated. In this work, we feed these counterexample traces to iteratively refine the original simulation trace data. We introduce an incremental decision tree to mine the new traces in each iteration. The algorithm converges when all the candidate assertions are true. We prove that our algorithm will always converge and capture the complete functionality of an output on convergence.We show that our method always results in a monotonic increase in simulation coverage. We also present an output-centric notion of coverage, and argue that we can attain coverage closure with respect to this notion of coverage. Experimental results to validate our arguments are presented on several designs from Rigel, OpenRisc and SpaceWire.

Scalable Hybrid Verification for Embedded Software [p. 179]

J. Behrend, D. Lettnin, P. Heckeler, J. Ruf, T. Kropf and W. Rosenstiel

The verification of embedded software has become an important subject over the last years. However, neither standalone verification approaches, like simulation-based or formal verification, nor state-of-the-art hybrid/semiformal verification approaches are able to verify large and complex embedded software with hardware dependencies. This work presents a new scalable and extendable hybrid verification approach for the verification of temporal properties in embedded software with hardware dependencies using for the first time a new mixed bottom-up/top-down algorithm. Therefore, new algorithms and methodologies like static parameter assignment and counter-example guided simulation are proposed in order to combine simulation-based and formal verification in a new way. We have successfully applied this hybrid approach to embedded software applications: Motorola's Powerstone Benchmark suite and a complex industrial embedded automotive software. The results show that our approach scales better than stand-alone software model checkers to reach deep state spaces. The whole approach is best suited for fast falsification.

3.5: Timing Related Issues in Test

Moderators: H.-J. Wunderlich, Stuttgart U, DE; S.K. Goel, TSMC, US

Diagnosing Scan Chain Timing Faults through Statistical Feature Analysis of Scan Images [p. 185]

M. Chen and A. Orailoglu

Excessive test mode power-ground noise in nanometer scale chips causes large delay uncertainties in scan chains, resulting in a highly elevated rate of timing failures. The hybrid timing violation types in scan chains, plus their possibly intermittent manifestations, invalidate the traditional assumptions in scan chain fault behavior, significantly increasing the ambiguity and difficulty in diagnosis. In this paper, we propose a novel methodology to resolve the challenge of diagnosing multiple permanent or intermittent timing faults in scan chains. Instead of relying on fault simulation that is incapable of approximating the intermittent fault manifestation, the proposed technique characterizes the impact of timing faults by analyzing the phase movement of scan patterns. Extracting fault-sensitive statistical features of phase movement information provides strong signals for the precise identification of fault locations and types. The manifestation probability of each fault is furthermore computed through a mathematical transformation framework which accurately models the behavior of multiple faults as a Markov chain. The fault model utilized in the proposed scheme considers the effect of possibly asymmetric fault manifestation, thus maximally approximating the realistic failure behavior. Simulations on large benchmark circuits and two industrial designs have confirmed that the proposed methodology can yield highly accurate diagnosis results even for complicated fault manifestations such as multiple intermittent faults with mixed fault types.

Design-for-Test Methodology for Non-Scan At-Speed Testing [p. 191]

M. Banga, N. Rahagude and M.S. Hsiao

While scan-based testing achieves a high fault coverage, it requires long test application times and substantial tester memory, in addition to the overhead in chip area and high test power. Functional testing, on the other hand, suffers from low coverage but can be applied at-speed. In this paper, we propose a novel three-step design-for-test (DFT) methodology which enhances the performance of functional testing to a great extent. In the first step we expand the state space of the circuit beyond functionally reachable space without scan or reset. These new states create conditions to activate/propagate fault effects that are otherwise hard-to-detect. Since structural correlation between D flip-flops (DFFs) of a circuit restricts its state space variation, the second step consists of partitioning the DFFs into different groups that helps to break such correlations. In the third step, we make internal hard-to-observe points in the circuit more observable by directly XORing them with selected primary outputs. This method can be applied at-speed (since no scan shifting is involved) saving significant amount of test application time, with comparable area overhead as scan-based DFT. Our experiments on large ISCAS'89 and ITC'99 benchmarks show that we are able to achieve very high non-scan fault coverages while simultaneously reducing the test application time (114x as compared to scan based techniques.

A Clock-Gating Based Capture Power Droop Reduction Methodology for At-Speed Scan Testing [p. 197]

B. Yang, A. Sanghani, S. Sarangi and C. Liu

Excessive power dissipation caused by large amount of switching activities has been a major issue in scan-based testing. For large designs, the excessive switching activities during launch cycle can cause severe power droop, which cannot be recovered before capture cycle, rendering the at-speed scan testing more susceptible to the power droop. In this paper, we present a methodology to avoid power droop during scan capture without compromising at-speed test coverage. It is based on the use of a low area overhead hardware controller to control the clock gates. The methodology is ATPG (Automatic Test Pattern Generation)-independent, hence pattern generation time is not affected and pattern manipulation is not required. The effectiveness of this technique is demonstrated on several industrial designs.

3.6: Performance and Timing Analysis

Moderators: H. Falk, TU Dortmund, DE; R. Wilhelm, Saarland U, DE

Pruning Infeasible Paths for Tight WCRT Analysis of Synchronous Programs [p. 204]

S. Andalam, P.S. Roop and A. Girault

Synchronous programs execute in discrete instants, called ticks. For real-time implementations, it is important to statically determine the worst case tick length, also known as the worst case reaction time (WCRT). While there is a considerable body of work on the timing analysis of procedural programs, such analysis for synchronous programs has received less attention. Current state-of-the art analyses for synchronous programs use integer linear programming (ILP) combined with path pruning techniques to achieve tight results. These approaches first convert a concurrent synchronous program into a sequential program. ILP constraints are then derived from this sequential program to compute the longest tick length. In this paper, we use an alternative approach based on model checking. Unlike conventional programs, synchronous programs are concurrent and state-space oriented, making them ideal for model checking based analysis. We propose an analysis of the abstracted state-space of the program, which is combined with expressive data-flow information, to facilitate effective path pruning.We demonstrate through extensive experimentation that the proposed approach is both scalable and about 67% tighter compared to the existing approaches.

Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi-Core Systems [p. 210]

S. Stattelmann, O. Bringmann and W. Rosenstiel

This work presents a SystemC-based simulation approach for fast performance analysis of parallel software components, using source code annotated with low-level timing properties. In contrast to other source-level approaches for performance analysis, timing attributes obtained from binary code can be annotated even if compiler optimizations are used without requiring changes in the compiler. To consider concurrent accesses to shared resources like caches accurately during a source-level simulation, an extension of the SystemC TLM-2.0 standard for reducing the necessary synchronization overhead is proposed as well. This enables the simulation of low-level timing effects without performing a full-fledged instruction set simulation and at speeds close to pure native execution. Index Terms - System analysis and design; Timing; Modeling; Software performance;

An Approach to Improve Accuracy of Source-Level TLMs of Embedded Software [p. 216]

Z. Wang, K. Lu and A. Herkersdorf

Virtual Prototypes (VPs) based on Transaction Level Models (TLMs) have become a de-facto standard for design space exploration and validation of complex software-centric multicore or multiprocessor systems. The most popular method to get timed software TLMs is to annotate timing information at the basic-block level granularity back into application source code, called source code instrumentation (SCI). The existing SCI approaches realize the back-annotation of timing information based on mapping between source code and binary code. However, optimizing compilation has a large impact on the code mapping and will lower the accuracy of the generated source-level TLMs. In this paper, we present an efficient approach to tackle this problem. We propose to use mapping between source-level and binary-level control flows as the basis for timing annotation instead of code mapping. Software TLMs generated by our approach allow for accurate evaluation of multiprocessor systems at a very high speed. This has been proven by our experiments with a set of benchmark programs and a case study.

Host-Compiled Multicore RTOS Simulator for Embedded Real-Time Software Development [p. 222]

P. Razaghi and A. Gerstlauer

With increasing demand for higher performance under limited power budgets, multicore processors are rapidly becoming the norm in today's embedded systems. Embedded software constitutes a large portion of today's systems and real-time software design on multicore platforms opens new design challenges. In this paper, we introduce a high-level, host-compiled multicore software simulator that incorporates an abstract real-time operating system (RTOS) model to enable early, fast and accurate software exploration in a symmetric multi-processing (SMP) context. Our proposed model helps designers to explore different scheduling parameters within a framework of a general SMP execution environment. A designer can easily adjust application and OS parameters to evaluate their effect on real-time system performance. We demonstrate the efficiency of our models on a suite of industrial-strength and artificial task sets. Results show that models simulate at up to 1000 MIPS with 1-3% timing error across a variety of different OS configurations.

3.7: Implementations for Digital Baseband Processing

Moderators: F. Kienle, TU Kaiserslautern, DE; F. Clermidy, CEA-LETI, FR

A Flexible High Throughput Multi-ASIP Architecture for LDPC and Turbo Decoding [p. 228]

P. Murugappa, R. Al-Khayat, A. Baghdadi and M. Jezequel

In order to address the large variety of channel coding options specified in existing and future digital communication standards, there is an increasing need for flexible solutions. This paper presents a multi-core architecture which supports convolutional codes, binary/duo-binary turbo codes, and LDPC codes. The proposed architecture is based on Application Specific Instruction-set Processors (ASIP) and avoids the use of dedicated interleave/deinterleave address lookup memories. Each ASIP consists of two datapaths one optimized for turbo and the other for LDPC mode, while efficiently sharing memories and communication resources. The logic synthesis results yields an overall area of 2.6mm2 using 90nm technology. Payload throughputs of up to 312Mbps in LDPC mode and of 173Mbps in Turbo mode are possible at 520MHz, fairing better than existing solutions. Index Terms - ASIP,LDPC,Turbo decoding.

A Low-Power VLIW Processor for 3GPP-LTE Complex Numbers Processing [p. 234]

C. Bernard and F. Clermidy

New generation of telecommunication applications requires highly efficient processing units to tackle with the increasing signal processing algorithmic complexity. They also need to be flexible for handling a large range of radio access technology with specifications moving very fast. As devices including telecommunication features are, per nature, mobile, the high level of flexibility must be achieved while preserving very low power consumption. In this paper, a high performance low-power application-specific processor is proposed for complex signal processing. Thanks to dedicated control architecture, this processor exhibits an average 81% utilization rate of its principal operator, a complex MAC for a 3GPP-LTE application. The main innovations are the use of a reconfigurable profile and instruction cache strategy to reduce power consumption. This leads to a 10x reduction of the control power consumption. As a result, an average 50 mW power consumption is measured after implementation in a low-power 65 nm technology while delivering 3.2 GOPS. Finally, a comparison with state-of-the-art low-power DSP shows at least 24 % gain. Digital Baseband, Signal processor, VLIW, Low-Power, 3GPP-LTE

Architecture and FPGA-Implementation of a High Throughput K⁺-Best Detector [p. 240]

N. Heidmann, T. Wiegand and S. Paul

Since Multiple Input Multiple Output (MIMO) transmission has become more and more popular for current and future mobile communication systems, MIMO detection is a big issue. Linear detection algorithms are less complex and well understood but their BER performance is limited. ML detectors achieve the optimum result but have exponential computational complexity. Hence, iterative tree-search algorithms like the sphere decoder or the K-Best detector, which reduce the computational complexity, has become a major topic in research. In this paper a modified K+-Best detector is introduced which is able to achieve the BER performance of a common K-Best detector with K=12, by using a sorting algorithm for K=8. This novel sorting approach based on Batchers Odd-Even Mergesort is less complex compared to other parallel sorting designs and saves valuable hardware resources. Due to an efficient implementation the throughput of the detector is about 455 Mbit/s which is twice as high as the LTE peak data rate of 217.6 Mbit/s for a 16-QAM modulated signal. In this paper the architecture and the implementation issues are demonstrated in detail and the BER performance of the K+-Best FPGA implementation is shown. Index Terms - K-Best Detector; MIMO; Odd-Even Mergesort; FPGA-Implementation.

An Energy-Efficient 64-QAM MIMO Detector for Emerging Wireless Standards [p. 246]

N. Moezzi-Madani, T. Thorolfsson, J. Crop, P. Chiang and W.R. Davis

A power/area aware design is mandatory for the MIMO (Multi-Input Multi-Output) detectors used in LTE and WiMAX standards. The 64-QAM modulation used in the MIMO detector requires more detection effort compared to the smaller constellation sizes widely implemented in the literature. In this work we propose a new architecture for the K-best detector, which unlike the popular multi-stage architecture used for K-best detectors, implements just one core. Also, we introduce a slight modification to the K-best algorithm that reduces the number of multiplications by 44%, and reduces the total power consumption by 27%, without any noticeable performance degradation. The overall architecture consumes only 24KGate, which is the smallest area compared to the other implementations in the literature. It also results in an at least 4-fold greater throughput-efficiency (Mbps/KiloGate) compared to the other detectors, while consuming a small power. The decoder implemented in a commercial 130nm process provides a data-rate of 107Mbps, and consumes 54.4mW.
Keywords-MIMO; K-best; single-core; 64-QAM; LTE; WiMAX

3.8: PANEL SESSION - Power Formats: Beyond UPF and CPF

Moderator: B. Pangrle, Mentor Graphics, US

Beyond UPF & CPF: Low-Power Design and Verification [p. 252]

B. Pangrle
Panelists: J. Biggs, C. Clavel, O. Domerego, K. Just

Two formats for specifying power intent are currently in wide use in the industry today and as designers continue to strive for more power efficient designs new issues arise that need new solutions to improve on today's standards. This panel will discuss areas for improving today's power formats and the direction that these formats need to move, in order to provide the most efficient flows for design and verification and especially with regards to low-power. The scope of the formats and their suitability from early ESL design exploration to back-end signoff checking will also be discussed.
Keywords-component; Unified Power Format; UPF; Common Power Format; CPF; Low-Power; Power-Aware; Power-Efficient; Design; Verification

IP1: Interactive Presentations

Buffering Implications for the Design Space of Streaming MEMS Storage [p. 253]

M.G. Khatib and L. Abelmann

Emerging nanotechnology-based systems encounter new non-functional requirements. This work addresses MEMS storage, an emerging technology that promises ultra-high density and energy-efficient storage devices. We study the buffering requirement of MEMS storage in streaming applications. We show that capacity and lifetime of a MEMS device dictate the buffer size most of the time. Our study shows that trading off 10% of the optimal energy saving of a MEMS device reduces its buffer capacity by up to three orders of magnitude. Index Terms - Secondary storage, energy efficiency, layout.

Efficient RC Power Grid Verification Using Node Elimination [p. 257]

A. Goyal and F.N. Najm

To ensure the robustness of an integrated circuit, its power distribution network (PDN) must be validated beforehand against any voltage drop on VDD nets. However, due to the increasing size of PDNs, it is becoming difficult to verify them in a reasonable amount of time. Lately, much work has been done to develop Model Order Reduction (MOR) techniques to reduce the size of power grids but their focus is more on simulation. In verification, we are concerned about the safety of nodes, including the ones which have been eliminated in the reduction process. This paper proposes a novel approach to systematically reduce the power grid and accurately compute an upper bound on the voltage drops at power grid nodes which are retained. Furthermore, a criterion for the safety of nodes which are removed is established based on the safety of other nearby nodes and a user specified margin.

A Novel TSV Topology for Many-Tier 3D Power-Delivery Networks [p. 261]

M.B. Healy and S.K. Lim

3D integration has the potential to increase performance and decrease energy consumption. However, there are many unsolved issues in the design of these systems. In this work we study the design of many-tier (more than 4 tiers stacked) 3D power-supply networks and demonstrate a technique specific to 3D systems that improves IR-drop over a straightforward extension of traditional design techniques. Previous work in 3D power delivery network design has simply extended 2D techniques by treating through-silicon vias (TSVs) as extensions of the C4 bumps. By exploiting the smaller size and much higher interconnect density possible with TSVs we demonstrate significant reduction of nearly 50% in the IR-drop of our 3D design. Simulations also show that a 3-tier stack with the distributed TSV topology actually lowers IR-drop by 20% over a non-3D system with less power dissipation. Finally, we analyze the power distribution network of an envisioned 1000-core processor with 30 stacked dies and show scaling trends related to both increased stacking and power distribution TSVs. Our 3D analysis technique is validated using commercial-grade sign-off IR-drop software from a major EDA vendor.

Cost-Efficient Fault-Tolerant Decoder for Hybrid Nanoelectronic Memories [p. 265]

N.Z. Haron and S. Hamdioui

Existing work on fault tolerance in hybrid nanoelectronic memories (hybrid memories) assumes that faults only occur in the memory array and the encoder, not in the decoder. However, as the decoder is structured using scaled CMOS devices, it is also becoming vulnerable to faults. This paper presents a cost-efficient fault-tolerant decoder for hybrid memories that are impacted by a high degree of non-permanent clustered faults. Fault-tolerant capability is achieved by combining partial hardware redundancy scheme and on-line masking scheme based on Muller C-gates. In addition, the cost-efficient implementation of the decoder is realized by modifying the decoding sequence and implementing it based on time redundancy. Experimental results show that the proposed decoder is able to provide better reliability of the overall hybrid memory system, yet requires smaller area as compared to conventional decoder. For example, when assuming the fault ratio between decoder and memory array is 1:10 and at 10% fault rate, the proposed decoder ensures 1% higher reliability of the overall hybrid memory system. Moreover, the proposed decoder realizes 18.4% smaller area overhead for 64-bit word hybrid memory.

DynOAA - Dynamic Offset Adaptation Algorithm for Improving Response Times of CAN Systems [p. 269]

T. Ziermann, J. Teich and Z. Salcic

CAN bus systems are used in many industrial control applications, particularly automotive. Due to growing system and functional requirements, the low capacity of the CAN bus and usually strict conditions under which it is used in real-time applications, applicability of CAN bus is severely limited. The paper presents an approach for achieving high utilization and breathes new life to CAN bus based systems by proposing a dynamic offset adaptation algorithm for scheduling messages and improving message response times without any changes to a standard CAN bus. This simple algorithm, which runs on all nodes of the system, results in excellent average response times at all loads and makes the approach particularly attractive for soft real-time systems.We demonstrate the performance improvement of the proposed approach by comparisons to other approaches and introduce a new performance measure in the form of a rating function. Index Terms - WCRT, Controller Area Network, CAN, response time, distributed embedded systems

A Sensor Fusion Algorithm for an Integrated Angular Position Estimation with Inertial Measurement Units [p. 273]

S. Sabatelli, F. Sechi, L. Fanucci and A. Rocchi

This work presents an orientation tracking system for 6D inertial measurements units. The system was modeled with MathWorks Simulink and experimentally tested with the Cube Demo board by SensorDynamics, used to simulate a 3D gyro and a 3D accelerometer. Quaternions were used to represent the angular position and an Extended Kalman filter was used for the sensor fusion algorithm. The goal was to obtain an integrated system that could be easily integrated within the logic of the new 6D sensor family produced by SensorDynamics. We propose a Kalman filter simplification for a fixed point arithmetic implementation to reduce the system complexity with negligible performance degradation.
Keywords: orientation tracking; angular position; Kalman filter; quaternions; inertial measurement unit; sensor fusion.

Speeding-up SIMD Instructions Dynamic Binary Translation in Embedded Processor Simulation [p. 277]

L. Michel, N. Fournel and F. Pétrot

This paper presents a strategy to speed-up the simulation of processors having SIMD extensions using dynamic binary translation. The idea is simple: benefit from the SIMD instructions of the host processor that is running the simulation. The realization is unfortunately not easy, as the nature of all but the simplest SIMD instructions is very different from a manufacturer to an other. To solve this issue, we propose an approach based on a simple 3-addresses intermediate SIMD instruction set on which and from which mapping most existing instructions at translation time is easy. To still support complex instructions, we use a form of threaded code. We detail our generic solution and demonstrate its applicability and effectiveness using a parametrized synthetic benchmark making use of the ARMv7 NEON extensions executed on a Pentium with MMX/SSE extensions.

System-Level Energy-Efficient Scheduling for Hard Real-Time Embedded Systems [p. 281]

L. Niu

In this paper, we present a system level dynamic scheduling algorithm to minimize the energy consumption by the DVS processor and multiple non-DVS peripheral devices in a hard real-time system. We show that the previous work which adopts the critical speed as the lower bound for scaling might not be most energy efficient when the energy overhead of shuttingdown/ waking-up is not negligible. Moreover, the widely used statically defined break even idle time might not be overall energy efficient due to its independence of job execution situations. In our approach, we first present an approach to enhance the computation of break even idle time dynamically. Then a dynamic scheduling approach is proposed in the management of speed determination and task preemption to reduce the energy consumption of the processor and devices. Compared with existing research, our approach can effectively reduce the system-level energy consumption for both CPU and peripheral devices.

Timing Error Statistics for Energy-Efficient Robust DSP Systems [p. 285]

R.A. Abdallah, Y.-H. Lee and N.R. Shanbhag

This paper makes a case for developing statistical timing error models of DSP kernels implemented in nanoscale circuit fabrics. Recently, stochastic computation techniques have been proposed [1], [2], [3], where the explicit use of error-statistics in system design has been shown to significantly enhance robustness and energy-efficiency. However, obtaining the error statistics at different process, voltage, and temperature (PVT) corners is hard. This paper: 1) proposes a simple additive error model for timing errors in arithmetic computations due PVT variations, 2) analyzes the relationship between error statistics and parameters, specifically the input statistics, and 3) presents a characterization methodology to obtain the proposed model parameters and thus enabling efficient implementations of emerging stochastic computing techniques. Key results include the following observations: 1) the output error statistics is a weak function of input statistics, and 2) the output error statistics depends upon the one's probability profile of the input word. These observations enable a one-time off-line statistical error characterization of DSP kernels similar to delay and power characterization done presently for standard cells and IP cores. The proposed error model is derived for a number of DSP kernels in a commercial 45nm CMOS process.

ScTMR: A Scan Chain-Based Error Recovery Technique for TMR Systems in Safety-Critical Applications [p. 289]

M. Ebrahimi, S.G. Miremadi and H. Asadi

We propose a roll-forward error recovery technique based on multiple scan chains for TMR systems, called Scan chained TMR (ScTMR). ScTMR reuses the scan chain flip-flops employed for testability purposes to restore the correct state of a TMR system in the presence of transient or permanent errors. In the proposed ScTMR technique, we present a voter circuitry to locate the faulty module and a controller circuitry to restore the system to the fault-free state. As a case study, we have implemented the proposed ScTMR technique on an embedded processor, suited for safety-critical applications. Exhaustive fault injection experiments reveal that the proposed architecture has the error detection and recovery coverage of 100% with respect to Single Event Upset (SEU) while imposing a negligible area and performance overhead as compared to traditional TMR-based techniques.

4.2: Robust and Low Power Systems

Moderators: C. Silvano, Politecnico di Milano, IT; M. Berekovic, TU Braunschweig, DE

Enabling Improved Power Management in Multicore Processors through Clustered DVFS [p. 293]

T. Kolpe, A. Zhai and S.S. Sapatnekar

In recent years, chip multiprocessors (CMP) have emerged as a solution for high-speed computing demands. However, power dissipation in CMPs can be high if numerous cores are simultaneously active. Dynamic voltage and frequency scaling (DVFS) is widely used to reduce the active power, but its effectiveness and cost depends on the granularity at which it is applied. Per-core DVFS allows the greatest flexibility in controlling power, but incurs the expense of an unrealistically large number of on-chip voltage regulators. Per-chip DVFS, where all cores are controlled by a single regulator overcomes this problem at the expense of greatly reduced flexibility. This work considers the problem of building an intermediate solution, clustering the cores of a multicore processor into DVFS domains and implementing DVFS on a per-cluster basis. Based on a typical workload, we propose a scheme to find similarity among the cores and cluster them based on this similarity. We also provide an algorithm to implement DVFS for the clusters, and evaluate the effectiveness of per-cluster DVFS in power reduction.

Dynamic Thermal Management in 3D Multi-Core Architecture through Run-Time Adaptation [p. 299]

F. Hameed, M.A. Al Faruque and J. Henkel

3D multi-core architectures are seen to provide increased transistor density, reduced power consumption, and improved performance through wire length reduction. However, 3D suffers from increased power density, which exacerbates thermal hotspots. In this paper, we present a novel 3D multi-core architecture that reduces processor activity on the die distant to the heat sink and a core-level dynamic thermal management technique based on the architectural adaptation, e.g. dynamically adapting core-resources depending on diverse application requirements and thermal behavior. The proposed thermal management technique synergistically combines the benefits of the architectural adaptation supported by our 3D multi-core architecture with dynamic voltage and frequency scaling. Our proposed technique provides 19.4% (maximum 24.4%, minimum 15.5%) improvement in the instruction throughput compared to the state-of-the-art thermal management techniques [4, 5] applied to the thermal-aware 3D processor architecture without considering run-time adaptation [10].

Distributed Hardware Matcher Framework for SoC Survivability [p. 305]

I. Wagner and S.-L. Lu

Modern systems on chip (SoCs) are rapidly becoming complex high-performance computational devices, featuring multiple general purpose processor cores and a variety of functional IP blocks, communicating with each other through on-die fabric. While modular SoC design provides power savings and simplifies the development process, it also leaves significant room for a special type of hardware bugs, interaction errors, to slip through pre- and post-silicon verification. Consequently, hard to fix silicon escapes may be discovered late in production schedule or even after a market release, potentially causing costly delays or recalls. In this work we propose a unified error detection and recovery framework that incorporates programmable features into the on-die fabric of an SoC, so triggers of escaped interaction bugs can be detected at runtime. Furthermore, upon detection, our solution locks the interface of an IP for a programmed time period, thus altering interactions between accesses and bypassing the bug in a manner transparent to software. For classes of errors that cannot be circumvented by this in-hardware technique our framework is programmed to propagate the error detection to the software layer. Our experiments demonstrate that the proposed framework is capable of detecting a range of interaction errors with less than 0.01% performance penalty and 0.45% area overhead.

A Cost-Effective Substantial-Impact-Filter Based Method to Tolerate Voltage Emergencies [p. 311]

S. Pan, Y. Hu, X. Hu and X. Li

Supply voltage fluctuation caused by inductive noises has become a critical problem in microprocessor design. A voltage emergency occurs when supply voltage variation exceeds the acceptable voltage margin, jeopardizing the microprocessor reliability. Existing techniques assume all voltage emergencies would definitely lead to incorrect program execution and prudently activate rollbacks or flushes to recover, and consequently incur high performance overhead. We observe that not all voltage emergencies result in external visible errors, which can be exploited to avoid unnecessary protection. In this paper, we propose a substantial-impact-filter based method to tolerate voltage emergencies, including three key techniques: 1) Analyze the architecture-level masking of voltage emergencies during program execution; 2) Propose a metric intermittent vulnerability factor for intermittent timing faults (IV Fitf ) to quantitatively estimate the vulnerability of microprocessor structures (load/store queue and register file) to voltage emergencies; 3) Propose a substantial-impact-filter based method to handle voltage emergencies. Experimental results demonstrate our approach gains back nearly 57% of the performance loss compared with the once-occur-then-rollback approach.

4.3: Formal Verification Techniques and Applications

Moderators: M. Wedler, Kaiserslautern U, DE; C. Scholl, Freiburg U, DE

Interpolation Sequences Revisited [p. 317]

G. Cabodi, S. Nocco and S. Quer

This work revisits the formulation of interpolation sequences, in order to better understand their relationships with Bounded Model Checking and with other Unbounded Model Checking approaches relying on standard interpolation. We first focus on different Bounded Model Checking schemes (bound, exact and exact-assume), pointing out their impact on the interpolation-based strategy. Then, we compare the abstraction ability of interpolation sequences with standard interpolation, highlighting their convergence at potentially different sequential depths. We finally propose a tight integration of interpolation sequences with an abstraction-refinement strategy. Our contributions are first presented from a theoretical standpoint, then supported by experimental results (on academic and industrial benchmarks) adopting a state-of-the-art academic tool.

Automated Debugging of SystemVerilog Assertions [p. 323]

B. Keng, S. Safarpour and A. Veneris

In the last decade, functional verification has become a major bottleneck in the design flow. To relieve this growing burden, assertion-based verification has gained popularity as a means to increase the quality and efficiency of verification. Although robust, the adoption of assertion-based verification poses new challenges to debugging due to presence of errors in the assertions. These unique challenges necessitate a departure from past automated circuit debugging techniques which are shown to be ineffective. In this work, we present a methodology, mutation model and additional techniques to debug errors in SystemVerilog assertions. The methodology uses the failing assertion, counterexample and mutation model to produce alternative properties that are verified against the design. These properties serve as a basis for possible corrections. They also provide insight into the design behavior and the failing assertion. Experimental results show that this process is effective in finding high quality alternative assertions for all empirical instances.

Counterexample-Guided SMT-Driven Optimal Buffer Sizing [p. 329]

B.A. Brady, D. Holcomb and S.A. Seshia

The quality of network-on-chip (NoC) designs depends crucially on the size of buffers in NoC components. While buffers impose a significant area and power overhead, they are essential for ensuring high throughput and low latency. In this paper, we present a new approach for minimizing the cumulative buffer size in on-chip networks, so as to meet throughput and latency requirements, given high-level specifications on traffic behavior. Our approach uses model checking based on satisfiability modulo theories (SMT) solvers, within an overall counterexample-guided synthesis loop. We demonstrate the effectiveness of our technique on NoC designs involving arbitration, credit logic, and virtual channels.

4.4: System Level Simulation and Validation

Moderators: F. Fummi, Verona U, IT; P. Sanchez, Cantabria U, ES

DOM: A Data-Dependency-Oriented Modeling Approach for Efficient Simulation of OS Preemptive Scheduling [p. 335]

P.-C. Wang, M.-H. Wu and R.-S. Tsay

Operating system (OS) models are widely used to alleviate the overwhelmed complexity of running system-level simulation of software applications on specific OS implementation. Nevertheless, current OS modeling approaches are unable to maintain both simulation speed and accuracy when dealing with preemptive scheduling. This paper proposes a Data-dependency-Oriented Modeling (DOM) approach. By guaranteeing the order of shared variable accesses, accurate simulation results are obtained. Meanwhile, the simulation effort of our approach is considerably less than that of the conventional Cycle-Accurate (CA) modeling approach, thereby leading to high simulation speed, 42 to 223 million instructions per second (MIPS) or 114 times faster, than CA modeling as supported by our experimental results.
Keywords-OS modeling; preemptive scheduling; simulation

Cycle-Count-Accurate Processor Modeling for Fast and Accurate System-Level Simulation [p. 341]

C.-K. Lo, L.-C. Chen, M.-H. Wu and R.-S. Tsay

Ideally, system-level simulation should provide a high simulation speed with sufficient timing details for both functional verification and performance evaluation. However, existing cycle-accurate (CA) and cycle-approximate (CX) processor models either incur low simulation speeds due to excessive timing details or low accuracy due to simplified timing models. To achieve high simulation speeds while maintaining timing accuracy of the system simulation, we propose a first cycle-count-accurate (CCA) processor modeling approach which pre-abstracts internal pipeline and cache into models with accurate cycle count information and guarantees accurate timing and functional behaviors on processor interface. The experimental results show that the CCA model performs 50 times faster than the corresponding CA model while providing the same execution cycle count information as the target RTL model.

A Shared-Variable-Based Synchronization Approach to Efficient Cache Coherence Simulation for Multi-Core Systems [p. 347]

C.-Y. Fu, M.-H. Wu and R.-S. Tsay

This paper proposes a shared-variable-based approach for fast and accurate multi-core cache coherence simulation. While the intuitive, conventional approach - synchronizing at either every cycle or memory access - gives accurate simulation results, it has poor performance due to huge simulation overloads. We observe that timing synchronization is only needed before shared variable accesses in order to maintain accuracy while improving the efficiency in the proposed shared-variable-based approach. The experimental results show that our approach performs 6 to 8 times faster than the memory-access-based approach and 18 to 44 times faster than the cycle-based approach while maintaining accuracy.
Keywords- cache-coherence; timing synchronization

Speeding up MPSoC Virtual Platform Simulation by Ultra Synchronization Checking Method [p. 353]

Y.-F. Yeh, C.-Y. Huang, C.-A. Wu and H.-C. Lin

Virtual platform simulation is an essential technique for early-stage system-level design space exploration and embedded software development. In order to explore the hardware behavior and verify the embedded software, simulation speed and accuracy are the two most critical factors. However, given the increasing complexity of the Multi-Processor System-on-Chip (MPSoC) designs, even the state-of-the-art virtual platform simulation algorithms may suffer from the simulation speed issue. In this paper, we proposed an Ultra Synchronization Checking Method (USCM) for fast and robust virtual platform simulation. We devise a data dependency table (DDT) so that the memory access information by the hardware modules and software programs can be predicted and checked. By reducing the unnecessary synchronizations among simulation modules and utilizing the asynchronous discrete event simulation technique, we can significantly improve the virtual platform simulation speed. Our experimental results show that the proposed USCM can simulate a 32-processor SoC design in the speed of multi-million instructions per second. We also demonstrate that our method is less sensitive to the number of cores in the virtual platform simulation.
Keywords-Virtual Platform Simulation, SoC, Synchronization.

4.5: Advances in Analogue, Mixed Signal and RF Testing

Moderators: A. Richardson, Lancaster U, UK; H. Stratigopoulos, IMAG, FR

An All-Digital Built-In Self-Test Technique for Transfer Function Characterization of RF PLLs [p. 359]

P.-Y. Wang, H.-M. Chang and K.-T. Cheng

This paper presents an all-digital built-in self-test (BIST) technique for characterizing the error transfer function of RF PLLs. This BIST scheme, with on-chip stimulus synthesis and response analysis completely done in the digital domain, achieves high-accuracy characterization and is applicable to a wide range of PLL architectures. For the popular sigma-delta fractional-N RF PLLs, the added circuitry required for this BIST solution is all digital except a bang-bang phase-frequency detector (BB-PFD), which incurs an area of only 0.0001mm2 for our implementation in a 65nm CMOS technology. The silicon characterization results at 3.6GHz reported by this BIST solution and by explicit measurement have a root-mean-square difference of 0.375dB only. Index Terms - BIST, PLL, frequency synthesizer, frequency modulator.

A True Power Detector for RF PA Built-In Calibration and Testing [p. 365]

J. Machado da Silva and P. Fonseca da Mota

Different built-in self testing schemes for RF circuits have been developed resorting to peak voltage detectors. These are simple to implement but provide a conditional RF power measurement accuracy as impedance is assumed to be known. A true power detector is presented which allows obtaining more accurate measurements, namely as far as output load variations are concerned. The theoretical fundaments underlining the power detector operating principle are presented and simulation and experimental results obtained with a prototype chip are described which confirm the benefits of measuring true power, comparing to output peak voltage, when observing output load matching deviations and complex waveforms.
Keywords-RF testing; power amplifier; power sensor

Test Time Reduction in Analogue/Mixed-Signal Devices by Defect Oriented Testing: An Industrial Example [p. 371]

H. Hashempour, J. Dohmen, B. Tasic, B. Kruseman, C. Hora, M. van Beurden and Y. Xing

We present an application of Defect Oriented Testing (DOT1) to an industrial mixed signal device to reduce test time and maintain quality. The device is an automotive IC product with stringent quality requirements and a mature test program that is already in volume production. A complete flow is presented including defect extraction, defect simulation, test selection, and validation. A major challenge of DOT for mixed signal devices is the simulation time. We address this challenge with a new fault simulation algorithm that provides significant speedup in the DOT process. Based on the fault simulations, we determine a minimal set of tests which detects all defects. The proposed minimal test set is compared with the actual test results of more than a million ICs. We prove that the production tests of the device can be reduced by at least 35%.

Testing of High-Speed DACs Using PRBS Generation with "Alternate-Bit-Tapping" [p. 377]

M. Singh, M. Sakare and S. Gupta

Testing of high-speed Digital-to-Analog Converters (DACs) is a challenging task, as it requires large number of high-speed synchronized input signals with specific test patterns. To overcome this problem, we propose use of PRBS signals with an "Alternate-Bit-Tapping" technique and eye-diagram measurement as a solution to efficiently generate the test-vectors and test the DACs. This approach covers all levels and transitions necessary for testing the dynamic behavior of the DAC completely, in minimum possible time. Circuit level simulations are used to verify its usefulness in testing a 4-bit 20-GS/s current-steering DAC.

4.6: Design Automation Methodologies and Architectures for Three-Dimensional ICs

Moderators: Y. Xie, Penn State U, US; H. Li, New York U, US

Statistical Thermal Evaluation and Mitigation Techniques for 3D Chip-Multiprocessors in the Presence of Process Variations [p. 383]

D.-C. Juan, S. Garg and D. Marculescu

Thermal issues have become critical roadblocks for achieving highly reliable three-dimensional (3D) integrated circuits. This paper performs both the evaluation and mitigation of the impact of leakage power variations on the temperature profile of 3D Chip-Multiprocessors (CMPs). Furthermore, this paper provides a learning-based model to predict the maximum temperature, based on which a simple, yet effective tier-stacking algorithm to mitigate the impact of variations on the temperature profile of 3D CMPs is proposed. Results show that (1) the proposed prediction model achieves more than 98% accuracy, (2) a 4-tier 3D implementation can be more than 40oC hotter than its 2D counterpart and (3) the proposed tier-stacking algorithm significantly improves the thermal yield from 44.4% to 81.1% for a 3D CMP.
Keywords-thermal; leakage; process variation; 3D; stack; yield; chip-multiprocessor; statistical learning; regression

Design Space Exploration for 3D-Stacked DRAMs [p. 389]

C. Weis, N. Wehn, I. Loi and L. Benini

3D integration based on TSV (through silicon via) technology enables stacking of multiple memory layers and has the advantage of higher bandwidth at lower energy consumption for the memory interface. As in mobile applications energy efficiency is key, 3D integration is especially here a strategic technology. In this paper we focus on the design space exploration of 3D-stacked DRAMs with respect to performance, energy and area efficiency for densities from 256Mbit to 4Gbit per 3D-DRAM channel. We investigate four different technology nodes from 75nm down to 45nm and show the optimal design point for the currently most common commodity DRAM density of 1Gbit. Multiple channels can be combined for main memory sizes of up to 32GB. We present a functional SystemC model for the 3D-stacked DRAM which is coupled with a SDR/DDR 3D-DRAM channel controller. Parameters for this model were derived from detailed circuit level simulations. The exploration demonstrates that an optimized 1Gbit 3D-DRAM stack is 15x more energy efficient compared to a commodity Low-Power DDR SDRAM part without IO drivers and pads. To the best of our knowledge this is the first design space exploration for 3D-stacked DRAM considering different technologies and real world physical commodity DRAM data.

Analytical Heat Transfer Model for Thermal Through-Silicon Vias [p. 395]

H. Xu, V.F. Pavlidis and G. De Micheli

Thermal issues are one of the primary challenges in 3-D integrated circuits. Thermal through-silicon vias (TTSVs) are considered an effective means to reduce the temperature of 3-D ICs. The effect of the physical and technological parameters of TTSVs on the heat transfer process within 3-D ICs is investigated. Two resistive networks are utilized to model the physical behavior of TTSVs. Based on these models, closed-form expressions are provided describing the flow of heat through TTSVs within a 3-D IC. The accuracy of these models is compared with results from a commercial FEM tool. For an investigated three-plane circuit, the average error of the first and second models is 2% and 4%, respectively. The effect of the physical parameters of TTSVs on the resulting temperature is described through the proposed models. For example, the temperature changes non-monotonically with the thickness of the silicon substrate. This behavior is not described by the traditional single thermal resistance model. The proposed models are used for the thermal analysis of a 3-D DRAM-..P system where the conventional model is shown to considerably overestimate the temperature of the system. Index Terms - 3-D ICs, Thermal through-silicon via (TTSV), thermal resistance, heat conductivity.

A New Architecture for Power Network in 3D IC [p. 401]

H.-T. Chen, H.-L. Lin, Z.-C. Wang and T.T. Hwang

Providing high vertical interconnection density between device tiers, through silicon via (TSV) offers a promising solution in 3D IC to reduce the length of global interconnection. However, some design issues hinder TSV from volumes of adoption, such as IR drop, thermal dissipation, current delivery per package pin and various voltage domains among tiers. To tackle these problems, the design of power network plays an important role in 3D IC. A new integrated architecture of stacked-TSV and power distributed network (STDN) is proposed in this paper. Our new STDN serves triple roles: power network to deliver larger current and reduce IR drop, thermal network to reduce temperature, and decoupling capacitor network to reduce power noise. As well, it helps to alleviate the limitation of the number of IO power pins. For both single and multiple power domains, the proposed STDN architecture demonstrates good performance in 3D floorplan, IR drop, power noise, temperature, area and even the total length of signal connections for selected MCNC benchmarks.

4.7: Resource Management for QoS Guaranteed NoCs

Moderators: A. Hansson, Twente U, NL; F. Petrot, TIMA Laboratory, FR

Achieving Composability in NoC-Based MPSoCs Through QoS Management at Software Level [p. 407]

E. Carara, G.M. Almeida, G. Sassateli and F.G. Moraes

Multiprocessors systems on chip (MPSoCs) have become the de-facto standard in embedded systems. The use of Networks-on-chip (NoCs) provides to these platforms scalability and support for parallel transactions. The computational power of these architectures enables the simultaneous execution of several applications, with different time constraints. However, as the number of applications executing simultaneously increases, the performance of such applications may be affected due to resources sharing. To ensure applications requirements are met, mechanisms are necessary for ensuring proper isolation. Such a feature is referred to as composability. As the NoC is the main shared component in NoC-based MPSoCs, quality-of-service (QoS) mechanisms are mandatory to meet application requirements in term of communication. In this work, we propose a hardware/software approach to achieve applications composability by means of QoS management mechanisms at the software level. The conducted experiments show the efficiency of the proposed method in terms of throughput, latency and jitter for a real time application sharing communication resources with best-effort applications.
Keywords-component; MPSoC; NoC; QoS; Composability; API (key words)

Supporting Non-Contiguous Processor Allocation in Mesh-Based CMPs Using Virtual Point-to-Point Links [p. 413]

M. Asadinia, M. Modarressi, A. Tavakkol and H. Sarbazi-Azad

In this paper, we propose a processor allocation mechanism for run-time assignment of a set of communicating tasks of input applications onto the processing nodes of a Chip Multiprocessor (CMP), when the arrival order and execution lifetime of the input applications are not known a priori. This mechanism targets the on-chip communication and aims to reduce the power and latency of the NoC employed as the communication infrastructure. In this work, we benefit from the advantages of non-contiguous processor allocation mechanisms, by allowing the tasks of the input application mapped onto disjoint regions (sub-meshes) and then virtually connecting them by bypassing the router pipeline stages of the inter-region routers. The experimental results show considerable improvement over one of the best existing allocation mechanisms.
Keywords-chip multiprocessors; network-on-Chip; processor allocation; contiguous allocation; non-contiguous allocation; power consumption; performance.

Guaranteed Service Virtual Channel Allocation in NoCs for Run-Time Task Scheduling [p. 419]

M. Winter and G.P. Fettweis

Quality-of-Service becomes a vital requirement in MPSoCs with NoCs. In order to serve them NoCs provide guarantees for latency, jitter and bandwidth by virtual channels. But the allocation of these guaranteed service channels is still an important question. In this paper we present and evaluate different realizations of a central hardware unit which allocates at run-time guaranteed service virtual channels providing QoS in packet-switched NoCs. We evaluate their performance in terms of allocation success, compare it to distributed channel setup techniques for different NoC sizes and traffic scenarios and analyze the required hardware area consumption. We find centralized channel allocation to be very suitable for our run-time task scheduling programming model. Index Terms - Network-on-Chip, virtual channel, guaranteed service, channel allocation, Quality-of-Service.

An FPGA Bridge Preserving Traffic Quality of Service for On-Chip Network-Based Systems [p. 425]

A. Beyranvand Nejad, M. Escudero Martinez and K. Goossens

FPGA prototyping of recent large Systems on Chip SoCs) is very challenging due to the resource limitation of a single FPGA. Moreover, having external access to SoCs for verification and debug purposes is essential. In this paper, we suggest to partition a network-on-chip (NoC) based system into smaller sub-systems each with their own NoC, and each of which is implemented on a separate FPGA board. Multiple SoC ASICs can be bridged in the same way. The scheme that interconnects the sub-systems should offer the application connections the required quality of service (QoS). In this paper, we investigate bridging schemes at different levels of the NoC protocol stack. Comparing the distinct design criteria for the proposed schemes, a bridge is designed. The bridge experiments show that it provides QoS in terms of bandwith and latency.

5.1: SMART DEVICES EMBEDDED TUTORIAL - Smart Devices for the Cloud Era

Moderators: A. Jerraya, CEA-LETI MINATEC, FR; J. Goodacre, ARM, UK

Entering the Path towards Terabit/s Wireless Links [p. 431]

G. Fettweis, F. Guderian and S. Krone

Wireless communications has been a hot area of technology advancement for the past two decades. As long as memory sizes increase, the demand for higher data rates of communications increases on the same scale. This means that one must understand today's high-end 10Gbit/s wireless technology to get prepared for 100 Gbit/s and 1Tbit/s data rates of tomorrow. This paper presents key boundary conditions learned by understanding leading edge wireless links of today to prepare for the Tbit/s technology of the year 2020.

Smart Imagers of the Future [p. 437]

A. Dupret, M. Tchagaspanian, A. Verdant, L. Alacoque and A. Peizerat

This paper presents the evolutions of CMOS image sensors. From the early works, highly image processing oriented, the main research effort has then emphasized on image acquisition. To overcome the rising limitations of standard approaches and to promote new functionalities, several research directions are underway with promising results.
Keywords-image sensors; vision chips; imagers; 3D technology

5.2: An Encyclopedia of Routing

Moderators: D. Stroobandt, Ghent U, BE; I. Markov, U of Michigan, US

Power-Driven Global Routing for Multi-Supply Voltage Domains [p. 443]

T.-H. Wu, A. Davoodi and J.T. Linderoth

This work presents a method for global routing (GR) to minimize interconnect power. We consider design with multi-supply voltage, where level converters are added to nets that connect driver cells to sink cells of higher supply voltage. The level converters are modeled as additional terminals during GR. Given an initial GR solution obtained with the objective of minimizing wirelength, we propose a GR method to detour nets to further save the interconnect power. When detouring routes via this procedure, overflow is not increased, and the increase in wirelength is bounded. The power saving opportunities include: 1) reducing the area capacitance of the routes by detouring from the higher metal layers to the lower ones, 2) reducing the coupling capacitance between adjacent routes by distributing the congestion, and 3) considering different power-weights for each segment of a routed net with level converters (to capture its corresponding supply voltage and activity factor). We present a mathematical formulation to capture these power saving opportunities and solve it using integer programming techniques. In our simulations, we show considerable saving in an interconnect power metric for GR, without any wirelength degradation.

Obstacle-Aware Multiple-Source Rectilinear Steiner Tree with Electromigration and IR-Drop Avoidance [p. 449]

J.-T. Yan and Z.-W. Chen

Based on the width determination of any current-driven connection for electromigration and IR-drop avoidance, an area-driven multiple-source routing tree can be firstly constructed to minimize the total wiring area with satisfying the current flow in Kirchhoff's current laws and the electromigration and IR-drop constraints. Furthermore, some Steiner points can be assigned onto feasible locations to reduce the total wiring area under the electromigration and IR-drop constraints. Finally, an obstacle-aware multiple-source rectilinear Steiner tree can be constructed by assigning the obstacle-aware minimum-length physical paths for all the connections. Compared with Lienig's multiple-source Steiner tree[7], the experimental results show that our proposed approach without any IR-drop constraint can reduce 10.5% of the total wiring area. Under 10%V_dd and 5%V_dd IR-drop constraints, the experimental results show that our proposed approach can satisfy 100% electromigration and IRdrop constraints and reduce 7.5% and 4.9% of the original total wiring area on the average for tested examples, respectively.

Steiner Tree Based Rotary Clock Routing with Bounded Skew and Capacitive Load Balancing [p. 455]

J. Lu, V. Honkote, X. Chen and B. Taskin

A novel rotary clock network routing method is proposed for the low-power resonant rotary clocking technology which guarantees: 1. The balanced capacitive load driven by each of the tapping points on the rotary rings, 2. Customized bounded clock skew among all the registers on chip, 3. A sub-optimally minimized total wirelength of the clock wire routes. In the proposed method, a forest of steiner trees is first created which connects the registers so as to achieve zero skew and greedily balance the total capacitance of each tree. Then, a balanced assignment of the steiner trees to the tapping points is performed to guarantee a balanced capacitive load on the rotary network. The proposed routing method is tested with the ISPD clock network contest and IBM r1-r5 benchmarks. The experimental results show that the capacitive load imbalance is very limited. The total wirelength is reduced by 64.2% compared to the best previous work known in literature through the combination of steiner tree routing and the assignment of trees to the tapping points. The average clock skew simulated using HSPICE is only 8.8ps when the bounded skew target is set to 10.0ps.

On Routing Fixed Escaped Boundary Pins for High Speed Boards [p. 461]

T.-Y. Tsai, R.-J. Lee, C.-Y. Chin, C.-Y. Kuan, H.-M. Chen and Y. Kajitani

Routing for high speed boards is still achieved manually nowadays. There have been some related works in escape routing to solve this problem recently, however a more practical problem is not addressed. Usually the packages/ components are designed with or without the requirement from board designers, and the boundary pins are usually fixed or advised to follow when the board design starts. Previous works in escape routing are not likely to be used due to this nature, in this work, we describe this fixed ordering boundary pin escaping problem, and propose a practical approach to solve it. Not only can we have a way to address, we also further plan the wires in a better way to preserve the precious routing resources in the limited number of layers on the board, and to effectively deal with obstacles. our approach has different feature compared with conventional shortest-path-based routing paradigm. In addition, we consider length-matching requirement and wire shape resemblance for high speed signal routes on board. Our results show that we can utilize routing resource very carefully, and can account for the resemblance of nets in the presence of the obstacles. Our approach is workable for board busses as well.

5.3: Temperature and Variation Aware Design in Low Power Systems

Moderators: D. Helms, OFFIS, DE; N. Chang, Seoul National U, KR

Dynamic Write Limited Minimum Operating Voltage for Nanoscale SRAMs [p. 467]

S. Nalam, V. Chandra, R.C. Aitken and B.H. Calhoun

Dynamic stability analysis for SRAM has been growing in importance with technology scaling. This paper analyzes dynamic writability for designing low voltage SRAM in nanoscale technologies. We propose a definition for dynamic write limited V_MIN. To the best of our knowledge, this is the first definition of a V_MIN based on dynamic stability. We show how this V_MIN is affected by the array capacity, the voltage scaling of the word-line pulse, the bitcell parasitics, and the number of cycles prior to the first read access. We observe that the array can be either dynamically or statically write limited depending on the aforementioned factors. Finally, we look at how voltage-bias based write assist techniques affect the dynamic write limited V_MIN.

Variation Aware Dynamic Power Management for Chip Multiprocessor Architectures [p. 473]

M. Ghasemazar and M. Pedram

With the increasing levels of variability in the characteristics of VLSI circuits and continued uncertainty in the operating conditions of processors, achieving predictable power efficiency and high performance in the electronic systems has become a daunting, yet vital, task. This paper tackles the problem of system-level dynamic power management (DPM) in the state-of-the-art chip multiprocessor (CMP) architectures that are manufactured in nanoscale CMOS technologies with large process variations or are operated under widely varying environmental conditions over their lifetime. We adopt a Markovian Decision Process based approach to CMP power management problem. The proposed technique models the underlying variability and uncertainty of parameters in system level as a partially observable MDP, and finds the optimal policy that stochastically minimizes energy per request. Experimental results demonstrate the high efficacy of the proposed power management framework.
Keywords - Chip multiprocessor; Dynamic power management; partially observable Markovian decision process;

Leakage Aware Energy Minimization for Real-Time Systems under the Maximum Temperature Constraint [p. 479]

H. Huang and G. Quan

In this paper, we study the problem on how to reduce the overall energy consumption while at the same time ensuring the timing and maximum temperature constraints for a real-time system. We incorporate the interdependence of leakage, temperature and supply voltage into analysis and develop a novel method to quickly estimate the overall energy consumption. Based on this method, we then propose a scheduling technique to minimize the overall energy consumption under the maximum temperature constraint. Our experimental results show that the proposed energy estimation method can achieve up to four-order-of-magnitude speedup compared with existing approaches while keeping the maximum estimation error within 4.8%. In addition, simulation results also demonstrate that our proposed energy minimization method consistently outperforms previous related approaches significantly.

5.4: Advanced NoC Tooling and Architectures

Moderators: A. Jantsch, KTH, SE; S Yoo, Pohang U of Science and Technology, KR

Multi-Objective Tabu Search Based Topology Generation Technique for Application-Specific Network-on-Chip Architectures [p. 485]

A. Tino and G.N. Khan

This paper presents a power and performance multi-objective Tabu Search based technique for designing application-specific Network-on-Chip architectures. The topology generation approach uses an automated technique to incorporate floorplan information and attain accurate values for wirelength and area. The method also takes dynamic effects such as contention into account, allowing performance constraints to be incorporated during topology synthesis. A new method for contention analysis is presented in this work which makes use of power and performance objectives using a Layered Queuing Network (LQN) contention model. The contention model is able to analyze rendezvous interactions between NoC components and alleviate potential bottleneck points within the system. Several experiments are conducted on various SoC benchmark applications and compared to previous works.
Keywords - Network-on-Chip, Topology Generation, Tabu Search, Layered Queuing Networks, Contention

A Fully-Synthesizable Single-Cycle Interconnection Network for Shared-L1 Processor Clusters [p. 491]

A. Rahimi, I. Loi, M.R. Kakoee and L. Benini

Shared L1 memory is an interesting architectural option for building tightly-coupled multi-core processor clusters. We designed a parametric, fully combinational Mesh-of-Trees (MoT) interconnection network to support high-performance, single-cycle communication between processors and memories in L1-coupled processor clusters. Our interconnect IP is described in synthesizable RTL and it is coupled with a design automation strategy mixing advanced synthesis and physical optimization to achieve optimal delay, power, area (DPA) under a wide range of design constraints. We explore DPA for a large set of network configurations in 65nm technology. Post placement& routing delay is 38FO4 for a configuration with 8 processors and 16 32-bit memories (8x16); when the number of both processors and memories is increased by a factor of 4, the delay increases almost logarithmically, to 84FO4, confirming scalability across a significant range of configurations. DPA tradeoff flexibility is also promising: in comparison to the max-performance 16x32 configuration, there is potential to save power and area by 45% and 12 % respectively, at the expense of 30% performance degradation.

Run-Time Deadlock Detection in Networks-on-Chip Using Coupled Transitive Closure Networks [p. 497]

R. Al-Dujaily, T. Mak, F. Xia, A. Yakovlev and M. Palesi

Interconnection networks with adaptive routing are susceptible to deadlock, which could lead to performance degradation or system failure. Detecting deadlocks at run-time is challenging because of their highly distributed characteristics. In this paper, we present a deadlock detection method that utilizes run-time Transitive Closure (TC) computation to discover the existence of deadlock-equivalence sets, which imply loops of requests in networks-on-chip (NoC). This detection scheme guarantees the discovery of all true deadlocks without false alarms unlike state-of-the-art approximation and heuristic approaches. A distributed TC-network architecture which couples with the NoC architecture is also presented to realize the detection mechanism efficiently. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the TC-network method. It drastically outperforms timing-based deadlock detection mechanisms by eliminating false detections and thus reducing energy dissipation in various traffic scenarios. For example, timing based methods may produce two orders of magnitude more deadlock alarms than the TC-network method. Moreover, the implementations presented in this paper demonstrate that the hardware overhead of TC-networks is insignificant.

5.5: INDUSTRIAL 1

Moderators: E.J. Marinissen, IMEC, BE; W. Nebel, OFFIS, DE

Developing an Integrated Verification and Debug Methodology [p. 503]

A. Matsuda and T. Ishihara

As design complexity of LSI systems increase, so does the verification challenges. It is very important, yet difficult to find all design errors and correct them in a timely manner. This paper presents our experience with a new verification and debug methodology based on the combination of formal verification and automated debugging. This methodology, which is applied to the development of a DDR2 memory design targeted for an FGPA, is found to significantly reduce the verification and debug tasks typically performed.
Keywords-system LSI; verification; debug; methodology

An Analytical Compact Model for Estimation of Stress in Multiple Through-Silicon Via Configurations [p. 505]

G. Eneman, J. Cho, V. Moroz, D. Milojevic, M. Choi, K. De Meyer, A. Mercha, E. Beyne, T. Hoffmann and G. Van der Plas

We present a compact model that provides a quick estimation of the stress and mobility patterns around arbitrary configurations of Through-Silicon Via's (TSVs). No separate TCAD simulations are required for these configurations. It estimates nFET and pFET mobility for industry-standard as well as for (100)/<100> substrate orientations. As the model provides mobility info in less than 0.1 millisecond/transistor/TSV, it is possible to be used in combination with layouting tools and circuit simulators to optimise layouts of circuits for digital and analog applications. The model has been integrated into the 3D PathFinding flow, for steering 3D IO placement during stack definition.

Power Management Verification Experiences in Wireless SoCs [p. 507]

B. Kapoor, A. Hunter, and P. Tiwari

We look into the validation a power managed ARM Cortex A-8 core used in SoCs targeted for mobile segment. Low Power design techniques used on the chip include clock gating, voltage scaling, and power gating. We focus on the verification challenges faced in designing the processor core including RTL modeling of power switches, isolation, and level-shifting cells, simulation of voltage ramps, generation of appropriate control signals to put the device into various power states, and ensuring correct operation of chip in these states as well as during the transitions between these states.
Keywords- low power, verification, power gating, dynamic volatage scaling, power switches, isolatio, ARM Cortex A-8

Challenges in Designing High Speed Memory Subsystem for Mobile Applications [p. 509]

T.G. Yip, P. Yeung, M. Li and D. Dressler

Some constraints imposed on the design of components for mobile devices are the size of the handheld device, safety for handling, heat dissipation, and in-system electromagnetic interference. This paper discusses challenges in designing the next generation low power DRAM subsystem operating at multi-gigabits per second. A new mobile DRAM interface that can meet the challenges and some test data are presented.
Keywords-low power DRAM; thermal; EM emission; high data rate; package-on-package; mobile phone

Solid State Photodetectors for Nuclear Medical Imaging Applications [p. 511]

M. Mazzillo, P.G. Fallica, E. Ficarra, A. Messina, M. Romeo and R. Zafalon

One of the most important challenges facing the entire globe is the trend towards an aging population. By 2045, there will be more people over 60 years old than younger than 15, thus raising from 600mln to 2bln worldwide. This will raise the number of patients with age-specific, chronic and degenerative diseases (e.g. cardio-vascular, cancer, diabetes, Alzheimer's, Parkinson's). Minimally-invasive imaging technologies such as PET (Positron Emission Tomography) and MRI (Magnetic Resonance Imaging) play a vital role in detecting and tracking the evolution of the above mentioned illnesses and determining the strategy and the effectiveness of the prescribed therapies. So far the detection unit of PET equipment has been implemented using photomultipliers tubes (PMTs). A novel solid state photo-detector, the Silicon photomultiplier (SiPM), can replace the PMT, offering, among many other advantages, the possibility of PET/MRI combo equipment.
Keywords: Nuclear Medicine, Photomultiplier, central nervous system's diagnostics, PET, SiPM.

Fault Grading of Software-Based Self-Test Procedures for Dependable Automotive Applications [p. 513]

P. Bernardi, M. Grosso, E. Sanchez and O. Ballan

Today, electronic devices are increasingly employed in different fields, including safety- and mission-critical applications, where the quality of the product is an essential requirement. In the automotive field, on-line self-test is a dependability technique currently demanded by emerging industrial standards. This paper presents an approach employed by STMicroelectronics for evaluating, or grading, the effectiveness of Software-Based Self-Test (SBST) procedures used for on-line testing microcontrollers to be included in safety-critical vehicle parts, such as in airbags and steering systems.
Keywords-SoC, test, software-based self-test, fault grading

5.6: Analysis, Compilation and Runtime Techniques

Moderators: H. Falk, TU Dortmund, DE; H. van Someren, ACE Associated Compiler Experts, NL

CARAT: Context-Aware Runtime Adaptive Task Migration for Multi Core Architectures [p. 515]

J. Jahn, M.A. Al Faruque and J. Henkel

Multi core architectures that are built to reap performance and energy efficiency benefits from the parallel execution of applications often employ runtime adaptive techniques in order to achieve, among others, load balancing, dynamic thermal management, and to enhance the reliability of a system. Typically, such runtime adaptation in the system level requires the ability to quickly and consistently migrate a task from one core to another. For distributed memory architectures, the policy for transferring the task context between source and destination cores is of vital importance to the performance and to the successful operation of the system. As its performance is negatively correlated with the communication overhead, energy consumption and the dissipated heat, task migration needs to be runtime adaptive to account for the system load, chip temperature, or battery capacity. This work presents a novel context-aware runtime adaptive task migration mechanism (CARAT) that reduces the task migration latency by 93.12%, 97.03% and 100% compared to three state-of-the-art mechanisms and allows to control the maximum migration delay and the performance overhead tradeoff at runtime. This novel mechanism is built on an in-depth analysis of the memory access behavior of several multi-media and robotic embedded-systems applications.

A Rule-Based Static Dataflow Clustering Algorithm for Efficient Embedded Software Synthesis [p. 521]

J. Falk, C. Zebelein, C. Haubelt and J. Teich

In this paper, an efficient embedded software synthesis approach based on a generalized clustering algorithm for static dataflow subgraphs embedded in general dataflow graphs is proposed. The clustered subgraph is quasi-statically scheduled, thus improving performance of the synthesized software in terms of latency and throughput compared to a dynamically scheduled execution. The proposed clustering algorithm outperforms previous approaches by a faster computation and a more compact representation of the derived quasi-static schedules. This is achieved by a rule-based approach, which avoids an explicit enumeration of the state space. Experimental results show significant improvements in both performance and code size when compared to a state-of-the-art clustering algorithm. Index Terms - MPSoC Scheduling, Software Synthesis, Actor- Oriented Design

Demand Code Paging for NAND Flash in MMU-less Embedded Systems [p. 527]

J.A. Baiocchi and B.R. Childers

NAND flash is preferred for code and data storage in embedded devices due to its high density and low cost. However, NAND flash requires code to be copied to main memory for execution. In inexpensive devices without hardware memory management, full shadowing of an application binary is commonly used to load the program. This approach can lead to a high initial application start-up latency and poor amortization of copy overhead. To overcome these problems, we describe a software-only demand-paging approach that incrementally copies code to memory with a dynamic binary translator (DBT). This approach does not require hardware or operating system support. With careful management, a savings can be achieved in total code footprint, which can offset the size of data structures used by DBT. For applications that cannot amortize full shadowing cost, our approach can reduce start-up latency by 50% or more, and improve performance by 11% on average.

5.7: EMBEDDED TUTORIAL - Architectures for Online Error Detection and Recovery in Multicore Processors

Moderator: X. Vera, Intel Corporation, ES

Architectures for Online Error Detection and Recovery in Multicore Processors [p. 533]

D. Gizopoulos, M. Psarakis, S.V. Adve, P. Ramachandran, S.K.S. Hari, D. Sorin, A. Meixner, A. Biswas and X. Vera

The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chips. Extremely complex, massively parallel, multi-core processor chips fabricated in these technologies will become more vulnerable to: (a) environmental disturbances that produce transient (or soft) errors, (b) latent manufacturing defects as well as aging/wearout phenomena that produce permanent (or hard) errors, and (c) verification inefficiencies that allow important design bugs to escape in the system. In an effort to cope with these reliability threats, several research teams have recently proposed multicore processor architectures that provide low-cost dependability guarantees against hardware errors and design bugs. This paper focuses on dependable multicore processor architectures that integrate solutions for online error detection, diagnosis, recovery, and repair during field operation. It discusses taxonomy of representative approaches and presents a qualitative comparison based on: hardware cost, performance overhead, types of faults detected, and detection latency. It also describes in more detail three recently proposed effective architectural approaches: a software-anomaly detection technique (SWAT), a dynamic verification technique (Argus), and a core salvaging methodology.
Keywords: multicore microprocessors; dependable architectures; online error detection/recovery/repair.

IP2: Interactive Presentations

An Energy-Efficient 3D CMP Design with Fine-Grained Voltage Scaling [p. 539]

J. Zhao, X. Dong and Y. Xie

In this paper, we propose an energy-efficient 3D-stacked CMP design by both temporally and spatially fine-grained tuning of processor cores and caches. In particular, temporally fine-grained DVFS is employed by each core and L2 cache to reduce the dynamic energy consumption, while spatially fine-grained DVS is applied to the cache hierarchy for the leakage energy reduction. Our tuning technique is implemented by integrating an array of on-chip voltage regulators into the original processor. Experimental results show that the proposed design can provide an energy-efficient, direct, and adaptive control to the system, leading to 20% dynamic and 89% leakage energy reductions, and an average of 34% total energy saving compared to the baseline design.

Optimized Model Checking of Multiple Properties [p. 543]

G. Cabodi and S. Nocco

This paper addresses the problem of model checking multiple properties on the same circuit/system. Although this is a typical scenario in several industrial verification frameworks, most model checkers currently handle single properties, verifying multiple properties one at a time. Possible correlations and shared sub-problems, that could be considered while checking different properties, are typically ignored, either for the sake of simplicity or for Cone-Of-Influence minimization. In this paper we describe a preliminary effort oriented to exploit possible synergies among distinct verification tasks of several properties on the same circuit. Besides considering given sets of properties, we also show that multiple properties can be automatically extracted from individual properties, thus simplifying difficult model checking tasks. Preliminary experimental results indicate that our approach can lead to significant performance improvements.

A New Distributed Event-Driven Gate-Level HDL Simulation by Accurate Prediction [p. 547]

D. Kim, M. Ciesielski and S. Yang

This paper describes a new and efficient solution to a distributed event-driven gate-level HDL simulation. It is based on a novel concept of spatial parallelism using accurate prediction of input and output signals of individual local modules in local simulations, derived from a model at a higher abstraction level (RTL). Using the predicted rather than actual signal values makes it possible to eliminate or greatly reduce the communication and synchronization overhead in a distributed event-driven simulation.

Circuit and DFT Techniques for Robust and Low Cost Qualification of a Mixed-Signal SoC with Integrated Power Management System [p. 551]

L. Balasubramanian, P. Sabbarwal, R.K. Mittal, P. Narayanan, R.K. Dash, A.D. Kudari, S. Manian, S. Polarouthu, H. Parthasarathy, R.C. Vijayaraghavan, S. Turkewadikar

This paper discusses some specific circuit, and analog DFT techniques and methodologies used in integrated power management (PM) systems to overcome challenges of mixed-signal SoC qualification. They are mainly targeted at achieving the following: 1. Enabling the robust digital and system level test and burn-in (BI) with external supplies by disabling the on-chip PM with robust power-on performance, 2. Minimising external on-board active components in BI board and making the whole BI process more robust, 3. Making the IDDQ tests more robust, increasing the IDDQ sensitivity by less error prone design methods and enabling IDDQ tests possible on analog supplies, and 4. Defining separate BI strategy for the whole PM modules on-chip and enabling it by targeted analog test modes.
Keywords: Burn-in, electrical reliability qualification, IDDQ, analog DFT, power management.

A 3D Reconfigurable Platform for 4G Telecom Applications [p. 555]

W. Lafi, D. Lattard and A. Jerraya

To address the problem of prohibitive costs of advanced technologies, one solution consists in reusing masks to address a wide range of systems. This could be achieved by a modular circuit that can be stacked to build 3D systems with processing performance adapted to several applications. This paper focuses on 4G wireless telecom applications. We propose a basic circuit that meets the SISO (Single Input Single Output) transmission mode. By stacking multiple instances of this same circuit, it will be possible to address several MIMO (Multiple Input Multiple Output) modes. The proposed circuit is composed of several processing units interconnected by a 3D NoC and controlled by a host processor. Compared to a 2D reference platform, the proposed circuit keeps at least the same performance and power consumption in the context of 4G telecom applications, while reducing total mask cost.

An LOCV-Based Static Timing Analysis Considering Spatial Correlations of Power Supply Variations [p. 559]

S. Kobayashi and K. Horiuchi

As the operating frequency of LSI becomes higher and the power supply voltage becomes lower, the on-chip power supply variation has become a dominant factor which influences the signal delay of the circuits. The static timing analysis (STA) considering on-chip power supply variations (IR-drop) is therefore one of the most crucial issues in the LSI designs nowadays. We propose an efficient STA method to consider on-chip power supply variations in the static timing analysis by utilizing the spatial correlations of IR-drop. The proposed method is based on the widely-used technique in STA considering OCV (on-chip variations), which is called LOCV (Location-based OCV) technique, and therefore our method is easy to be incorporated into the existing timing analysis flow. The proposed method is evaluated by using test data including H-tree clock structure with various on-chip IR-drop distributions. The experimental results show that the proposed method can reduce the design margin with respect to power supply variations by 6-85% (47% on the average) compared with the conventional practical approach with a constant OCV derating factor, while requiring no additional computation cost in the static timing analysis. Thus the proposed method can contribute to a fast timing closure considering on-chip power supply variations.
Keywords-static timing analysis; power supply variation; OCV

Compiling SyncCharts to Synchronous C [p. 563]

C. Traulsen, T. Amende and R. von Hanxleden

SyncCharts are a synchronous Statechart variant to model reactive systems with a precise and deterministic semantics. The simulation and software synthesis for SyncCharts usually involve the compilation into Esterel, which is then further compiled into C code. This can produce efficient code, but has two principal drawbacks: 1) the arbitrary control flow that can be expressed with SyncChart transitions cannot be mapped directly to Esterel, and 2) it is very difficult to map the resulting C code back to the original SyncChart, which hampers traceability. This paper presents an alternative software synthesis approach for SyncCharts that compiles SyncCharts directly into Synchronous C (SC). The compilation preserves the structure of the original SyncChart, which is advantageous for validation and possibly certification. We present a static thread-scheduling scheme that reflects data dependencies and optimizes both the number of used threads as well as the maximal used priorities. This results in SC code with competitive speed and little memory requirements.

Optimization of Stateful Hardware Acceleration in Hybrid Architectures [p. 567]

X. Chang, Y. Ma, H. Franke, K. Wang, R. Hou, H. Yu and T. Nelms

In many computing domains, hardware accelerators can improve throughput and lower power consumption, instead of executing functionally equivalent software on the general-purpose micro-processors cores. While hardware accelerators often are stateless, network processing exemplifies the need for stateful hardware acceleration. The packet oriented streaming nature of current networks enables data processing as soon as packets arrive rather than when the data of the whole network flow is available. Due to the concurrence of many flows, an accelerator must maintain and switch contexts between many states of the various accelerated streams embodied in the flows, which increases overhead associated with acceleration. We propose and evaluate dynamic reordering of requests of different accelerated streams in a hybrid on-chip/memory based request queue in order to reduce the associated overhead.

Formal Reset Recovery Slack Calculation at the Register Transfer Level [p. 571]

C.-N. Chung, C.-W. Chang, K.-H. Chang and S.-Y. Kuo

Reset is one of the most important signals in many designs. Since reset is typically not timing critical, it is handled at late physical design stages. However, the large fanout of reset and the lack of routing resources at these stages can create variant delays on different targets of the reset signal, creating reset recovery problems. Traditional approaches address this problem using physical design methods such as buffer insertion or rerouting. However, these methods may invalidate previous optimization efforts, making timing closure difficult. In this work we propose a formal method to calculate reset recovery slacks for registers at the register transfer level. Designers and physical design tools can then utilize this information throughout the design flow to reduce reset problems at later design stages.

Multi-Granularity Thermal Evaluation of 3D MPSoC Architectures [p. 575]

A. Fourmigue, G. Beltrame, G. Nicolescu, E.M. Aboulhamid and I. O. Connor

Three-dimensional (3D) integrated circuits (IC) are emerging as a viable solution to enhance the performance of Multi-processor System-On-Chip (MPSoC). The use of high-speed hardware and the increased density of 3D architectures present novel challenges concerning thermal dissipation and power management. Most approaches at power and thermal modeling use either static analytical models or slow low-level analog simulations. In this paper, we propose a novel thermal modeling methodology for evaluation of 3D MPSoCs. The integration of this methodology in a virtual platform enables efficient dynamic thermal evaluation of a chip. We present initial results for an architecture based on a 3D Network-On-Chip (NoC) interconnecting 2D processing elements (PE). Our methodology is based on the finite difference method: we perform an initial static characterization, after which high-speed dynamic simulation is possible. Index Terms - Virtual Platform, 3D IC, MPSoC, Dynamic Evaluation of Performance, Power Estimation, Thermal Analysis

Two Methods for 24 Gbps Test Signal Synthesis [p. 579]

D.C. Keezer and C.E. Gray

This paper describes and compares two methods for producing digital test signals up to 24 Gbps. Prototypes are experimentally characterized to determine signal quality, and the two methods are demonstrated and compared. The residual timing errors are dominated by jitter. Typical random jitter (RJ) is about 1.17ps to 1.4ps (RMS) including system measurement errors for the two methods. Deterministic Jitter (DJ) is between 2.4ps and 8.5ps. Total jitter (TJ) ranges between 18.9ps and 28.2ps at a bit-error-rate BER=10-12.
Keywords-multi-Gbps; Test Synthesis; Jitter; ATE

3D-ICML: A 3D Bipolar ReRAM Design with Interleaved Complementary Memory Layers [p. 583]

Y.-C. Chen, H. Li, Y. Chen and R.E. Pino

Resistive random access memory (ReRAM) has been demonstrated as a promising non-volatile memory technology with features such as high density, low power, good scalability, easy fabrication and compatibility to the existing CMOS technology. The conventional three-dimensional (3D) bipolar ReRAM design usually stacks up multiple memory layers that are separated by isolation layers, e.g. Spin-on-Glass (SOG). In this paper, we propose a new 3D bipolar ReRAM design with interleaved complimentary memory layers (3D-ICML) which can form a memory island without any isolation. The set of metal wires between two adjacent memory layers in vertical direction can be shared. 3D-ICML design can reduce fabrication complexity and increase memory density. Meanwhile, multiple memory cells interconnected horizontally and vertically can be accessed at the same time, which dramatically increases the memory bandwidth.

Architectural Exploration of 3D FPGAs towards a Better Balance between Area and Delay [p. 587]

C.-I. Chen, B.-C. Lee and J.-D. Huang

The emerging 3D technology, which stacks multiple dies within a single chip and utilizes through-silicon vias (TSVs) as vertical connections, is considered a promising solution for achieving better performance and easy integration. Similarly, a generic 2D FPGA architecture can evolve into a 3D one by extending its signal switching scheme from 2D to 3D by means of TSVs. However, replacing all 2D switch boxes (SBs) by 3D ones with full vertical connectivity is found both area-consuming and resource-squandering. Therefore, it is possible to greatly reduce the footprint with only minor delay increase by properly tailoring the structure and deployment strategy of 3D SB. In this paper, we perform a comprehensive architectural exploration of 3D FPGAs. Various architectural alternatives are proposed and then evaluated thoroughly to pick out the most appropriate ones with a better balance between area and delay. Finally, we recommend several configurations for generic 3D FPGA architectures, which can save up to 52% area with virtually no delay penalty.
Keywords-3D ICs; 3D FPGAs; architectural exploration; area/delay trade-off

NoC-MPU: A Secure Architecture for Flexible Co-Hosting on Shared Memory MPSoCs [p. 591]

J. Porquet, A. Grenier and C. Schwarz

For many embedded systems, data protection is becoming a major issue. On those systems, processors are often heterogeneous and prevent from deploying a common, trusted hypervisor on all of them. Multiple native software stacks are thus bound to share the resources without protection between them. NoC-MPU is a Memory Protection Unit allowing to support the secure and flexible co-hosting of multiple native software stacks running in multiple protection domains, on any shared memory MP-SoC using a NoC. This paper presents a complete hardware architecture of this NoC-MPU mechanism, along with a software trusted model organization.

6.1.1: SMART DEVICES HOT TOPIC/EMBEDDED TUTORIAL - Ultra Low Power Smart Devices

Moderators: J. Goodacre, ARM, UK; A. Jerraya, CEA-LETI MINATEC, FR

Low Power Smart Industrial Control [p. 595]

A. Bilgic, V. Pichot, M. Gerding and F. Bruns

Measurement equipment for process control in the chemical industry has to face severe restrictions due to safety concerns and regulations. In this work, we discuss the challenges raised by safety concerns and explain how they lead to strong power and energy constraints in the design of industrial measurement equipment. We argue that a comprehensive strategy in the design and implementation of hardware and software on one hand, and power management on the other hand is required to satisfy these constraints. Furthermore we demonstrate solutions for the power efficient design of the computing system and bus topology in an industrial environment.

Low Power Interconnects for SIMD Computers [p. 600]

M. Woh, S. Satpathy, R.G. Dreslinski, D. Kershaw, D. Sylvester, D. Blaauw and T. Mudge

Driven by continued scaling of Moore's Law, the number of processing elements on a die are increasing dramatically. Recently there has been a surge of wide single instruction multiple data architectures designed to handle computationally intensive applications like 3D graphics, high definition video, image processing, and wireless communication. A limit of the SIMD width of these types of architectures is the scalability of the interconnect network between the processing elements in terms of both area and power. To mitigate this problem, we propose the use of a new interconnect topology, XRAM, which is a low power high performance matrix style crossbar. It re-uses output buses for control programming, and stores multiple swizzle configurations at the cross points using SRAM cells, significantly reducing routing congestion and control signaling. We show that compared to conventionally implemented crossbars, the area scales with the product of inputxoutput ports while consuming almost 50% less energy. We present an application case study, color-space conversion, utilizing XRAM and show a 1.4x gain in performance while consuming 1.5-2.5x less power.

6.1.2: SPECIAL DAY KEYNOTE

Moderator: A. Jerraya, CEA-LETI MINATEC, FR

Wireless Innovations for Smartphones [p. 606]

H. Kauppinen

The ever increasing demand for fast mobile internet connectivity continues to set challenges for research in radio communications. On one hand the capacity demand can be served by offloading data traffic to local networks; on the other hand using more bandwidth, and possibly dynamically allocating spectrum in a flexible way, will improve the usage of the available spectrum. The future of wireless access continues to be defined by the 3GPP and IEEE standards setting bodies. Radios can also provide innovative features that offer new functionalities for consumers, such as ultra fast local connectivity, sensing and positioning. This talk will present examples of various radio innovations and the challenges related to commercializing them.

6.2: Placement and Floorplanning

Moderators: R. Otten, TU Eindhoven, NL; A. Davoodi, Wisconsin U, US

Flow-based Partitioning and Position Constraints in VLSI Placement [p. 607]

M. Struzyna

This paper presents a new quadratic, partitioning-based placement algorithm which is able to handle non-convex and overlapping position constraints to subsets of cells, the movebounds. Our new flow-based partitioning (FBP) combines a global MinCostFlow model for computing directions with extremely fast and highly parallelizable local realization steps. Despite its global view, the size of the MinCostFlow instance is only linear in the number of partitioning regions and does not depend on the number of cells. We prove that our partitioning scheme finds a (fractional) solution for any given placement or decides in polynomial time that none exists. In practice, BonnPlace with FBP can place huge designs with almost 10 million cells and dozens of movebounds in 90 minutes of global placement. On instances with movebounds, the netlengths of our placements are more than 32% shorter than RQL's [25] and our tool is 9-20 times faster. Even without movebounds, the FBP improves the quality and runtime of BonnPlace significantly and our tool shows the currently best results on the latest placement benchmarks [16].

Integrated Circuit White Space Redistribution for Temperature Optimization [p. 613]

Y. Chen, H. Zhou and R.P. Dick

Thermal problems are important for integrated circuits with high power densities. Three-dimensional stacked-wafer integrated circuit technology reduces interconnect lengths and improves performance compared to two-dimensional integration. However, it intensifies thermal problems. One remedy is to redistribute white space during floorplanning. In this paper, we propose a two-phase algorithm to redistribute white space. In the first phase, the lateral heat flow white space redistribution problem is formulated as a minimum cycle ratio problem, in which the maximum power density is minimized. Since this phase only considers lateral heat flow, it also works for traditional two-dimensional integrated circuits. In the second phase, to consider inter-layer heat flow in three-dimensional integrated circuits, we discretize the chip into an array of tiles and use a dynamic programming algorithm to minimize the maximum stacked tile power consumption. We compared our algorithms with a previously proposed technique based on mathematical programming. Our iterative minimum cycle ratio algorithm achieves 35% more reduction in peak temperature. Our two-phase algorithm achieves 4.21x reduction in peak temperature for three-dimensional integrated circuits compared to applying the first phase, alone.

Timing-Constrained I/O Buffer Placement for Flip-Chip Designs [p. 619]

Z.-W. Chen and J.-T. Yan

Due to inappropriate assignment of bump pads or improper placement of I/O buffers, the configured delays of I/O signals may not satisfy the timing requirement inside die core. In this paper, the problem of timing-constrained I/O buffer placement in an area-IO flip-chip design is firstly formulated. Furthermore, an efficient two-phase approach is proposed to place I/O buffers onto feasible buffer locations between I/O pins and bump pads with the consideration of the timing constraints. Compared with Peng's SA-based approach[7], with no timing constraint, our approach can reduce 71.82% of total wirelength and 55.74% of the maximum delay for 7 tested cases on the average. Under the given timing constraints, our result obtains higher timing-constrained satisfaction ratio(TCSR) than the SA-based approach[7].

Floorplanning Exploration and Performance Evaluation of a New Network-on-Chip [p. 625]

L. Xue, W. Ji, Q. Zuo and Y. Zhang

The Network-on-Chip (NoC) paradigm has emerged as a revolutionary methodology in current System-on-Chips (SoCs) for integrating a large number of processing elements in a single die. It has the advantage of enhanced performance, scalability and modularity, compared with previous bus-based communication architectures. Recently, A new Triplet-based Hierarchical Interconnection Network (THIN) has been proposed. In this paper, we explore the three-dimensional (3D) floorplanning of THIN and present two different floorplanning and routing methods using both the Manhattan routing and the Y-architecture routing architectures. A cycle-accurate simulator is developed based on Noxim NoC simulator and ORION 2.0 energy model. The latency, power consumption and area requirement of both THIN and Mesh are evaluated. The experimental results indicate that the proposed design provides 24.95% reduction in average power consumption and 16.84% improvement in area requirement.

6.3: Power Modeling, Analysis and Optimization

Moderators: J. Henkel, Karlsruhe Institute of Technology, DE; M. Poncino, Politecnico di Torino, IT

Worst-Case Temperature Analysis for Real-Time Systems [p. 631]

D. Rai, H. Yang, I. Bacivarov, J.-J. Chen and L. Thiele

With the evolution of today's semiconductor technology, chip temperature increases rapidly mainly due to the growth in power density. For modern embedded real-time systems, it is crucial to estimate maximal temperatures in order to take mapping or other design decisions to avoid burnout, and still be able to guarantee meeting real-time constraints. This paper provides answers to the question: When work-conserving scheduling algorithms, such as earliest-deadline-first (EDF), rate-monotonic (RM), deadline-monotonic (DM), are applied, what is the worst-case peak temperature of a real-time embedded system under all possible scenarios of task executions? We propose an analytic framework, which considers a general event model based on network and real-time calculus. This analysis framework has the capability to handle a broad range of uncertainties in terms of task execution times, task invocation periods, and jitter in task arrivals. Simulations show that our framework is a cornerstone to design real-time systems that have guarantees on both schedulability and maximal temperatures.
Keywords-real-time systems; compositional analysis; worst-case peak temperature; thermal analysis

Black-Box Leakage Power Modeling for Cell Library and SRAM Compiler [p. 637]

C.-K. Tseng, S.-Y. Huang, C.-C. Weng, S.-C. Fang and J.-J. Chen

In this paper, we present an automatic leakage power modeling method for standard cell library as well as SRAM compiler. For this problem, there are two major challenges - (1) the high sensitivity of leakage power to the temperature (e.g., the leakage power of an inverter can be different by 19.28X when temperature rises from 25°C to 100°C in 90nm technology), and (2) the large number of models to be built (e.g., there could be 80,835 SRAM macros supported by an SRAM compiler). Our method achieves high accuracy efficiently by two formula-based prediction techniques. First of all, we incorporate a quick segmented exponential interpolation scheme to take into account the effects of the temperature. Secondly, we use a MUX-oriented linear extrapolation scheme, which is so accurate that it allows us to build the leakage power models for all SRAM macros based on linear regression using only the simulation results of 9 small-sized SRAM macros. Experimental results show that this method is not only accurate but also highly efficient. Index Terms - Leakage Power Modeling, Leakage Power Estimation, Standard Cell Library, SRAM Compiler

Clock Gating Optimization with Delay-Matching [p. 643]

S.-J. Hsu and R.-B. Lin

Clock gating is an effective method of reducing power dissipation of a high-performance circuit. However, deployment of gated cells increases the difficulty of optimizing a clock tree. In this paper, we propose a delay-matching approach to addressing this problem. Delay-matching uses gated cells whose timing characteristics are similar to that of their clock buffer (inverter) counterparts. It attains better slew and much smaller latency with comparable clock skew and less area when compared to type-matching. The skew of a delay-matching gated tree, just like the one generated by type-matching, is insensitive to process and operating corner variations. Besides, delay-matching ECO of a gated tree excels in preserving the original timing characteristics of the gated tree.
Keywords- Clock gating; low power design; clock tree

A Low Complexity Stopping Criterion for Reducing Power Consumption in Turbo Decoders [p. 649]

P. Reddy, F. Clermidy, A. Baghdadi and M. Jezequel

Turbo codes are proposed in most of the advanced digital communication standards, such as 3GPP-LTE. However, due to its computational complexity, the turbo decoder is one of the most power hungry blocks in digital baseband. To alleviate this issue, one way is to avoid surplus computing phases thanks to the early termination of the iterative decoding process. The use of stopping criteria is one of the most common algorithm level power reduction methods in literature. These methods always come with some hardware overhead. In this paper, a new trellis based stopping criterion is proposed. The novelty of this approach is the lower hardware overhead thanks to the use of trellis states as key parameter to stop the iterative process. Results are showing the importance of this added hardware in terms of method efficiency. Compared to state-of-the-art Log Likelihood Ratio (LLR) based techniques, proposed Low Complexity Trellis Based (LCTB) is demonstrating 23% less power consumption on average, for comparable performance level in terms of Bit Error Rate (BER) and Frame Error Rate (FER).
Keywords-stopping criteria; turbo-decoder; trellis states; low complexity; power reduction

A Novel Tag Access Scheme for Low Power L2 Cache [p. 655]

H. Park, S. Yoo and S. Lee

Tag comparisons occupy a significant portion of cache power consumption in the highly associative cache such as L2 cache. In our work, we propose a novel tag access scheme which applies a partial tag-enhanced Bloom filter to reduce tag comparisons by detecting per-way cache misses. The proposed scheme also classifies cache data into hot and cold data and the tags of hot data are compared earlier than those of cold data exploiting the fact that most of cache hits go to hot data. In addition, the power consumption of each tag comparison can be further reduced by dividing the tag comparison into two micro-steps where a partial tag comparison is performed first and, only if the partial tag comparison gives a partial hit, then the remaining tag bits are compared. We applied the proposed scheme to an L2 cache with 10 programs from SPEC2000 and SPEC2006. Experimental results show average 23.69% and 8.58% reduction in cache energy consumption compared with the conventional serial tag-data access and the other existing methods, respectively.

6.4: Design and Test of Fault Resilient NoC Architectures

Moderators: M. Coppola, ST Microelectronics, FR; K. Goossens, TU Eindhoven, NL

Exploiting Network-on-Chip Structural Redundancy for a Cooperative and Scalable Built-In Self-Test Architecture [p. 661]

A. Strano, C. Gomez, D. Ludovici, M. Favalli, M.E. Gomez and D. Bertozzi

This paper proposes a built-in self-test/self-diagnosis procedure at start-up of an on-chip network (NoC). Concurrent BIST operations are carried out after reset at each switch, thus resulting in scalable test application time with network size. The key principle consists of exploiting the inherent structural redundancy of the NoC architecture in a cooperative way, thus detecting faults in test pattern generators too. At-speed testing of stuck-at faults can be performed in less than 1200 cycles regardless of their size, with an hardware overhead of less than 11%.

ReliNoC: A Reliable Network for Priority-Based On-Chip Communication [p. 667]

M.R. Kakoee, V. Bertacco and L. Benini

The reliability of networks-on-chip (NoC) is threatened by low yield and device wearout in aggressively scaled technology nodes. We propose ReliNoC, a network-on-chip architecture which can withstand failures, while maintaining not only basic connectivity, but also quality-of-service support based on packet priorities. Our network leverages a dual physical channel switch architecture which removes the control overhead of virtual channels (VCs) and utilizes the inherent redundancy within the 2-channel switch to provide spares for faulty elements. Experimental results show that ReliNoC provides 1.5 to 3 times better network physical connectivity in presence of several faults, and reduces the latency of both high and low priority traffic by 30 to 50%, compared to a traditional VC architecture. Moreover, it can tolerate up to 50 faults within an 8x8 mesh at only 10 and 40% latency overhead on control and data packets for PARSEC traces [24]. Synthesis results show that our reliable architecture incurs only 13% area overhead on the baseline 2-channel switch.

FARM: Fault-Aware Resource Management in NoC-Based Multiprocessor Platforms [p. 673]

C.-L. Chou and R. Marculescu

In this paper, we address the problem of run-time resource management in non-ideal multiprocessor platforms where communication happens via the Network-on-chip (NoCs) approach. More precisely, we propose a system-level fault-tolerant technique for application mapping which aims at optimizing the entire system performance and communication energy consumption, while considering the occurrence of permanent, transient, and intermittent faults in the system. As the main theoretical contribution, we address the problem of spare core placement and its impact on system fault-tolerance (FT) properties. Then, we investigate several metrics and provide insight into the fault-aware resource management process for such non-ideal multiprocessor platforms. Experimental results show that our proposed resource management technique is efficient and highly scalable and significant throughput improvements can be achieved compared to the existing solutions that do not consider failures in the system.

6.5: New Techniques for Diagnosis and Debug

Moderators: S. Reddy, Iowa U, US; B. Vermeulen, NXP Semiconductors, NL

On Diagnosis of Multiple Faults Using Compressed Responses [p. 679]

J. Ye, Y. Hu and X. Li

With the exponential growth in the number of transistors, not only test data volume and test application time may increase, but also multiple faults may exist in one chip. Test compaction has been a de-facto design-for-testability technique to reduce the test cost. However, the compacted test responses make multiple-fault diagnosis rather difficult. When there is no space compactor, the most likely suspect fault is considered producing the failing responses most similar to the failing responses observed from the automatic test equipment. But when compactor exists, those suspect faults may no longer have the same high possibility of being the actual faults. To address this problem, we introduce a novel metric explanation necessity. By using both of the new metric and the traditional metric explanation capability, we evaluate the possibility of a suspect fault to be the actual fault. For ISCAS'89 and ITC'99 benchmark circuits equipped with extreme space compactors, experimental results show that 98.8% of the top-ranked suspect faults hit the actual faults, outperforming a previous work by 11.3%.

On Multiplexed Signal Tracing for Post-Silicon Debug [p. 685]

X. Liu and Q. Xu

Trace-based debug solutions facilitate to eliminate design errors escaped from pre-silicon verification and have gained wide acceptance in the industry. Existing techniques typically trace the same set of signals throughout each debug run, which is not quite effective for catching design errors. In this work, we propose a multiplexed signal tracing strategy that is able to significantly increase debuggability of the circuit. That is, we divide the tracing procedure in each debug run into a few periods and trace different sets of signals in each period. A novel trace signal grouping algorithm is presented to maximize the probability of catching the propagated evidences from design errors, considering the trace interconnection fabric design constraints. Experimental results on benchmark circuits demonstrate the effectiveness of proposed solution.

Eliminating Data Invalidation in Debugging Multiple-Clock Chips [p. 691]

J. Gao, Y. Han and X. Li

A critical concern for post-silicon debug is the need to control the chip at clock cycle level. In a single clock chip, runstop control can be implemented by gating the clock signal using a stop signal. However, data invalidation might occur when it comes to multiple-clock chips. In this paper, we analyze the possible data invalidation, including data repetition and data loss, when stopping and resuming a multiple-clock chip. Furthermore, we propose an efficient solution to eliminate data repetition and data loss. Theoretical analysis and simulation experiments are both conducted for the proposed solution. We implement the proposed Design-for-Debug (DfD) circuit with SMIC 0.18μm technology and simulate the data transfer across clock domains using SPICE tool. The results show that both data repetition and data loss can be avoided with the proposed solution, even if metastability occurs.

6.6: Embedded Software for Parallel Architectures

Moderators: O. Bringmann, FZI Karlsruhe, DE; F. Slomka, Ulm U, DE

Parallelization of While Loops in Nested Loop Programs for Shared-Memory Multiprocessor Systems [p. 697]

S.J. Geuns, M.J.G. Bekooij, T. Bijlsma and H. Corporaal

Many applications contain loops with an undetermined number of iterations. These loops have to be parallelized in order to increase the throughput when executed on an embedded multiprocessor platform. This paper presents a method to automatically extract a parallel task graph based on function level parallelism from a sequential nested loop program with while loops. In the parallelized task graph loop iterations can overlap during execution. We introduce the notion of a single assignment section such that we can exploit single assignment to overlap iterations of the while loop during the execution of the parallel task graph. Synchronization is inserted in the parallelized task graph to ensure the same functional behavior as the sequential nested loop program. It is shown that the generated parallel task graph does not introduce deadlock. A DVB-T radio receiver where the user can switch channels after an undetermined amount of time illustrates the approach.

Gemma in April: A Matrix-Like Parallel Programming Architecture on OpenCL [p. 703]

T. Wu, D. Wu, Y. Wang, X. Zhang, H. Luo, N. Xu and H. Yang

Nowadays, Graphics Processing Unit (GPU), as a kind of massive parallel processor, has been widely used in general purposed computing tasks. Although there have been mature development tools, it is not a trivial task for programmers to write GPU programs. Based on this consideration, we propose a novel parallel computing architecture. The architecture includes a parallel programming model, named Gemma, and a programming framework, named April. Gemma is based on generalized matrix operations, and helps to alleviate the difficulty of describing parallel algorithms. April is a high-level framework that can compile and execute tasks described in Gemma with OpenCL. In particular, April can automatically 1) choose the best parallel algorithm and mapping scheme, and generate OpenCL kernels, 2) schedule Gemma tasks based on execution costs such as data storing and transferring. Our experimental results show that with competitive performance, April considerably reduces the programs' code length compared with OpenCL.

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing [p. 709]

S. Mu, C. Wang, M. Liu, D. Li, M. Zhu, X. Chen, X. Xie and Y. Deng

Today's high performance embedded computing applications are posing significant challenges for processing throughout. Traditionally, such applications have been realized on application specific integrated circuits (ASICs) and/or digital signal processors (DSP). However, ASICs' advantage in performance and power often could not justify the fast increasing fabrication cost, while current DSP offers a limited processing throughput that is usually lower than 100GFLOPS. On the other hand, current multi-core processors, especially graphics processing units (GPUs), deliver very high computing throughput, and at the same time maintain high flexibility and programmability. It is thus appealing to study the potential of GPUs for high performance embedded computing. In this work, we perform a comprehensive performance evaluation on 'PUs with the high performance embedded computing (HPEC) benchmark suite, which consist a broad range of signal processing benchmarks with an emphasis on radar processing applications. We develop efficient GPU implementations that could outperform previous results for all the benchmarks. In addition, a systematic instruction level analysis for the GPU implementations is conducted with a GPU micro-architecture simulator. The results provide key insights on optimizing GPU hardware and software. Meanwhile, we also compared the performance and power efficiency between GPU and DSP with the HPEC benchmarks. The comparison reveals that the major hurdle for GPU's applications in embedded computing is its relatively low power efficiency.
Keywords: GPU, Multi-core, HPEC benchmark, DSP, parallel computing, Fermi, GFLOPS

6.7: HOT TOPIC - Virtual Manycore Platforms: Moving Towards 100+ Processor Cores

Moderator: F. Ghenassia, STMicroelectronics, FR

Virtual Manycore Platforms: Moving Towards 100+ Processor Cores [p. 715]

R. Leupers, G. Martin, N. Topham, L. Eeckhout, F. Schirrmeister and X. Chen

The evolution to Manycore platforms is real, both in the High-Performance Computing domain and in embedded systems. If we start with ten or more cores, we can see the evolution to many tens of cores and to platforms with 100 or more occurring in the next few years. These platforms are heterogeneous, homogeneous, or a mixture of subsystems of both types, both relatively generic and quite application-specific. They are applied to many different application areas. When we consider the design, verification, software development and debugging requirements for applications on these platforms, the need for virtual platform technologies for Manycore systems grows quickly as the systems evolve. As we move to Manycore, the key issue is simulation speed, and trying to keep pace with the target complexity using host-based simulation is a major challenge. New Instruction Set Simulation technologies, such as compiled, JIT, DBT, sampling, abstract, hybrid and parallel have all emerged in the last few years to match the growth in complexity and requirements. At the same time, we have seen consolidation in the virtual platform industrial sector, leading to some concerns about whether the market can support the required continued development of innovations to give the needed performance. This special session deals with Manycore virtual platforms from several different perspectives, highlighting new research approaches for high speed simulation, tool and IP marketing opportunities, as well as real life virtual platform needs of industrial end users.

6.8: PANEL SESSION - Embedded Software Debug and Test

Moderator: M. Winterholer, Cadence, DE

Embedded Software Debug and Test - Needs and Requirements for Innovations in Debugging [p. 721]

M. Winterholer

Panelists: F. Cerisier, S. Davidmann, L. Ducuosso, J. Engblom, and A. Mayer

Today's complexity of embedded software is steadily increasing. The growing number of processors in a system and the increased communication and synchronization of all components requires scalable debug and test methods for each component as well as the system as a whole. Considering today's cost and time to market sensitivity it is important to find and debug errors as early as possible and to increase the degree of test and debug automation to avoid the loss of quality, cost and time. These challenges are not only requiring new tools and methodologies but also organizational changes since hardware and software developer have to work together to achieve the necessary productivity and quality gain. This panel brings together users and solution provider experienced in debugging embedded systems to discuss requirements for robust systems that are easy to debug.
Keywords: embedded systems, software debugging,model-based software debug, test automation, virtual prototype, hardware-software co-verification, silicon debug,debug standards

7.1: SMART DEVICES HOT TOPIC - Smart Medical Implants

Moderator: S. Yoo, POSTECH, KR

Powering and Communicating with mm-size Implants [p. 722]

J.M. Rabaey, M. Mark, D. Chen, C. Sutardja, C. Tang, S. Gowda, M. Wagner and D. Werthimer

This paper deals with system level design considerations for mm-size implantable electronic devices with wireless connectivity. In particular, it focuses on neural sensors as one application requiring such miniature interfaces. Common to all these implants is the need for power supply and a wireless interface. Wireless power transfer via electromagnetic fields is identified as a promising option for powering such devices. Design methodologies, system level trade-offs, as well as limitations of power supply systems based on electromagnetic coupling are discussed in detail. Further, various wireless data communication architectures are evaluated for their feasibility in the application. Reflective impulse radios are proposed as an alternative scheme for enabling highly scalable data transmission at <1pJ/bit. Finally, design considerations for the corresponding reader system are addressed.
Keywords-component; implantable neural sensors, brain-machine interfaces, wireless power transfer, ultra low power, data communication

An Antenna-Filter Co-Design for Cardiac Implants [p. 728]

E. de Foucauld, J.-B. David, C. Delaveaud and P. Ciais

This paper presents the design of a 2.4 GHz antenna and a BAW filter for cardiac implants in the ISM band. These components are both sensitive to their environment. The antenna modelisation in human body is presented in order to characterize its impedance. The BAW filter connection to a substrate modifies its impedance, so the link between the two components is the key of the Radio-Frequency transmission. The antenna-filter exhibits a Standing Wave Ratio better than 2 and a maximum insertion loss of 5.6 dB in the 2.4-2.48 GHz frequency band.
Keywords: Filter, antenna, modelisation, cardiac implants, co-design

7.2: Emerging Memory Technologies

Moderators: H. Li, New York U, US; Y. Chen, Pittsburgh U, US

Design Implications of Memristor-Based RRAM Cross-Point Structures [p. 734]

C. Xu, X. Dong, N.P. Jouppi and Y. Xie

Emerging non-volatile memory (NVM) technologies are getting mature in recent years. These emerging NVM technologies have demonstrated great potentials for the universal memory hierarchy design. Among all the technology candidates, resistive random- access memory (RRAM) is considered to be the most promising as it operates faster than phase-change memory (PCRAM), and it has simpler and smaller cell structure than magnetic memory (MRAM or STT-RAM). In contrast to a conventional MOS-accessed memory cell, memristor-based RRAM has the potential of forming a cross-point structure without using access devices, achieving ultra high density. The cross-point structure, however, brings extra challenges to the peripheral circuitry design. In this work, we study the memristor-based RRAM array design and focus on the choices of different peripherals to achieve the best trade-off among performance, energy, and area. In addition, a system-level model is built to estimate the performance, energy, and area values.

Robust 6T Si Tunneling Transistor SRAM Design [p. 740]

X. Yang and K. Mohanram

SRAMs based on tunneling field effect transistors (TFETs) consume very low static power, but the unidirectional conduction inherent to TFETs calls for special care when designing the SRAM cell. In this work, we make the following contributions. (i) We perform the first study of 6T TFET SRAMs based on both n-type and p-type access transistors and determine that only inward p-type TFETs are suitable as access transistors. However, even using inward p-type access transistors, the 6T TFET SRAM achieves only the write or the read operation reliably. (ii) In order to improve the reliability of 6T TFET SRAMs, we perform the first study of four leading write-assist (WA) and four leading read-assist (RA) techniques in TFET SRAMs. We conclude that the 6T TFET SRAM with GND lowering RA is the most reliable 6T TFET SRAM during write and read, and we verify that it is also robust under process variations. It also achieves the best performance and reliability, as well as the least static power and area, in comparison to other existing TFET SRAM structures. Further, it not only has comparable performance and reliability to the 32nm 6T CMOS SRAM, but also consumes 6-7 orders of magnitude lower static power, making it attractive for low-power high-density SRAM applications.

Towards Energy Efficient Hybrid On-Chip Scratch Pad Memory with Non-Volatile Memory [p. 746]

J. Hu, C.J. Xue, Q. Zhuge, W.-C. Tseng, E.H.-M. Sha

Scratch Pad Memory (SPM), a software-controlled on-chip memory, has been widely adopted in many embedded systems due to its small area and low power consumption. As technology scaling reaches the sub-micron level, leakage energy consumption is surpassing dynamic energy consumption and becoming a critical issue. In this paper, we propose a novel hybrid SPM which consists of non-volatile memory (NVM) and SRAM to take advantage of the ultra-low leakage power consumption and high density of NVM as well as the efficient writes of SRAM. A novel dynamic data allocation algorithm is proposed to make use of the full potential of both NVM and SRAM. According to the experimental results, with the help of the proposed algorithm, the novel hybrid SPM architecture can reduce memory access time by 18.17%, dynamic energy by 24.29%, and leakage power by 37.34% on average compared with a pure SRAM based SPM with the same size area.

7.3: Architectural Optimization for Low Power Systems

Moderators: A. Nannarelli, TU Denmark, DK; W. Nebel, Oldenburg U and OFFIS, DE

A New Reconfigurable Clock-Gating Technique for Low Power SRAM-Based FPGAs [p. 752]

L. Sterpone, L. Carro, D. Matos, S. Wong and F. Fakhar

Power consumption is dramatically increasing for Static Random Access Memory Field Programmable Gate Arrays (SRAM-FPGAs), therefore lower power FPGA circuitry and new CAD tools are needed. Clock-gating methodologies have been applied in low power FPGA designs with only minor success in reducing the total average power consumption. In this paper, we developed a new structural clock-gating technique based on internal partial reconfiguration and topological modifications. The solution is based on the dynamic partial reconfiguration of the configuration memory frames related to the clock routing resources. For a set of design cases, figures of static and dynamic power consumption were obtained. The analyses have been performed on a synchronous FIFO and on a r-VEX VLIW processor. The experimental results shown that the efficiency in the total average power consumptions ranges from about 28% to 39% with respect to standard clock-gating approaches. Besides, the proposed method is not intrusive, and presents a very limited cost in term of area overhead.

Controlled Timing-Error Acceptance for Low Energy IDCT Design [p. 758]

K. He, A. Gerstlauer and M. Orshansky

In embedded digital signal processing (DSP) systems, quality is set by a signal-to-noise ratio (SNR) floor. Conventional digital design strategies guarantee timing correctness of all operations, which leaves large quality margins in practical systems and sacrifices energy efficiency. This paper presents techniques to significantly improve energy efficiency by shaping the quality-energy tradeoff achievable via VDD scaling. In an unoptimized design, such scaling leads to rapid loss of quality due to the onset of timing errors. We introduce techniques that modify the behavior of the early and worst timing error offenders to allow for larger VDD reduction. We demonstrate the effectiveness of the proposed techniques on a 2D-IDCT design. The design was synthesized using a 45nm standard cell library. The experiments show that up to 45% energy savings can be achieved at a cost of 10dB peak signal-to-noise ratio (PSNR). The resulting PSNR remains above 30dB, which is a commonly accepted value for lossy image and video compression. Achieving such energy savings by direct VDD scaling without the proposed transformations results in a 35dB PSNR loss. The overhead for the needed control logic is less than 3% of the original design.

Energy Parsimonious Circuit Design through Probabilistic Pruning [p. 764]

A. Lingamneni, C. Enz, J.-L. Nagel, K. Palem and C. Piguet

Inexact Circuits or circuits in which the accuracy of the output can be traded for energy or delay savings, have been receiving increasing attention of late due to invariable inaccuracies in designs as Moore's law approaches the low nanometer range, and a concomitant growing desire for ultra low energy systems. In this paper, we present a novel design-level technique called probabilistic pruning to realize inexact circuits. Unlike the previous techniques in literature which relied mostly on some form of scaling of operational parameters such as the supply voltage (Vdd) to achieve energy and accuracy tradeoffs, our technique uses pruning of portions of circuits having a lower probability of being active, as the basis for performing architectural modifications resulting in significant savings in energy, delay and area. Our approach yields more savings when compared to any of the conventional voltage scaling schemes, for similar error values. Extensive simulations using this pruning technique in a novel logic synthesis based CAD framework on various architectures of 64-bit adders demonstrate that normalized gains as great as 2X-7.5X in the Energy-Delay- Area product can be obtained, with a relative error percentage as low as 10^-6% up to 10%, when compared to corresponding conventionally correct designs.

Stage Number Optimization for Switched Capacitor Power Converters in Micro-Scale Energy Harvesting [p. 770]

C. Lu, S.P. Park, V. Raghunathan and K. Roy

Micro-scale energy harvesting has become an increasingly viable and promising option for powering ultra-low power systems. A power converter is a key component in microscale energy harvesting systems. Various design parameters of the power converter, most notably the number of stages in a multi-stage power converter, play a crucial role in determining the amount of electrical power that can be extracted from a micro-scale energy transducer such as a miniature solar cell. Existing stage number optimization techniques for switched capacitor power converters, when used for energy harvesting systems, result in a substantial degradation in the amount of harvested electrical power. To address this problem, this paper proposes a new stage number optimization technique for switched capacitor power converters that maximizes the net harvested power in micro-scale energy harvesting systems. The proposed technique is based on a new figure-of-merit that is well suited for energy-harvesting systems. We have validated the proposed technique through circuit simulations using IBM 65nm technology. Our simulation results demonstrate that the proposed stage number optimization technique results in an increase of 60% - 290% in net harvested power, compared to existing stage number optimization techniques.

7.4: Advanced Technologies for NoC Implementation

Moderators: D. Bertozzi, Ferrara U, IT; P. Vivet, CEA-LETI, FR

Interconnect-Fault-Resilient Delay-Insensitive Asynchronous Communication Link Based on Current-Flow Monitoring [p. 776]

N. Onizawa, A. Matsumoto and T. Hanyu

Delay-insensitive asynchronous on-chip communication links are a key element to realize a highly reliable asynchronous Network-on-Chip system. However, even a single permanent fault, such as an interconnect fault, causes a dead-lock state in the system. This paper presents an interconnect-fault-resilient delay-insensitive asynchronous communication link based on current-flow monitoring. Since current flow upon an interconnect is cut off by an open fault in the interconnect, the current is fed back to a transmitter, which increases a feedback current monotonically. Monitoring the feedback current makes it possible to detect the interconnect fault with delay insensitivty. The proposed link is evaluated by a 0.13μm CMOS technology with a Triple Modular Redundancy (TMR)-based asynchronous communication link which is resilient to the interconnect fault without the delay insensitivity. As a result, the energy consumption and the number of wires of the proposed link are reduced to 57% and 33%, respectively, in comparison with those of the conventional one.

VANDAL: A Tool for the Design Specification of Nanophotonic Networks [p. 782]

G. Hendry, J. Chan, L.P. Carloni and K. Bergman

Continuing to scale CMP performance at reasonable power budgets has forced chip designers to consider emerging silicon-photonic technologies as the primary means of on- and off-chip communication. Different designs for chip-scale photonic interconnects have been proposed, and system-level simulations have shown them to be far superior to purely electronic network solutions. However, specifying the exact geometries for all the photonic devices used in these networks is currently a time-consuming and difficult manual process. We present VANDAL, a layout tool which provides a user with semi-automatic assistance for placing silicon photonic devices, modifying their geometries, and routing waveguides for hierarchically building photonic networks. VANDAL also includes SCILL, a scripting language that can be used to automate photonic device place and route for repeatability, automation, verification, and scaling. We demonstrate some of the features and flexibility of the CAD environment with a case study, designing modulator and detector banks for integrated photonic links.

Optical Ring Network-on-Chip (ORNoC): Architecture and Design Methodology [p. 788]

S. Le Beux, J. Trajkovic, I. O'Connor, G, Nicolescu, G. Bois and P. Paulin

State-of-the-art System-on-Chip (SoC) consists of hundreds of processing elements, while trends in design of the next generation of SoC point to integration of thousand of processing elements, requiring high performance interconnect for high throughput communications. Optical on-chip interconnects are currently considered as one of the most promising paradigms for the design of such next generation Multi- Processors System on Chip (MPSoC). They enable significantly increased bandwidth, increased immunity to electromagnetic noise, decreased latency, and decreased power. Therefore, defining new architectures taking advantage of optical interconnects represents today a key issue for MPSoC designers. Moreover, new design methodologies, considering the design constraints specific to these architectures are mandatory. In this paper, we present a contention-free new architecture based on optical network on chip, called Optical Ring Network-on-Chip (ORNoC). We also show that our network scales well with both large 2D and 3D architectures. For the efficient design, we propose automatic wavelength- /waveguide assignment and demonstrate that the proposed architecture is capable of connecting 1296 nodes with only 102 waveguides and 64 wavelengths per waveguide.

7.5: Emerging Test Solutions for Advanced Technologies, RF and MEMS Devices

Moderators: S. Khursheed, Southampton U, UK; J. Machado da Silva, INESC Porto, PT

Multidimensional Parametric Test Set Optimization of Wafer Probe Data for Predicting in Field Failures and Setting Tighter Test Limits [p. 794]

D. Drmanac, N. Sumikawa, L. Winemberg, L.-C. Wang and M.S. Abadir

This work proposes a wafer probe parametric test set optimization method for predicting dies which are likely to fail in the field based on known in-field or final test fails. Large volumes of wafer probe data across 5 lots and hundreds of parametric measurements are optimized to find test sets that help predict actually observed test escapes and final test failures. Simple rules are generated to explain how test limits can be tightened in wafer probe to prevent test escapes and final test fails with minimal overkill. The proposed method is evaluated on wafer probe data from a current automotive IC with near zero DPPM requirements resulting in improved test quality and reduced test cost.

On Design of Test Structures for Lithographic Process Corner Identification [p. 800]

A. Sreedhar and S. Kundu

Lithographic process variations, such as changes in focus, exposure, resist thickness introduce distortions to line shapes on a wafer. Large distortions may lead to line open and bridge faults and the locations of such defects vary with lithographic process corner. Based on lithographic simulation, it is easily verified that for a given layout, changing one or more of the process parameters shifts the defect location. Thus, if the lithographic process corner of a die is known, test patterns can be better targeted for both hard and parametric defects. In this paper, we present design of control structures such that preliminary testing of these structures can uniquely identify the manufacturing process corner. If the manufacturing process corner is known, we can easily attain highest possible fault coverage for lithography related defects during manufacturing test. Parametric defects such as delay defects are notorious to test because such defects may affect paths that are subcritical under nominal conditions and not ordinarily targeted for test. Adoption of the proposed approach can easily flag such paths for delay tests.
Keywords-photolithography, defocus, Resistance, process corner analysis, test pattern optimization

An Electrical Test Method for MEMS Convective Accelerometers: Development and Evaluation [p. 806]

A.A. Rekik, F. Azaïs, N. Dumas, F. Mailly and P. Nouet

In this paper, an alternative test method for MEMS convective accelerometers is presented. It is first demonstrated that device sensitivity can be determined without the use of physical test stimuli by simple electrical measurements. Using a previously developed behavioral model that allows efficient Monte-Carlo simulations, we have established a good correlation between electrical test parameters and device sensitivity. Proposed test method is finally evaluated for different strategies that privilege yield, fault coverage or test efficiency.
Keywords: MEMS testing, convective accelerometer, alternative electrical test

Correlating Inline Data with Final Test Outcomes in Analog/RF Devices [p. 812]

N. Kupp, M. Slamani and Y. Makris

In semiconductor manufacturing, a wealth of wafer-level measurements, generally termed inline data, are collected from various on-die and between-die (kerf) test structures and are used to provide characterization engineers with information on the health of the process. While it is generally believed that these measurements also contain valuable information regarding die performances, the vast amount of inline data collected often thwarts efficient and informative correlation with final test outcomes. In this work, we develop a data mining approach to automatically identify and explore correlations between inline measurements and final test outcomes in analog/RF devices. Significantly, we do not depend on statistical methods in isolation, but incorporate domain expert feedback into our algorithm to identify and remove spurious autocorrelations which are frequently present in semiconductor manufacturing data. We demonstrate our method using data from an analog/RF product manufactured in IBM's 90nm low-power process, on which we successfully identify a set of key inline parameters correlating to module final test (MFT) outcomes.

7.6: Innovative Power-Aware Systems for a Green and Healthy Society

Moderators: W. Eberle, IMEC, BE; E. Popovici, National U of Ireland, IE

Systematic Design of a Programmable Low-Noise CMOS Neural Interface for Cell Activity Recording [p. 818]

C.M. López, S. Musa, C. Bartic, R. Puers, G. Gielen and W. Eberle

The increasing electrode density in multi-electrode arrays and the use of new materials for electrode fabrication are motivating the migration from passive to active neuroprobes. Numerous circuit design challenges for the implementation of optimal integrated neural recording systems are still present and need to be addressed. In this paper we present the systematic design of a programmable low-noise multi-channel neural interface that can be used for the recording of neural activity in in vitro and in vivo experiments. The design methodology includes modeling and simulation of important parameters, allowing the definition, optimization and testing of the architecture and the circuit blocks. In the proposed architecture, individual channel programmability is provided in order to address different neural signals and electrode characteristics. A 16-channel fully-differential architecture is fabricated in a 0.35 μm CMOS technology, with a die size of 5.6 mm x 4.5 mm. Gains (40-75.6 dB) and band-pass filter cut-off frequencies (1-6000 Hz) can be digitally programmed using 7 bits per channel and a serial interface. The circuit consumes a maximum of 1.8 mA from a 3.3 V supply and the measured input-referred noise is between 2.3 and 2.9 μV_rms for the different configurations. We successfully performed simultaneous recordings of action potential signals, using different electrode characteristics in in vitro experiments.

A Real-Time Compressed Sensing-Based Personal Electrocardiogram Monitoring System [p. 824]

K. Kanoun, H. Mamaghanian, N. Khaled and D. Atienza

Wireless body sensor networks (WBSN) hold the promise to enable next-generation patient-centric mobile-cardiology systems. A WBSN-enabled electrocardiogram (ECG) monitor consists of wearable, miniaturized and wireless sensors able to measure and wirelessly report cardiac signals to a WBSN coordinator, which is responsible for reporting them to the tele-health provider. However, state-of-the-art WBSN-enabled ECG monitors still fall short of the required functionality, miniaturization and energy efficiency. Among others, energy efficiency can be significantly improved through embedded ECG compression, which reduces airtime over energy-hungry wireless links. In this paper, we propose a novel real-time energy-aware ECG monitoring system based on the emerging compressed sensing (CS) signal acquisition/compression paradigm for WBSN applications. For the first time, CS is demonstrated as an advantageous real-time and energy-efficient ECG compression technique, with a computationally light ECG encoder on the state-of-the-art ShimmerTM wearable sensor node and a real-time decoder running on an iPhone (acting as a WBSN coordinator). Interestingly, our results show an average CPU usage of less than 5% on the node, and of less than 30% on the iPhone.

A Distributed and Self-Calibrating Model-Predictive Controller for Energy and Thermal Management of High-Performance Multicores [p. 830]

A. Bartolini, M. Cacciari, A. Tilli and L. Benini

High-end multicore processors are characterized by high power density with significant spatial and temporal variability. This leads to power and temperature hot-spots, which may cause non-uniform ageing and accelerated chip failure. These critical issues can be tackled on-line by closed-loop thermal and reliability management policies. Model predictive controllers (MPC) outperform classic feedback controllers since they are capable of minimizing a cost function while enforcing safe working temperature. Unfortunately basic MPC controllers rely on a-priori knowledge of multicore thermal model and their complexity exponentially grows with the number of controlled cores. In this paper we present a scalable, fully-distributed, energy-aware thermal management solution. The model- predictive controller complexity is drastically reduced by splitting it in a set of simpler interacting controllers, each allocated to a core in the system. Locally, each node selects the optimal frequency to meet temperature constraints while minimizing the performance penalty and system energy. Global optimality is achieved by letting controllers exchange a limited amount of information at run-time on a neighbourhood basis. We address model uncertainty by supporting learning of the thermal model with a novel distributed self-calibration approach that matches well the controller architecture.

An Effective Multi-Source Energy Harvester for Low Power Applications [p. 836]

D. Carli, D. Brunelli, L. Benini and M. Ruggeri

Small autonomous embedded systems powered by means of energy harvesting techniques, have gained momentum in industry and research. This paper presents a simple, yet effective and complete energy harvesting solution which permits the exploitation of an arbitrary number of ambient energy sources. The proposed modular architecture collects energy from each of the connected harvesting subsystems in a concurrent and independent way. The possibility of connecting a lithiumion or nickel-metal hydride rechargeable battery protects the system against long periods of ambient energy shortage and improves its overall dependability. The simple, fully analogue design of the power management and battery monitoring circuits minimizes the component count and the parasitic consumption of the harvester. The numerical simulation of the system behavior allows an in-depth analysis of its operation under different environmental conditions and validates the effectiveness of the design.

7.7: HOT TOPIC - Foundations of Component-Based Design for Embedded Systems

Moderators: A. Sangiovanni-Vincentelli, UC Berkeley, US and Trento U, IT; J. Sifakis, VERIMAG, FR

Composing Heterogeneous Components for System-Wide Performance Analysis [p. 842]

S. Perathoner, K. Lampka and L. Thiele

Component-based validation techniques for parallel and distributed embedded systems should be able to deal with heterogeneous components, interactions, and specification mechanisms. This paper describes various approaches that allow the composition of subsystems with different execution and interaction semantics by combining computational and analytic models. In particular, this work shows how finite state machines, timed automata, and methods from classical real-time scheduling theory can be embedded into MPA (modular performance analysis), a contemporary framework for system-level performance analysis. The result is a powerful tool for compositional performance validation of distributed real-time systems.

7.8: EMBEDDED TUTORIAL - Predictable System Integration

Moderator: W. Kruijtzer, Synopsys, NL

Building Real-time HDTV Applications in FPGAs Using Processors, AXI Interfaces and High Level Synthesis Tools [p. 848]

K. Vissers, S. Neuendorffer and J. Noguera

Modern FPGAs enable complete system designs that include processors, interconnect systems, memory subsystems and a number of application functions that are implemented using High-Level Synthesis tools.
Keywords-HDTV systems, processor subsystem, image processing applications, High-Level Synthesis.

Architectures and Modeling of Predictable Memory Controllers for Improved System Integration [p. 851]

B. Akesson and K. Goossens

Designing multi-processor systems-on-chips becomes increasingly complex, as more applications with real-time requirements execute in parallel. System resources, such as memories, are shared between applications to reduce cost, causing their timing behavior to become inter-dependent. Using conventional simulation-based verification, this requires all concurrently executing applications to be verified together, resulting in a rapidly increasing verification complexity. Predictable and composable systems have been proposed to address this problem. Predictable systems provide bounds on performance, enabling formal analysis to be used as an alternative to simulation. Composable systems isolate applications, enabling them to be verified independently. Predictable and composable systems are built from predictable and composable resources. This paper presents three general techniques to implement and model predictable and composable resources, and demonstrates their applicability in the context of a memory controller. The architecture of the memory controller is general and supports both SRAM and DDR2/DDR3 SDRAM and a wide range of arbiters, making it suitable for many predictable and composable systems. The modeling approach is based on a shared-resource abstraction that covers any combination of supported memory and arbiter and enables system-level performance analysis with a variety of well-known frameworks, such as network calculus or data-flow analysis. Index Terms - predictability; composability; memory controller; memory patterns; real-time; SDRAM; arbitration; latency-rate servers

SoC Infrastructures for Predictable System Integration [p. 857]

P. van der Wolf and J. Geuzebroek

Advanced SoCs integrate a diverse set of system functions that pose different requirements on the SoC infrastructure. Predictable integration of such SoCs, with guaranteed Quality-of-Service (QoS) for the real-time functions, is becoming increasingly challenging. We present a structured approach to predictable integration based on a combination of architectural principles and associated analysis techniques. We identify four QoS classes and define the type of QoS guarantees to be supported for the two classes targeted at real-time functions. We then discuss how a SoC infrastructure can be built that provides such QoS guarantees on its interfaces and how network calculus can be applied for analyzing worst-case performance and sizing of buffers. Benefits of our approach are predictable performance and improved time-to-market, while avoiding costly over-design.
Keywords - SoC infrastructure; system integration; real-time; predictability; Quality-of-Service; network calculus;

IP3: Interactive Presentations

Early Chip Planning Cockpit [p. 863]

J. Shin, J.A. Darringer, G. Luo, A.J. Weger and C.L. Johnson

The design of high-performance servers has always been a challenging art. Now, server designers are being asked to explore a much larger design space as they consider multicore heterogeneous architecture and the limits of advancing silicon technology. Bringing automation to the early stages of design can enable more rapid and accurate trade-off analysis. In this paper, we introduce an Early Chip Planner which allows designers to rapidly analyze microarchitecture, physical and package design trade-offs for 2D and 3D VLSI chips and generates an attributed netlist to be carried on to the implementation stage. We also describe its use in planning a 3D special-purpose server processor.
Keywords-system level design automation; early chip planning

Power Reduction via Near-Optimal Library-Based Cell-Size Selection [p. 867]

M. Rahman, H. Tennakoon and C. Sechen

Assuming continuous cell sizes we have robustly achieved global minimization of the total transistor sizes needed to achieve a delay goal, thus minimizing dynamic power (and reducing leakage power). We then developed a feasible branch-and-bound algorithm that maps the continuous sizes to the discrete sizes available in the standard cell library. Results show that a typical library gives results close to the optimal continuous size results. After using state-of-the-art commercial synthesis, the application of our discrete size selection tool results in a dynamic power reduction of 40% (on average) for large industrial designs.
Keywords- power-delay optimization, discrete cell-size selection, delay modelling, parallelism

Scalable Packet Classification via GPU Metaprogramming [p. 871]

K. Kang and Y.S. Deng

Packet classification has been a fundamental processing pattern of modern networking devices. Today's high-performance routers use specialized hardware for packet classification, but such solutions suffer from prohibitive cost, high power consumption, and poor extensibility. On the other hand, software-based routers offer the best flexibility, but could only deliver limited performance (<10Gbps). Recently, graphics processing units (GPUs) have been proved to be an efficient accelerator for software routers. In this work, we propose a GPU-based linear search framework for packet classification. The core of our framework is a metaprogramming technique that dramatically enhances the execution efficiency. Experimental results prove that our solution could outperform a CPU-based solution by a factor of 17, in terms of classification throughput. Our technique is scalable to large rule sets consisting of over 50K rules and thus provides a solid foundation for future applications of packet context inspection.
Keywords- Packet Classification; Software Router; GPU; CUDA; Metaprogramming

Battery-Supercapacitor Hybrid System for High-Rate Pulsed Load Applications [p. 875]

D. Shin, Y. Kim, J. Seo, N. Chang, Y. Wang and M. Pedram

Modern batteries (e.g., Li-ion batteries) provide high discharge efficiency, but the rate capacity effect in these batteries drastically decreases the discharge efficiency as the load current increases. Electric double layer capacitors, or simply supercapacitors, have extremely low internal resistance, and a battery-supercapacitor hybrid may mitigate the rate capacity effect for high pulsed discharging current. However, a hybrid architecture comprising a simple parallel connection does not perform well when the supercapacitor capacity is small, which is a typical situation because of the low energy density and high cost of supercapacitors. This paper presents a new battery-supercapacitor hybrid system that employs a constant-current charger. The constant-current charger isolates the battery from supercapacitor to improve the end-to-end efficiency for energy from the battery to the load while accounting for the rate capacity effect of Li-ion batteries and the conversion efficiencies of the converters.

Feedback Based Droop Mitigation [p. 879]

S. Pontarelli, M. Ottavi, A. Salsano and K. Zarrineh

A strong dI/dt event in a VLSI circuit can induce a temporary voltage drop and consequent malfunctioning of logic as for instance failing speed paths. This event, called power droop, usually manifests itself in at-speed scan test where a surge in switching activity (capture phase) follows a period of quiescent circuit state (shift phase). Power droop is also present during mission mode operation. However, because of the less predictable occurrence of the switching events in mission mode, usually the values of power droop measured during test are different from those measured in mission mode. To overcome the power droop problem, different mitigation techniques have been proposed. The goal of these techniques is to create a uniform current demand throughout the test. This paper proposes a feedback based droop mitigation technique which can adapt to the droop by reading the level of VDD and modifying real time the current flowing on ad-hoc droop mitigators. It is shown that the proposed solution not only can compensate for droop events occurring during test mode but also can be used as a method of mission mode droop mitigation and yield enhancement if higher power consumption is acceptable.
Keywords:droop, mitigation techniques, ATPG, power supply;

A 0.964mW Digital Hearing Aid System [p. 883]

P. Qiao, H. Corporaal and M. Lindwer

This paper concerns the design and optimization of a digital hearing aid application. It aims to show that a suitably adapted ASIP can be constructed to create a highly optimized solution for the wide variety of complex algorithms that play a role in this domain. These algorithms are configurable to fit the various hearing impairments of different users. They pose significant challenges to digital hearing aids, having strict area and power consumption constraints. First, a typical digital hearing aid application is proposed and implemented, comprising all critical parts of today's products. Then a small area and ultra low-power 16-bit processor is designed for the application domain. The resulting hearing aid system achieves a power reduction of ≥ 56x over the RISC implementation and can operate for > 300 hours on a typical battery.

HypoEnergy: Hybrid Supercapacitor-Battery Power-Supply Optimization for Energy Efficiency [p. 887]

A. Mirhoseini and F. Koushanfar

This paper presents HypoEnergy, a framework for extending the hybrid battery-supercapacitor power supply lifetime. HypoEnergy combines high energy density and reliable workload supportability of an electrochemical battery with high power density and high number of recharge cycles of supercapacitors. The lifetime optimizations consider nonlinear battery characteristics and supercapacitors' charging overhead. HypoEnergy-KI studies the hybrid supply lifetime optimization for a preemptively known workload and for one ideal supercapacitor. We show a mapping of HypoEnergy-KI to the multiple-choice knapsack problem and use dynamic programming to address the problem. HypoEnergy-KN considers the optimization for the known workload but in the case of having a non-ideal supercapacitor bank that leaks energy. Evaluations on iPhone load measurements demonstrate the efficiency and applicability of the HypoEnergy framework in extending the system's lifetime.

Fine-Grain OpenMP Runtime Support with Explicit Communication Hardware Primitives [p. 891]

P. Tendulkar, V. Papaefstathiou, G. Nikiforos, S. Kavadias, D.S. Nikolopoulos and M. Katevenis

We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC multi-core architecture, to implement efficiently the OpenMP programming model and enable the exploitation of fine-grain parallelism in OpenMP programs. We explore the design space of implementation of OpenMP directives and runtime intrinsics, using a family of hardware primitives; remote stores, remote DMAs, hardware counters and hardware event queues with automatic responses, to support static and dynamic scheduling and data transfers in local memories. Using an FPGA prototype with four cores, we achieve OpenMP task creation latencies of 30-35 processor clock cycles, initiation of parallel contexts in 50 cycles and synchronization primitives in 65-210 cycles.

Transition-Time-Relation Based Capture-Safety Checking for At-Speed Scan Test Generation [p. 895]

K. Miyase, X. Wen, M. Aso, H. Furukawa, Y. Yamato and S. Kajihara

Excessive capture power in at-speed scan testing may cause timing failures, resulting in test-induced yield loss. This has made capture-safety checking mandatory for test vectors. This paper presents a novel metric, called the TTR (Transition-Time-Relation-based) metric, which takes transition time relations into consideration in capture-safety checking. Capture-safety checking with the TTR metric greatly improves the accuracy of test vector sign-off and low-capture-power test generation.

2D and 3D Integration with Organic and Silicon Electronics [p. 899]

C.K. Landrock, B. Omrane, Y. Chuo, B. Kaminska and J. Aristizabal

Organic electronics, such as OLEDs, OPVs, and polymer based power storage units (batteries and capacitors) are rapidly becoming low-cost viable alternatives to silicon-based devices. These organic devices however, are still reliant on the support functions of standard silicon components such as power and logic transistors. Integration of these organic devices with standard silicon electronics into a combined heterogeneous system requires specific design and fabrication considerations. Full-scale integration with conventional silicon based electronic components is challenging due to their incompatibility with common semiconductor fabrication process that can damage the active organic compounds. The printable/spray/spin nature of organic electronics fabrication makes 3D integration an attractive methodology. We propose to combine the organic and inorganic portions of a heterogeneous system by fabricating the modules separately (hence enabling parallel manufacturing) in a specific 2D layout scheme, and subsequently connecting the devices together in a post fabrication process. In this paper we discuss the 2D designs in detail and propose a 2D-3D hybrid design as well as a fully 3D stacked design for organic electronics with energy storage devices in a face-to-back configuration. The fabrication process of each device and the integration of OPVs and OLEDs with power storage devices are discussed. An overview of test procedure and fault tolerances for the proposed configuration is provided. Finally, a potential solution for a new test environment derived from a mixed configuration of different technologies and materials is proposed. Index Terms - 3D Integration, Organic Electronics, Interconnects, Photovoltaics, Polymer Battery, Capacitor, OLED.

Ultra Low-Power Photovoltaic MPPT Technique for Indoor and Outdoor Wireless Sensor Nodes [p. 905]

A.S. Weddell, G.V. Merrett and B.M. Al-Hashimi

Photovoltaic (PV) energy harvesting is commonly used to power wireless sensor nodes. To optimise harvesting efficiency, maximum power point tracking (MPPT) techniques are often used. Recently-reported techniques focus solely on outdoor applications, being too power-hungry for use under indoor lighting. Additionally, some techniques have required light sensors (or pilot cells) to control their operating point. This paper describes an ultra low-power MPPT technique which is based on a novel system design and sample-and-hold arrangement, which enables MPPT across the range of light intensities found indoors and outdoors and is capable of cold-starting. The proposed sample-and-hold based technique has been validated through a prototype system. Its performance compares favourably against state-of-the-art systems, and does not require an additional pilot cell or photodiode. This represents an important contribution, in particular for sensors which may be exposed to different types of lighting (such as body-worn or mobile sensors).

A fault-Tolerant Deadlock-Free Adaptive Routing for On Chip Interconnects [p. 909]

F. Chaix, D. Avresky, N.-E. Zergainoh and M. Nicolaidis

Future applications will require processors with many cores communicating through a regular interconnection network. Meanwhile, the Deep submicron technology foreshadows highly defective chips era. In this context, not only fault-tolerant designs become compulsory, but their performance under failures gains importance. In this paper, we present a deadlock-free fault-tolerant adaptive routing algorithm featuring Explicit Path Routing in order to limit the latency degradation under failures. This is particularly interesting for streaming applications, which transfer huge amount of data between the same source-destination pairs. The proposed routing algorithm is able to route messages in the presence of any set of multiple nodes and links failures, as long as a path exists, and does not use any routing table. It is scalable and can be applied to multicore chips with a 2D mesh core interconnect of any size. The algorithm is deadlock-free and avoids infinite looping in fault-free and faulty 2D meshes. We simulated the proposed algorithm using the worst case scenario, with different failure rates. Experimentation results confirmed that the algorithm tolerates multiple failures even in the most extreme failure patterns. Additionally, we monitored the interconnect traffic and average latency for faulty cases. For 20x20 meshes, the proposed algorithm reduces the average latency by up to 50%.

8.1: SMART DEVICES PANEL SESSION - Integrating the Real World Interfaces [p. 913]

Moderators: A. Jerraya, CEA-LETI MINATEC, FR; J. Goodacre, ARM, UK

Panelists: P. Urard, J. Rabaey, R. Bramley, A. King-Smith, W. Burleson, and F. Perruchot

: Smart systems require intefacing more of the world in an integrated solution (Camera, gyroscope, compass, temp, direction, ....). This is imposed by social evolution and enabled by new integration technologies. So far, many products featuring a variety of hardware and software have been developed and many more seems to be coming in a fast growing market. The panel will present the most promising products and solutions and discuss the innovations enabling future smart systems.

8.2: System-Level Design Techniques for Automotive Systems

Moderators: J. Teich, Erlangen-Nuremberg U, DE; L. Lavagno, Politecnico di Torino, IT

Re-Engineering Cyber-Physical Control Applications for Hybrid Communication Protocols [p. 914]

D. Goswami, R. Schneider and S. Chakraborty

In this paper, we consider a cyber-physical architecture where multiple control applications are divided into multiple tasks, spatially distributed over various processing units that communicate over a bus implementing a hybrid communication protocol, i.e., a protocol with both time-triggered and event-triggered communication schedules (e.g., FlexRay). In spite of efficient utilization of communication bandwidth (BW), event-triggered protocols suffer from unpredictable temporal behavior, which is exactly the opposite in the case of their time-triggered counterparts. In the context of communication delays experienced by the control-related messages exchanged over the shared communication bus, we observe that a distributed control application is more prone to performance deterioration in transient phases compared to in the steady-state. We exploit this observation to re-engineer control applications to operate in two modes, in order to optimally exploit the bi-modal(time- and event-triggered) characteristics of the underlying communication medium. Depending on the state (transient or steady) of the system. Using a FlexRay-based case study, we show that such a design provides a good trade-off between control performance and bus utilization.

Precise WCET Calculation in Highly Variant Real-Time Systems [p. 920]

P. Montag and S. Altmeyer

Embedded hard real-time systems that are based on software product lines using dynamically derivable variants are prone to overestimations in static WCET analyses. This is due to the fact that infeasible paths in the code resulting from infeasible variant combinations are unknown to the analysis. This paper presents an approach to incorporate variant constraints in the calculation to exclude infeasible paths and thus to decrease the WCET overestimation. Based on feature models we propose a sound approach to identify significant infeasible paths that can be safely discarded in the analysis. The benefits of the approach are exemplified by a real world example from the automotive domain where we are able to reduce the WCET bound by up to 50 percent.

Optimal Scheduling of Switched FlexRay Networks [p. 926]

T. Schenkelaars, B. Vermeulen and K. Goossens

This paper introduces the concept of switched FlexRay networks and proposes two algorithms to schedule data communication for this new type of network. Switched FlexRay networks use an intelligent star coupler, called a switch, to temporarily decouple network branches, thereby increasing the effective network bandwidth. Although scheduling for basic FlexRay networks is not new, prior work in this domain does not utilize the branch parallelism that is available when a FlexRay switch is used. In addition to the novel exploitation of branch parallelism, the scheduling algorithms proposed in this paper also support all slot multiplexing options as defined in the FlexRay v3.0 protocol specification. This includes support for the newly added repetition rates and support for multiplexing frames from different sending nodes in the same slot. Our first algorithm quickly produces a schedule given the communication requirements, network topology and FlexRay parameters, but cannot guarantee an optimal schedule in terms of the bandwidth efficiency and extensibility. Therefore, a second, branch-and-price algorithm is introduced that does find optimal schedules.

8.3: Power/Error Tradeoffs

Moderators: L. Bai, U of Michigan, US; J. Chen, Karlsruhe Institute of Technology, DE

On the Efficacy of NBTI Mitigation Techniques [p. 932]

T.-B. Chan, J. Sartori, P. Gupta and R. Kumar

Negative Bias Temperature Instability (NBTI) has become an important reliability issue in modern semiconductor processes. Recent work has attempted to address NBTI-induced degradation at the architecture level. However, such work has relied on device-level analytical models that, we argue, are limited in their flexibility to model the impact of architecture-level techniques on NBTI degradation. In this paper, we propose a flexible numerical model for NBTI degradation that can be adapted to better estimate the impact of architecture-level techniques on NBTI degradation. Our model is a numerical solution to the reaction-diffusion equations describing NBTI degradation that has been parameterized to model the impact of dynamic voltage scaling, averaging effects across logic paths, power gating, and activity management. We use this model to understand the effectiveness of different classes of architecture-level techniques that have been proposed to mitigate the effects of NBTI. We show that the potential benefits from these techniques are, for the most part, smaller than what has been previously suggested, and that guardbanding may still be an efficient way to deal with aging.

Partitioned Cache Architectures for Reduced NBTI-Induced Aging [p. 938]

A. Calimera, M. Loghi, E. Macii and M. Poncino

Conventional power management knobs such as voltage scaling or power gating have been shown to have a beneficial effect on the aging phenomena caused Negative Bias Temperature Instability (NBTI). Such a benefit can be especially exploited in SRAM memories, which are particularly sensitive to NBTI effects: given their symmetric structure, they cannot in fact take advantage of value-dependent recovery. We propose an architectural solutions that is based on the idea of partitioning a memory into multiple banks of identical size. While this organization has been widely used for reducing both dynamic and static power, its exploitation for aging benefits requires proper management of the existing idleness of the various banks. This can be achieved by means of a sort of time-varying addressing scheme in which addresses are mapped to different banks over time in such a way that the idleness is uniformly distributed over all the banks. Experimental analysis shows that it is possible to simultaneously reducing leakage power and aging in caches, with minimal overhead and without modifying the internal structure of the SRAM arrays.

Adaptive Voltage Over-Scaling for Resilient Applications [p. 944]

P.K. Krause and I. Polian

We present an energy-reduction strategy for applications which are resilient, i. e. can tolerate occasional errors, based on an adaptive voltage control. The voltage is lowered, possibly beyond the safe-operation region, as long as no errors are observed, and raised again when the severity of the detected errors exceeds a threshold. Due to the resilient nature of the applications, lightweight error detection logic is sufficient for operation, and no expensive error recovery circuitry is required. On a hardware block implementing texture decompression, we observe 25% to 30% energy reduction at negligible quality loss (compared to the error introduced by the lossy compression algorithm). We investigate the strategy's performance under temperature and process variations and different assumptions on voltage-control circuitry. The strategy automatically chooses the lowest appropriate voltage, and thus the largest energy reduction, for each individual manufactured instance of the circuit.

Design of Voltage-Scalable Meta Functions for Approximate Computing [p. 950]

D. Mohapatra, V.K. Chippa, A. Raghunathan and K. Roy

Approximate computing techniques that exploit the inherent resilience in algorithms through mechanisms such as voltage over-scaling (VOS) have gained significant interest. In this work, we focus on meta-functions that represent computational kernels commonly found in application domains that demonstrate significant inherent resilience, namely Multimedia, Recognition and Data Mining. We propose design techniques (dynamic segmentation with multi-cycle error compensation, and delay budgeting for chained data path components) which enable the hardware implementations of these meta-functions to scale more gracefully under voltage over-scaling. The net effect of these design techniques is improved accuracy (fewer and smaller errors) under a wide range of over-scaled voltages. Results based on extensive transistor-level simulations demonstrate that the optimized meta-function implementations consume up to 30% less energy at iso-error rates, while achieving upto 27% lower error rates at iso-energy when compared to their baseline counterparts. System-level simulations for three applications, motion estimation, support vector machine based classification and k-means based clustering are also presented to demonstrate the impact of the improved meta-functions at the application level. Index Terms - Approximate Computing, Low Power Design, Voltage Over-scaling, Meta-functions.

8.4: Memory System Architectures

Moderators: T. Austin, U of Michigan, US; G. Gaydadjiev, TU Delft, NL

MLP Aware Heterogeneous Memory System [p. 956]

S. Phadke and S. Narayanasamy

Main memory plays a critical role in a computer system's performance and energy efficiency. Three key parameters define a main memory system's efficiency: latency, bandwidth, and power. Current memory systems tries to balance all these three parameters to achieve reasonable efficiency for most programs. However, in a multi-core system, applications with various memory demands are simultaneously executed. This paper proposes a heterogeneous main memory with three different memory modules, where each module is heavily optimized for one the three parameters at the cost of compromising the other two. Based on the memory access characteristics of an application, the operating system allocates its pages in a memory module that satisfies its memory requirements. When compared to a homogeneous memory system, we demonstrate through cycle-accurate simulations that our design results in about 13.5% increase in system performance and a 20% improvement in memory power.

Impact of Process Variation on Endurance Algorithms for Wear-Prone Memories [p. 962]

A.P. Ferreira, S. Bock, B. Childers, R. Melhem and D. Mossé

Non-volatile memories, such as Flash and Phase- Change Memory, are replacing other memory and storage technologies. Although these new technologies have desirable energy and scalability properties, they are prone to wear-out due to excessive write operations. Because wear-out is an important phenomenon, a number of endurance management schemes have been proposed. There is a trade-off between what techniques to use, depending on the range of bit cell lifetime within a device. This range in cell durability arises from effects due to process variation. In this paper, we describe modeling techniques to analyze trade-offs for endurance management based on the anticipated distribution of cell lifetime. This analysis considers two general endurance strategies (physical capacity degradation and physical sparing) under four distributions of cell lifetime (constant, linear, normal, and bimodal). The modeling techniques can be used to determine how much redundancy is needed when a sparing endurance strategy is adopted. With the correct choice of technique, the device lifetime can be doubled.

FlexMemory: Exploiting and Managing Abundant Off-Chip Optical Bandwidth [p. 968]

Y. Wang, L. Zhang, Y. Han, H. Li and X. Li

The emerging nanophotonic technology can avoid the limitation of I/O pin count, and provide abundant memory bandwidth. However, current DRAM organization has mainly been optimized for a higher storage capacity and package pin utilization. The resulted data fetching mechanism is quite inefficient in performance and energy saving, and cannot effectively utilize the abundant optical bandwidth in off-chip communication. This paper inspects the opportunity brought by optical communication, and revisits the DRAM memory architecture considering the technology trend towards multiprocessors. In our FlexMemory design, super-line prefetching is proposed to boost system performance and promote energy efficiency, which leverages the abundant photonic bandwidth to enlarge the effective data fetch size per memory cycle. To further preserve locality and maintain service parallelism for different workloads, page folding technique is employed to achieve adaptive data mapping in photonics-connected DRAM chips via optical wavelengths allocation. By combining both techniques, surplus off-chip bandwidth can be utilized and effectively managed adapting to the workloads intensity. Experimental results show that our FlexMemory achieves considerable improvements in performance and energy efficiency.
Keywords-DRAM; nanophotonic; memory architecture; locality

Scratchpad Memory Optimizations for Digital Signal Processing Applications [p. 974]

S.Z. Gilani, N.S. Kim and M. Schulte

Modern digital signal processors (DSPs) need to support a diverse array of applications ranging from digital filters to video decoding. Many of these applications have drastically different precision and on-chip memory requirements. Moreover, DSPs often employ aggressive dynamic voltage and frequency scaling (DVFS) techniques to minimize power consumption. However, at reduced voltages, process variations can significantly increase the failure rate of on-chip SRAMs designed with small transistors to achieve high integration density, resulting in low yields. Consequently, the size of transistors in SRAMcells and cell size needs to be increased to satisfy the target yield. However, this can result in high area overhead since on-chip memories consume a significant portion of the die area. In this paper, we present a scratchpad memory design that exploits the tradeoffs between SRAM cell sizes, their failure rates, the minimum operating voltage for target yield (V_ddmin), and application characteristics to achieve an on-chip memory area reduction of up to 17%. Our approach reduces V_ddmin, which allows dynamic and leakage power savings of 42% and 36% respectively with DVFS. Moreover, for error-tolerant DSP applications we allow voltage scaling below V_ddmin to achieve further power savings while incurring lower mean error as compared to short word-length memory. Finally, for error-sensitive applications, we propose a reconfigurable memory organization that trades memory capacity for higher precision at a lower V_ddmin.

8.5: Testing and Designing SRAM Memories

Moderators: S. Nassif, IBM, US; X. Wen, Kyushu Institute of Technology, JP

Robustness Analysis of 6T SRAMs in Memory Retention Mode under PVT Variations [p. 980]

E.I. Vatajelu and J. Figueras

Process variability is becoming a major challenge in CMOS design of general and embedded SRAMs in particular due to continuous device scaling. The main problems are the increased static power and reduced operating margins, robustness and reliability. A common way to reduce the static power consumption of an SRAM memory array is to decrease its supply voltage when in memory retention mode. However, this leads to a further reduction in memory robustness. The most common tool for statistical analysis of circuits under process variability is standard Monte Carlo simulation which has been proven to be too expensive when applied on an ultra dense SRAM [1]-[6]. In this paper a statistical robustness analysis method is proposed based on decoupling statistical integration from robustness region determination in the parameter domain. The robustness is estimated with a ~ 556X speed up relation to Monte Carlo and an error of ~ 1%.
Keywords-6T SRAM;Robustness Analysis;Data Retention; PVT Variability.

Stability Optimization of Embedded 8T SRAMs Using Word-Line Voltage Modulation [p. 986]

B. Alorda, G. Torrens, S. Bota and J. Segura

SRAM cell stability analysis is typically based on Static Noise Margin (SNM) evaluation when in hold mode, although memory errors may also occur during read operations. Given that SNM varies with each cell operation, a thorough analysis of SNM in read mode is required. In this paper we investigate the SNM of OAM cells during write operations. The Word- Line Voltage modulation is proposed as an alternative to improve cell stability when in this mode. We show that it is possible to improve 8T OAM cells stability during write operations while reducing current leakage, as opposed to present methods that improve cell stability at the cost of leakage increase.

Proactive Recovery for BTI in High-K SRAM Cells [p. 992]

L. Li, Y. Zhang and J. Yang

Recent studies of BTI behavior in SRAM cells showed that for high-.. metal gate stack technology, PBTI induced ....h shift in NMOS is as significant as NBTI induced ....h shift in PMOS. Previous techniques of mitigating NBTI in SRAM focus mainly on PMOS and thus lack the ability to mitigate PBTI of NMOS transistors. In this paper, we propose a novel design to recover 4 internal gates within a SRAM cell simultaneously to mitigate both NBTI and PBTI effects. In the evaluated L2 cache, our technique effectively slows down the cell failure probability increase, and achieves 4.64/2.86x (best/worst case) lifetime improvement over normal design. Index Terms - high-.., NBTI, PBTI, recovery, SRAM

8.6: Cryptoanalysis, Attacks and Countermeasures

Moderators: K. Sakiyama, U of Electro-Communications, Tokyo, JP; L. Torres, LIRMM, FR

The Potential of Reconfigurable Hardware for HPC Cryptanalytic of SHA-1 [p. 998]

A. Cilardo

Modern reconfigurable technologies can have a number of inherent advantages for cryptanalytic applications. Aimed at the cryptanalysis of the SHA-1 hash function, this work explores this potential showing new approaches inherently based on hardware reconfigurability, enabling algorithm and architecture exploration, input-dependent system specialization, and low-level optimizations based on static/dynamic reconfiguration. As a result of this approach, we identified a number of new techniques, at both the algorithmic and architectural level, to effectively improve the attacks against SHA-1. We also defined the architecture of a high-performance FPGA-based cluster, that turns out to be the solution with the highest speed/cost ratio for SHA-1 collision search currently available. A small-scale prototype of the cluster enabled us to reach a real collision for a 72-round version of the hash function.

Enhancement of Simple Electro-Magnetic Attacks by Pre-Characterization in Frequency Domain and Demodulation Techniques [p. 1004]

O. Meynard, D. Réal, F. Flament, S. Guilley, N. Homma and J.-L. Danger

SPA/SEMA (Simple Power/Electro-magnetic Analysis) attacks performed on public-key cryptographic modules implemented on FPGA platforms are well known from the theoretical point of view. However, the practical aspect is not often developed in the literature. But researchers know that these attacks do not always work, like in the case of an RSA accelerator. Indeed, SEMA on RSA needs to make a difference between square and multiply which use the same logic; this contrast with SEMA on ECC, which is easier since doubling and add that are two different operations from the hardware point of view. In this paper, we wonder what to do if a SEMA fails to succeed on a device.Does it mean that no attack is possible? We show that hardware demodulation techniques allow the recording of a signal with more information on the leakage than a raw recording. Then, we propose a generic and fast method enabling to find out demodulation frequencies. The effectiveness of our methods is demonstrated through actual experiments using an RSA processor on the SASEBO FPGA board. We show cases where only demodulated signals permit to defeat RSA.
Keywords: Demodulation, Simple Electro-Magnetic Analysis, Mutual Information, Modular Exponentiation.

LOEDAR: A Low Cost Error Detection and Recovery Scheme for ECC [p. 1010]

K. Ma and K. Wu

This paper presents LOEDAR, a novel low-cost Error Detection and Recovery scheme, for Montgomery Ladder Algorithm based Elliptic Curve Scalar Multiplication (ECSM). The LOEDAR scheme exploits the invariance among the intermediate results produced by the algorithm to detect errors. The error detection process can be carried periodically during ECSM to verify data correctness, and will recover the cryptosystem back to the latest checkpoint upon detecting errors. The frequency of running the error detection process can be adjusted to trade off the power and time overhead with error detection latency and recovery overhead. The hardware and power overhead of LOEDAR are about 37% and 69% respectively. Each additional error detection process contributes less than 1% additional time overhead and power overhead.
Keywords - elliptic curve cryptography(ECC); elliptic curve scalar multiplication(ECSM); concurrent error detection; montgomery ladder;

Low-cost Fault Detection Method for ECC Using Montgomery Powering Ladder [p. 1016]

D. Karaklajic, J. Fan, J.-M. Schmidt and I. Verbauwhede

When using Elliptic Curve Cryptography (ECC) in constrained embedded devices such as RFID tags, L'opez-Dahab's method along with the Montgomery powering ladder is considered as the most suitable method. It uses x-coordinate only for point representation, and meanwhile offers intrinsic protection against simple power analysis. This paper proposes a low-cost fault detection mechanism for Elliptic Curve Scalar Multiplication (ECSM) using the L'opez-Dahab algorithm. Introducing minimal changes to the last round of the algorithm, we make it capable of detecting faults with a very high probability. In addition, by reusing the existing resources, we significantly reduce both performance losses and area overhead compared to other methods in this scenario. This method is suitable especially for constrained devices. Index Terms - Elliptic Curve Cryptosystems (ECC), Montgomery Powering Ladder, Fault Attacks, Low Overhead, L'opez-Dahab algorithm.

8.7: HOT TOPIC - Flows, Application and Future of Component-based Design for Embedded Systems

Moderator: A. Sangiovanni-Vincentelli, UC Berkeley, US and Trento U, IT

Methods and Tools for Component-Based System Design [p. 1022]

J Sifakis

Traditional engineering disciplines such as civil or mechanical engineering are based on solid theory for building artefacts with predictable behavior over their life-time. In contrast, we lack similar constructivity results for computing systems engineering: computer science provides only partial answers to particular system design problems. With few exceptions, predictability is impossible to guarantee at design time and therefore, a posteriori verification remains the only means for ensuring their correct operation.

Using Contract-Based Component Specifications for Virtual Integration Testing and Architecture Design [p. 1023]

W. Damm, H. Hungar, B. Josko, T. Peikenkamp and I. Stierand

We elaborate on the theoretical foundation and practical application of the contract-based specification method originally developed in the Integrated Project SPEEDS [11], [9] for two key use cases in embedded systems design. We demonstrate how formal contract-based component specifications for functional, safety, and real-time aspects of components can be expressed using the pattern-based requirement specification language RSL developed in the Artemis Project CESAR, and develop a formal approach for virtual integration testing of composed systems based on such contract-specifications of subsystems. We then present a methodology for multi-criteria architecture evaluation developed in the German Innovation Alliance SPES on Embedded Systems.

Component-Based Design for the Future [p. 1029]

E.A. Lee and A.L. Sangiovanni-Vincentelli

Motivation. The specific root causes of the design problems that are haunting system companies such as automotive and avionics companies are complex and relate to a number of issues ranging from design processes and relationships with different departments of the same company and with suppliers1 to incomplete requirement specification and testing.2 Further, there is a widespread consensus in the industry that there is much to gain by optimizing the implementation phase that today is only considering a very small subset of the design space. Some attempts at a more efficient design space exploration have been afoot but there is a need to formalize the problem better and to involve in major ways the different players of the supply chain. Information about the capabilities of the subsystems in terms of timing, power consumed, size, weight and other physical aspects transmitted to the system assemblers during design time would go a long way in providing a better opportunity to design space exploration. In this landscape, a wrong turn in a system design project could cause so much economic, social and organizational upheaval that it may imperil the life of an entire company. No wonder that there is much interest in risk management approaches to assess risks associated to design errors, delays, recalls and liabilities. Finding appropriate countermeasures to lower risks and to develop contingency plans is then a mainstay of the way large projects are managed today. The overarching issue is the need of a substantive evolution of the design methodology in use today in system companies. The issue to address is the understanding of the principles of system design, the necessary change to design methodologies, and the dynamics of the supply chain. Developing this understanding is necessary to define a sound approach to the needs of the system companies as they try to serve their customers better, to develop their products faster and with higher quality. An important approach to tackle in part these issues is component-based design.

8.8: EMBEDDED TUTORIAL - Communication Networks in Next Generation Automobiles

Moderator: C. Grimm, TU Vienna, AT

Sensor Networks on the Car: State of the Art and Future Challenges [p. 1030]

L. D'Orazio, F. Visintainer and M. Darin

Modern cars are equipped with hundreds of sensors, not only used in the traditional powertrain, chassis, and body areas, but also in more advanced applications related to multimedia, infotainment, and x-by-wire systems. Such a big quantity of sensing elements require a particular attention in the design phase of the in-vehicle communication networks. This paper provides an overview of the most commonly used automotive sensors and describes the traditional networks nowadays used to collect their measurements. Moreover, it considers some possible alternative solutions that could be used in the future to have a single uniform network as asked by the automotive industry in order to reduce weight, space, and cost of the communication system.

Real-Time Wireless Communication in Automotive Applications [p. 1036]

R. Matischek, T. Herndl, C. Grimm and J. Haase

Wireless communication in a car has several advantages, given that demanded safety and real-time requirements are fulfilled. This paper presents a wireless MAC protocol designed for the needs of automotive and industrial applications. The proposed MAC protocol provides special support for network traffic prioritization in order to guarantee worst-case message delays for a set of high-prioritized nodes. The performance is further analyzed with a network simulator and compared with the IEEE 802.15.4 standard CSMA/CA protocol.

Wireless Communication and Energy Harvesting in Automobiles [p. 1042]

S. Mahlknecht, T. Kazmierski, C. Grimm and L. Wang

Using wireless communication and energy harvesting in automobiles might have significant advantages considering dependability (no wires and contacts) and weight (no cable tree). In this paper, we give a brief overview of the related technologies, surrounding conditions, and methods for design and optimization. As examples, we focus on methods for harvesting kinetic energy and wireless transmission in a tire pressure metering system (TPMS).

9.1: INTELLIGENT ENERGY MANAGEMENT TUTORIAL - Energy Transfer, Generation and Power Electronics

Moderator: P. Mitcheson, Imperial College, UK

Power Management Trends in Portable Consumer Applications [p. 1048]

J. Brown

Mobile consumer electronics continue to converge, in terms of functionality and feature sets, bringing many challenges to the circuits required to power these applications. This paper outlines some of the technology available to address these challenges.

9.2: Design Automation Methodologies for Emerging Technologies

Moderators: K. Mohanram, Rice U, US; S. Bhanja, South Florida U, US

An Efficient Mask Optimization Method Based on Homotopy Continuation Technique [p. 1053]

F. Liu and X. Shi

In sub-wavelength lithography, traditional resolution enhancement techniques (e.g., OPC) cannot guarantee the optimality of the mask. In this paper, we present a novel inverse lithography method to solve the mask optimization problem. Recognizing that when formulated on a pixel-by-pixel basis with partially coherent optical models, the problem is a large-scale nonlinear optimization problem, we cast the optimization flow into a homotopy framework and apply an efficient numerical continuation technique. Compared to earlier pixel-based inverse lithography methods, our homotopy approach is not only more efficient, but also capable of naturally addressing the mask manufactureability problem. Experiment results in a state-of-the-art lithography environment show that our method generates high fidelity wafer images, and is 100x faster than previously reported inverse lithography method.

Waste-Aware Dilution and Mixing of Biochemical Samples with Digital Microfluidic Biochips [p. 1059]

S. Roy, B.B. Bhattacharya and K. Chakrabarty

A key challenge in design automation of digital microfluidic biochips is to carry out on-chip dilution/mixing of biochemical samples/reagents for achieving a desired concentration factor (CF). In a bioassay, reducing the waste is crucial because the waste droplet handling is cumbersome and the number of waste reservoirs on-chip needs to be minimized to use limited volume of sample and expensive reagents and hence to reduce the cost of a biochip. The existing dilution algorithms attempt to reduce the number of mix/split steps required in the process but focus little on minimization of sample requirement or waste droplets. In this work, we characterize the underlying combinatorial properties of waste generation and identify the inherent limitations of two earlier mixing algorithms (BS algorithm by Thies et al., Natural Computing 2008; DMRW algorithm by Roy et al., IEEE TCAD 2010) in addressing this issue. Based on these properties, we design an improved dilution/mixing algorithm (IDMA) that optimizes the usage of intermediate droplets generated during the dilution process, which in turn, reduces the demand of sample/reagent and production of waste. The algorithm terminates in O(n) steps for producing a target CF with a precision of 1/2ⁿ . Based on simulation results for all CF values ranging from 1/1024 to 1023/1024 using a sample (100% concentration) and a buffer solution (0% concentration), we present an integrated scheme of choosing the best waste-aware dilution algorithm among BS, DMRW, and IDMA for any given value of CF. Finally, an architectural layout of a DMF biochip that supports the proposed scheme is designed.

High-Temperature (>500°C) Reconfigurable Computing Using Silicon Carbide NEMS Switches [p. 1065]

X. Wang, S. Narasimhan, A. Krishna, F.G. Wolff, S. Rajgopal, T.-H. Lee, M. Mehregany and S. Bhunia

Many industrial systems, sensors and advanced propulsion systems demand electronics capable of functioning at high ambient temperature in the range of 500-600°C. Conventional Si-based electronics fail to work reliably at such high temperature ranges. In this paper we propose, for the first time, a high-temperature reconfigurable computing platform capable of operating at temperature of 500°C or higher. Such a platform is also amenable for reliable operation in high-radiation environment. The hardware reconfigurable platform follows the interleaved architecture of conventional Field Programmable Gate Array (FPGA) and provides the usual benefits of lower design cost and time. However, high-temperature operation is enabled by choice of a special device material, namely silicon carbide (SiC), and a special switch structure, namely Nano-Electro-Mechanical-System (NEMS) switch. While SiC provides excellent mechanical and chemical properties suitable for operation at extreme harsh environment, NEMS switch provides low-voltage operation, ultra-low leakage and radiation hardness. We propose a novel multi-layer NEMS switch structure and efficient design of each building block of FPGA using nanoscale SiC NEMS switches. Using measured switch parameters from a number of SiC NEMS switches we fabricated, we compare the power, performance and area of an all-mechanical FPGA with alternative implementations for several benchmark circuits.
Keywords- High Temperature Electronics; SiC; NEMS; FPGA

Case Study: Alleviating Hotspots and Improving Chip Reliability via Carbon Nanotube Thermal Interface [p. 1071]

W. Zhang, J. Huang, S. Yang and P. Gupta

The increasing power consumption of integrated circuits (ICs) enabled by technology scaling requires more efficient heat dissipation solutions to improve overall chip reliability and reduce hotspots. Thermal interface materials (TIMs) are widely employed to improve the thermal conductivity between the chip and the cooling facilities. In recent years, carbon nanotubes (CNTs) have been proposed as a promising TIM due to their superior thermal conductivity. Some CNT-based thermal structures for improving chip heat dissipation have been proposed, and they have demonstrated significant temperature reduction. In this paper, we present an improved CNT TIM design which includes a CNT grid and thermal vias to dissipate heat more efficiently to obtain a more uniform chip thermal profile. We present simulation-based experimental results that indicate a 32% / 25% peak temperature reduction and 48% / 22% improvement in chip reliability for two industrial processor benchmarks, showing the effectiveness of our proposed thermal structure.

9.3: System Modeling

Moderators: J. Haase, TU Vienna, AT; D. Borrione, TIMA Laboratory, FR

Verifying Dynamic Aspects of UML Models [p. 1077]

M. Soeken, R. Wille and R. Drechsler

The Unified Modeling Language (UML) as a defacto standard for software development finds more and more application in the design of systems which also contain hardware components. Guaranteeing the correctness of a system specified in UML is thereby an important as well as challenging task. In recent years, first approaches for this purpose have been introduced. However, most of them focus only on the static view of a UML model. In this paper, an automatic approach is presented which checks verification tasks for dynamic aspects of a UML model. That is, given a UML model as well as an initial system state, the approach proves whether a sequence of operation calls exists so that a desired behavior is invoked. The underlying verification problem is encoded as an instance of the satisfiability problem and subsequently solved using a SAT Modulo Theory solver. An experimental evaluation confirms the applicability of the proposed approach.

Automated Construction of Fast and Accurate System-Level Models for Wireless Sensor Networks [p. 1083]

L.S. Bai, R.P. Dick, P. Chou and P.A. Dinda

Rapidly and accurately estimating the impact of design decisions on performance metrics is critical to both the manual and automated design of wireless sensor networks. Estimating system-level performance metrics such as lifetime, data loss rate, and network connectivity is particularly challenging because they depend on many factors, including network design and structure, hardware characteristics, communication protocols, and node reliability. This paper describes a new method for automatically building efficient and accurate predictive models for a wide range of system-level performance metrics. These models can be used to eliminate or reduce the need for simulation during design space exploration. We evaluate our method by building a model for the lifetime of networks containing up to 120 nodes, considering both fault processes and battery energy depletion. With our adaptive sampling technique, only 0.27% of the potential solutions are evaluated via simulation. Notably, one such automatically produced model outperforms the most advanced manually designed analytical model, reducing error by 13% while maintaining very low model evaluation overhead. We also propose a new, more general definition of system lifetime that accurately captures application requirements and decouples the specification of requirements from implementation decisions.

Fast and Accurate Transaction-Level Model of a Wormhole Network-on-Chip with Priority Preemptive Virtual Channel Arbitration [p. 1089]

L.S. Indrusiak and O.M. dos Santos

Simulation is a bottleneck in the design flow of on-chip multiprocessors. This paper addresses that problem by reducing the simulation time of complex on-chip interconnects through transaction-level modelling (TLM). A particular on-chip interconnect architecture was chosen, namely a wormhole network-on-chip with priority preemptive virtual channel arbitration, because its mechanisms can be modelled at transaction level in such a way that accurate figures for communication latency can be obtained with less simulation time than a cycle-accurate model. The proposed model produced latency figures with more than 90% accuracy and simulated more than 1000 times faster than a cycle-accurate model.
Keywords-system specification; transaction-level modeling; network-on-chip; on-chip multiprocessing; simulation.

A High-Level Analytical Model for Application Specific CMP Design Exploration [p. 1095]

A. Cassidy, K. Yu, H. Zhou and A.G. Andreou

We present a high-level analytical model for chip-multiprocessors (CMPs) that encompasses processors, memory, and communication in an area-constrained, global optimization process. Applying this analytical model to the design of a symmetric CMP for speech recognition, we demonstrate a methodology for estimating model parameters prior to design exploration. Then we present an automated approach for finding the optimal high-level CMP architecture. The result is the ability to find the allocation of silicon resources for each architectural element that maximizes overall system performance. This balances the performance gains from parallelism, processor microarchitecture, and cache memory with the energy-delay costs of computation and communication.

9.4: Modeling and Verification of Analogue and RF Circuits

Moderators: L. Hedrich, Frankfurt U, DE; M. Olbrich, Hannover U, DE

Global Optimization of Integrated Transformers for High Frequency Microwave Circuits Using a Gaussian Process Based Surrogate Model [p. 1101]

B. Liu, Y. He, P. Reynaert and G. Gielen

Design and optimization of microwave passive components is one of the most critical problems for RF IC designers. However, the state-of-the-art methods either have good efficiency but highly depend on the accuracy of the equivalent circuit models, which may fail the synthesis when the frequency is high; or fully depend on electromagnetic (EM) simulations, whose solution quality is high but are too expensive. To address the problem, a new method, called Gaussian Process-Based Differential Evolution for Constrained Optimization (GPDECO) is proposed. In particular, GPDECO performs global optimization of the microwave structure using EM simulations, and a Gaussian process (GP) based surrogate model is constructed ON-LINE at the same time to predict the results of expensive EM simulations. GPDECO is tested by two 60GHz transformers and comparisons with the state-of-the-art methods are performed. The results show that GPDECO can generate high performance RF passive components that cannot be generated by the available efficient methods. Compared with available methods with the best solution quality, GPDECO can achieve comparable results but only costs 20%- 25% of the computational effort. Using parallel computation in an 8-core CPU, the synthesis can be finished in less than 0.5 hour.
Keywords - Transformer synthesis, Microwave components, Microwave design, Gaussian process, Surrogate model, Differential evolution

A Method for Fast Jitter Tolerance Analysis of High-Speed PLLs [p. 1107]

S. Erb and W. Pribyl

We propose a fast method for identifying the jitter tolerance curves of high-speed phase locked loops. The method is based on an adaptive recursion and uses known tail fitting methods to realize a fast optimization combined with a small number of jitter samples. It allows for efficient behavioral simulations, and can also be applied to hardware measurements. A typical modeling example demonstrates applicability to both software and hardware scenarios and achieves simulated measurement times in the range of few hundred milliseconds.

SAMURAI: An Accurate Method for Modeling and Simulating Non-Stationary Random Telegraph Noise in SRAMs [p. 1113]

K.V. Aadithya, A. Demir, S. Venugopalan and J. Roychowdhury

In latest CMOS technologies, Random Telegraph Noise (RTN) has emerged as an important challenge for SRAM design. Due to rapidly shrinking device sizes and heightened variability, analytical approaches are no longer applicable for characterising the circuit-level impact of non-stationary RTN. Accordingly, this paper presents SAMURAI, a computational method for accurate, trap-level, non-stationary analysis of RTN in SRAMs. The core of SAMURAI is a technique called Markov Uniformisation, which extends stochastic simulation ideas from the biological community and applies them to generate realistic traces of non-stationary RTN in SRAM cells. To the best of our knowledge, SAMURAI is the first computational approach that employs detailed trap-level stochastic RTN generation models to obtain accurate traces of non-stationary RTN at the circuit level. We have also developed a methodology that integrates SAMURAI and SPICE to achieve a simulation-driven approach to RTN characterisation in SRAM cells under (a) arbitrary trap populations, and (b) arbitrarily time-varying bias conditions. Our implementation of this methodology demonstrates that SAMURAI is capable of accurately predicting non-stationary RTN effects such as write errors in SRAM cells.

9.5: INDUSTRIAL 2

Moderators: D. Sciuto, Politecnico di Milano, IT; L. Anghel, TIMA Laboratory, FRA

Characterization of an Intelligent Power Switch for LED Driving with Control of Wiring Parasitics Effects [p. 1119]

G. Pasetti, N. Constantino, F. Tinfena, R. Serventi, P. D'Abramo, S. Saponara and L. Fanucci

The flexibility of an Intelligent Power Switch (IPS) designed in HV-CMOS technology for incandescent lamp in automotive scenarios has been evaluated for the driving of a LED in presence of wiring parasitics. The paper presents how it is possible, through proper reconfiguration of the flexible IPS, to reduce the undesired ringing phenomenon when driving a LED with wiring parasitics thus reducing Electromagnetic Interferences (EMI) and spikes on supply voltage. Electrical simulation and experimental measurements prove the effectiveness of the proposed IPS.
Keywords - Intelligent Power Switch; wiring parasitics; LED driving; High Voltage CMOS Circuit; Automotive Electronics

Energy Analysis Methods and Tools For Modelling and Optimizing Tyre Systems [p. 1121]

A. Bonanno, A. Bocca and M. Sabatini

The increasing demand of "safe" vehicles requires continuous design of innovative devices and sensors. This paper presents a methodology for an efficient energy analysis of a self-powered sensor in an ultra-low power automotive application. In order to achieve this goal, new tools have been developed for storing and elaborating data (e.g., power consumption values, operating conditions, etc.) and even for reporting the energy balance, after considering the source (i.e. a scavenger device) that supplies the sensor. Index Terms - Low-power design, wireless sensors, energy scavenging, analysis tools

System Level Techniques to Improve Reliability in High Power Microcontrollers for Automotive Applications [p. 1123]

A. Acquaviva, M. Poncino, M. Ottella and M. Sciolla

In high power microcontrollers, a decrease in circuit lifetime is often observed in safey-critical applications where circuitry is subjected to the most severe stresses and reliability has become a major concern. Thus, ad-hoc design solutions become necessary to mitigate the impact of ageing. In this paper, we discuss hard-software approaches that exploit distributed on-chip monitoring of wear-out parameters to perform ageing-aware allocation of computation and recovery periods on the various computational units.

System-Level Power Estimation Methodology Using Cycle- and Bit-Accurate TLM [p. 1125]

M.D. Grammatikakis, S. Politis, J.-P. Schoellkopf and C. Papadas

We propose a new system-level methodology for relative power estimation, which is independent of register transfer level models. Our methodology monitors the number of bit transitions for all input/output gate signals on a bit- and cyclea-ccurate SystemC virtual platform model. For absolute results and reliable technology-based predictions of system power and speed (e.g. in future 32/22nm technology nodes and variations), relative metrics can be multiplied with bit energy coefficients provided by semiconductor technology datasheets and device models.
Keywords-design methodology; multicore; network-on-chip; SystemC; system-on-chip; TLM; component;

Moving to Green ICT: From Stand-Alone Power-Aware IC Design to an Integrated Approach to Energy Efficient Design of Heterogeneous Electronic Systems [p. 1127]

S. Rinaudo, G. Gangemi, A. Calimera, A. Macii and M. Poncino

Energy efficiency is one of the most critical aspects of today's information society. The most obvious benefits of being Green are reduced environmental impact and cost savings. Reducing energy consumption of electronic devices, circuits and heterogeneous systems, however, is not trivial. This requires the development of innovative energy-aware vertical design solutions and EDA technologies for next generations' nanoelectronics circuits and systems, and the related energy generation, conversion and management systems.

9.6: Embedded System Resource Allocation and Management

Moderators: S. Chakraborty, TU Munich, DE; A. Girault, INRIA, Rhone-Alpes, FR

A Workflow for Runtime Adaptive Task Allocation on Heterogeneous MPSoCs [p. 1129]

J. Huang, A. Raabe, C. Buckl and A. Knoll

Modern Multiprocessor Systems-on-Chips (MPSoCs) are ideal platforms for co-hosting multiple applications, which may have very distinct resource requirements (e.g. data processing intensive or communication intensive) and may start/stop execution independently at time instants unknown at design time. In such systems, the runtime task allocator, which is responsible for assigning appropriate resources to each task, is a key component to achieve high system performance. This paper presents a new task allocation strategy in which self-adaptability is introduced. By dynamically adjusting a set of key parameters at runtime, the optimization criteria of the task allocator adapts itself according to the relative scarcity of different types of resources, so that resource bottlenecks can be effectively mitigated. Compared with traditional task allocators with fixed optimization criteria, experimental results show that our adaptive task allocator achieves significant improvement both in terms of hardware efficiency and stability.

Energy-Efficient Scheduling of Real-Time Tasks on Cluster-Based Multicores [p. 1135]

F. Kong, W. Yi and Q. Deng

While much work has addressed the energy-efficient scheduling problem for uniprocessor or multiprocessor systems, little has been done for multicore systems. We study the multicore architecture with a fixed number of cores partitioned into clusters (or islands), on each of which all cores operate at a common frequency. We develop algorithms to determine a schedule for real-time tasks to minimize the energy consumption under the timing and operating frequency constraints. As technical contributions, we first show that the optimal frequencies resulting in the minimum energy consumption for each island is not dependent on the workload mapped but the number of cores and leakage power on the island, when not considering the timing constraint. Then for systems with timing constraints, we present a polynomial algorithm which derives the minimum energy consumption for a given task partition. Finally, we develop an efficient algorithm to determine the number of active islands, task partition and frequency assignment. Our simulation result shows that our approach significantly outperforms the related approaches in terms of energy saving.

E-RoC: Embedded Raids-on-Chip for Low Power Distributed Dynamically Managed Reliable Memories [p. 1141]

L.A.D. Bathen and N.D. Dutt

The dual effects of larger die sizes and technology scaling, combined with aggressive voltage scaling for power reduction, increase the error rates for on-chip memories. Traditional on-chip memory reliability techniques (e.g., ECC) incur significant power and performance overheads. In this paper, we propose a low-power-and-performance-overhead Embedded RAID (E-RAID) strategy and present Embedded RAIDs-on-Chip (E-RoC), a distributed dynamically managed reliable memory subsystem. E-RoC achieves reliability through redundancy by optimizing RAID-like policies tuned for on-chip distributed memories. We achieve on-chip reliability of memories through the use of distributed dynamic scratch pad allocatable memories (DSPAMs) and their allocation policies. We exploit aggressive voltage scaling to reduce power consumption overheads due to parallel DSPAM accesses, and rely on the E-RoC manager to automatically handle any resulting voltage-scaling-induced errors. Our experimental results on multimedia benchmarks show that E-RoC's fully distributed redundant reliable memory subsystem reduces power consumption by up to 85% and latency up to 61% over traditional reliability approaches that use parity/cyclic hybrids for error checking and correction.

9.7: EMBEDDED TUTORIAL - Sub-Wave Length Lithography and Variability Aware Test and Characterization Methods

Moderator: R. Galivanche, Intel Corporation, US

Modeling Manufacturing Process Variation for Design and Test [p. 1147]

S. Kundu and A. Sreedhar

For process nodes 22nm and below, a multitude of new manufacturing solutions have been proposed to improve the yield of devices being manufactured. With these new solutions come an increasing number of defect mechanisms. There is a need to model and characterize these new defect mechanisms so that (i) ATPG patterns can be properly targeted, (ii) defects can be properly diagnosed and addressed at design or manufacturing level. This presentation reviews currently available defect modeling and test solutions and summarizes open issues faced by the industry today. It also explores the topic of creating special test structures to expose manufacturing process parameters which can be used as input to software defect models to predict die specific defect locations for better targeting of test.
Keywords-Manufacturing test; Photolithography; Defect Modeling; Fault Diagnosis; Layout Enhancements for Manufacturing

Variability Aware Modeling for Yield Enhancement of SRAM and Logic [p. 1153]

M. Miranda, P. Zuber, P. Dobrovolný and P. Roussel

Anticipating silicon response in the presence or process variability is essential to avoid costly silicon re-spins. EDA industry is trying to provide the right set of tools to designers for statistical characterization of SRAM and logic. Yet design teams (also in foundries) are still using classical corner based characterization approaches. On the one hand the EDA industry fails to meet the demands on the appropriate functionality of the tools. On the other hand, design teams are not yet fully aware of the trade-offs involved when designing under extreme process variability. This paper summarizes the challenges for statistical characterization of SRAM and logic. It describes the key features of a set of prototype tools addressing that required functionality together with their application to a number of case studies aiming at enhancing yield at product level.

Correlating Models and Silicon for Improved Parametric Yield [p. 1159]

R. Aitken, G. Yeric and D. Flynn

This paper discusses one of the key challenges of design-for-yield: namely, the difficulty in correlating observed behavior with modeled behavior. In order to achieve good parametric yield, the design process must account for a large number of sources of variability in the silicon, ranging from those inherent in the device and wire models themselves through approximations made in library modeling, extraction, tool algorithms and so on. The problem is further complicated by defects and systematic errors that can be present in early silicon but are expected to be fixed as part of the volume ramp. In addition, environmental factors such as temperature and power delivery must be understood, and variation in the measurement equipment must also be correctly accounted for. Examples are given for validating standard cell and memory based designs as well as a general methodology that can be used to enable chip bring-up.
Keywords - yield optimization, variability, silicon correlation

IP4: Interactive Presentations

Evaluating Energy Consumption of Homogeneous MPSoCs Using Spare Tiles [p. 1164]

A.M. Amory, L.C. Ost, C.A.M. Marcon, F.G. Moraes and M.S. Lubaszewski

The yield of homogeneous network-on-chip based multi-processor chips can be improved with the addition of spare tiles. However, the impact of this reliability approach on the chip energy consumption is not documented. For instance, in a homogeneous MPSoC, application tasks can be placed onto any tile of a defect-free chip. On the other hand, a chip with defective tile needs a special task placement, where the faulty tile is avoided. This paper presents a task placement tool and the evaluation of energy consumption of homogeneous NoC-based MPSoCs with spare tiles. Results show NoC energy consumption overhead ranging from 1 to 10% when considering up to three faults randomly distributed over the tiles of a 3x4 mesh network. The results also indicate that faults on the central tiles typically have more impact on energy overhead.
Keywords-component; network-on-chip, homogeneous MPSoCs, reliability estimation.

Improving the Efficiency of a Hardware Transactional Memory on an NoC-based MPSoC [p. 1168]

L. Kunz, G. Girão and F.R. Wagner

Transactional Memories (TM) have attracted much interest as an alternative to lock-based synchronization in shared-memory multiprocessors. Considering the use of TM on an embedded, NoC-based MPSoC, this work evaluates a LogTM implementation. It is shown that the time an aborted transaction waits before restarting its execution (the backoff delay) can seriously affect the overall performance and energy consumption of the system. This work also shows the difficulty to find a general and optimal solution to set this time and analyzes three backoff policies to handle it. A new solution to this issue is presented based on a handshake between transactions. Results suggest up to 20% in performance gains and up to 53% in energy savings when comparing our new solution to the best backoff delay alternative found in our experiments.
Keywords: Hardware Transactional Memories; Multiprocessor Systems-on-Chip; Networks-on-Chip; Embedded Systems; Performance; Energy Consumption.

Analytical Model for SRAM Dynamic Write-Ability Degradation Due to Gate Oxide Breakdown [p. 1172]

V. Chandra and R. Aitken

Progressive gate oxide breakdown is emerging as one of the most important source of stability degradation in nanoscale SRAMs, especially at lower supply voltages. Low voltage operation of SRAM arrays is critical in reducing the power consumption of embedded microprocessors, thus necessitating the lowering of V_min. However, the oxide breakdown undesirably increases V_min due to increase in dynamic write failures and eventually static write failures as the supply voltage decreases. In this work, we describe an analytical model based on the Kohlrausch-William-Watts (KWW) function to predict the degradation in the WL_crit as the oxide breakdown increases. The KWW model also accurately predicts the efficacy of the word-line boosting and Vdd lowering write-assist techniques in reducing WL_crit. Simulation results from an industrial low-power 32nm SRAM show that model is accurate to within 1% of SPICE across range of supply voltages and severity of oxide breakdown with orders of improvement in runtime.

Multi-Level Attacks: An Emerging Security Concern for Cryptographic Hardware [p. 1176]

S.S. Ali, R.S. Chakraborty, D. Mukhopadhyay and S. Bhunia

Modern hardware and software implementations of cryptographic algorithms are subject to multiple sophisticated attacks, such as differential power analysis (DPA) and fault-based attacks. In addition, modern integrated circuit (IC) design and manufacturing follows a horizontal business model where different third-party vendors provide hardware, software and manufacturing services, thus making it difficult to ensure the trustworthiness of the entire process. Such business practices make the designs vulnerable to hard-to-detect malicious modifications by an adversary, termed as "Hardware Trojans". In this paper, we show that malicious nexus between multiple parties at different stages of the design, manufacturing and deployment makes the attacks on cryptographic hardware more potent. We describe the general model of such an attack, which we refer to as Multi-level Attack, and provide an example of it on the hardware implementation of the Advanced Encryption Standard (AES) algorithm, where a hardware Trojan is embedded in the design. We then analytically show that the resultant attack poses a significantly stronger threat than that from a Trojan attack by a single adversary. We validate our theoretical analysis using power simulation results as well as hardware measurement and emulation on a FPGA platform.

A New Reversible Design of BCD Adder [p. 1180]

H. Thapliyal and N. Ranganathan

Reversible logic is one of the emerging technologies having promising applications in quantum computing. In this work, we present new design of the reversible BCD adder that has been primarily optimized for the number of ancilla input bits and the number of garbage outputs. The number of ancilla input bits and the garbage outputs is primarily considered as an optimization criteria as it is extremely difficult to realize a quantum computer with many qubits. As the optimization of ancilla input bits and the garbage outputs may degrade the design in terms of the quantum cost and the delay, thus the quantum cost and the delay parameters are also considered for optimization with primary focus towards the optimization of the number of ancilla input bits and the garbage outputs. Firstly, we propose a new design of the reversible ripple carry adder having the input carry C0 and is designed with no ancilla input bits. The proposed reversible ripple carry adder design with no ancilla input bits has less quantum cost and the logic depth (delay) compared to its existing counterparts. The existing reversible Peres gate and a new reversible gate called the TR gate is efficiently utilized to improve the quantum cost and the delay of the reversible ripple carry adder. The improved quantum design of the TR gate is also illustrated. Finally, the reversible design of the BCD adder is presented which is based on a 4 bit reversible binary adder to add the BCD number, and finally the conversion of the binary result to the BCD format using a reversible binary to BCD converter.

jTLM: An Experimentation Framework for the Simulation of Transaction-Level Models of Systems-on-Chip [p. 1184]

G. Funchal and M. Moy

Virtual prototypes are simulators used in the consumer electronics industry. Transaction-level Modeling (TLM) is a widely used technique for designing such virtual prototypes. In particular, they allow for early development of embedded software. The SystemC modeling language is the current industry standard for developing virtual prototypes. Our experience suggests that writing TLM models exclusively in SystemC leads sometimes to confusion between modeling concepts and their implementation, and may be the root of some known bad practices. This paper introduces jTLM, an experimentation framework that allow us to study the extent to which common modeling issues come from a more fundamental constraint of the TLM approach. We focus on a discussion of the two modes of simulation scheduling: cooperative and preemptive. We con- front the implications of these two modes on the way of designing TLM models, the software bugs exposed by the simulators and the performance.

Ensuring Correctness of Analog Circuits in Presence of Noise and Process Variations Using Pattern Matching [p. 1188]

R. Narayanan, M.H. Zaki and S. Tahar

This paper relies on the longest closest subsequence (LCSS), a variant of the longest common subsequence (LCS), to account for noise and process variations inherited by analog circuits. The idea is to use stochastic differential equations (SDE) to model the design and integrate device variation due to the 0.18..m fabrication process in a MATLAB simulation environment. LCSS is used to find the longest and closest subsequence that matches with the subsequence of an ideal circuit. We illustrate the proposed approach on a Colpitts oscillator circuit. Advantages of the proposed methods are robustness and flexibility to account for wide range of variations.

A Multi-Objective Decision-Theoretic Exploration Algorithm for Platform-Based Design [p. 1192]

G. Beltrame and G. Nicolescu

This paper presents an efficient technique to perform multi-objective design space exploration of a multiprocessor platform. Instead of using semi-random search algorithms (like simulated annealing, tabu search, genetic algorithms, etc.), we use the domain knowledge derived from the platform architecture to set-up the exploration as a discrete-space multi-objective Markov Decision Process (MDP). The system walks the design space changing its parameters, performing simulations only when probabilistic information becomes insufficient for a decision. The algorithm employs a novel multi-objective value function and exploration strategy, which guarantees high accuracy and minimizes the number of necessary simulations. The proposed technique has been tested with a small benchmark (to compare the results against exhaustive exploration) and two large applications (to prove effectiveness in a real case), namely the ffmpeg transcoder and pigz parallel compressor. Results show that the exploration can be performed with 10% of the simulations necessary for state-of-the-art exploration algorithms and with unrivaled accuracy (0:6 ± 0:05% error).

Predicting Bus Contention Effects on Energy and Performance in Multi-Processor SoCs [p. 1196]

S. Penolazzi, I. Sander and A. Hemani

We present a high-level method for rapidly and accurately predicting bus contention effects on energy and performance in multi-processor SoCs. Unlike most other approaches, which rely on Transaction-Level Modeling (TLM), we infer the information we need directly from executing the algorithmic specification, without needing to build any high-level architectural model. This results in higher estimation speed and allows us to maintain our prediction results within ~2% of gate-level estimation accuracy.

A Specialized Low-Cost Vectorized Loop Buffer for Embedded Processors [p. 1200]

L. Huang, Z. Wang, L. Shen, H. Lu, N. Xiao and C. Liu

Current loop buffer has been mainly explored as an effective architectural technique for low-power execution in embedded processor. Another avenue, however, for exploiting loop buffer is to obtain its performance benefit. In this paper, we propose an application specific loop buffer organization for vectorized processing kernels, to achieve low-power and high-performance goals. The vectorized loop buffer (VLB) is simplified with single loop support for SIMD devices. Since significant data rearrangement overhead is required in order to use the SIMD capabilities, the VLB is specialized for zero-overhead implicit data permutation. We extend several instructions to the baseline ISA for programming and integrate it into an embedded processor for evaluation. Our results show that VLB improves the performance and power measures significantly compared to conventional SIMD devices.

Determining the Minimal Number of Lines for Large Reversible Circuits [p. 1204]

R. Wille, O. Keszöcze and R. Drechsler

Synthesis of reversible circuits is an active research area motivated by its applications e.g. in quantum computation or low-power design. The number of used circuit lines thereby a crucial criterion. In this paper, we introduce several methods (including a theoretical upper bound) for the efficient computation or at least approximation of the minimal number of lines needed to realize a given function in reversible logic. While the proposed exact approach requires a significant amount of run-time (exponential in the worst case), the heuristic methods lead to very precise approximations in very short run-time. Using this, it can be shown that current synthesis approaches for large functions are still far away from producing optimal circuits with respect to the number of lines.

Dynamic Applications on Reconfigurable Systems: From UML Model Design to FPGAs Implementation [p. 1208]

J. Vidal, F. de Lamotte, G. Gogniat, J.-P. Diguet and S. Guillet

In this paper we propose a design methodology to explore dynamic and partial reconfiguration (DPR) of modern FPGAs. We define a set of rules in order to model DPR by means of UML and design patterns. Our approach targets MPSoPC (Multiprocessor System on Programmable Chip) which allows: a) area optimization through partial reconfiguration without performance penalty and b) increased system flexibility through dynamic behavior modeling and implementation. In our case, area reduction is achieved by reconfiguring co-processors connected to embedded processors, and flexibility is achieved by permitting new behavior to be easily added to the system. Most of the system is automatically generated by means of MDE techniques. Our modeling approach allows designers to target dynamic reconfiguration without being experts of modern FPGAs. Such a methodology allows design time speed-up and a significant reduction of the gap between hardware and software modeling.

A Symbolic Technique for Automated Characterization of the Uniqueness and Similarity of Analog Circuit Design Features [p. 1212]

C. Ferent and A. Doboli

This paper presents a technique for automated generation of hierarchical classification schemes to express the main similarities and differences between analog circuits. The produced classification schemes offer insight about the uniqueness and importance of specific design features in setting various performance attributes as well as the limiting factors of designs. Hence, the classification schemes serve as a systematic way of relating one circuit design to alternatives. The automatically produced classification schemes for a set of OpAmps are discussed.

Coordinate Strip-Mining and Kernel Fusion to Lower Power Consumption on GPU [p. 1216]

G. Wang

Although general purpose GPUs have relatively high computing capacity, they also introduce high power consumption compared with general purpose CPUs. Therefore low-power techniques targeted for GPUs will be one of the most hot topics in the future. On the other hand, in several application domains, users are unwilling to sacrifice performance to save power. In this paper, we propose an effective kernel fusion method to reduce the power consumption for GPUs without performance loss. Different from executing multiple kernels serially, the proposed method fuses several kernels into one larger kernel. Owing to the fact that most consecutive kernels in an application have data dependency and could not be fused directly, we split large kernel into multiple slices with strip-mining method, then fuse independent sliced kernels into one kernel. Based on the CUDA programming model, we propose three different kernel fusion implementations, with each one targeting for a special case. Based on the different strip-ming methods, we also propose two fusion mechanisms, which are called invariant-slice fusion and variant-slice fusion. The latter one could be better adapted to the requirements of the kernels to be fused. The experimental results validate that the proposed kernel fusion method could effectively reduce the power consumption for GPU.
Keywords-GPGPU, Kernel Fusion, Strip-mining, Power Efficiency

An Efficient Quantum-Dot Cellular Automata Adder [p. 1220]

F. Bruschi, F. Perini, V. Rana and D. Sciuto

This paper presents a ripple-carry adder module that can serve as a basic component for Quantum Dot Automata arithmetic circuits. The main methodological design innovation over existing state of the art solutions was the adoption of so called minority gates in addition to the more traditional majority voters. Exploiting this widened basic block set, we obtained a more compact, and thus less expensive circuit. Moreover, the layout was designed in order to comply with the rules for robustness again noise paths [6].

10.1.1: INTELLIGENT ENERGY MANAGEMENT - Smart Energy Generation: Design Automation and the Smart-Grid

Moderators: P.K. Wright, UK Berkeley, US

Understanding the Role of Buildings in a Smart Microgrid [p. 1224]

Y. Agarwal, T. Weng and R.K. Gupta

A "smart microgrid" refers to a distribution network for electrical energy, starting from electricity generation to its transmission and storage with the ability to respond to dynamic changes in energy supply through co-generation and demand adjustments. At the scale of a small town, a microgrid is connected to the wide-area electrical grid that may be used for "baseline" energy supply; or in the extreme case only as a storage system in a completely self-sufficient microgrid. Distributed generation, storage and intelligence are key components of a smart microgrid. In this paper, we examine the significant role that buildings play in energy use and its management in a smart microgrid. In particular, we discuss the relationship that IT equipment has on energy usage by buildings, and show that control of various building subsystems (such as IT and HVAC) can lead to significant energy savings. Using the UCSD as a prototypical smart microgrid, we discuss how buildings can be enhanced and interfaced with the smart microgrid, and demonstrate the benefits that this relationship can bring as well as the challenges in implementing this vision.

10.1.2: SPECIAL DAY KEYNOTE

Moderator: P.K. Wright, UC Berkeley, US

Smart Systems at ST [p. 1230]

C. Papa

The exponential increase of world energy demand, with a forecasted rise of 45%[1] between 2010 and 2030, makes energy management one of the most urgent topics of the century and a key driver for semiconductors and electronics products evolution. The main solutions for world energy demand and global warming issues have been divided in two main streams: an increasing offer from alternative energy sources and their integration into the new Smart Grid and a reduction of the demand through an increase in the efficiency of systems.

10.2: Advanced Algorithms and Applications for Reconfigurable Computing

Moderators: M. Huebner, Karlsruhe Institute of Technology (KIT), DE; C. Passerone, Politecnico di Torino, IT

Theoretical Modeling of the Itoh-Tsujii Inversion Algorithm for Enhanced Performance on k-LUT Based FPGAs [p. 1231]

S.S. Roy, C. Rebeiro and D. Mukhopadhyay

Maximizing the performance of the Itoh-Tsujii finite field inversion algorithm (ITA) on FPGAs requires tuning of several design parameters. This is often time consuming and difficult. This paper presents a theoretical model for the ITA for any Galois field and k-input LUT based FPGA (k > 3). Such a model would aid a hardware designer to select the ideal design parameters quickly. The model is experimentally validated with the NIST specified fields and with 4 and 6 LUT based FPGAs. Finally, it is demonstrated that the resultant designs of the Itoh-Tsujii Inversion algorithm is most optimized among contemporary works on LUT based FPGAs.

SHARC: A Streaming Model for FPGA Accelerators and its Application to Saliency [p. 1237]

S. Kestur, D. Dantara and V. Narayanan

Reconfigurable hardware such as FPGAs are being increasingly employed for accelerating compute-intensive applications. While recent advances in technology have increased the capacity of FPGAs, lack of standard models for developing custom accelerators creates issues with scalability and compatibility. We present SHARC - Streaming Hardware Accelerator with Run-time Configurability, for an FPGA-based accelerator. This model is at a lower-level compared to existing stream processing models and provides the hardware designer with a flexible platform for developing custom accelerators. The SHARC model provides a generic interface for each hardware module and a hierarchical structure for parallelism at multiple levels in an accelerator. It also includes a parameterization and hierarchical run-time reconfiguration framework to enable hardware reuse for flexible yet high throughput design. This model is very well suited for compute-intensive applications in areas such as real-time vision and signal processing, where stream processing provides enormous performance benefits. We present a case-study by implementing a bio-inspired Saliency-based visual attention system using the proposed model and demonstrate the benefits of run-time reconfiguration. Experimental results show about 5X speedup over an existing CPU implementation and up to 14X higher Performance-per-Watt over a relevant GPU implementation.

A Reconfigurable, Pipelined, Conflict Directed Jumping Search SAT Solver [p. 1243]

M. Safar, M.W. El-Kharashi, M. Shalan and A. Salem

Several approaches have been proposed to accelerate the NP-complete Boolean Satisfiability problem (SAT) using reconfigurable computing. In this paper, we present a five-stage pipelined SAT solver. SAT solving is broken into five stages: variable decision, variable effect fetch, clause evaluation, conflict detection, and conflict analysis. The solver performs a novel search algorithm combining state-of-the-art SAT solvers advanced techniques: non-chronological backjumping, dynamic backtracking and learning without explicit traversal of implication graph. SAT instance information is stored into FPGA block RAMs avoiding synthesizing overhead for each instance. The proposed solver achieves up to 70x speedup over other hardware SAT solvers with 200x less resource utilization.
Keywords - Boolean Satisfiability, Conflict-directed jumping.

10.3: System Optimizations and Adaptivity

Moderators: M. Berekovic, TU Braunschweig, DE; S. Yehia, Thales Research and Technology, FR

Reducing the Cost of Redundant Execution in Safety-Critical Systems Using Relaxed Dedication [p. 1249]

B.H. Meyer, N. George, B. Calhoun, J. Lach and K. Skadron

We introduce on-demand redundancy, a set of architectural techniques that leverage the tightly-coupled nature of components in systems-on-chip to reduce the cost of safety-critical systems. On-demand redundancy eases the assumptions that traditionally segregate the execution of critical and non-critical tasks (NCTs), making resources available for critical tasks at potentially arbitrary points in both space and time, and otherwise freeing resources to execute non-critical tasks when critical tasks are not executing. Relaxed dedication is one such technique that allows non-critical tasks to execute on critical task resources. Our results demonstrate that for a wide variety of applications and architectures, relaxed dedication is more cost-effective than a traditional approach that employs dedicated resources executing in lockstep. Applied to dual-modular redundancy (DMR), relaxed dedication exposes 73% more NCT cycles than traditional DMR on average, across a wide variety of usage scenarios.

Frugal but Flexible Multicore Topologies in Support of Resource Variation-Driven Adaptivity [p. 1255]

C. Yang and A. Orailoglu

Given the projected higher variations in the availability of computational resources, adaptive static schedules have been developed to attain high-speed execution reconfiguration with no reliance on any runtime rescheduling decisions. These schedules are able to deliver predictable execution despite the increased levels of device unreliability in future multicore systems. Yet the associated runtime reconfiguration overhead is largely determined by the underlying system topology. Fully connected architectures, although they can effectively hide the overhead in execution migration, become infeasible as the core count grows to hundreds in the near future. We exploit in this paper the high locality associated with adaptive static schedules, and outline a scalable and locally shareable system organization for multicore platforms. With the incorporation of a limited set of neighborhood-centered communication links, threads are allowed to be directly migrated among adjacent cores without physical data movement. At the architecture level, a set of 2-dimensional physical topologies with such a local sharing property embedded is furthermore proposed. The inherent regularity allows these topologies to be adopted as a fixed-silicon multicore platform that can be flexibly redefined according to the parallelism characteristics and resilience needs of each application.

Minority-Game-Based Resource Allocation for Run-Time Reconfigurable Multi-Core Processors [p. 1261]

M. Shafique, L. Bauer, W. Ahmed and J. Henkel

A novel policy for allocating reconfigurable fabric resources in multi-core processors is presented. We deploy a Minority-Game to maximize the efficient use of the reconfigurable fabric while meeting performance constraints of individual tasks running on the cores. As we will show, the Minority Game ensures a fair allocation of resources, e.g., no single core will monopolize the reconfigurable fabric. Rather, all cores receive a "fair" share of the fabric, i.e., their tasks would miss their performance constraints by approximately the same margin, thus ensuring an overall graceful degradation. The policy is implemented on a Virtex-4 FPGA and evaluated for diverse applications ranging from security to multimedia domains. Our results show that the Minority-Game policy achieves on average 2x higher application performance and a 5x improved efficiency of resource utilization compared to state-of-the-art.

10.4: Design and Simulation of Mixed-Signal Systems

Moderators: M. Louerat, UPMC Paris, FR; I. O'Connor, EC Lyon, FR

Accelerated Simulation of Tunable Vibration Energy Harvesting Systems Using a Linearised State-Space Technique [p. 1267]

L. Wang, T.J. Kazmierski, B.M. Al-Hashimi, A.S. Weddell, G.V. Merrett and I.N. Ayala Garcia

This paper proposes a linearised state-space technique to accelerate the simulation of tunable vibration energy harvesting systems by at least two orders of magnitude. The paper provides evidence that currently available simulation tools are inadequate for simulating complete energy harvesting systems where prohibitive CPU times are encountered due to disparate time scales. In the proposed technique, the model of a complete mixed-technology energy harvesting system is divided into component blocks whose mechanical and analogue electrical parts are modelled by local state equations and terminal variables while the digital electrical part is modelled as a digital process. Unlike existing simulation tools that use Newton-Raphson method, the proposed technique uses explicit integration such as Adams- Bashforth method to solve the state equations of the complete energy harvester model in short simulation time. Experimental measurements of a practical tunable energy harvester have been carried out to validate the proposed technique.

Simulation Based Tuning of System Specification [p. 1273]

Y. Zaidi, C. Grimm and J. Haase

We report an approach targeted to aid design exploration, early decisioning in model refinement, optimization and trade offs. The approach consists of SystemC AMS coupling with a descriptive functional simulator. System engineering tools are typically used in design and analysis of system prototypes captured at very high level. Naturally, in high level analysis accuracy and detail of results is compromised in lieu simulation speed and design effort. In the presented approach, the much needed abstraction and simulation speed is retained during simulation of platform architecture while near implementation models (RTL, SPICE) may be also be cosimulated with the architecture.

An Extension to SystemC-A to Support Mixed-Technology Systems with Distributed Components [p. 1279]

C. Zhao and T.J. Kazmierski

This contribution proposes syntax extensions to SystemC-A that support mixed-technology system modelling where components might exhibit distributed behaviour modelled by partial differential equations. The important need for such extensions arises from the well known modelling difficulties in hardware description languages where complex electronics in a mixed-technology system interfaces with distributed components from different physical domains, e.g. mechanical, magnetic or thermal. A digital MEMS accelerometer with distributed mechanical sensing element is used as a case study to illustrate modelling capabilities offered by the proposed extended syntax of SystemC-A.

Stochastic Circuit Reliability Analysis [p. 1285]

E. Maricau and G. Gielen

Stochastic circuit reliability analysis, as described in this work, matches the statistical attributes of underlying device fabrics and transistor aging to the spatial and temporal reliability of an entire circuit. For the first time, spatial and temporal stochastic and deterministic reliability effects are handled together in an efficient framework. The paper first introduces an equivalent transistor SPICE model, comprising the currently most important aging effects (i.e NBTI, hot carriers and soft breakdown). A simulation framework then uses this SPICE model to minimize the number of circuit factors and to build a circuit model. The latter allows for example very fast circuit yield analysis. Using experimental design techniques the proposed method is very efficient and also proves to be very flexible. The simulation technique is demonstrated on an example 6-bit current-steering DAC, where the creation of soft breakdown spots can result in circuit failure due to increasing time-dependent transistor mismatch. Index Terms - NBTI, Hot Carrier Degradation, TDDB, SBD, HBD, Failure-Resilience, Aging, Design for Reliability.

10.5: Advances in Test Generation and Fault Simulation

Moderators: I. Polian, Passau U, DE; A. Virazel, LIRMM, FR

As-Robust-As-Possible Test Generation in the Presence of Small Delay Defects Using Pseudo-Boolean Optimization [p. 1291]

S. Eggersglüβ and R. Drechsler

Delay testing is performed to guarantee that a manufactured chip is free of delay defects and meets its performance specification. However, only few delay faults are robustly testable. For robustly untestable faults, non-robust tests which are of lesser quality are typically generated. Due to significantly relaxed conditions, there is a large quality gap between non-robust and robust tests. This paper presents a test generation procedure for As-Robust-As-Possible (ARAP) tests to increase the overall quality of the test set. Instead of generating a non-robust test for a robustly untestable fault, an ARAP test is generated which maximizes the number of satisfiable conditions required for robust test generation by pseudo-Boolean optimization. Additionally, the problem formulation is extended to incorporate the increased significance of small delay defects. By this, the likeliness that small delay defects invalidate the test is reduced. Experimental results on large industrial circuits confirm the quality gap and show that the generated ARAP tests satisfy a large percentage of all robustness conditions on average which signifies a very high quality.

Built-In Generation of Functional Broadside Tests [p. 1297]

I. Pomeranz

Functional broadside tests are two-pattern scan-based tests that avoid overtesting by ensuring that a circuit traverses only reachable states during the functional clock cycles of a test. On-chip test generation has the added advantage that it reduces test data volume and facilitates at-speed test application. This paper shows that on-chip generation of functional broadside tests can be done using simple hardware, and can achieve high transition fault coverage for testable circuits. With the proposed on-chip test generation method, the circuit is used for generating reachable states during test application. This alleviates the need to compute reachable states off-line.

SAT-Based Fault Coverage Evaluation in the Presence of Unknown Values [p. 1303]

M.A. Kochte and H.-J. Wunderlich

Fault simulation of digital circuits must correctly compute fault coverage to assess test and product quality. In case of unknown values (X-values), fault simulation is pessimistic and underestimates actual fault coverage, resulting in increased test time and data volume, as well as higher overhead for design-for-test. This work proposes a novel algorithm to determine fault coverage with significantly increased accuracy, offering increased fault coverage at no cost, or the reduction of test costs for the targeted coverage. The algorithm is compared to related work and evaluated on benchmark and industrial circuits. Index Terms - Unknown values, fault coverage, precise fault simulation

10.6: Model Based Verification and Synthesis of Embedded Systems

Moderators: R. Majumdar, Max Planck Institute for Software Systems, DE; P. Pop, TU Denmark, DK

When to Stop Verification? Statistical Trade-Off between Expected Loss and Simulation Cost [p. 1309]

S.K. Jha, C.J. Langmead, S. Mohalik and S. Ramesh

Exhaustive state space exploration based verification of embedded system designs remains a challenge despite three decades of active research into Model Checking. On the other hand, simulation based verification of even critical embedded system designs is often subject to financial budget considerations in practice. In this paper, we suggest an algorithm that minimizes the overall cost of producing an embedded system including the cost of testing the embedded system and expected losses from an incompletely tested design. We seek to quantify the trade-off between the budget for testing and the potential financial loss from an incorrect design. We demonstrate that our algorithm needs only a logarithmic number of test samples in the cost of the potential loss from an incorrect validation result. We also show that our approach remains sound when only upper bounds on the potential loss and lower bounds on the cost of simulation are available. We present experimental evidence to corroborate our theoretical results.

Resynchronization of Cyclo-Static Dataflow Graphs [p. 1315]

J.P.H.M. Hausmans, M.J.G. Bekooij and H. Corporaal

Parallel stream processing applications are often executed on shared-memory multiprocessor systems. Synchronization between tasks is needed to guarantee correct functional behavior. An increase in the communication granularity of the tasks in the parallel application can decrease the synchronization overhead. However using coarser-grained synchronization can result in deadlock or violation of the throughput constraint for the application in case of cyclic data dependencies. Resynchronization tries to change the synchronization behavior in order to reduce the synchronization overhead. Determining the amount of resynchronization while preventing deadlock and satisfying the throughput constraint of the application, forms a global analysis problem. In this paper we present a Linear Programming (LP) algorithm for minimizing synchronization by means of resynchronization that is based on the properties of dataflow models. We demonstrate our approach with an extended Constant Modulus Algorithm (CMA) in a beam-forming application. For this application we reduce the number of synchronization statements with 30% while having a memory constraint of 200 tokens. The algorithm which calculates this reduction takes less than 20 milliseconds for this problem instance.

Pipeline Schedule Synthesis for Real-Time Streaming Tasks with Inter/Intra-Instance Precedence Constraints [p. 1321]

Y.-S. Chiu, C.-S. Shih, S.-H. Hung

Heterogeneous multi-core platforms are widely accepted for high performance multimedia embedded systems. Although pipeline techniques can enhance performance for the multi-core platforms, data dependencies for processing compressed multimedia data makes it difficult, if not impossible, to automate pipelined design. In this paper, we target on multimedia streaming applications on heterogeneous multi-core platforms and develop the "Tile Piecing Algorithm" for pipelined schedule synthesis within the targeted applications and platforms. The algorithm gives an efficient way to construct a pipelined schedule. The performance evaluation result shows that the algorithm performs as well as the optimal algorithm to utilize the computation resource. On the other hand, the algorithm only takes hundreds of milliseconds to complete, which is less than one tenth of the running time for optimal algorithm. Last, the synthesized schedule is well packed. The short execution time and schedule make-span makes the algorithm more practical to be used during the run-time.

10.7: EMBEDDED TUTORIAL - Die Stacking Goes Mobile and Embedded

Moderator: Y. Xie, Penn State U, US

3D Embedded Multi-Core: Some Perspectives [p. 1327]

F. Clermidy, F. Darve, D. Dutoit, W. Lafi and P. Vivet

3D technologies using Through Silicon Vias (TSV) have not yet proved their viability for being deployed in large-range products. In this paper, we investigate three promising perspectives for short to medium terms adoption of such technology in high-end System-on-Chip built around multi-core architectures: the wide bus concept will help solving high bandwidth requirements with external memory. 3D Network-on- Chip is a promising solution for increased modularity and scalability. We show that an efficient implementation provides an available bandwidth outperforming classical interfaces. Finally, we put in perspective the active interposer concept which aims at simplifying and improving power, test and debug management.
Keywords: 3D, TSV, Through Silicon Via, Network-on-Chip, NoC, power management, test, debug

A Quantitative Analysis of Performance Benefits of 3D Die Stacking on Mobile and Embedded SoC [p. 1333]

D. Kim, S. Yoo, S. Lee, J.H. Ahn and H. Jung

3D stacked DRAM improves peak memory performance. However, its effective performance is often limited by the constraints of row-to- row activation delay (tRRD), four active bank window (tFAW), etc. In this paper, we present a quantitative analysis of the performance impact of such constraints. In order to resolve the problem, we propose balancing the budget of DRAM row activation across DRAM channels. In the proposed method, an inter-memory controller coordinator receives the current demand of row activation from memory controllers and re-distributes the budget to the memory controllers in order to improve DRAM performance. Experimental results show that sharing the budget of row activation between memory channels can give average 4.72% improvement in the utilization of 3D stacked DRAM.

10.8: PANEL SESSION - State of the Art Verification Methodologies in 2015 [p. 1339]

Moderator: A.. Crone

Panelists: O. Bringmann, C. Chevallaz, B. Dickman, V. Esen, and M. Rohleder

: In the last few years, the industry has seen acceleration in the evolution of verification methodologies. While the industry focus has been on enabling a standard based approach to help today's challenges, one can wonder what is needed to prepare us self for the further verification challenges. The expert panelist will discuss the many aspects of verification methodologies, the requirements and predictions for verification methodologies needed 4-5 years from now on.

11.1: INTELLIGENT ENERGY MANAGEMENT - Smart Energy Utilization: From Circuits to Consumer Products

Moderator: P. Mitcheson, Imperial College, UK

Energy-Modulated Computing [p. 1340]

A. Yakovlev

For years people have been designing electronic and computing systems focusing on improving performance but "keeping power and energy consumption in mind". This is a way to design energy-aware or power-efficient systems, where energy is considered as a resource whose utilization must be optimized in the realm of performance constraints. Increasingly, energy and power turn from optimization criteria into constraints, sometimes as critical as, for example, reliability and timing. Furthermore, quanta of energy or specific levels of power can shape the system's action. In other words, the system's behavior, i.e. the way how computation and communication is carried out, can be determined or modulated by the flow of energy into the system. This view becomes dominant when energy is harvested from the environment. In this paper, we attempt to pave the way to a systematic approach to designing computing systems that are energy-modulated. To this end, several design examples are considered where power comes from energy harvesting sources with limited power density and unstable levels of power. Our design examples include voltage sensors based on self-timed logic and speed-independent SRAM operating in the dynamic range of Vdd 0.2-1V. Overall, this work advocates the vision of designing systems in which a certain quality of service is delivered in return for a certain amount of energy.
Keywords-charge-to-digital converter; energy; energy-frugality; energy-harvesting; power; power-proportionality; self-timed logic; SRAM; voltage sensor

11.2: Architectural Innovations for Reconfigurable Computing

Moderators: D. Goehringer, Fraunhofer Institute, DE; K. Bertels, TU Delft, NL

I²CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics [p. 1346]

J.W. Yoon, J. Lee, J. Jung, S. Park, Y. Kim, Y. Paek and D. Cho

Integrating coarse-grained reconfigurable architectures (CGRAs) into a System-on-a-Chip (SoC) presents many benefits as well as important challenges. One of the challenges is how to customize the architecture for the target applications efficiently and effectively without explicit design space exploration. In this paper we present a novel methodology for incremental interconnect customization of CGRAs that can suggest a new interconnection architecture that can maximize the performance for a given set of application kernels while minimizing the hardware cost. Applying the inexact graph matching analogy, we translate our problem into graph matching taking into account the cost of various graph edit operations, which we solve using the A∗ search algorithm with a heuristic tailored to our problem. Our experimental results demonstrate that our customization method can quickly find application-optimized interconnections that exhibit 70% higher performance on average compared to the base architecture, with relatively little hardware increase in interconnections and muxes.

MARC II: A Parametrized Speculative Multi-Ported Memory Subsystem for Reconfigurable Computers [p. 1352]

H. Lange, T. Wink and A. Koch

We describe a parameterized memory system suitable as target for automatic high-level language to hardware compilers for reconfigurable computers. It fully supports the spatial computation paradigm by allowing the realization of each memory operator by a dedicated hardware memory port. Interport coherency is maintained only for those ports that actually require it, and efficient speculative execution is enabled by a dynamic scheme for arbitrating access to shared resources (such as main memory), relying on techniques inspired by the branch prediction of conventional software-programmable processors.

Targeting Code Diversity with Run-Time Adjustable Issue-Slots in a Chip Multiprocessor [p. 1358]

F. Anjam, M. Nadeem and S. Wong

This paper presents an adaptable softcore chip multiprocessor (CMP). The processor instruction set architecture (ISA) is based on the VEX ISA. The issue-width of the processor can be adjusted at run-time (before an application starts). The processor has eight 2-issue cores that can run independently from each other. If not in use, each core can be taken to a lower power mode by gating off its source clock. Multiple 2-issue cores can be combined at run-time to form a variety of configurations of very long instruction word (VLIW) processors. The CMP is implemented in the Xilinx Virtex-6 XC6VLX240T FPGA. It has a single ISA and requires no specialized compiler support. The CMP can target a variety of applications having instruction and/or data level parallelism. We found that applications/kernels with larger instruction level parallelism (ILP) performs better when run on a larger issue-width core, while applications with larger data level parallelism (DLP) performs better when run on multiple 2-issue cores with the data distributed among the cores.

11.3: Asynchronous Circuits and Advanced Timing Issues in Logic Synthesis

Moderators: T. Villa, Verona U, IT; P. Vivet, CEA-LETI, FR

An Efficient Algorithm for Multi-Domain Clock Skew Scheduling [p. 1364]

Y. Zhi, W.-S. Luk, H. Zhou, C. Yan, H. Zhu and X. Zeng

Conventional clock skew scheduling for sequential circuits can be formulated as a minimum cycle ratio (MCR) problem, and hence can be solved effectively by methods such as Howard's algorithm. However, its application is practically limited due to the difficulties in reliably implementing a large set of arbitrary dedicated clock delays for the flip-flops. Multi-domain clock skew scheduling was proposed to tackle this impracticality by constraining the total number of clock delays. Even though this problem can be formulated as a mixed integer linear programming (MILP), it is expensive to solve optimally in general. In this paper, we show that, under mild restrictions, the underlying domain assignment problem can be formulated as a special MILP that can be solved effectively using similar techniques for the MCR problem. In particular, we design a generalized Howard's algorithm for solving this problem efficiently. We also develop a critical-cycle-oriented refinement algorithm to further improve the results. The experimental results on ISCAS89 benchmarks show both the accuracy and efficiency of our algorithm. For example, only 4.3% of the tests have larger than 1% degradation (3% in the worst case), and all the tests finish in less than 0.7 seconds on a laptop with a 2.1GHz processor.

A Delay-Insensitive Bus-Invert Code and Hardware Support for Robust Asynchronous Global Communication [p. 1370]

M.Y. Agyekum and S.M. Nowick

A new class of delay-insensitive (DI) codes, called DI Bus-Invert, is introduced for timing-robust global asynchronous communication. This work builds loosely on an earlier synchronous bus-invert approach for low power by Stan and Burleson, but with significant modifications to ensure that delay-insensitivity is guaranteed. The goal is to minimize the average number of wire transitions per communication (a metric for dynamic power), while maintaining good coding efficiency. Basic implementations of the key supporting hardware blocks (encoder, completion detector, decoder) for the DI bus-invert codes are also presented. Each design was synthesized using the UC Berkeley ABC tool and technology mapped to a 90nm industrial standard cell library. When compared to the most coding-efficient systematic DI code (i.e. Berger) over a range of field sizes from 2 to 14 bits, the DI bus-invert codes had 24.6 to 42.9% fewer wire transitions per transaction, while providing comparable coding efficiency. In comparison to the most coding-efficient non-systematic DI code (i.e. m-of-n), the DI bus-invert code had similar coding efficiency and number of wire transitions per transaction, but with significantly lower hardware overhead.

Redressing Timing Issues for Speed-Independent Circuits in Deep Submicron Age [p. 1376]

Y. Li, T. Mak and A. Yakovlev

The class of speed independent (SI) circuits opens a promising way towards tolerating process variations. However, the fundamental assumption of speed independent circuit is that forks in some wires (usually, large percentage of wires) in such circuits are isochronic; this assumption is more and more challenged by the shrinking technology. This paper suggests a method to generate the weakest timing constraints for a SI circuit to work correctly under bounded delays in wires. The method works for all SI circuits and the generated timing constraints are significantly weaker than those suggested in the current literature claiming the weakest formally proved conditions.

11.4: High Level Synthesis

Moderators: S. Singh, Microsoft Research, UK; P. Brisk, UC Riverside, US

Realistic Performance-Constrained Pipelining in High-Level Synthesis [p. 1382]

A. Kondratyev, L. Lavagno, M. Meyer and Y. Watanabe

This paper describes an approach to pipelining in high-level synthesis that modifies the control/data flow graph before and after scheduling. This enables the direct re-use of a pre-existing, timing- and area-aware non-pipelined simultaneous scheduler and binder. Such an approach ensures that the RTL output can be synthesized within the given timing and area constraints. Results from real industrial designs show the effectiveness of this approach in improving Pareto optimality with respect to area, delay and power.
Keywords- pipelining, high-level synthesis, design exploration

Optimization of Mutually Exclusive Arithmetic Sum-of-Products [p. 1388]

T. Drane and G. Constantinides

Arithmetic blocks consume a major portion of chip area, delay and power. The arithmetic sum-of-product (SOP) is a widely used block. We introduce a novel binary integer linear program (BLP) based algorithm for optimising a general class of mutually exclusive SOPs. Benchmarks drawn from existing literature, standard APIs and constructed for demonstration purposes, exhibit speed improvements of up to 16% and area reduction of up to 57% in a 65nm TSMC process.

Intermediate Representations for Controllers in Chip Generators [p. 1394]

K. Kelley, M. Wachs, A. Danowitz, P. Stevenson, S. Richardson and M. Horowitz

Creating parameterized "chip generators" has been proposed as one way to decrease chip NRE costs. While many approaches are available for creating or generating flexible data path elements, the design of flexible controllers is more problematic. The most common approach is to create a microcoded engine as the controller, which offers flexibility through programmable table-based lookup functions. This paper shows that after "programming" the hardware for the desired application, or applications, these flexible controller designs can be easily converted to efficient fixed (or less programmable) solutions using partial evaluation capabilities that are already present in most synthesis tools.

Power Optimization in Heterogenous Datapaths [p. 1400]

A.A. Del Barrio, S.O. Memik, M.C. Molina, J.M. Mendias and R. Hermida

Heterogenous datapaths maximize the utilization of functional units (FUs) by customizing their widths individually through fragmentation of wide operands. In comparison, slices in large functional units in a homogenous datapath could be spending many cycles not performing actual useful work. Various fragmentation techniques demonstrated benefits in minimizing the total functional unit area. Upon a closer look at fragmentation techniques, we observe that the area savings achieved by heterogenous datapaths can be traded-off for power optimization. Our specific approach is to introduce choices for functional units with power/area trade-offs for different fragmentation and allocation choices, for reducing power consumption while satisfying the area constraint imposed on the heterogenous datapath. As low power FUs in literature produce an area penalty, a methodology must be developed in order to introduce them in the HLS flow while complying with the area constraint. We propose an allocation and module selection algorithms that pursue a trade-off between area and power consumption for fragmented datapaths under a total area constraint. Results show that it is possible to reduce power by 37% on average (49% in the best case). Moreover latency and cycle time will be equal or nearly the same as in the baseline case, which will lead to an energy reduction, too.
Keywords: low-power, area, HLS

Abstract State Machines as an Intermediate Representation for High-Level Synthesis [p. 1406]

R. Sinha and H.D. Patel

This work presents a high-level synthesis methodology that uses the abstract state machines (ASMs) formalism as an intermediate representation (IR). We perform scheduling and allocation on this IR, and generate synthesizable VHDL. We have the following advantages when using ASMs as an IR: 1) it allows the specification of both sequential and parallel computation, 2) it supports an extension of a clean timing model based on an interpretation of the sequential semantics, and 3) it has well-defined formal semantics, which allows the integration of formal methods into the methodology. While we specify our designs using ASMs, we do not mandate this. Instead, one can create translators that convert the algorithmic specifications from C-like languages into their equivalent ASM specifications. This makes the hardware synthesis transparent to the designer. We experiment our methodology with examples of a FIR, microprocessor, and an edge detecteor. We synthesize these designs and validate our designs on an FPGA.

11.5: New Directions in Testing

Moderators: J. Schloeffel, Mentor Graphics, DE; J. Tyszer, TU Poznan, PL

Design Automation for IEEE P1687 [p. 1412]

F.G. Zadegan, U. Ingelsson, G. Carlsson and E. Larsson

The IEEE P1687 (IJTAG) standard proposal aims at standardizing the access to embedded test and debug logic (instruments) via the JTAG TAP. P1687 specifies a component called Segment Insertion Bit (SIB) which makes it possible to construct a multitude of alternative P1687 instrument access networks for a given set of instruments. Finding the best access network with respect to instrument access time and the number of SIBs is a time-consuming task in the absence of EDA support. This paper is the first to describe a P1687 design automation tool which constructs and optimizes P1687 networks. Our EDA tool, called PACT, considers the concurrent and sequential access schedule types, and is demonstrated in experiments on industrial SOCs, reporting total access time and average access time.
Keywords-IEEE P1687 IJTAG, Design Automation, Instrument Access, Access Time Optimisation

On Testing Prebond Dies with Incomplete Clock Networks in a 3D IC Using DLLs [p. 1418]

M. Buttrick and S. Kundu

3D integration of ICs is an emerging technology where multiple silicon dies are stacked vertically. The manufacturing itself is based on wafer-to-wafer bonding, die-to-wafer bonding or die-to-die bonding. Wafer-to-wafer bonding has the lowest yield as a good die may be stacked against a bad die, resulting in a wasted good die. Thus the latter two options are preferred to keep yield high and manufacturing costs low. However, these methods require dies to be tested separately before they are stacked. A problem with testing dies separately is that the clock network of a prebond die may be incomplete before stacking. In this paper we present a solution to address this problem. The solution is based on on-die DLL implementations that are only activated during testing prebond unstacked dies to synchronize disconnected clock regions. A problem with using DLLs in testing is that they cannot be turned on or off within a single cycle. Since scan-based testing requires that test patterns be scanned in at a slow clock frequency before fast capture clocks are applied [1], on-product clock generation (OPCG) must be used. The proposed solution addresses the above problems. Furthermore, we show that a higher-speed DLL is better suited to not only high frequency system clocks, but lower power as well due to a smaller variable delay line.
Keywords-3D integrated circuit testing, delay lock loops, low power testing, on-product clock generation

Hyper-Graph Based Partitioning to Reduce DFT Cost for Pre-Bond 3D-IC Testing [p. 1424]

A. Kumar, S.M. Reddy, I. Pomeranz and B. Becker

3D IC technology has demonstrated significant performance and power gains over 2D. However, for technology to be viable yield should be increased. Testing a complete 3D IC after stacking leads to an exponential decay in yield. Pre-bond tests are required to insure correct functionality of the die. In this work we propose a hypergraph based biased netlist partitioning scheme scheme for pre-bond testing of individual dies to reduce extra-hardware (flip-flops) required. Further reduction in hardware is achieved by a logic cone based flip-flop sharing scheme. Simulation results on ISCAS89 benchmark circuits and several industrial benchmarks demonstrate the effectiveness of the proposed approach.

Adaptive Test Optimization through Real Time Learning of Test Effectiveness [p. 1430]

B. Arslan and A. Orailoglu

Production test suites include a large number of redundant test patterns due to the inclusion of multiple test types with overlapping defect detection and the use of simple fault models for test generation. Identification and elimination of ineffective test patterns promises a significant reduction in test cost. This paper proposes a test framework that learns, without extensive data collection and at no additional test time, the effectiveness of individual test patterns during production testing by getting defect detection feedback from a dynamic test flow. The proposed technique is further capable of adapting to changes in the underlying defect mechanisms by tracking the defect detection trend of test patterns.

11.6: Hardware Design for Multimedia Applications

Moderators: F. Clermidy, CEA-LETI, FR; A. Baghdadi, Telecom Bretagne, FR

A High-Performance Parallel Implementation of the Chambolle Algorithm [p. 1436]

A. Akin, I. Beretta, A.A. Nacci, V. Rana, M.D. Santambrogio and D. Atienza

The determination of the optical flow is a central problem in image processing, as it allows to describe how an image changes over time by means of a numerical vector field. The estimation of the optical flow is however a very complex problem, which has been faced using many different mathematical approaches. A large body of work has been recently published about variational methods, following the technique for total variation minimization proposed by Chambolle. Still, their hardware implementations do not offer good performance in terms of frames that can be processed per time unit, mainly because of the complex dependency scheme among the data. In this work, we propose a highly parallel and accelerated FPGA implementation of the Chambolle algorithm, which splits the original image into a set of overlapping sub-frames and efficiently exploits the reuse of intermediate results. We validate our hardware on large frames (up to 1024x768), and the proposed approach significantly improves state-of-the-art implementations, reaching up to 76x speedups, which enables real-time frame rates even at high resolutions.

Depth-Directed Hardware Object Detection [p. 1442]

C. Kyrkou, C. Ttofis and T. Theocharides

Object detection is a vital task in several emerging applications, requiring real-time detection frame-rate and low energy consumption for use in embedded and mobile devices. This paper proposes a hardware-based, depth-directed search method for reducing the search space involved in object detection, resulting in significant speed-ups and energy savings. The proposed architecture utilizes the disparity values computed from a stereoscopic camera setup, in an attempt to direct the detection classifier to regions that contain objects of interest. By eliminating large amounts of search data, the proposed system achieves both performance gains and reduced energy consumption. FPGA simulation results indicate performance speedups up to 4.7 times and high energy savings ranging from 41-48%, when compared to the traditional sliding window approach.
Keywords- Hardware Object Detection; Stereoscopic Disparity Computation; FPGA Image Processing

Multi-Level Pipelined Parallel Hardware Architecture for High Throughput Motion and Disparity Estimation in Multiview Video Coding [p. 1448]

B. Zatt, M. Shafique, S. Bampi and J. Henkel

This paper presents a novel motion and disparity estimation (ME, DE) scheme in Multiview Video Coding (MVC) that addresses the high throughput challenge jointly at the algorithm and hardware levels. Our scheme is composed of a fast ME/DE algorithm and a multi-level pipe-lined parallel hardware architecture. The proposed fast ME/DE algorithm exploits the correlation available in the 3D-neighborhood (spatial, temporal, and view). It eliminates the search step for different frames by prioritizing and evaluating the neighborhood predictors. It thereby reduces the coding computations by up to 83% with 0.1dB quality loss. The proposed hardware architecture further improves the throughput by using parallel ME/DE modules with a shared array of SAD (Sum of Absolute Differences) accelerators and by exploiting the four levels of parallelism inherent to the MVC prediction structure (view, frame, reference frame, and macroblock levels). A multi-level pipeline schedule is introduced to reduce the pipeline stalls. The pro-posed architecture is implemented for a Xilinx Virtex-6 FPGA and as an ASIC with an IBM 65nm low power technology. It is compared to state-of-the-art at both algorithm and hardware levels. Our scheme achieves a real-time (30fps) ME/DE in 4-view High Definition (HD1080p) encoding with a low power consumption of 81 mW.

11.7: HOT TOPIC - New Frontiers in Embedded Systems Design: Technology and Applications

Moderator: H. Meyr, RWTH Aachen U, DE

An Integrated Platform for Advanced Diagnostics [p. 1454]

G. De Micheli, S.S. Ghoreishizadeh, C. Boero, F. Valgimigli and S. Carrara

The objective of this work is the systematic study of the use of electrochemical readout for advanced diagnosis and drug monitoring. Whereas to date various electrochemical principles have been studied and successfully tested, they typically operate on a single target molecule and are not integrated in a full data analysis chain. The present work aims to view various sensing approaches and explore the design space for integrated realization of multi-target sensors and sensor arrays. Index Terms - biosensor, integrated circuit, metabolite, oxidase, cytochrome P450, potentiostat.

X-SENSE: Sensing in Extreme Environments [p. 1460]

J. Beutel, B. Buchli, F. Ferrari, M. Keller, M. Zimmerling and L. Thiele

The field of Wireless Sensor Networks (WSNs) is now in a stage where serious applications of societal and economical importance are in reach. For example, it is well known that the global climate change dramatically influences the visual appearance of mountain areas like the European Alps. Very destructive geological processes may be triggered or intensified, impacting the stability of slopes, possibly inducing landslides. Unfortunately, the interactions between these complex processes is poorly understood. Therefore, one needs to develop wireless sensing technology as a new scientific instrument for environmental sensing under extreme conditions. Large variations in temperature, humidity, mechanical forces, snow coverage, and unattended operation play a crucial role in long-term deployments. We argue that, in order to significantly advance the application domain, it is inevitable that sensor networks be created as a quality scientific instrument with known and predictable properties, and not as a research toy delivering average observations at best. In this paper, key techniques for achieving highly reliable, yet resource efficient wireless sensor networks are discussed on the basis of productive wireless sensor networks measuring permafrost processes in the Swiss Alps.

Towards Thermally-Aware Design of 3D MPSoCs with Inter-Tier Cooling [p. 1466]

M.M. Sabry, A. Sridhar, D. Atienza, Y. Temiz, Y. Leblebici, S. Szczukiewicz, N. Borhani, J.R. Thome, T. Brunschwiler and B. Michel

New tendencies envisage 3D Multi-Processor System-On-Chip (MPSoC) design as a promising solution to keep increasing the performance of the next-generation high-performance computing (HPC) systems. However, as the power density of HPC systems increases with the arrival of 3D MPSoCs, supplying electrical power to the computing equipment and constantly removing the generated heat is rapidly becoming the dominant cost in any HPC facility. Thus, both power and thermal/cooling implications play a major role in the design of new HPC systems, given the energy constraints in our society. Therefore, EPFL, IBM and ETHZ have been working within the CMOSAIC Nano-Tera.ch program project in the last three years on the development of a holistic thermally-aware design. This paper presents the exploration in CMOSAIC of novel cooling technologies, as well as suitable thermal modeling and system-level design methods, which are all necessary to develop 3D MPSoCs with inter-tier liquid cooling systems. As a result, we develop energy-efficient run-time thermal control strategies to achieve energy-efficient cooling mechanisms to compress almost 1 Tera nano-sized functional units into one cubic centimeter with a 10 to 100 fold higher connectivity than otherwise possible. The proposed thermally-aware design paradigm includes exploring the synergies of hardware-, software- and mechanical-based thermal control techniques as a fundamental step to design 3D MPSoCs for HPC systems. More precisely, we target the use of inter-tier coolants ranging from liquid water and two-phase refrigerants to novel engineered environmentally friendly nano-fluids, as well as using specifically designed micro-channel arrangements, in combination with the use of dynamic thermal management at system-level to tune the flow rate of the coolant in each micro-channel to achieve thermally-balanced 3D-ICs. Our management strategy prevents the system from surpassing the given threshold temperature while achieving up to 67% reduction in cooling energy and up to 30% reduction in system-level energy in comparison to setting the flow rate at the maximum value to handle the worst-case temperature.

A Circuit Technology Platform for Medical Data Acquisition and Communication [p. 1472]

Q. Huang, C. Dehollain, C. Enz and T. Burger

Recognizing the importance of interfacing a variety of sensors and networking such sensors around the body area and by cellular services, a Swiss project within the Nano-Tera.ch Initiative is dedicated to developing a platform of circuit technologies for medical data acquisition and communication.

11.8: HOT TOPIC - Stochastic Circuit Reliability Analysis in Nanometer CMOS

Moderator: G. Gielen, KU Leuven, BE

Analog Circuit Reliability in Sub-32 Nanometer CMOS: Analysis and Mitigation [p. 1474]

G. Gielen, E. Maricau and P. De Wit

The paper discusses reliability threats and opportunities for analog circuit design in high-k sub-32 nanometer technologies. Compared to older SiO2 or SiON based technologies, transistor reliability is found to be worse in high-k nodes due to larger oxide electric fields, the severely aggravated PBTI effect and increased time-dependent variability. Conventional reliability margins, based on accelerated stress measurements on individual transistors, are no longer sufficient nor adequate for analog circuit design. As a means to find more accurate, circuit-dependent reliability margins, advanced degradation effect models are reviewed and an efficient method for stochastic circuit reliability simulation is discussed. Also, an example 6- bit 32nm current-steering digital-to-analog convertor is studied. Experiments demonstrate how the proposed simulation tool, combined with novel design techniques, can provide an up to 89% better area-power product of the analog part of the circuit under study, while still guaranteeing a 99.7% yield over a lifetime of 5 years. Index Terms - NBTI, PBTI, Hot Carriers, TDDB, SBD, HBD, Failure-Resilience, Aging, Design for Reliability, High-k CMOS.

Statistical Aspects of NBTI/PBTI and Impact on SRAM Yield [p. 1480]

A. Asenov, A.R. Brown and B. Cheng

Quantitative simulations of the statistical impact of negative-bias-temperature-instability (NBTI) on pMOSFETs, and positive-bias-temperature-instability (PBTI) on nMOSFETs are carried out for a 45nm low power technology generation. Based on the statistical simulation results, we investigate the impact of NBTI and PBTI on the degradation of the static noise margin (SNM) of SRAM cells. The results indicate that SNM degradation due only to NBTI follows a different evolution pattern compared with the impact of simultaneous NBTI and PBTI degradation.
Keywords-NBTI; PBTI; Statistical Variability; SRAM; Static Noise Margin

Mathematical Approach Based on a "Design of Experiment" to Simulate Process Variations [p. 1486]

E. Rémond, E. Nercessian, C. Bernicot and R. Mina

This paper describes a Design Of Experiments (DOE) based method used in computer-aided design to simulate the impact of process variations on circuit performances. The method is based on a DOE approach using simple first and second order polynomial models with multiple experiment maps. It is a technology & circuit-independent method which allows circuit designers to perform statistical analysis with a dramatically reduced number of simulations compared to traditional methods, and hence to estimate more realistic worst cases, resulting in a reduced design cycle time. Moreover, the simple polynomial models enable direct linking of performance sensitivity to process parameters. The method is demonstrated on a set of circuits. It showed very accurate results in linking linearity, gain and noise performances to process parameters, for both RF and analog circuit.

System-Assisted Analog Mixed-Signal Design [p. 1491]

N. Shanbhag and A. Singer

In this paper, we propose a system-assisted analog mixed-signal (SAMS) design paradigm whereby the mixed-signal components of a system are designed in an application-aware manner in order to minimize power and enhance robustness in nanoscale process technologies. In a SAMS-based communication link, the digital and analog blocks from the output of the information source at the transmitter to the input of the decision device in the receiver are treated as part of the composite channel. This comprehensive systems-level view enables us to compensate for impairments of not just the physical communication channel but also the intervening circuit blocks, most notably the analog/mixed-signal blocks. This is in stark contrast to what is done today, which is to treat the analog components in the transmitter and the analog front-end at the receiver as transparent waveform preservers. The benefits of the proposed system-aware mixed-signal design approach are illustrated in the context of analog-to-digital converters (ADCs) for high-speed links. CAD challenges that arise in designing system-assisted mixed-signal circuits are also described.

IP5: Interactive Presentations

Priority Division: A High-Speed Shared-Memory Bus Arbitration with Bounded Latency [p. 1497]

H. Shah, A. Raabe and A. Knoll

In state-of-the-art multi-processor systems-on-chip (MPSoC), interconnect of processing elements has a major impact on the system's overall average-case and worst-case performance. Moreover, in real-time applications predictability of inter-chip communication latency is imperative for bounding the response time of the overall system. In shared-memory MPSoCs buses are still the prevalent means of on-chip communication for small to medium size chip-multi-processors (CMPs). Still, bus arbitration schemes employed in current architectures either deliver good average-case performance (i.e. maximize bus utilization) or enable tight bounding of worst-case-execution time. This paper presents a shared bus arbitration approach allowing high bus utilization while guaranteeing a fixed bandwidth per time frame to each master. Thus it provides high-performance to both real-time and any-time applications or even a mixture of both. The paper includes performance results obtained while executing random traffic on a shared bus implemented on a FPGA. The results show that our approach provides bus utilization close to static priority based arbitration, a fairer bandwidth distribution than Round Robin and latency guarantees identical to TDMA. With this it combines the best properties of these schemes.

System-Level Modeling of a Mixed-Signal System on Chip for Wireless Sensor Networks [p. 1501]

G.S. Beserra, J.E.G. de Medeiros, A.M. Sampaio, J. Camargo da Costa

Due to the increasing advance on wireless communication and sensors, Wireless Sensor Networks (WSN) have been widely used in several fields, such as medicine, science, industrial automation and security. A possible solution is to use CMOS System on Chip (SoC) sensor nodes as hardware platforms due to its extremely low power, sensing, computation and communication capabilities. This work presents the modeling of a mixed-signal SoC for WSN using a system-level approach. The digital section was modeled using SystemC Transaction Level Modeling (TLM) and consists of a 32-bit RISC microprocessor, memory, interrupt controller and serial interface. The analog block consists of an Analog-to-Digital Converter (ADC) described in SystemC-AMS. An application was implemented to test the correctness of the model and perform the communication between the SoC and a functional level node model.

A UML 2-Based Hardware/Software Co-Design Framework for Body Sensor Network Applications [p. 1505]

Z. Sun, C.-T. Yeh and W.-F. Wong

This paper proposes a unified framework for the hardware/software codesign of body sensor network applications that aims to enhance both modularity and reusability. The proposed framework consists of a Unified Modeling Language (UML) 2 profile for TinyOS applications and a corresponding simulator. The UML profile allows for the description of the low-level details of the hardware simulator, thereby providing a higher level of abstraction for application developers to visually design, document and maintain their systems that consist of both hardware and software components. With the aid of a predefined component repository, minimum TinyOS knowledge is needed to construct a body sensor network system. A novel feature of our framework is that we have modeled not only software applications but the simulator platform in UML. A new instance of the simulator can be automatically generated whenever hardware changes are made. Key design issues, such as timing and energy consumption can be tested by simulating the generated software implementation on the automatically customized simulator. The framework ensures a separation of software and hardware development while maintaining a close connection between them. This paper describes the concepts and implementation of the proposed framework, and presents how the framework is used in the development of nesC-TinyOS based body sensor network applications. Two actual case studies are used to show how the proposed framework can quickly and automatically adapt the software implementation to efficiently accommodate hardware changes.

An Area-Efficient Multi-Level Single-Track Pipeline Template [p. 1509]

P. Golani and P.A. Beerel

This paper presents a new asynchronous design template using single-track handshaking that targets medium-to-high performance applications. Unlike other single-track templates, the proposed template supports multiple levels of logic per pipeline stage, improving area efficiency by sharing the control logic among more logic while at the same time providing higher robustness to timing variability. The template also yields higher throughput than most four-phase templates and lower latency than bundled-data templates. The template has been incorporated into the asynchronous ASIC flow Proteus and experiments on ISCAS benchmarks show significant improvement in achievable throughput per area.

Slack-Aware Scheduling on Coarse Grained Reconfigurable Arrays [p. 1513]

G. Ansaloni, L. Pozzi, K. Tanimura and N. Dutt

Coarse Grained Reconfigurable Arrays (CGRAs) are a promising class of architectures conjugating flexibility and efficiency. Devising effective methodologies to map applications onto CGRAs is a challenging task, due to their parallel execution paradigm and sparse interconnection topology. In this paper we present a scheduling framework that is able to efficiently map operations on CGRA architectures. It leverages differences in delays of various operations, which a reconfigurable architecture always exhibits at run-time, to effectively route data. We call this ability "slack-awareness". Experimental evidence showcases the benefit of slack-aware scheduling in a coarse-grained reconfigurable environment, as more complex applications can be mapped for a given mesh size and more efficient schedules can be achieved, compared to the state of the art methods.

Timing Variation-Aware Custom Instruction Extension Technique [p. 1517]

M. Kamal, A. Afzali-Kusha and M. Pedram

In this paper, we propose a technique for custom instruction (CI) extension considering process variations. It bridges the gap between the high level custom instruction extension and chip fabrication in nanotechnologies. In the proposed method, instead of using the conventional static timing analysis (STA), statistical static timing analysis (SSTA) which in turn results in a probabilistic approach to identifying and selecting different parts of the CI extension is utilized. More precisely, we use the delay Probability Density Function (PDF) of the CIs in identification and selection phases of the CI extension. In the identification phase, the delay of each CI is modeled by PDF whereas the performance yield is added as a constraint. Additionally, in the selection phase, the merit function of the conventional approaches is modified to increase the performance gain of the selected CIs at the price of slightly sacrificing the design yield. Also, to make the approach computationally more efficient, we propose a method for reducing the modeling time of the PDF of the CIs by reducing the number of candidate CIs before extracting the PDF.
Keywords-component; ASIP, Custom Instruction, Process Variation, PDF

Pseudo Circuit Model for Representing Uncertainty in Waveforms [p. 1521]

A. Nigam, Q. Tang, A. Zjajo, M. Berkelaar, and N. van der Meijs

This paper introduces a novel compact implicit model for a probabilistic set of waveforms (PSoW) which arise as representations for uncertain signal waveforms in Statistical Static Timing Analysis (SSTA). In traditional SSTA tools, signals are just represented as (distributions of) arrival time and slew. In our approach, to increase accuracy, PSoW's are used instead. However, to represent PSoW's explicitly, a very large amount of data is necessary, which can be problematic. To solve this problem, a compact implicit model is introduced, which can be characterized with just a handful of parameters. The results obtained show that the implicit model can generate real-life PSoW's with high accuracy.

A Global Postsynthesis Optimization Method for Combinational Circuits [p. 1525]

Z. Vasicek and L. Sekanina

A genetic programming-based circuit synthesis method is proposed that enables to globally optimize the number of gates in circuits that have already been synthesized using common methods such as ABC and SIS. The main contribution is a proposal for a new fitness function that enables to significantly reduce the fitness evaluation time in comparison to the state of the art. The fitness function performs optimized equivalence checking using a SAT solver. It is shown that the equivalence checking time can significantly be reduced when knowledge of the parent circuit and its mutated offspring is taken into account. For a cost of a runtime, results of conventional synthesis conducted using SIS and ABC were improved by 20-40% for the LGSynth93 benchmarks.

An Algorithm to Improve Accuracy of Criticality in Statistical Static Timing Analysis [p. 1529]

S. Tsukiyama and M. Fukui

Statistical design approaches have been studied intensively in the last decade so as to deal with the process variability, and statistical delay fault testing is one of key techniques for the statistical design. In order to represent the distributions of timing information such as a gate delay, a signal arrival time, and a slack, various techniques have been proposed. Among them, Gaussian mixture model is distinguished from the others in that it can handle any correlation, non-Gaussian distributions, and slew distributions easily. However, the previous method of computing the statistical maximum for Gaussian mixture models has a defect such that it produces distribution similar to Gaussian in a certain case, although the correct distribution is far from Gaussian. In this paper, we propose a novel method for statistical maximum (minimum) operation for Gaussian mixture models. It takes cumulative distribution function curve into consideration so as to compute accurate criticalities (probabilities of timing violation), which is important for detecting delay faults and circuit optimization. The proposed method reduces the error of criticality almost 80% from the previous method.
Keywords- criticality; probability of timing violation; statistical static timing analysis; Gaussian mixture model; cumulative distribution curve

An Approach for Dynamic Selection of Synthesis Transformations Based on Markov Decision Processes [p. 1533]

T. Welp and A. Kuehlmann

Modern logic synthesis systems apply a sequence of loosely-related function-preserving transformations to gradually improve the circuit with respect to certain criteria such as area, performance, power, etc. For the quality of a complete synthesis run, the application order of the transformations for the individual steps are critical as they can produce vastly different outcomes. In practice, the transformation sequences is encoded in synthesis scripts which are derived manually based on experience and intuition of the tool developer. These scripts are static in the sense that transformations are applied independently of the result of previous transformations or the current status of the design. Despite the importance of obtaining high quality scripts, there are only a few attempts to optimize them. In this paper, we present a novel method to select transformations dynamically during the synthesis run leveraging the theory of Markov Decision Processes. The decision to select a particular transformation is based on transition probabilities, the history of the applied synthesis steps, and expectations for future steps. We report experimental results obtained from an implementation of the approach using the logic synthesis system ABC.

Modeling Circuit Performance Variations Due to Statistical Variability: Monte Carlo Static Timing Analysis [p. 1537]

M. Merrett, P. Asenov, Y. Wang, M. Zwolinski, D. Reid, C. Millar, S. Roy, Z. Liu, S. Furber and A. Asenov

The scaling of MOSFETs has improved performance and lowered the cost per function of CMOS integrated circuits and systems over the last 40 years, but devices are subject to increasing amounts of statistical variability within the decanano domain. The causes of these statistical variations and their effects on device performance have been extensively studied, but there have been few systematic studies of their impact on circuit performance. This paper describes a method for modelling the impact of random intra-die statistical variations on digital circuit timing and power consumption. The method allows the variation modelled by large-scale statistical transistor simulations to be propagated up the design flow to the circuit level, by making use of commercial STA and standard cell characterisation tools. The method provides circuit designers with the information required to analyse power, performance and yield trade-offs when fabricating a design, while removing the large levels of pessimism generated by traditional Corner Based Analysis.

12.1: INTELLIGENT ENERGY MANAGEMENT PANEL SESSION - The Role of the EDA Community in the Future of World Energy Supply and Conservation?

Moderators: P.K. Wright, UC Berkeley, US; P. Mitcheson, Imperial College, UK

What Does the Power Industry Need from the EDA Industry and What Is the EDA Industry Doing About It? [p. 1541]

P.K. Wright

Panelists: L Bomhold, T Green, A Ephrimides, C Blumstein

Until recently the electrical power industry has relied solely on traditional technologies - copper and iron as cables, transformers and machines as the mainstream solution for the generation, transmission and distribution of power. Whilst use of these materials and technologies is here to stay, improvements in power semiconductor technology mean that the industry is moving into a position where more and faster control of power systems can be achieved. This high level control requires a sensing and communication infrastructure to be put in place across the network. At the same time, the use of electricity in the home, through the potential of real time consumer pricing requires new technologies. This panel session aims to pull together heavy current electrical power engineers and light current electronic engineers to form a discussion and debate about the future role of EDA in applications which are being brought about by changes in the functioning of the power industry. Power engineers from both industry and academia will stimulate the discussion with requirements both from a system perspective and consumer perspective. The representatives from the EDA side will respond with what contributions they believe EDA can make, what already exists or is a simple development problem and what research issues remain in achieving these goals. In summary, this panel aims to provide motivation for the EDA industry to work on useful technology that can be applied to heavy power systems with a view to improving global energy efficiency.

12.2: Design and Run-Time Support for Dynamic Reconfigurability

Moderators: P. Lysaght, Xilinx, US; F. Ferrandi, Politecnico di Milano, IT

Fast Startup for Spartan-6 FPGAs Using Dynamic Partial Reconfiguration [p. 1542]

J. Meyer, J. Noguera, M. Hübner, L. Braun, O. Sander, R.M. Gil, R. Stewart and J. Becker

This paper introduces the first available tool flow for Dynamic Partial Reconfiguration on the Spartan-6 family. In addition, the paper proposes a new configuration method called Fast Start-up targeting modern FPGA architectures, where the FPGA is configured in two-steps, instead of using a single (monolithic) full device configuration. In this novel approach, only the timing-critical modules are loaded at power-up using the first high-priority bitstream, while the non-timing critical modules are loaded afterwards. This two-step or prioritized FPGA start-up is used in order to meet the extremely tight startup timing specifications found in many modern applications, like PCI-express or automotive applications. Finally, the developed tool flow and methods for Fast Start-up have been used and tested to implement a CAN-based automotive ECU on a Spartan-6 evaluation board (i.e., SP605). By using this novel approach, it was possible to decrease the initial bitstream size and hence, achieve a configuration time speed-up of up to 4.5x, when compared to a standard configuration solution.

Loop Distribution for K-Loops on Reconfigurable Architectures [p. 1548]

O.S. Dragomir and K. Bertels

Within the context of Reconfigurable Architectures, we define a kernel loop (K-loop) as a loop containing in the loop body one or more kernels mapped on the reconfigurable hardware. In this paper, we analyze how loop distribution can be used in the context of K-loops. We propose an algorithm for splitting K-loops that contain more than one kernel and intra-iteration dependencies. The purpose is to create smaller loops (Ksub-loops) that have more speedup potential when parallelized. Making use of partial reconfigurability, the K-sub-loops can take advantage of having more area available for multiple kernel instances to execute in parallel on the FPGA. In order to study the potential for performance improvement of using the loop distribution on K-loops, we make use of a suite of randomly generated test cases. The results show an improvement of more than 40% over previously proposed methods in more than 60% of the cases. The algorithm is also validated with a K-loop extracted from the MJPEG application. A speedup of maximum 8.22 is achieved when mapping MJPEG on VirtexIIPro with partial reconfiguration and 13.41 when statically mapping it on the Virtex-4.

mRTS: Run-Time System for Reconfigurable Processors with Multi-Grained Instruction-Set Extensions [p. 1554]

W. Ahmed, M. Shafique, L. Bauer and J. Henkel

We present a run-time system for a multi-grained reconfigurable processor in order to provide a dynamic trade-off between performance and available area budgets for both fine- as well as coarse-grained reconfigurable fabrics as part of one reconfigurable processor. Our run-time system is the first implementation of its kind that dynamical-ly selects and steers a performance-maximizing multi-grained instruction set under run-time varying constraints. It achieves a performance improvement of more than 2x compared to state-of-the-art run-time systems for multi-grained architectures. To elaborate the benefits of our approach further, we also compare it with offline- and online-optimal instruction-set selection schemes.

12.3: Reliability and Error Tolerance in Logic Synthesis

Moderators: S. Nowick, Columbia U, US; A. Yakovlev, Newcastle U, UK

Reliability-driven Don't Care Assignment for Logic Synthesis [p. 1560]

A. Zukoski, M.R. Choudhury and K. Mohanram

This paper describes two algorithms for the selective assignment of input don't cares (DCs) for logical derating of input errors to enhance reliability. It is motivated by the observation that reliability-driven assignment of DCs can improve input error resilience by up to 49.7% in logic circuits. Two algorithms - ranking-based and complexity-factor-based - for reliability-driven DC assignment are proposed in this paper. Both algorithms use Hamming distance metrics to determine 0/1 assignments for the most critical DC terms, thereby leaving flexibility in the circuit specification for subsequent optimization. Since ranking-based DC assignment offers less control over overhead, we develop a complexity-factor-based DC assignment algorithm that can achieve up to 21.4% improvement in error rate with a simultaneous 4.3% reduction in area over conventional DC assignment. Finally, we derive analytical estimates on min-max reliability improvements to evaluate the effectiveness of the proposed algorithms.

A New Circuit Simplification Method for Error Tolerant Applications [p. 1566]

D. Shin and S.K. Gupta

Starting from a functional description or a gate level circuit, the goal of the multi-level logic optimization is to obtain a version of the circuit that implements the original function at a lower cost. For error tolerant applications - images, video, audio, graphics, and games - it is known that errors at the outputs are tolerable provided that their severities are within application-specified thresholds. In this paper, we perform application level analysis to show that significant errors at the circuit level are tolerable. Then we develop a multi-level logic synthesis algorithm for error tolerant applications that minimizes the cost of the circuit by exploiting the budget for approximations provided by error tolerance. We use circuit area as the cost metric and use a test generation algorithm to select faults that introduce errors of low severities but provide significant area reductions. Selected faults are injected to simplify the circuit for the experiments. Results show that our approach provides significant reductions in circuit area even for modest error tolerance budgets.
Keywords- Error tolerance, circuit optimization, ATPG, DCT, redundancy removal

Aging-Aware Timing Analysis and Optimization Considering Path Sensitization [p. 1572]

K.-C. Wu and D. Marculescu

Device aging, which causes significant loss on circuit performance and lifetime, has been a main factor in reliability degradation of nanoscale designs. Aggressive technology scaling trends, such as thinner gate oxide without proportional downscaling of supply voltage, necessitate an aging-aware analysis and optimization flow during early design stages. Since only a small portion of critical and near-critical paths can be sensitized and may determine the circuit delay under aging, path sensitization should also be explicitly addressed for more accurate and efficient optimization. In this paper, we first investigate the impact of path sensitization on aging-aware timing analysis and then present a novel framework for aging-aware timing optimization considering path sensitization. By extracting and manipulating critical sub-circuits accounting for the effective circuit delay, our proposed framework can reduce aging-induced performance degradation to only 1.21% or one-seventh of the original performance loss with less than 2% area overhead.

12.4: Multilevel Simulation and Validation

Moderators: A. Acquaviva, Politecnico di Torino, IT; E. Aboulhamid, Montreal U, CA

Efficient Parameter Variation Sampling for Architecture Simulations [p. 1578]

F. Lu, R. Joseph, G. Trajcevski and S. Liu

This paper addresses the problem of efficient and effective parameter variation modeling and sampling in computer architecture simulations. While there has been substantial progress in accelerating simulation time for circuit designs subject to manufacturing variations, these approaches are not generally suitable for architectural studies. Toward this we investigated two complementary avenues: (1) adapting low-discrepancy sampling methods for use in Monte Carlo architectural simulations. We apply techniques previously developed for gate-level circuit models to higher level component models and in so doing drastically reduce the number of samples needed for detailed simulation; (2) applying multi-resolution analysis to appropriately decompose geometric regions of a chip, and achieve more effective description of parameter variations without increasing computational complexity. Our experimental results demonstrate that the combined techniques can reduce the number of Monte Carlo trials by a factor of 3.3, maintaining the same accuracy while significantly reducing the overall simulation run-time.

Temporal Parallel Simulation: A Fast Gate-Level HDL Simulation Using Higher Level Models [p. 1584]

D. Kim, M. Ciesielski, K. Shim and S. Yang

Simulation speedup offered by distributed parallel event-driven simulation is known to be seriously limited by the synchronization and communication overhead. These limiting factors are particularly severe in gate-level timing simulation. This paper describes a radically different approach to gate-level simulation based on a concept of temporal rather than conventional spatial parallelism. The proposed method partitions the entire simulation run into simulation slices in temporal domain and each slice is simulated separately. With each slice being independent from each other, an almost linear speedup achievable with a large number of simulation nodes. This concept naturally enables "correct by simulation" methodology that explicitly maintains the consistency between the reference and the target specifications. Experimental results clearly show significant simulation speed-up.
Keywords : Event-driven simulation; parallel simulation; verilog simulation; Gate-level simulation.

A Unified Methodology for Pre-Silicon Verification and Post-Silicon Validation [p. 1590]

A. Adir, S. Copty, S. Landa, A. Nahir, G. Shurek, A. Ziv, C. Meissner and J. Schumann

The growing importance of post-silicon validation in ensuring functional correctness of high-end designs increases the need for synergy between the pre-silicon verification and post-silicon validation. We propose a unified functional verification methodology for the pre- and post-silicon domains. This methodology is based on a common verification plan and similar languages for test-templates and coverage models. Implementation of the methodology requires a user-directable stimuli generation tool for the post-silicon domain. We analyze the requirements for such a tool and the differences between it and its pre-silicon counterpart. Based on these requirements, we implemented a tool called Threadmill and used it in the verification of the IBM POWER7 processor chip with encouraging results.

Efficient Validation Input Generation in RTL by Hybridized Source Code Analysis [p. 1596]

L. Liu and S. Vasudevan

We present HYBRO, an automatic methodology to generate high coverage input vectors for Register Transfer Level (RTL) designs based on branch-coverage directed approach. HYBRO uses dynamic simulation data and static analysis of RTL control flow graphs (CFGs). A concrete simulation is applied over a fixed number of cycles. Instrumented code records the branches covered. The corresponding symbolic trace is extracted from the CFG with an RTL symbolic execution engine. A guard in the symbolic expression is mutated. If the mutated guard has dependent branches that have not already been covered, it is mutated and passed to an SMT solver. A satisfiable assignment generates a valid input vector. We implement the Verilog RTL symbolic execution engine and show that the notion of branch-coverage directed exploration can avoid path explosion caused by previous path-based approach to input vector generation and achieve full branch and more than 90% functional(assertion) coverage quickly on ITC99 benchmark and several Openrisc designs. We also describe two types of optimizations a) dynamic UD chain slicing b)local conflict resolution to speed up HYBRO by 1.6-12 times on different benchmarks.

An Efficient and Scalable STA Tool with Direct Path Estimation and Exhaustive Sensitization Vector Exploration for Optimal Delay Computation [p. 1602]

S. Barceló, X. Gili, S. Bota and J. Segura

We present a STA tool based on a single-pass true path computation that efficiently determines the critical path list. Given that it does not rely on a two-step process it can be programmed to find efficiently the N true paths from a circuit. We also report and analyze the dependence of complex gates delay with the sensitization vector and its variation (that gets up to 15% in 65nm technologies), and consider such effect in the path delay estimation. Delay is computed from a simple polynomial analytical description that requires a one-time library parameter extraction process, making it highly scalable. Results on combinational ISCAS synthesized for three technologies (130nm, 90nm and 65nm) provide better results in computation time, number of paths reported and delay estimation for these paths compared to a commercial tool.
Keywords: delay-model, timing-analysis

12.5: Error Correction and Resilience

Moderators: J. Abella, Barcelona Supercomputing Center, ES; D. Gizopoulos, Piraeus U, GR

A Confidence-Driven Model for Error-Resilient Computing [p. 1608]

C.-H. Chen, Y. Kim, Z. Zhang, D. Blaauw, D. Sylvester, H. Naeimi and S. Sandhu

We propose an adaptive reliability enhancement structure for deeply-scaled CMOS and future devices that exhibit nondeterministic behavior. This structure forms the basis of confidence-driven computing model that can be implemented in either a rollback recovery or an iterative dual modular redundancy method incorporating synchronous handshake schemes. The performance and cost of the computing model are estimated using a 45 nm CMOS technology and the functionality is verified by FPGA-based emulation. The confidence-driven computing model is demonstrated using a 16-bit, 12-stage CORDIC processor operating under random, transient errors. The confidence-driven computing model adapts to the fluctuating error rates at the device substrate level to guarantee the reliability of computation at the system level. This computing model costs 4.2 times smaller area and 2.7 times less energy overhead than triple modular redundancy to guarantee a system-level mean time to failure of two years.
Keywords: reliability, transient error, confidence estimator, rollback recovery, dual modular redundancy.

Eliminating Speed Penalty in ECC Protected Memories [p. 1614]

M. Nicolaidis, T. Bonnoit and N.-E. Zergainoh

Drastic device shrinking, power supply reduction, increasing complexity and increasing operating speeds that accompanying technology scaling have reduced the reliability of nowadays ICs. The reliability of embedded memories is affected by particle strikes (soft errors), very low voltage operating modes, PVT variability, EMI and accelerated circuit aging. Error correcting codes (ECC) is an efficient mean for protecting memories against failures. A major issue with ECC is the speed penalty induced by the encoding and decoding circuits. In this paper we present an effective approach for eliminating this penalty and we demonstrate its efficiency in the case of an advanced reconfigurable OFDM modulator.
Keywords-Reliability, technoloy scalling, ECC, performance

Error Correcting Code Analysis for Cache Memory High Reliability and Performance [p. 1620]

D. Rossi, N. Timoncini, M. Spica and C. Metra

In this paper we address the issue of improving ECC correction ability beyond that provided by the standard SEC/DED Hsiao code. We analyze the impact of the standard SEC/DED Hsiao ECC and for several double error correcting (DEC) codes on area overhead and cache memory access time for different codeword sizes and code-segment sizes, as well as their correction ability as a function of codeword/codesegment sizes. We show the different trade-offs that can be achieved in terms of impact on area overhead, performance and correction ability, thus giving insight to designers for the selection of the optimal ECC and codeword organization/codesegment size for a given application.

Error Prediction Based on Concurrent Self-Test and Reduced Slack Time [p. 1626]

V. Gherman, J. Massas, S. Evain, S. Chevobbe and Y. Bonhomme

Small circuit defects occurred during manufacturing and/or enhanced/induced by various aging mechanisms represent a serious challenge in advanced scaled CMOS technologies. These defects initially manifest as small delay faults that may evolve in time and exceed the slack time in the clock cycle period. Periodic tests performed with reduced slack time provide a low-cost solution that allows to predict failures induced by slowly evolving delay faults. Unfortunately, such tests have limited fault coverage and fault detection latency. Here, we introduce a way to complement or completely replace the periodic testing with reduced slack time. Delay control structures are proposed to enable arbitrarily small parts of the monitored component to switch fast between a normal operating mode and a degraded mode characterized by a smaller slack time. Only two or three additional transistors are needed for each flip-flop in the monitored logic. Micro-architectural support for a concurrent self-test of pipelined logic that takes benefit of the introduced degraded mode is presented as well. Test stimuli are produced on the fly by the last two valid operations executed before each stall cycle. Test result evaluation is facilitated by the replication of the last valid operation during a stall cycle. Protection against transient faults can be achieved if each operation is replicated via stall cycle insertion.

12.6: Security Modules from Layout to Network-on-Chip

Moderators: L. Fesquet, TIMA Laboratory, FR; V. Fischer, St. Etienne U, FR

Physically Unclonable Functions for Embedded Security Based on Lithographic Variation [p. 1632]

A. Sreedhar and S. Kundu

Physically unclonable functions (PUF) are designed on integrated circuits (IC) to generate unique signatures that can be used for chip authentication. PUFs primarily rely on manufacturing process variations to create distinction between chips. In this paper, we present novel PUF circuits designed to exploit inherent fluctuations in physical layout due to photolithography process. Variations arising from proximity effects, density effects, etch effects, and non-rectangularity of transistors is leveraged to implement lithography-based physically unclonable functions (litho-PUFs). We show that the uniqueness level of these PUFs are adjustable and are typically much higher than traditional ring-oscillator or tri-state buffer based approaches.
Keywords-PUF, IC authentication, photolithography, proximity effect, chemical mechanical polishing, hardware security

RON: An On-Chip Ring Oscillator Network for Hardware Trojan Detection [p. 1638]

X. Zhang and M. Tehranipoor

Integrated circuits (ICs) are becoming increasingly vulnerable to malicious alterations, referred to as hardware Trojans. Detection of these inclusions is of utmost importance, as they may potentially be inserted into ICs bound for military, financial, or other critical applications. A novel on-chip structure including a ring oscillator network (RON), distributed across the entire chip, is proposed to verify whether the chip is Trojan-free. This structure effectively eliminates the issue of measurement noise, localizes the measurement of dynamic power, and additionally compensates for the impact of process variations. Combined with statistical data analysis, the separation of process variations from the Trojan contribution to the circuit's transient power is made possible. Simulation results featuring Trojans inserted into a benchmark circuit using 90nm technology and experimental results on Xilinx Spartan-3E FPGA demonstrate the efficiency and scalability of the RON architecture for Trojan detection.

Arithmetic Logic Units with High Error Detection Rates to Counteract Fault Attacks [p. 1644]

M. Medwed and S. Mangard

Modern security-aware embedded systems need protection against fault attacks. These attacks rely on intentionally induced faults. Such intentional faults have not only a different origin, but also a different nature than errors that fault-tolerant systems usually have to face. For instance an adversary who attacks the circuit with two lasers can potentially induce two errors at different positions. Such errors can not only defeat simple double modular redundancy schemes, but as we show, also naive schemes based on any linear code over GF(2). In this article, we describe arithmetic logic units (ALUs) which provide high error detection rates even in the presence of such errors. The contribution in this article is threefold. First, we show that the minimum weight of an undetected error is no longer defined by the code distance when certain arithmetic and logic operations are applied to the codewords. As a result, additional hardware is needed to preserve the minimum error weight for a given code. Second, we show that for multi-residue codes, these delicate operations are rare in typical smart card applications. This allows for an efficient time-area trade-off for checking the codewords and thus to significantly reduce the hardware costs for such a protected ALU. Third, we implement the proposed architectures and study the influence of the register file and a multiplier on the area and on the critical path.

Data-Oriented Performance Analysis of SHA-3 Candidates on FPGA Accelerated Computers [p. 1650]

Z. Chen, X. Guo, A. Sinha and P. Schaumont

The SHA-3 competition organized by NIST has triggered significant efforts in performance evaluation of cryptographic hardware and software. These benchmarks are used to compare the implementation efficiency of competing hash candidates. However, such benchmarks test the algorithm in an ideal setting, and they ignore the effects of system integration. In this contribution, we analyze the performance of hash candidates on a high-end computing platform consisting of a multi-core Xeon processor with an FPGA-based hardware accelerator. We implement two hash candidates, Keccak and SIMD, in various configurations of multi-core hardware and multi-core software. Next, we vary application parameters such as message length, message multiplicity, and message source. We show that, depending on the application parameter set, the overall system performance is limited by three possible performance bottlenecks, including limitations in computation speed, in communication band-width, and in buffer storage. Our key result is to demonstrate the dependency of these bottlenecks on the application parameters. We conclude that, to make sound system design decisions, selecting the right hash candidate is only half of the solution: one must also understand the nature of the data stream which is hashed.

12.7: HOT TOPIC - Sustainability through Massively Integrated Computing: Are We Ready to Break the Energy Efficiency Wall for Single-Chip Platforms?

Moderator: N. Nicolici, McMaster U, CA

Sustainability through Massively Integrated Computing: Are We Ready to Break the Energy Efficiency Wall for Single-Chip Platforms? [p. 1656]

P. Pande, F. Clermidy, D. Puschini, I. Mansouri, P. Bogdan, R. Marculescu and A. Ganguly

While traditional cluster computers are more constrained by power and cooling costs for solving extreme-scale (or exascale) problems, the continuing progress and integration levels in silicon technologies make possible complete end-user systems on a single chip. This massive level of integration makes modern multicore chips all pervasive in domains ranging from climate forecasting and astronomical data analysis, to consumer electronics, smart phones, and biological applications. Consequently, designing multicore chips for exascale computing while using the embedded systems design principles looks like a promising alternative to traditional cluster-based solutions. This paper aims to present an overview of new, far-reaching design methodologies and run-time optimization techniques that can help breaking the energy efficiency wall in massively integrated single-chip computing platforms.
Keywords - Exascale computing, multicore, small-world, game theory, consensus theory, fractal behavior.

12.8: HOT TOPIC - Synthesis Supported Increase of Efficiency in Analog Design

Moderator: R. Sommer, TU Ilmenau, DE

Automated Constraint-Driven Topology Synthesis for Analog Circuits [p. 1662]

O. Mitea, M. Meissner, L. Hedrich and P. Jores

This contribution will present a fully automated approach for explorative topology synthesis of small analog circuit blocks. Circuits are composed from a library of basic building blocks. Therefore, various algorithms are used to explore the entire design space, even allowing to generate unusual circuits. Correct combination of the basic blocks is accomplished through generic electrical rules, which ensure the fundamental electrical functionality of the generated circuit. Additionally, symmetry constraints are introduced to narrow the design space, which leads to more reasonable circuits. Further a replaceable bias-voltage generator is included into the circuit to replicate real world circumstances. For the first evaluation and selection of best candidate circuits, fast symbolic analysis techniques are used. The final sizing is done through a parallelized industrial based sizing method. Experimental results show the feasibility of this synthesis approach.

A New Method for Automated Generation of Compensation Networks - The EDA Designer Finger [p. 1666]

R. Sommer, D. Krausse, E. Hennig, E. Schaefer and C. Sporrer

In this paper a new frequency compensation method based on automatic topology modification of analog amplifier circuits is presented. Starting from an uncompensated circuit topology in closed-loop configuration, a capacitance is inserted between each pair of nodes. Subsequently, the set of inserted capacitances is reduced to a manageable size using a selection algorithm based on eigenvalue sensitivity calculation. Finally, the remaining capacitances are sized by a numerical optimization method. The presented method is demonstrated on a transimpedance amplifier design for an industrial HDTV application.

Strategies for Initial Sizing and Operating Point Analysis of Analog Circuits [p. 1672]

V. Boos, J. Nowak, M. Sylvester, S. Henker, S. Höppner, H. Grimm, D. Krausse and R. Sommer

This work presents novel analog sizing flows based on analytical techniques. A graph-based operating point driven sizing approach provides operating point voltages and a rough sizing with respect to constraints. A voltage-range analysis method using linearized operating-point models obtains information about feasible voltage ranges. A direct-sizing method solves nonlinear algebraic circuit equations directly to obtain design parameters from specifications. All three methods require no or only minimum simulation effort and can provide quick insight into circuit design space and constraints in an early design stage. They allow flexible inclusion into state-of-the-art simulation-based optimization flows, where they lead to improved results with less optimization effort and prevent unnecessary simulation effort on unfeasible circuit topologies. The sizing flows are enhanced by a commercial optimization tool in order to obtain reliable circuits.

Generator Based Approach for Analog Circuit and Layout Design and Optimization [p. 1675]

A. Graupner, R. Jancke and R. Wittmann

Layout generation remains a critical bottleneck in analog circuit design. It is especially distracting when re-using an existing design for a similar specification or when transferring a working design to a new technology. This paper presents a new methodology for layout generation of analog circuits that is based on a modular circuit design and a so-called "executable design flow description". This is created once manually and allows to describe the layout in a technology independent and parameterizable manner assuring a consistent view of circuit and layout design. Complex layouts can be created in negligible time, achieving an early involvement of layout effects in the circuit design. Furthermore, the parameterization of the design description allows simplified technology transfer and seamless access to sizing tools.
Keywords - circuit design, layout design, analog circuits, parametrizable design cells