Sessions: [Keynotes] [2.2] [2.3] [2.4] [2.5] [2.6] [2.7] [2.8] [3.2] [3.3] [3.4] [3.5] [3.6] [3.7] [3.8] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [5.1] [5.2] [5.3] [5.4] [5.5] [5.6] [5.7] [6.1] [6.2] [6.3] [6.4] [6.5] [6.6] [6.7] [7.1] [7.2] [7.3] [7.4] [7.5] [7.6] [7.7] [8.1] [8.2] [8.3] [8.4] [8.5] [8.6] [8.7] [8.8] [9.1] [9.2] [9.3] [9.4] [9.5] [9.6] [9.7] [10.1] [10.2] [10.3] [10.4] [10.5] [10.6] [10.7] [10.8] [11.1] [11.2] [11.3] [11.4] [11.5] [11.6] [11.7] [11.8] [12.1] [12.2] [12.3] [12.4] [12.5] [12.6] [12.7] [12.8]

DATE Executive Committee
DATE Sponsors Committee
Technical Program Topic Chairs
Technical Program Committee
Reviewers
Foreword
Best Paper Awards
PH.D. Forum
Call for Papers: DATE 2014

Keynotes

Smart Systems for Internet of Things [p. 1]

Benedetto Vigna

Sensors add intelligence to systems which represent a broad class of devices incorporating functionalities like sensing, actuation, and control. They are the core of smart components and subsystems; then, the challenge in the realization of such smart systems goes beyond the design of the individual components and subsystems and consists of accommodating a multitude of functionalities, technologies, and materials to play a key role to augment our daily life.

Creating a Sustainable Information and Communication Infrastructure [p. 2]

Massoud Pedram

Modern society's dependence on information and communication infrastructure (ICI) is so deeply entrenched that it should be treated on par with other critical lifelines of our existence, such as water and electricity. As is the case with any true lifeline, ICI must be reliable, affordable, and sustainable. Meeting these requirements (especially sustainability) is a continued critical challenge, which will be the focus of my talk. More precisely, I will provide an overview of information and communication technology trends in light of various societal and environmental mandates followed by a review of technologies, systems, and hardware/software solutions required to create a sustainable ICI.

2.2: Acceleration and Verification of ESL and Analog Systems

Moderators: Alper Sen - Bogazici University, TR; Daniel Grosse - University of Bremen, DE

Optimized Out-of-Order Parallel Discrete Event Simulation Using Predictions [p. 3]

Weiwei Chen and Rainer Dömer

Parallel Discrete Event Simulation (PDES) enables efficient validation of ESL models on multi-core simulation hosts. Out-of-order PDES is an advanced scheduling technique which allows multiple threads to run in parallel even in different simulation cycles. To maintain simulation semantics and timing accuracy, the compiler performs complex static conflict analysis so that the scheduler can make quick and safe decisions at run time and issue threads early. Often, however, out-of-order scheduling is prevented because of the unknown future behavior of the threads. In this paper, we extend the analysis in order to predict the future of candidate threads. Looking ahead of the current simulation state allows the scheduler to issue more threads in parallel, resulting in significantly reduced simulator run time. Our experimental results show simulation speedup up to 1.92x with only negligible increase in compile time.

Parallel Programming with SystemC for Loosely Timed Models: A Non-Intrusive Approach [p. 9]

Matthieu Moy

The SystemC/TLM technologies are widely accepted in the industry for fast system-level simulation. An important limitation of SystemC regarding performance is that the reference implementation is sequential, and the official semantics makes parallel executions difficult. As the number of cores in computers increase quickly, the ability to take advantage of the host parallelism during a simulation is becoming a major concern. Most existing work on parallelization of SystemC targets cycle-accurate simulation, and would be inefficient on loosely timed systems since they cannot run in parallel processes that do not execute simultaneously. We propose an approach that explicitly targets loosely timed systems, and offers the user a set of primitives to express tasks with duration, as opposed to the notion of time in SystemC which allows only instantaneous computations and time elapses without computation. Our tool exploits this notion of duration to run the simulation in parallel. It runs on top of any (unmodified) SystemC implementation, which lets legacy SystemC code continue running as-it-is. This allows the user to focus on the performance-critical parts of the program that need to be parallelized.

Accuracy vs Speed Tradeoffs in the Estimation of Fixed-Point Errors on Linear Time-Invariant Systems [p. 15]

David Novo, Sara El Alaoui and Paolo Ienne

Fixed-point format is essential to most efficient Digital Signal Processing (DSP) implementations. The conversion of an algorithm specification to fixed-point precision targets the minimization of the implementation cost while guaranteeing a minimal processing accuracy. However, measuring such processing accuracy can be extremely time consuming and lead to long design cycles. In this paper, we study reference approaches to measure fixed-point errors of Linear Time-Invariant (LTI) systems without feedback. Unsurprisingly, we find the existing analytical approach significantly faster than a straightforward simulation-based estimation. However, we also show that such analytical approach can incur high estimation errors for some particular bitwidth configurations. Accordingly, we propose a new hybrid approach, which is able to reduce by up to 4 times the error of the analytical estimation, while still being more than 10 times faster than the simulation-based estimation.

Runtime Verification of Nonlinear Analog Circuits Using Incremental Time-Augmented RRT Algorithm [p. 21]

Seyed Nematollah Ahmadyan, Jayanand Asok Kumar and Shobha Vasudevan

Because of complexity of analog circuits, their verification presents many challenges. We propose a runtime verification algorithm to verify design properties of nonlinear analog circuits. Our algorithm is based on performing exploratory simulations in the state-time space using the Time-augmented Rapidly Exploring Random Tree (TRRT) algorithm. The proposed runtime verification methodology consists of i) incremental construction of the TRRT to explore the state-time space and ii) use of an incremental online monitoring algorithm to check whether or not the incremented TRRT satisfies or violates specification properties at each iteration. In comparison to the Monte Carlo simulations, for providing the same state-space coverage, we utilize a logarithmic order of memory and time

An Automated Parallel Simulation Flow for Heterogeneous Embedded Systems [p. 27]

Seyed Hosein, Attarzadeh Niaki and Ingo Sander

Simulation of complex embedded and cyber-physical systems requires exploitation of the computation power of available parallel architectures. Current simulation environments either do not address this parallelism or use separate models for parallel simulation and for analysis and synthesis, which might lead to model mismatches. We extend a formal modeling framework targeting heterogeneous systems with elements that enable parallel simulations. An automated flow is then proposed that starting from a serial executable specification generates an efficient MPI-based parallel simulation model by using a constraint-based method. The proposed flow generates parallel models with acceptable speedups for a representative example.

Mutation Analysis with Coverage Discounting [p. 31]

Peter Lisherness, Nicole Lesperance and Kwang-Ting (Tim) Cheng

Mutation testing is an established technique for evaluating validation thoroughness, but its adoption has been limited by the manual effort required to analyze the results. This paper describes the use of coverage discounting for mutation analysis, where undetected mutants are explained in terms of functional coverpoints, simplifying their analysis and saving effort. Two benchmarks are shown to compare this improved flow against regular mutation analysis. We also propose a confidence metric and simulation ordering algorithm optimized for coverage discounting, potentially reducing overall simulation time.

Scalable Fault Localization for SystemC TLM Designs [p. 35]

Hoang M. Le, Daniel Große and Rolf Drechsler

SystemC and Transaction Level Modeling (TLM) have become the de-facto standard for Electronic System Level (ESL) design. For the costly task of verification at ESL, simulation is the most widely used and scalable approach. Besides the Design Under Test (DUT), the TLM verification environment typically consists of stimuli generators and checkers where the latter are responsible for detecting errors. However, in case of an error, the subsequent debugging process is still very time-consuming. In this paper, we present a scalable fault localization approach for SystemC TLM designs. The approach targets the described standard TLM verification environment and can be easily integrated into one. Our approach is inspired by software diagnosis techniques. We extend the concept of execution profiles of software programs, also known as program spectra, to handle the TLM simulation. The whole simulation consists of several runs; each run corresponds to the request-DUT-response path. During simulation our approach individually collects spectra for each run. Then, based on analyzing the differences of passed and failed runs we determine possible fault locations. We demonstrate the quality of our approach by several experiments including TLM-2.0 designs. As shown in the experiments, the fault locations are identified accurately and very fast.

2.3: Energy Optimization in Multi-core Systems

Moderators: Thidapat Chantem - Utah State University, US; William Fornaciari - Politecnico di Milano, IT

Cherry-Picking: Exploiting Process Variations in Dark-Silicon Homogeneous Chip Multi-Processors [p. 39]

Bharathwaj Raghunathan, Yatish Turakhia, Siddharth Garg and Diana Marculescu

It is projected that increasing on-chip integration with technology scaling will lead to the so-called dark silicon era in which more transistors are available on a chip than can be simultaneously powered on. It is conventionally assumed that the dark silicon will be provisioned with heterogeneous resources, for example dedicated hardware accelerators. In this paper we challenge the conventional assumption and build a case for homogeneous dark silicon CMPs that exploit the inherent variations in process parameters that exist in scaled technologies to offer increased performance. Since process variations result in core-to-core variations in power and frequency, the idea is to cherry pick the best subset of cores for an application so as to maximize performance within the power budget. To this end, we propose a polynomial time algorithm for optimal core selection, thread mapping and frequency assignment for a large class of multi-threaded applications. Our experimental results based on the Sniper multi-core simulator show that up to 22% and 30% performance improvement is observed for homogeneous CMPs with 33% and 50% dark silicon, respectively.

Energy Optimization with Worst-Case Deadline Guarantee for Pipelined Multiprocessor Systems [p. 45]

Gang Chen, Kai Huang, Christian Buckl and Alois Knoll

Pipelined computing is a promising paradigm for embedded system design. Designing the scheduling policy for a pipelined system is however more involved. In this paper, we study the problem of the energy minimization for coarse-grained pipelined systems under hard real-time constraints and propose a method based on an inverse use of the pay-burst-only-once principle. We formulate the problem by means of the resource demands of individual pipeline stages and solve it by quadratic programming. Our approach is scalable w.r.t the number of the pipeline stages. Simulation results using real-life applications as well as commercialized processors are presented to demonstrate the effectiveness of our method.

Self-Adaptive Hybrid Dynamic Power Management for Many-Core Systems [p. 51]

Muhammad Shafique, Benjamin Vogel and Jörg Henkel

We present a self-adaptive, hybrid Dynamic Power Management (DPM) scheme for many-core systems that targets concurrently executing applications with what we call "expanding" and "shrinking" resource allocations as, for example, in [13]-[15] [27]. To avoid frequent allocation and de-allocation, it enables applications to temporarily reserve their resources and to perform local power management decisions. The expand-to-shrink time periods and resource demands are predicted on-the-fly based on the application-specific knowledge and the monitored system information. Experimental results demonstrate up to 15%-40% Energy-Delay² Product reduction of our scheme compared to state-of-the-art power management schemes like [4][8]. Self-adaptive local power-management decisions make our scheme scalable for large-scaled many-core systems as illustrated by numerous experiments.

SmartCap: User Experience-Oriented Power Adaptation for Smartphone's Application Processor [p. 57]

Xueliang Li, Guihai Yan, Yinhe Han and Xiaowei Li

Power efficiency is increasingly critical to battery-powered smartphones. Given the using experience is most valued by the user, we propose that the power optimization should directly respect the user experience. We conduct a statistical sample survey and study the correlation among the user experience, the system runtime activities, and the minimal required frequency of an application processor. This study motivates an intelligent self-adaptive scheme, SmartCap, which automatically identifies the most power-efficient state of the application processor according to system activities. Compared to prior Linux power adaptation schemes, SmartCap can help save power from 11% to 84%, depending on applications, with little decline in user experience.

Runtime Power Estimation of Mobile AMOLED Displays [p. 61]

Dongwon Kim, Wonwoo Jung and Hojung Cha

Modeling and estimating power consumption of OLED displays are necessary to understand the energy behavior of emerging mobile devices. Although previous study exists to model and estimate the power consumption of stationary display images, to the best of our knowledge, no prior work is found to deal with runtime power behavior of OLED display running real applications. This paper proposes a runtime power estimation scheme for OLED displays that involves monitoring kernel activities that capture the screen change events of running applications. The experiment results show that the proposed scheme estimates the display energy consumption of running applications with reasonable accuracy.
Index Terms - Power, energy, modeling, estimation

2.4: Memory and Cache Architectures

Moderators: Georgi Gaydadjiev - Chalmers University of Technology, SE; Todd Austin - Michigan University Ann Arbor, US

AVICA: An Access-time Variation Insensitive L1 Cache Architecture [p. 65]

Seokin Hong and Soontae Kim

Ever scaling process technology increases variations in transistors. The process variations cause large fluctuations in the access times of SRAM cells. Caches made of those SRAM cells cannot be accessed within the target clock cycle time, which reduces yield of processors. To combat these access time failures in caches, many schemes have been proposed, which are, however, limited in their coverage and do not scale well at high failure rates. We propose a new L1 cache architecture (AVICA) employing asymmetric pipelining and pseudo multi-banking. Asymmetric pipelining eliminates all access time failures in L1 caches. Pseudo multi-banking minimizes the performance impact of asymmetric pipelining. For further performance improvement, architectural techniques are proposed. Our experimental results show that our proposed L1 cache architecture incurs less than 1% performance hit compared to the conventional cache architecture with no access time failure. Our proposed architecture is not sensitive to access time failure rates and has low overheads compared to the previously proposed competitive schemes.

Dual-addressing Memory Architecture for Two-dimensional Memory Access Patterns [p. 71]

Yen-Hao Chen and Yi-Yu Liu

Cache performance is an important factor in modern computing systems due to large memory access latency. To exploit the principle of spatial locality, a requested data set and its adjacent data sets are often loaded from memory to a cache block simultaneously. However, the definition of adjacent data sets is strongly correlated with the memory organization. Commodity memory is a two-dimensional structure with two (row and column) access phases to locate the requested data set. Therefore, the adjacent data sets are neighbors of the requested data set in a linear order. In this paper, we propose a novel memory organization with dual-addressing modes as well as orthogonal memory access mechanisms. Our dual-addressing memory can be efficiently applied to two-dimensional memory access patterns. Furthermore, we propose a cache coherence protocol to tackle the cache coherence issue due to synonym data set of the dual-addressing memory. For benchmark kernels with two-dimensional memory access patterns, the dual-addressing memory achieves 60% performance improvement as compared to conventional memory. Both cache hit rate and cache utilization are improved after removing two-dimensional memory access patterns from conventional memory.

Adaptive Cache Management for a Combined SRAM and DRAM Cache Hierarchy for Multi-cores [p. 77]

Fazal Hameed, Lars Bauer and Jörg Henkel

On-chip DRAM caches may alleviate the memory bandwidth problem in future multi-core architectures through reducing off-chip accesses via increased cache capacity. For memory intensive applications, recent research has demonstrated the benefits of introducing high capacity on-chip L4-DRAM as Last-Level-Cache between L3-SRAM and off-chip memory. These multi-core cache hierarchies attempt to exploit the latency benefits of L3-SRAM and capacity benefits of L4-DRAM caches. However, not taking into consideration the cache access patterns of complex applications can cause inter-core DRAM interference and inter-core cache contention. In this paper, we contest to re-architect existing cache hierarchies by proposing a hybrid cache architecture, where the Last-Level-Cache is a combination of SRAM and DRAM caches. We propose an adaptive DRAM placement policy in response to the diverse requirements of complex applications with different cache access behaviors. It reduces inter-core DRAM interference and inter-core cache contention in SRAM/DRAM-based hybrid cache architectures: increasing the harmonic mean instruction-per-cycle throughput by 23.3% (max. 56%) and 13.3% (max. 35.1%) compared to state-of-the-art.

Combining RAM Technologies for Hard-error Recovery in L1 Data Caches Working at Very-low Power Modes [p. 83]

Vicente Lorente, Alejandro Valero, Julio Sahuquillo, Salvador Petit, Ramon Canal, Pedro López and José Duato

Low-power modes in modern microprocessors rely on low frequencies and low voltages to reduce the energy budget. Nevertheless, manufacturing induced parameter variations can make SRAM cells unreliable producing hard errors at supply voltages below Vccmin. Recent proposals provide a rather low fault-coverage due to the fault coverage/overhead trade-off. We propose a new fault-tolerant L1 cache, which combines SRAM and eDRAM cells in L1 data caches to provide 100% SRAM hard-error fault coverage. Results show that, compared to a conventional cache and assuming 50% failure probability at low-power mode, leakage and dynamic energy savings are by 85% and 62%, respectively, with a minimal impact on performance.

A Dual Grain Hit-Miss Detector for Large Die-Stacked DRAM Caches [p. 89]

Michel El-Nacouzi, Islam Atta, Myrto Papadopoulou, Jason Zebchuk, Natalie Enright Jerger and Andreas Moshovos

Die-Stacked DRAM caches offer the promise of improved performance and reduced energy by capturing a larger fraction of an application's working set than on-die SRAM caches. However, given that their latency is only 50% lower than that of main memory, DRAM caches considerably increase latency for misses. They also incur a significant energy overhead for remote lookups in snoop-based multi-socket systems. Ideally, it would be possible to detect in advance that a request will miss in the DRAM cache and thus selectively bypass it. This work proposes a "dual grain filter" which successfully predicts whether an access is a hit or a miss in most cases. Experimental results with commercial and scientific workloads show that a 158KB dual-grain filter can correctly predict data block residency for 85% of all accesses to a 256MB DRAM cache. As a result, average off-die latency with our filter is within 8% of that possible with a perfectly accurate filter, which is impractical to implement.

Reducing Writes in Phase-Change Memory Environments by Using Efficient Cache Replacement Policies [p. 93]

Roberto Rodríguez-Rodríguez, Fernando Castro, Daniel Chaver, Luis Pinuel and Francisco Tirado

Phase Change Memory (PCM) is currently postulated as the best alternative for replacing Dynamic Random Access Memory (DRAM) as the technology used for implementing main memories, thanks to its significant advantages such as good scalability and low leakage. However, PCM also presents some drawbacks compared to DRAM, like its lower endurance. This work presents a behavior analysis of conventional cache replacement policies in terms of the amount of writes to main memory. Besides, new last level cache (LLC) replacement algorithms are exposed, aimed at reducing the number of writes to PCM and hence increasing its lifetime, without significantly degrading system performance

2.5: Communications, Multimedia, and Consumer Electronics

Moderators: Theocharis Theocharides - University of Cyprus, CY; Amer Baghdadi - Telecom Bretagne/ Lab-STICC, FR

Low Complexity QR-Decomposition Architecture Using the Logarithmic Number System [p. 97]

Jochen Rust, Frank Ludwig and Steffen Paul

In this paper we propose a QR-decomposition hardware implementation that processes complex calculations in the logarithmic number system. Thus, low complexity numeric format converters are installed, using nonuniform piecewise and multiplier-less function approximation. The proposed algorithm is simulated with several different configurations in a downlink precoding environment for 4x4 and 8x8 multi-antenna wireless communication systems. In addition, the results are compared to default CORDIC-based architectures. In a second step, HDL implementation as well as logical and physical CMOS synthesis are performed. The comparison to actual references highlight our approach as highly efficient in terms of hardware complexity and accuracy.
Index Terms - QR-Decomposition, Nonuniform function approximation, LNS

Perceptual Quality Preserving SRAM Architecture for Color Motion Pictures [p. 103]

Wen Yueh, Minki Cho and Saibal Mukhopadhyay

This work proposes a low power methodology for video framebuffers to preserve the perceptual quality while reducing SRAM power. The bank-wise voltage scaling combined with error masking circuitry is proposed where voltage domains are separated according to the importance of luminous and color channels. The implementation may apply to standard embedded memory cores without redesigning specialized hardware within the SRAM bank. The simulation results showed that the proposed channel protection technique produced better energy-quality trade-off than the conventional higher-order-bit protection for the uncompressed as well as compressed motion image frames.
Keywords - parametric failure, color image protection, low power circuit, static random access memory (SRAM).

Parameterized Area-efficient Multi-standard Turbo Decoder [p. 109]

Purushotham Murugappa, Amer Baghdadi and Michel Jézéquel

Emerging wireless digital communication standards specify a large variety of channel coding options, each suitable for specific application needs. In this context, several recent efforts are being conducted to propose flexible channel decoder implementations. However, the need of optimal solutions in terms of performance, area, and power consumption is increasing and cannot be neglected against flexibility. In this paper we present a novel parameterized architecture for multi-standard Turbo decoding which illustrates how flexibility, architecture efficiency, and rapid design time can be combined. The proposed architecture supports both single-binary Turbo codes (SBTC) of 3GPP-LTE and double-binary Turbo codes (DBTC) of WiMAX and DVB-RCS standards. It achieves, in both modes, a high architecture efficiency of 4.37 bits/cycle/iteration/mm². A major contribution of this work concerns the rapid design time allowed by the well established design concept and tools of application-specific instruction-set processors (ASIPs). Using such a tool, the paper illustrates the possibility to design application-specific parameterized cores, removing the need of the program memory and the related instruction decoder.

An H.264 Quad-FullHD Low-Latency Intra Video Encoder [p. 115]

Muhammad Usman Karim Khan, Jan Micha Borrmann, Lars Bauer, Muhammad Shafique and Jörg Henkel

Video applications are moving from Full-HD capability (1920x1080) to even higher resolutions such as Quad-FullHD (3840x2160). The H.264 Intra-mode can be used by embedded devices to trade off the better encoding efficiency of H.264 temporal prediction (Inter-mode) against savings in area and power as well as saving the massive computational overhead of the sub-pixel motion estimation by using only spatial prediction (Intra-mode). Still, the H.264 Intra-mode requires a large computational effort and imposes severe challenges when targeting Quad-FullHD 25 fps real-time video encoding at moderate operating frequencies (we target 150 MHz) and limited area budget. Therefore, in this work we address the strong sequential data dependencies within H.264 Intra-mode that restrict the parallelism and inhibit high resolution encoding by a) decoupling of DC and AC transform paths, b) cycle-budget aware mode prediction scheduling while c) being area efficient. Using our proposed techniques, Quad-FullHD (3840x2160) 28 fps video encoding is achieved at 150 MHz, making our architecture applicable for high definition recording.

A 100 GOPS ASP Based Baseband Processor for Wireless Communication [p. 121]

Zhu Ziyuan, Tang Shan, Su Yongtao, Han Juan, Sun Gang and Shi Jinglin

This paper presents an ASP (application specific processor) with 512-bit SIMD (Single Instruction Multiple Data) and 192-bit VLIW (Very Long Instruction Word) architecture optimized for wireless baseband processing. It employs optimized architecture and address generation unit to accelerate the kernel algorithms. Based on the ASP, a multi-core baseband processor is developed which can work at 2x2 MIMO and 20 MHz physical bandwidth configuration for LTE inner receiver and meet requirements of Category 3 User Equipment (CAT3 UE). Furthermore, a silicon implementation of the baseband processor with 130nm CMOS technology is presented. Experimental results show that the baseband processor provides 100 GOPS computing ability at 117.6MHz.
Keywords - Application Specific Processor; VLIW; AGU; Baseband processor; LTE

Hardware-Software Collaborative Complexity Reduction Scheme for the Emerging HEVC Intra Encoder [p. 125]

Muhammad Usman Karim Khan, Muhammad Shafique, Mateus Grellert and Jörg Henkel

High Efficiency Video Coding (HEVC/H.265) is an emerging standard for video compression that provides almost double compression efficiency at the cost of major computational complexity increase as compared to current industry-standard Advanced Video Coding (AVC/H.264). This work proposes a collaborative hardware and software scheme for complexity reduction in an HEVC Intra encoding system, with run-time adaptivity. Our scheme leverages video content properties which drive the complexity management layer (software) to generate a highly probable coding configuration. The intra prediction size and direction are estimated for the prediction unit which provides reduced computational-complexity. At the hardware layer, specialized coprocessors with enhanced reusability are employed as accelerators. Additionally, depending upon the video properties, the software layer administers the energy management of the hardware coprocessors. Experimental results show that a complexity reduction of up to 60 % and the energy reduction up to 42 % are achieved.

2.6: HOT TOPIC: Reliability Challenges of Real-time Systems in Forthcoming Technology Nodes

Organizers and Moderators: Said Hamdioui - Delft University of Technology, NL; Dimitris Gizopoulos - University of Athens, GR

Reliability Challenges of Real-Time Systems in Forthcoming Technology Nodes [p. 129]

Said Hamdioui, Dimitris Gizopoulos, Groeseneken Guido, Michael Nicolaidis, Arnaud Grasset, Philippe Bonnot

Forthcoming technology nodes are posing major challenges on the manufacturing of reliable (real-time) systems: process variations, accelerated degradation aging, as well as external and internal noise are key examples. This paper focuses on real-time systems reliability and analyzes the state-of-the-art and the emerging reliability bottlenecks from three different perspectives: technology, circuit/IP and full system.
Keywords - Circuit reliability, embedded real-time systems, dependable computing

2.7: Safety Critical Real-Time Systems

Moderators: Michael Paulitsch - EADS, DE; Giuseppe Lipari - ENS - Cachan, FR

Sensitivity Analysis for Arbitrary Activation Patterns in Real-time Systems [p. 135]

Moritz Neukirchner, Sophie Quinton, Tobias Michaels, Philip Axer and Rolf Ernst

Response time analysis, which determines whether timing guarantees are satisfied for a given system, has matured to industrial practice and is able to consider even complex activation patterns modelled through arrival curves or minimum distance functions. On the other side, sensitivity analysis, which determines bounds on parameter variations under which constraints are still satisfied, is largely restricted to variation of single-valued parameters as e.g. task periods. In this paper we provide a sensitivity analysis to determine the bounds on the admissible activation pattern of a task, modelled through a minimum distance function. In an evaluation on a set of synthetic testcases we show, that the proposed algorithm provides significantly tighter bounds, than previous exact analyses, that determine allowable parametrizations of activation patterns.

PT-AMC: Integrating Preemption Thresholds into Mixed-Criticality Scheduling [p. 141]

Qingling Zhao, Zonghua Gu and Haibo Zeng

Mixed-Criticality Scheduling (MCS) is an effective approach to addressing diverse certification requirements of safety-critical systems that integrate multiple subsystems with different levels of criticality. Preemption Threshold Scheduling (PTS) is a well-known technique for controlling the degree of preemption, ranging from fully-preemptive to fully-non-preemptive scheduling. We present schedulability analysis algorithms to enable integration of PTS with MCS, in order to bring the rich benefits of PTS into MCS, including minimizing the application stack space requirement, reducing the number of runtime task preemptions, and improving schedulability.

An Elastic Mixed-Criticality Task Model and Its Scheduling Algorithm [p. 147]

Hang Su and Dakai Zhu

To address the service abrupt problem for low-criticality tasks in existing mixed-criticality scheduling algorithms, we study an Elastic Mixed-Criticality (E-MC) task model, where the key idea is to have variable periods (i.e., service intervals) for low-criticality tasks. The minimum service requirement of a low-criticality task is ensured by its largest period. However, at runtime low-criticality tasks can be released early by exploiting the slack time generated from the over-provisioned execution time for high-criticality tasks to reduce their service intervals and thus improve their service levels. We propose an Early-Release EDF (ER-EDF) scheduling algorithm, which can judiciously manage the early release of low-criticality tasks without affecting the timeliness of high-criticality tasks. Compared to the state-of-the-art EDF-VD scheduling algorithm, our simulation results show that the ER-EDF can successfully schedule much more task sets. Moreover, the achieved execution frequencies of low-criticality tasks can also be significantly improved under ER-EDF.

An Open Platform for Mixed-Criticality Real-time Ethernet [p. 153]

Gonzalo Carvajal and Sebastian Fischmeister

For more than one decade, researchers have considered Ethernet as a natural replacement to legacy fieldbuses in modern distributed applications. However, Ethernet components require special modifications and hardware support to provide strict timing guarantees. In general, the high-cost of deploying hardware components limits the experimental validation of proposed solutions in real-world applications. Despite the vast literature, only a few solutions report real implementations, and they are all closed to the research community, hindering further development for constantly evolving applications. This paper introduces Atacama, an on-going effort on deploying the first hardware-accelerated and open-source framework for mixed-criticality communication on multi-hop networks. Specialized modules exploit the principles of traditional fieldbus systems to coordinate communication tasks on real-time stations, and can be easily integrated to and coexist with Commercial Off The Shelf (COTS) devices operating with best-effort traffic. Experimental characterization of implemented prototypes report minimal jitter on 1Gbps links, and show that real-time guarantees are resilient to injected best-effort traffic. The framework is available as an open-source project, enabling researchers to verify the results, explore, test, and deploy new networking solutions for modern distributed systems in real-world scenarios.

2.8: HOT TOPIC: IP Subsystems: The Next Productivity Wave?

Organizers and Moderators: Wido Kruijtzer - Synopsys, NL; Luciano Lavagno - Politecnico di Torino, IT

Modular SoC Integration with Subsystems: The Audio Subsystem Case [p. 157]

Pieter van der Wolf and Ruud Derwig

We explore the potential of subsystem-based design to reduce cost and time-to-market in the design of advanced Systems-on-Chips (SoCs) while retaining low-power and high performance processing. Using a concrete audio subsystem as an example, we illustrate the benefits of modular SoC integration with subsystems and identify challenges to be addressed. Well-designed subsystems pre-integrate hardware and software modules to implement complete system functions and offer high-level hardware and software interfaces for easy SoC integration. Configurability of subsystems enables reuse across SoCs. Subsystems can offer software plug-ins to support integration into a software stack on a host processor while making core crossings transparent for the application programmer. We conclude that subsystems can indeed be the next reuse paradigm for efficient SoC integration.
Keywords - System-on-Chip (SoC); subsystem; audio;

Configurability in IP Subsystems: Baseband Examples [p. 163]

Pierre-Xavier Thomas, Grant Martin, David Heine, Dennis Moolenaar, and James Kim

Configurability in IP subsystems has two major motivations. The first is the requirements of the IP subsystem itself; the second the particular customer requirements, as every customer has unique things they want to change in a subsystem. Configurability manifests itself at two levels - the individual components, such as processors (ideally configurable), memories, and hardware blocks for specialized processing; the second one at the subsystem level, where component choices, interconnect and interfaces may all vary considerably. This paper discusses these concepts applied to practical, real, baseband subsystems for wireless communications. Configurability allows both scalability of a reference IP subsystem - e.g. to handle varieties of standards and use cases; and differentiation, so that customers get the optimal IP subsystem for their unique needs. This is illustrated with existing product-ready systems and cores, and future subsystem concepts that will allow even better scalability, performance, and adaptability for the next generation.
Keywords - IP, IP Substems, configurable processors, configurable subsystems, baseband

Configurable IO Integration to Reduce System-on-Chip Time to Market: DDR, PCIe Examples [p. 169]

Frank Martin and Peter Bennett

The availability of protocol features with iterative configurability is central to the successful adoption of reusable IP in SoC development. However, the promise of ultimately shrinking the SoC development TTM whilst also allowing greater resourcing efficiency can only be realized with a comprehensive approach to delivering the software, digital and analog components of the protocol to the SoC top level integration as IP subsystems with the correct integration views. This talk will discuss quantitatively how the combination of configurability, quality and integration at the IO protocol can systematically reduce the SoC development and resource plan. It will be demonstrated with examples for DDR and PCIe IO protocols as well as examples from application specific SoC's.

High-performance Imaging Subsystems and Their Integration in Mobile Devices [p. 170]

Menno Lindwer and Mark Ruvald Pedersen

Within today's SoCs, functionality such as video, audio, graphics, and imaging is increasingly integrated through IP blocks, which are subsystems in their own right. Integration of IP blocks within SoCs always brought software integration aspects with it. However, since these subsystems increasingly consist of programmable processors, many more layers of firmware and software need to be integrated. In the imaging domain, this is particularly true. Imaging subsystems typically are highly heterogeneous, with high levels of parallelism. The construction of their firmware requires target-specific optimization, yet needs to take interoperability with sensor input systems and graphics/display subsystems into account. Hard real-time scheduling within the subsystem needs to cooperate with less stringent image analytics and SoC-level (OS) scheduling. In many of today's systems, the latter often only supports soft scheduling deadlines. At HW level, IP subsystems need to be integrated such that they can efficiently exchange both short-latency control signals and high-bandwidth data-plane blocks. Solutions exist, but need to be properly configured. However, at the SW level, currently no support exists that provides (i) efficient programmability, (ii) SW abstraction of all the different HW features of these blocks, and (iii) interoperability of these blocks. Starting points could be languages such as OpenCL and OpenCV, which do provide some abstractions, but are not yet sufficiently versatile.

3.2: PANEL: The Heritage of Mead & Conway: What Has Remained the Same, What Was Missed, What Has Changed, What Lies Ahead

Organizer: Marco Casale-Rossi - Synopsys, US
Moderators: Alberto Sangiovanni-Vincentelli - UCB, US; Marco Casale-Rossi - Synopsys, US

PANEL: The Heritage of Mead & Conway: What Has Remained the Same, What Was Missed, What Has Changed, What Lies Ahead

Panelists: Luca Carloni, Bernard Courtois, Hugo de Man, Antun Domic, and Jan Rabaey [p. 171]

Thirty-two years ago, Electronics Magazine honored Carver Mead and Lynn Conway with its Achievement Award for their contributions to VLSI chip design. The "Mead & Conway methods" were being taught at 100+ universities all over the world, and "not only have helped spawn a common design culture so necessary in the VLSI era, but have greatly increased interaction between university and industry so as to stimulate research by both." Concepts such as simplified design methods, new, electronic representations of digital design data, scalable design rules, "clean" formalized digital interfaces between design and manufacturing, and widely accessible silicon foundries suddenly enabled many thousands of chip designers to create many tens of thousands of chip designs. Today, as Moore's Law - a term coined by Carver Mead - has brought as from 10 microns to 10 nanometers, what is the heritage of Mead & Conway? UCB Professor Alberto Sangiovanni-Vincentelli will moderate an industry and research panel, to discuss what has remained the same, what was missed, what has changed, and what lies ahead.

3.3: Addressing Process and Delay Variation in High-Level Synthesis

Moderators: Lars Bauer - Karlsruhe Institute of Technology, DE; Hiroyuki Tomiyama - Ritsumeikan University, JP

Profit Maximization through Process Variation Aware High Level Synthesis with Speed Binning [p. 176]

Zhao Mengying, Orailoglu Alex and Xue Chun Jason

As integrated circuits continuously scale up, process variation plays an increasingly significant role in system design and semiconductor economic return. In this paper, we explore the potential of profit improvement under the inherent semiconductor variability based on the speed binning technique. We first accordingly propose a set of high level synthesis techniques, including allocation, scheduling and resource binding, thus essentially constructing designs that maximize the number of chips that can be sold at the most advantageous price, leading to the maximization of the overall profit. We explore subsequently the optimal bin placement strategy for further profit improvement. Experimental results confirm the superiority of the high level synthesis results and the associated improvement in profit margins.

Instruction-Set Extension under Process Variation and Aging Effects [p. 182]

Yuko Hara-Azumi, Farshad Firouzi, Saman Kiamehr and Mehdi Tahoori

We propose a novel custom instruction (CI) selection technique for process variation and transistor aging aware instruction-set architecture synthesis. For aggressive clocking, we select CIs based on statistical static timing analysis (SSTA), which achieves efficient speedup during target lifetime while mitigating degradation of timing yield (i.e., probability of satisfying the timing). Furthermore, we consider process variation and aging on not only CIs but also basic instructions (BIs). Even if basic functional units (BFUs), e.g., ALU, get slower due to aging, only a few BIs with critical propagation delay may violate the timing, whereas the other BIs running on the same BFU can still satisfy the timing. We then introduce "customized BFUs", which execute only such aging-critical BIs. The customized BFUs, used as spare BFUs of the aging-critical BIs, can extend lifetime of the system. Combining the two approaches enables speedup as well as lifetime extension with no or negligibly small area/power overhead. Experiments demonstrate that our work outperforms conventional worst-case work (by an average speedup of about 49%) and existing SSTA-based work (16x or more lifetime extension with comparable speedup).

Multispeculative Additive Trees in High-Level Synthesis [p. 188]

Alberto A. Del Barrio, Roman Hermida, Seda Ogrenci Memik, Jose M. Mendis and Marla C. Molina

Multispeculative Functional Units (MSFUs) are arithmetic functional units that operate using several predictors for the carry signal. The carry prediction helps to shorten the critical path of the functional unit. The average performance of these units is determined by the hit rate of the prediction. In spite of utilizing more than one predictor, none or only one additional cycle is enough for producing the correct result in the majority of the cases. In this paper we present multispeculation as a way of increasing the performance of tree structures with a negligible area penalty. By judiciously introducing these structures into computation trees, it will only be necessary to predict in certain selected nodes, thus minimizing the number of operations that can potentially mispredict. Hence, the average latency will be diminished and thus performance will be increased. Our experiments show that it is possible to improve on average 24% and 38% execution time, when considering logarithmic and linear modules, respectively.
Index Terms - Speculation, operation trees, High-Level Synthesis.

Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis [p. 194]

Andrew Canis, Jason H. Anderson and Stephen D. Brown

Resource sharing is a classic high-level synthesis (HLS) optimization that saves area by mapping multiple operations to a single functional unit. With resource sharing, only operations scheduled in separate cycles can be assigned to shared hardware, which can result in longer schedules. In this paper, we propose a new approach to resource sharing that allows multiple operations to be performed by a single functional unit/ in one clock cycle. Our approach is based on multi-pumping, which operates functional units at a higher frequency than the surrounding system logic, typically 2x, allowing multiple computations to complete in a single system cycle. Our approach is particularly effective for DSP blocks on an FPGA, which are used to perform multiply and/or accumulate operations. Our results show that resource sharing using multi-pumping is comparable to traditional resource sharing in terms of area saved, but provides significant performance advantages. Specifically, when targeting a 50% reduction in DSP blocks, traditional resource sharing decreases circuit speed performance by 80%, on average, whereas multi-pumping decreases circuit speed by just 5%. Multi-pumping is a viable approach to achieve the area reductions of resource sharing, with considerably less negative impact to circuit performance.

Resource-Constrained High-Level Datapath Optimization in ASIP Design [p. 198]

Yuankai Chen and Hai Zhou

In this work, we study the problem of optimizing the datapath under resource constraint in the high-level synthesis of Application-Specific Instruction Processor (ASIP). We propose a two-level dynamic programming (DP) based heuristic algorithm. At the inner level of the proposed algorithm, the instructions are sorted in topological order, and then a DP algorithm is applied to optimize the topological order of the datapath. At the outer level, the space of the topological order of each instruction is explored to iteratively improve the solution. Compared with an optimal brutal-force algorithm, the proposed algorithm achieves near-optimal solution, with only 3% more performance overhead on average but significant reduction in runtime. Compared with a greedy algorithm which replaces the DP inner level with a greedy heuristic approach, the proposed algorithm achieves 48% reduction in performance overhead.

3.4: Microarchitectural Techniques for Reliability

Moderators: Todd Austin - Michigan University Ann Arbor, US; Mladen Berekovic - Technical University of Braunschweig, DE

Extracting Useful Computation from Error-Prone Processors for Streaming Applications [p. 202]

Yavuz Yetim, Margaret Martonosi and Sharad Malik

As semiconductor fabrics scale closer to fundamental physical limits, their reliability is decreasing due to process variation, noise margin effects, aging effects, and increased susceptibility to soft errors. Reliability can be regained through redundancy, error checking with recovery, voltage scaling and other means, but these techniques impose area/energy costs. Since some applications (e.g. media) can tolerate limited computation errors and still provide useful results, error-tolerant computation models have been explored, with both the application and computation fabric having stochastic characteristics. Stochastic computation has, however, largely focused on application-specific hardware solutions, and is not general enough to handle arbitrary bit errors that impact memory addressing or control in processors. In response, this paper addresses requirements for error-tolerant execution by proposing and evaluating techniques for running error-tolerant software on a general-purpose processor built from an unreliable fabric. We study the minimum error-protection required, from a microarchitecture perspective, to still produce useful results at the application output. Even with random errors as frequent as every 250μs, our proposed design allows JPEG and MP3 benchmarks to sustain good output quality - 14dB and 7dB respectively. Overall, this work establishes the potential for error-tolerant single-threaded execution, and details its required hardware/system support.

Orchestrator: A Low-cost Solution to Reduce Voltage Emergencies for Multi-threaded Applications [p. 208]

Xing Hu, Guihai Yan, Yu Hu and Xiaowei Li

Voltage emergencies have become a major challenge to multi-core processors because core-to-core resonance may put all cores into danger which jeopardizes system reliability. We observed that the applications following SPMD (Single Program and Multiple Data) programming model tend to spark domain-wide voltage resonance because multiple threads sharing the same function body exhibit similar power activity. When threads are judiciously relocated among the cores, the voltage droops can be greatly reduced. We propose "Orchestrator", a sensor-free non-intrusive scheme for multi-core architectures to smooth the voltage droops. Orchestrator focuses on the inter-core voltage interactions, and maximally leverages the thread diversity to avoid voltage droops synergy among cores. Experimental results show that Orchestrator can reduce up to 64% voltage emergencies on average, meanwhile improving performance.

Memory Array Protection: Check on Read or Check on Write? [p. 214]

Panagiota Nikolaou, Yiannakis Sazeides, Lorena Ndreu, Emre Özer and Sachin Idgunji

This work introduces Check-on-Write: a memory array error protection approach that enables a trade-off between a memory array's fault-coverage and energy. The presented approach checks for error in a value stored in an array before it is overwritten rather than, as currently done, when it is read (check-on-read). This aims at reducing the number and energy of error code checks. This lazy protection approach can be used for caches in systems that support failure-atomicity to recover from corrupted state due to a fault. The paper proposes and evaluates an adaptive memory protection scheme that is capable of both check-on-read and check-on-write and switches between the two protection modes depending on the energy to be saved and fault coverage requirements. Experimental analysis shows that our technique reduces the average dynamic energy of the L1 instruction cache tag and data arrays by 18.6% and 17.7% respectively. For the L1 data cache, this is 17.2% and 2.9%, and the savings are 13.4% for the L2 tag array. The paper also quantifies the implications of the proposed scheme on fault-coverage by analyzing the meantime-to-failure as a function of the transient failure rate.

FaulTM: Error Detection and Recovery Using Hardware Transactional Memory [p. 220]

Gulay Yalcin, Osman Unsal and Adrian Cristal

Reliability is an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults. Additionally, it also minimizes the Silent Data Corruption rate. However, comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors might lead to substantial performance degradation. In this study we propose FaulTM, an error detection and recovery schema utilizing Hardware Transactional Memory (HTM) in order to reduce these performance degradations. We show how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM's intended purpose of supporting optimistic concurrency. Compared with lockstepping, FaulTM reduces the performance degradation by 2.5X for SPEC2006 benchmark.

Phoenix: Reviving MLC Blocks as SLC to Extend NAND Flash Devices Lifetime [p. 226]

Xavier Jimenez, David Novo and Paolo Ienne

On a Multi-Level Cell (MLC) flash memory, a flash block that is becoming unreliable to store multiple bits per cell can be "revived" by storing only a single bit per cell. While the revived-block capacity is halved, its lifetime is significantly extended without jeopardizing the stored data. We present Phoenix, a technique that benefits from this feature to extend a device lifetime, and we evaluate its potential through detailed trace simulation on realistic benchmarks. Phoenix shows systematic lifetime extensions ranging from 3% up to 17%, without extra memory requirements or performance loss.

3.5: Energy Efficient Mobile and Cloud Computing Systems

Moderators: Tajana Rosing - University of California San Diego, US; Theocharis Theocharides - University of Cyprus, CY

SCC Thermal Model Identification via Advanced Bias-Compensated Least-Squares [p. 230]

Roberto Diversi, Andrea Bartolini, Andrea Tilli, Francesco Beneventi and Luca Benini

Compact thermal models and modeling strategies are today a cornerstone for advanced power management to counteract the emerging thermal crisis for many-core systems-on-chip. System identification techniques allow to extract models directly from the target device thermal response. Unfortunately, standard Least Squares techniques cannot effectively cope with both model approximation and measurement noise typical of real systems. In this work, we present a novel distributed identification strategy capable of coping with real-life temperature sensor noise and effectively extracting a set of low-order predictive thermal models for the tiles of Intel's Single-chip-Cloud-Computer (SCC) many-core prototype.

System and Circuit Level Power Modeling of Energy-Efficient 3D-Stacked Wide I/O DRAMs [p. 236]

Karthik Chandrasekar, Christian Weis, Benny Akesson, Norbert Wehn and Kees Goossens

JEDEC recently introduced its new standard for 3D-stacked Wide I/O DRAM memories, which defines their architecture, design, features and timing behavior. With improved performance/power trade-offs over previous generation DRAMs, Wide I/O DRAMs provide an extremely energy-efficient green memory solution required for next-generation embedded and high-performance computing systems. With both industry and academia pushing to evaluate and employ these highly anticipated memories, there is an urgent need for an accurate power model targeting Wide I/O DRAMs that enables their efficient integration and energy management in DRAM stacked SoC architectures. In this paper, we present the first system-level power model of 3D-stacked Wide I/O DRAM memories that is almost as accurate as detailed circuit-level power models of 3D-DRAMs. To verify its accuracy, we experimentally compare its power and energy estimates for different memory workloads and operations against those of a circuit-level 3D-DRAM power model and show less than 2% difference between the two sets of estimates.

Design of Low Energy, High Performance Synchronous and Asynchronous 64-Point FFT [p. 242]

William Lee, Vikas S. Vij, Anthony R. Thatcher and Kenneth S. Stevens

A case study exploring multi-frequency design is presented for a low energy and high performance FFT circuit implementation. An FFT architecture with concurrent data stream computation is selected. An asynchronous and synchronous implementations for a 16-point and a 64-point FFT circuit were designed and compared for energy, performance and area. Both versions are structurally similar and are generated using similar ASIC CAD tools and flows. The asynchronous design shows a benefit of 2.4x, 2.4x and 3.2x in terms of area, energy and performance respectively over its synchronous counterpart. The circuit is further compared with other published designs and shows 0.4x, 4.8x and 32.4x benefit with respect to area, energy and performance.
Index Terms - Asynchronous circuits, FFT, synthesis, timing analysis, low power digital, low energy digital, synchronous circuits, high performance

A Multi-Level Monte Carlo FPGA Accelerator for Option Pricing in the Heston Model [p. 248]

Christian de Schryver, Pedro Torruella and Norbert Wehn

The increasing demand for fast and accurate product pricing and risk computation together with high energy costs currently make finance and insurance institutes to rethink their IT infrastructure. Heterogeneous systems including specialized accelerator devices are a promising alternative to current CPU and GPU-clusters towards hardware accelerated computing. It has already been shown in previous work that complex state-of-the-art computations that have to be performed very frequently can be sped up by FPGA accelerators in a highly efficient way in this domain. A very common task is the pricing of credit derivatives, in particular options, under realistic market models. Monte Carlo methods are typically employed for complex or path dependent products. It has been shown that the multi-level Monte Carlo can provide a much better convergence behavior than standard single-level methods. In this work we present the first hardware architecture for pricing European barrier options in the Heston model based on the advanced multi-level Monte Carlo method. The presented architecture uses industry-standard AXI4-Stream flow control, is constructed in a modular way and can be extended to more products easily. We show that it computes around 100 millions of steps in a second with a total power consumption of 3.58 W on a Xilinx Virtex-6 FPGA.

Non-Speculative Double-Sampling Technique to Increase Energy-Efficiency in a High-Performance Processor [p. 254]

Junyoung Park, Ameya Chaudhari and Jacob A. Abraham

In the past few years, many techniques have been introduced which try to utilize excessive timing margins of a processor. However, these techniques have limitations due to one of the following reasons: first, they are not suitable for high-performance processor designs due to the power and design overhead they impose; second, they are not accurate enough to effectively exploit the timing margins, requiring substantial safety margin to guarantee correct operation of the processor. In this paper, we introduce an alternative, more effective technique that is suitable for high-performance processor designs, in which a processor predicts timing errors in the critical paths and undertakes preventive steps in order to avoid the errors in the event that the timing margins fall below a critical level. This technique allows a processor to exploit timing margins, while only requiring the minimum safety margin. Our simulation results show that proposed idea results in 12% and 6% improvement in energy and Energy-Delay Product (EDP), respectively, over a Razor-based speculative method.

User-Aware Energy Efficient Streaming Strategy for Smartphone Based Video Playback Applications [p. 258]

Hao Shen and Qinru Qiu

We propose a methodology to design user-aware streaming strategies for energy efficient smartphone video playback applications (e.g. YouTube). Our goal is to manage the streaming process to minimize the sleep and wake penalty of cellular module and at the same time avoid the energy waste from excessive downloading. The problem is modeled as a stochastic inventory system, where the real length of video playback requested by the smartphone user is considered as demand that follows a stochastic process. Through user behavior analysis, a Gaussian Mixture Model (GMM) is constructed to predict the user demand in video playback, and then an energy efficient video downloading strategy will be determined progressively during the playback process. Experimental results show that compared to a static downloading strategy that is optimized by exhaustive trail, our method can reduce the wasted energy by 10 percent in average.
Key words: smartphone, video download, 3G, energy, Inventory Theory, Gaussian Mixture Model

Utility-Aware Deferred Load Balancing in the Cloud Driven by Dynamic Pricing of Electricity [p. 262]

Muhammad Abdullah Adnan and Rajesh Gupta

Distributed computing resources in a cloud computing environment provides an opportunity to reduce energy and its cost by shifting loads in response to dynamically varying availability of energy. This variation in electrical power availability is represented in its dynamically changing price that can be used to drive workload deferral against performance requirements. But such deferral may cause user dissatisfaction. In this paper, we quantify the impact of deferral on user satisfaction and utilize flexibility from the service level agreements (SLAs) for deferral to adapt with dynamic price variation. We differentiate among the jobs based on their requirements for responsiveness and schedule them for energy saving while meeting deadlines and user satisfaction. Representing utility as decaying functions along with workload deferral, we make a balance between loss of user satisfaction and energy efficiency. We model delay as decaying functions and guarantee that no job violates the maximum deadline, and we minimize the overall energy cost. Our simulation on MapReduce traces show that energy consumption can be reduced by ~15%, with such utility-aware deferred load balancing. We also found that considering utility as a decaying function gives better cost reduction than load balancing with a fixed deadline.

Leakage and Temperature Aware Server Control for Improving Energy Efficiency in Data Centers [p. 266]

Marina Zapater, José L. Ayala, José M. Moya, Kalyan Vaidyanathan, Kenny Gross and Ayse K. Coskun

Reducing the energy consumption for computation and cooling in servers is a major challenge considering the data center energy costs today. To ensure energy-efficient operation of servers in data centers, the relationship among computational power, temperature, leakage, and cooling power needs to be analyzed. By means of an innovative setup that enables monitoring and controlling the computing and cooling power consumption separately on a commercial enterprise server, this paper studies temperature-leakage-energy tradeoffs, obtaining an empirical model for the leakage component. Using this model, we design a controller that continuously seeks and settles at the optimal fan speed to minimize the energy consumption for a given workload. We run a customized dynamic load-synthesis tool to stress the system. Our proposed cooling controller achieves up to 9% energy savings and 30W reduction in peak power in comparison to the default cooling control scheme.

3.6: Dealing with Timing Variation in Advanced Technologies

Moderators: Hans Manhaeve - Ridgetop Europe, BE; Saqib Khursheed - University of Southampton, UK

MTTF-Balanced Pipeline Design [p. 270]

Fabian Oboril and Mehdi B.Tahoori

As CMOS technologies enter nanometer scales, microprocessors become more vulnerable to transistor aging mainly due to Bias Temperature Instability and Hot Carrier Injection. These phenomena lead to increasing device delays during the operational lifetime, which results in increasing pipeline stage delays. However, the aging rates of different stages are different. Hence, a previously delay-balanced pipeline becomes increasingly imbalanced resulting in a non-optimized design in terms of Mean Time to Failure (MTTF), frequency, area and power consumption. In this paper, we propose an MTTF-balanced pipeline design, in which the pipeline stage delays are balanced after the desired lifetime rather than at design time. This can lead to significant MTTF (lifetime) improvements as well as additional performance, area, and power benefits. Our experimental results show that MTTF of the FabScalar microprocessor can be improved by 2x (or frequency by 3 %) while achieving an additional 4% power, and 1% area optimization.

Efficient Variation-Aware Statistical Dynamic Timing Analysis for Delay Test Applications [p. 276]

Marcus Wagner and Hans-Joachim Wunderlich

Increasing parameter variations, caused by variations in process, temperature, power supply, and wear-out, have emerged as one of the most important challenges in semiconductor manufacturing and test. As a consequence for gate delay testing, a single test vector pair is no longer sufficient to provide the required low test escape probabilities for a single delay fault. Recently proposed statistical test generation methods are therefore guided by a metric, which defines the probability of detecting a delay fault with a given test set. However, since runtime and accuracy are dominated by the large number of required metric evaluations, more efficient approximation methods are mandatory for any practical application. In this work, a new statistical dynamic timing analysis algorithm is introduced to tackle this problem. The associated approximation error is very small and predominantly caused by the impact of delay variations on path sensitization and hazards. The experimental results show a large speedup compared to classical Monte Carlo simulations.

SlackProbe: A Low Overhead In Situ On-line Timing Slack Monitoring Methodology [p. 282]

Liangzhen Lai, Vikas Chandra, Robert Aitken and Puneet Gupta

In situ monitoring is an accurate way to monitor circuit delay or timing slack, but usually incurs significant overhead. We observe that most existing slack monitoring methods exclusively focus on monitoring path ending registers, which is not cost efficient from power and area perspectives. In this paper, we propose SlackProbe methodology, which inserts timing slack monitors like "probes" at a selected set of nets, including intermediate nets along critical paths. SlackProbe can significantly reduce the total number of monitors required at the cost of some additional delay margin. It can be used to detect impending delay failures due to various reasons (process variations, ambient fluctuations, circuit aging, etc.) and can be used with various preventive actions (e.g. voltage/frequency scaling, clock stretching/time borrowing, etc.). Though we focus on monitor selection in this work, we give an example of using SlackProbe with adaptive voltage scaling. Experimental results on commercial processors show that with 5% more timing margin, SlackProbe can reduce the number of monitors by 15-18X as compared to the number of monitors inserted at path ending pins.

Capturing Post-Silicon Variation by Layout-aware Path-delay Testing [p. 288]

Xiaolin Zhang, Jing Ye, Yu Hu and Xiaowei Li

With aggressive device scaling, the impact of parameter variation is becoming more prominent, which results in the uncertainty of a chip's performance. Techniques that capture post-silicon variation by deploying on-chip monitors suffer from serious area overhead and low testing reliability, while techniques using non-invasion test are limited in small scale circuits. In this paper, a novel layout-aware post-silicon variation extraction method which is based on non-invasive path-delay test is proposed. The key technique of the proposed method is a novel layout-aware heuristic path selection algorithm which takes the spatial correlation and linear dependence between paths into consideration. Experimental results show that the proposed technique can obtain an accurate timing variation distribution with zero area overhead. Moreover, the test cost is much smaller than the existing non-invasion method.
Keywords: variation extraction, path selection, path-delay testing, layout-aware

Adaptive Reduction of the Frequency Search Space for Multi-Vdd Digital Circuits [p. 292]

Chandra K.H. Suresh, Ender Yilmaz, Sule Ozev and Ozgur Sinanoglu

Increasing process variations, coupled with the need for highly adaptable circuits, bring about tough new challenges in terms of circuit testing. Circuit adaptation for process and workload variability require costly characterization/test cycles for each chip, in order to extract particular V_dd/f_max behavior of the die under test. This paper aims at adaptively reducing the search space for f_max at multiple levels by reusing the information previously obtained from the DUT during test-time. The proposed adaptive solution reduces the test/characterization time and costs at no area or test overhead.

3.7: Timing Analysis

Moderators: Stefan Petters - CISTER/INESC-TEC, ISEP, PT; Michael Paulitsch - EADS, DE

FIFO Cache Analysis for WCET Estimation: A Quantitative Approach [p. 296]

Nan Guan, Xinping Yang, Mingsong Lv and Wang Yi

Although most previous work in cache analysis for WCET estimation assumes the LRU replacement policy, in practise more processors use simpler non-LRU policies for lower cost, power consumption and thermal output. This paper focuses on the analysis of FIFO, one of the most widely used cache replacement policies. Previous analysis techniques for FIFO caches are based on the same framework as for LRU caches using qualitative always-hit/always-miss classifications. This approach, though works well for LRU caches, is not suitable to analyze FIFO and usually leads to poor WCET estimation quality. In this paper, we propose a quantitative approach for FIFO cache analysis. Roughly speaking, the proposed quantitative analysis derives an upper bound on the "miss ratio" of an instruction (set), which can better capture the FIFO cache behavior and support more accurate WCET estimations. Experiments with benchmarks show that our proposed quantitative FIFO analysis can drastically improve the WCET estimation accuracy over previous techniques (the average overestimation ratio is reduced from around 70% to 10% under typical setting).

Timing Analysis of Multi-Mode Applications on AUTOSAR Conform Multi-Core Systems [p. 302]

Mircea Negrean, Sebastian Klawitter, Rolf Ernst

Many real-time embedded systems execute multi-mode applications, i.e. applications that can change their functionality over time. With the advent of multi-core embedded architectures, the system design process requires appropriate support for accommodating multi-mode applications on multiple cores which share common resources. Various mode change and resource arbitration protocols, and corresponding timing analysis solutions were proposed for either multi-mode or multi-core real-time applications. However, no attention was given to multi-mode applications that share resources when executing on multi-core systems. In this paper, we address this subject in the context of automotive multi-core processors using AUTOSAR. We present an approach for safely handling shared resources across mode changes and provide a corresponding timing analysis method.

Bounding SDRAM Interference: Detailed Analysis vs. Latency-Rate Analysis [p. 308]

Hardik Shah, Alois Knoll and Benny Akesson

The transition towards multi-processor systems with shared resources is challenging for real-time systems, since resource interference between con- current applications must be bounded using timing analysis. There are two common approaches to this problem: 1) Detailed analysis that models the particular resource and arbiter cycle-accurately to achieve tight bounds. 2) Using temporal abstractions, such as latency-rate (LR) servers, to enable unified analysis for different resources and arbiters using well-known timing analysis frameworks. However, the use of abstraction typically implies reducing the tightness of analysis that may result in over-dimensioned systems, although this pessimism has not been properly investigated. This paper compares the two approaches in terms of worst-case execution time (WCET) of applications sharing an SDRAM memory under Credit-Controlled Static-Priority (CCSP) arbitration. The three main contributions are: 1) A detailed interference analysis of the SDRAM memory and CCSP arbiter. 2) Based on the detailed analysis, two optimizations are proposed to the LR analysis that increase the tightness of its interference bounds. 3) An experimental comparison of the two approaches that quantifies their impact on the WCET of applications from the CHStone benchmark.

3.8: HOT TOPIC: Design for Variability, Manufacturability, Reliability, and Debug: Many Faces of the Same Coin?

Organizer: Vikas Chandra - ARM, US
Moderators: Vikas Chandra - ARM, US; Kartik Mohanram - University of Pittsburgh, US

Role of Design in Multiple Patterning: Technology Development, Design Enablement and Process Control [p. 314]

Rani S. Ghaida and Puneet Gupta

Multiple-patterning optical lithography is inevitable for technology scaling beyond the 22nm technology node. Multiple patterning imposes several counter-intuitive restrictions on layout and carries serious challenges for design methodology. This paper examines the role of design at different stages of the development and adoption of multiple patterning: technology development, design enablement, and process control. We discuss how explicit design involvement can enable timely adoption of multi-patterning with reduced costs both in design and manufacturing.

Overcoming Post-Silicon Validation Challenges through Quick Error Detection (QED) [p. 320]

David Lin, Ted Hong, Yanjing Li, Farzan Fallah, Donald S. Gardner, Nagib Hakim and Subhasish Mitra

Existing post-silicon validation techniques are generally ad hoc, and their cost and complexity are rising faster than design cost. Hence, systematic approaches to post-silicon validation are essential. Our research indicates that many of the bottlenecks of existing post-silicon validation approaches are direct consequences of very long error detection latencies. Error detection latency is the time elapsed between the activation of a bug during post-silicon validation and its detection or manifestation as a system failure. In our earlier papers, we created the Quick Error Detection (QED) technique to overcome this significant challenge. QED systematically creates a wide variety of post-silicon validation tests to detect bugs in processor cores and uncore components of multi-core System-on-Chips (SoCs) very quickly, i.e., with very short error detection latencies. In this paper, we present an overview of QED and summarize key results: 1. Error detection latencies of "typical" post-silicon validation tests can range up to billions of clock cycles. 2. QED shortens error detection latencies by up to 6 orders of magnitude. 3. QED enables 2- to 4-fold improvement in bug coverage. QED does not require any hardware modification. Hence, it is readily applicable to existing designs.
Keywords - Debug, Post-Silicon Validation, Quick Error Detection, Testing, Verification

Stochastic Degradation Modeling and Simulation for Analog Integrated Circuits in Nanometer CMOS [p. 326]

Georges Gielen and Elie Maricau

Reliability is one of the major concerns in designing integrated circuits in nanometer CMOS technologies. Problems related to transistor degradation mechanisms like NBTI/PBTI or soft gate breakdown cause time-dependent circuit performance degradation. Variability and mismatch between transistors only makes this more severe, while at the same time transistor aging can increase the variability and mismatch in the circuit over time. Finally, in advanced nanometer CMOS, the aging phenomena themselves become discrete, with both the time and the impact of degradation being fully stochastic. This paper explores these problems by means of a circuit example, indicating the time-dependent stochastic nature of offset in a comparator and its impact in flash A/D converters.
Keywords - analog integrated circuits; aging; reliability modeling and simulation

4.2: The Quest for Better NoCs

Moderators: Pascal Vivet - CEA-LETI, FR; Riccardo Locatelli - ST Microelectronics, FR

A Transition-Signaling Bundled Data NoC Switch Architecture for Cost-effective GALS Multicore Systems [p. 332]

Alberto Ghiribaldi, Davide Bertozzi and Steven M. Nowick

Asynchronous networks-on-chip (NoCs) are an appealing solution to tackle the synchronization challenge in modern multicore systems through the implementation of a GALS paradigm. However, they have found only limited applicability so far due to two main reasons: the lack of proper design tool flows as well as their significant area footprint over their synchronous counterparts. This paper proposes a largely unexplored design point for asynchronous NoCs, relying on transition-signaling bundled data, which contributes to break the above barriers. Compared to an existing lightweight synchronous switch architecture, xpipesLite, the post-layout asynchronous switch achieved a 71% reduction in area, up to 85% reduction in overall power consumption, and a 44% average reduction in energy-per-flit, while mastering the more stringent timing assumptions of this solution with a semi-automated synthesis flow.

SMART: A Single-Cycle Reconfigurable NoC for SoC Applications [p. 338]

Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramanian, Anantha P. Chandrakasan and Li-Shiuan Peh

As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs to interconnect the multiple cores on the chip. Given aggressive SoC design targets, NoCs have to deliver low latency, high bandwidth, at low power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures and tailors a generic mesh topology for SoC applications at runtime. The heart of our SMART NoC is a novel low-swing clockless repeated link circuit embedded within the router crossbars, that allows packets to potentially bypass all the way from source to destination core within a single clock cycle, without being latched at any intermediate router. Our clockless repeater link has been proven in silicon in 45nm SOI. Results show that at 2GHz, we can traverse 8mm within a single cycle, i.e. 8 hops with 1mm cores. We implement the SMART NoC to layout and show that SMART NoC gives 60% latency savings, and 2.2X power savings compared to a baseline mesh NoC.

Switch Folding: Network-on-Chip Routers with Time-Multiplexed Output Ports [p. 344]

G. Dimitrakopoulos, N. Georgiadis, C. Nicopoulos and E. Kalligeros

On-chip interconnection networks simplify the increasingly challenging process of integrating multiple functional modules in modern Systems-on-Chip (SoCs). The routers are the heart and backbone of such networks, and their implementation cost (area/power) determines the cost of the whole network. In this paper, we explore the time-multiplexing of a router's output ports via a folded datapath and control, where only a portion of the router's arbiters and crossbar multiplexers are implemented, as a means to reduce the cost of the router without sacrificing performance. In parallel, we propose the incorporation of the switch-folded routers into a new form of heterogeneous network topologies, comprising both folded (time-multiplexed) and unfolded (conventional) routers, which leads to effectively the same network performance, but at lower area/energy, as compared to topologies composed entirely of full-fledged wormhole or virtual-channel-based router designs.

An Efficient Network-on-Chip Architecture Based on Isolating Local and Non-Local Communications [p. 350]

Vahideh Akhlaghi, Mehdi Kamal, Ali Afzali-Kusha and Massoud Pedram

In this paper, we propose a scheme for reducing the latency of packets transmitted via on-chip interconnect network in MultiProcessor Systems on Chips (MPSoCs). In this scheme, the network architecture separates the packets transmitted to near destinations from those transmitted to distant ones by using two network layers. These two layers are realized by dividing the channel width among the cores. The optimum ratio for the channel width division is a function of relative significances of the two types of communications. Simulation results indicate that for non-uniform traffic constituting of more than 30 percent local traffic, the proposed network, on average provides 64% and 70% improvement over the conventional one in terms of average network latency and Energy-Delay product (EDP), respectively. Also, for uniform and NED traffic patterns, by adjusting the number of hops between local nodes to include approximately 55 percent of total communications in local ones, the proposed architecture provides the latency reduction of 50%.

SVR-NoC: A Performance Analysis Tool for Network-on-Chips Using Learning-based Support Vector Regression Model [p. 354]

Zhiliang Qian, Da-Cheng Juan, Paul Bogdan, Chi-Ying Tsui, Diana Marculescu and Radu Marculescu

In this work, we propose SVR-NoC, a learning-based support vector regression (SVR) model for evaluating Network-on-Chip (NoC) latency performance. Different from the state-of-the-art NoC analytical model, which uses classical queuing theory to directly compute the average channel waiting time, the proposed SVR-NoC model performs NoC latency analysis based on learning the typical training data. More specifically, we develop a systematic machine-learning framework that uses the kernel-based support vector regression method to predict the channel average waiting time and the traffic flow latency. Experimental results show that SVR-NoC can predict the average packet latency accurately while achieving about 120X speed-up over simulation-based evaluation methods.
Index Terms - Network-on-Chip, learning, performance model

4.3: EMBEDDED TUTORIAL: Reliability Analysis Reloaded: How Will We Survive?

Organizers: Goerschwin Fey - University of Bremen, DE; Matteo Sonza Reorda - Politecnico di Torino, IT
Moderators: Bernd Becker - University of Freiburg, DE; Xavier Vera - Intel, ES

Reliability Analysis Reloaded: How Will We Survive? [p. 358]

Robert Aitken, Görschwin Fey, Zbigniew T. Kalbarczyk, Frank Reichenbach, Matteo Sonza Reorda

In safety related applications and in products with long lifetimes reliability is a must. Moreover, facing future technology nodes of integrated circuit device level reliability may decrease, i.e., counter-measures have to be taken to ensure product level reliability. But assessing the reliability of a large system is not a trivial task. This paper revisits the state-of-the-art in reliability evaluation starting from the physical device level, to the software system level, all the way up to the product level. Relevant standards and future trends are discussed.

4.4: Emerging Solutions to Manage Energy/Performance Trade-Offs along the Memory Hierarchy

Moderators: Mladen Berekovic - Technical University of Braunschweig, DE; Cristina Silvano - Politecnico di Milano, IT

MALEC: A Multiple Access Low Energy Cache [p. 368]

Matthias Boettcher, Giacomo Gabrielli, Bashir M. Al-Hashimi and Danny Kershaw

This paper addresses the dynamic energy consumption in L1 data cache interfaces of out-of-order superscalar processors. The proposed Multiple Access Low Energy Cache (MALEC) is based on the observation that consecutive memory references tend to access the same page. It exhibits a performance level similar to state of the art caches, but consumes approximately 48% less energy. This is achieved by deliberately restricting accesses to only 1 page per cycle, allowing the utilization of single-ported TLBs and cache banks, and simplified lookup structures of Store and Merge Buffers. To mitigate performance penalties it shares memory address translation results between multiple memory references, and shares data among loads to the same cache line. In addition, it uses a Page-Based Way Determination scheme that holds way information of recently accessed cache lines in small storage structures called way tables that are closely coupled to TLB lookups and are able to simultaneously service all accesses to a particular page. Moreover, it removes the need for redundant tag-array accesses, usually required to confirm way predictions. For the analyzed workloads, MALEC achieves average energy savings of 48% in the L1 data memory subsystem over a high performance cache interface that supports up to 2 loads and 1 store in parallel. Comparing MALEC and the high performance interface against a low power configuration limited to only 1 load or 1 store per cycle reveals 14% and 15% performance gain requiring 22% less and 48% more energy, respectively. Furthermore, Page-Based Way Determination exhibits coverage of 94%, which is a 16% improvement over the originally proposed line-based way determination.

TreeFTL: Efficient RAM Management for High Performance of NAND Flash-based Storage Systems [p. 374]

Chundong Wang and Weng-Fai Wong

NAND flash memory is widely used for secondary storage today. The flash translation layer (FTL) is the embedded software that is responsible for managing and operating in flash storage system. One important module of the FTL performs RAM management. It is well-known to have a significant impact on flash storage system's performance. This paper proposes an efficient RAM management scheme called TreeFTL. As the name suggests, TreeFTL organizes address translation pages and data pages in RAM in a tree structure, through which it dynamically adapts to workloads by adjusting the partitions for address mapping and data buffering. TreeFTL also employs a lightweight mechanism to implement the least recently used (LRU) algorithm for RAM cache evictions. Experiments show that compared to the two latest schemes for RAM management in flash storage system, TreeFTL can reduce service time by 46.6% and 49.0% on average, respectively, with a 64MB RAM cache.

DA-RAID-5: A Disturb Aware Data Protection Technique for NAND Flash Storage Systems [p. 380]

Jie Guo, Wujie Wen, Yaojun Zhang Li, Sicheng Li, Hai Li and Yiran Chen

Program disturb, read disturb and retention time limit are three major reasons accounting for the bit errors in NAND flash memory. The adoption of multi-level cell (MLC) technology and technology scaling further aggravates this reliability issue by narrowing threshold voltage noise margins and introducing larger device variations. Besides implementing error correction code (ECC) in NAND flash modules, RAID-5 are often deployed at system level to protect the data integrity of NAND flash storage systems (NFSS), however, with significant performance degradation. In this work, we propose a technique called "DA-RAID-5" to improve the performance of the enterprise NFSS under RAID-5 protection without harming its reliability (here DA stands for "disturb aware"). Three schemes, namely, unbound-disturb limiting (UDL), PE-aware RAID-5 and Hybrid Caching(HC) are proposed to protect the NFSS at the different stages of its lifetime. The experimental results show that compared to the best prior work, DA-RAID-5 can improve the NFSS response time by 9.7% on average.

Exploiting Subarrays inside a Bank to Improve Phase Change Memory Performance [p. 386]

Jianhui Yue and Yifeng Zhu

Enabling subarrays reduces memory latency by allowing concurrent accesses to different subarrays within the same bank in the DRAM system. However, this technology has great challenges in the PCM system since an on-going write cannot overlap with other accesses due to large electric current draw for writes. This paper proposes two new mechanisms (PASAK and WAVAK) that leverage subarray-level parallelism to enable a bank to serve a write and multiple reads in parallel without violating power constraints. PASAK exploits the electric current difference between writing a bit 0 and a bit 1, and provides a new power allocation strategy that better utilizes the power budget to mitigate the performance degradation due to bank conflicts. WAVAK adds a simple coding method that inverts all bits to be written if there are more zeros than ones, with a goal to reduce electric current for writes and create larger power surplus to serve more reads if there is no subarray conflict. Experimental results under 4-cores SPEC CPU 2006 workloads show that our proposed mechanisms can reduce memory latency by 68.7% and running time by 34.8% on average, comparing with the standard PCM system. In addition, our mechanisms outperform Flip-N-Write 14.6% in latency and 8.5% in running time on average.

Future of GPGPU Micro-Architectural Parameters [p. 392]

Cedric Nugteren, Gert-Jan van den Braak and Henk Corporaal

As graphics processing units (GPUs) are becoming increasingly popular for general purpose workloads (GPGPU), the question arises how such processors will evolve architecturally in the near future. In this work, we identify and discuss trade-offs for three GPU architecture parameters: active thread count, compute-memory ratio, and cluster and warp sizing. For each parameter, we propose changes to improve GPU design, keeping in mind trends such as dark silicon and the increasing popularity of GPGPU architectures. A key-enabler is dynamism and workload-adaptiveness, enabling among others: dynamic register file sizing, latency aware scheduling, roofline-aware DVFS, run-time cluster fusion, and dynamic warp sizing.

Synchronizing Code Execution on Ultra-Low-Power Embedded Multi-Channel Signal Analysis Platforms [p. 396]

Ahmed Yasir Dogan, Rubén Braojos, Jeremy Constantin, Giovanni Ansaloni, Andreas Burg and David Atienza

Embedded biosignal analysis involves a considerable amount of parallel computations, which can be exploited by employing low-voltage and ultra-low-power (ULP) parallel computing architectures. By allowing data and instruction broadcasting, single instruction multiple data (SIMD) processing paradigm enables considerable power savings and application speedup, in turn allowing for a lower voltage supply for a given workload. The state-of-the-art multi-core architectures for biosignal analysis however lack a bare, yet smart, synchronization technique among the cores, allowing lockstep execution of algorithm parts that can be performed using the SIMD, even in the presence of data-dependent execution flows. In this paper, we propose a lightweight synchronization technique to enhance an ULP multi-core processor, resulting in improved energy efficiency through lockstep SIMD execution. Our results show that the proposed improvements accomplish tangible power savings, up to 64% for an 8-core system operating at a workload of 89 MOps/s while exploiting voltage scaling.

Using Synchronization Stalls in Power-aware Accelerators [p. 400]

Ali Jooya and Amirali Baniasadi

GPUs spend significant time on synchronization stalls. Such stalls provide ample opportunity to save leakage energy in GPU structures left idle during such periods. In this paper we focus on the register file structure of NVIDIA GPUs and introduce sync-aware low leakage solutions to reduce power. Accordingly, we show that applying the power gating technique to the register file during synchronization stalls can improve power efficiency without considerable performance loss. To this end, we equip the register file with two leakage power saving modes with different levels of power saving and wakeup latencies.

4.5: Device Identification and Protection

Moderators: Patrick Koeberl - Intel Labs, DE; Roel Maes - Intrinsic-ID, NL

Comprehensive Analysis of Software Countermeasures against Fault Attacks [p. 404]

Nikolaus Theißing, Dominik Merli, Michael Smola, Frederic Stumpf and Georg Sigl

Fault tolerant software against fault attacks constitutes an important class of countermeasures for embedded systems. In this work, we implemented and systematically analyzed a comprehensive set of 19 different strategies for software countermeasures with respect to protection effectiveness as well as time and memory efficiency. We evaluated the performance and security of all implementations by fault injections into a microcontroller simulator based on an ARM Cortex-M3. Our results show that some rather simple countermeasures outperform other more sophisticated methods due to their low memory and/or performance overhead. Further, combinations of countermeasures show strong characteristics and can lead to a high fault coverage, while keeping additional resources at a minimum. The results obtained in this study provide developers of secure software for embedded systems with a solid basis to decide on the right type of fault attack countermeasure for their application.

An EDA-Friendly Protection Scheme against Side-Channel Attacks [p. 410]

Ali Galip Bayrak, Nikola Velickovic, Francesco Regazzoni, David Novo, Philip Brisk and Paolo Ienne

This paper introduces a generic and automated methodology to protect hardware designs from side-channel attacks in a manner that is fully compatible with commercial standard cell design flows. The paper describes a tool that artificially adds jitter to the clocks of the sequential elements of a cryptographic unit, which increases the non-determinism of signal timing, thereby making the physical device more difficult to attack. Timing constraints are then specified to commercial EDA tools, which restore the circuit functionality and efficiency while preserving the introduced randomness. The protection scheme is applied to an AES-128 hardware implementation that is synthesized using both ASIC and FPGA design flows.

Design and Implementation of a Group-based RO PUF [p. 416]

Chi-En Yin, Gang Qu and Qiang Zhou

The silicon physical unclonable functions (PUF) utilize the uncontrollable variations during integrated circuit (IC) fabrication process to facilitate security related applications such as IC authentication. In this paper, we describe a new framework to generate secure PUF secret from ring oscillator (RO) PUF with improved hardware efficiency. Our work is based on the recently proposed group-based RO PUF with the following novel concepts: an entropy distiller to filter the systematic variation; a simplified grouping algorithm to partition the ROs into groups; a new syndrome coding scheme to facilitate error correction; and an entropy packing method to enhance coding efficiency and security. Using RO PUF dataset available in the public domain, we demonstrate these concepts can create PUF secret that can pass the NIST randomness and stability tests. Compared to other state-of-the-art RO PUF design, our approach can generate an average of 72% more PUF secret with the same amount of hardware.

ClockPUF: Physical Unclonable Functions Based on Clock Networks [p. 422]

Yida Yao, MyungBo Kim, Jianmin Li, Igor L. Markov and Farinaz Koushanfar

Physical Unclonable Functions (PUFs) extract unique chip signatures from process variations. They are used in identification, authentication, integrity verification, and anti-counterfeiting tasks. We introduce new PUF techniques that extract bits from pairwise skews between sinks of a clock network. These techniques inherit the stability of clock network, but require a return network to deliver clock pulses to a certain region, where they are compared. Our algorithms select equidistant sinks and route the return network, then derive chip-specific random bits from available data with a moderate overhead. SPICE-based evaluation of clock-PUFs using a 45nm CMOS technology validates the operability, stability, uniqueness, randomness, and their low overhead.

Memristor PUFs: A New Generation of Memory-based Physically Unclonable Functions [p. 428]

Patrick Koeberl, Ünal Kocabas and Ahmad-Reza Sadeghi

Memristors are emerging as a potential candidate for next-generation memory technologies, promising to deliver non-volatility at performance and density targets which were previously the domain of SRAM and DRAM. Silicon Physically Unclonable Functions (PUFs) have been introduced as a relatively new security primitive which exploit manufacturing variation resulting from the IC fabrication process to uniquely fingerprint a device instance or generate device-specific cryptographic key material. While silicon PUFs have been proposed which build on traditional memory structures, in particular SRAM, in this paper we present a memristor-based PUF which utilizes a weak-write mechanism to obtain cell behaviour which is influenced by process variation and hence usable as a PUF response. Using a model-based approach we evaluate memristor PUFs under random process variations and present results on the performance of this new PUF variant.

Wireless Sensor Network Simulation for Security and Performance Analysis [p. 432]

A. Díaz, P. Sanchez, J. Sancho and J. Rico

During the last years, Wireless Sensor Networks (WSN) have been deployed at an accelerated rate. The complexity and low-power requirements of these networks have also been growing. Therefore, WSN developers are beginning to require efficient methodologies for network simulation and embedded SW performance analysis. These tools should also include security analysis. This security analysis has to evaluate the vulnerability of a WSN to the wide variety of attacks that these networks could suffer. WSN attacks could also affect power consumption and performance of the node's software, thus security analysis has to be integrated into a complete performance analysis framework. This work proposes a methodology to simulate the most common and dangerous attacks that a WSN can suffer nowadays. The impact of these attacks on power consumption and software execution time are also analyzed. This provides developers with important information about the effects that one or multiple attacks could have on the WSN, helping them to develop more secure software.
Index Terms - WSN, Attack Simulation, Power Consumption, Performance Analysis, Security.

4.6: New Techniques for Test Pattern Generation

Moderators: Sudhakar Reddy - University of Iowa, US; Matteo Sonza Reorda - Politecnico di Torino, IT

Accurate QBF-based Test Pattern Generation in Presence of Unknown Values [p. 436]

Stefan Hillebrecht, Michael A. Kochte, Dominik Erb, Hans-Joachim Wunderlich and Bernd Becker

Unknown (X) values may emerge during the design process as well as during system operation and test application. Sources of X-values are for example black boxes, clock-domain boundaries, analog-to-digital converters, or uncontrolled or uninitialized sequential elements. To compute a detecting pattern for a given stuck-at fault, well defined logic values are required both for fault activation as well as for fault effect propagation to observing outputs. In presence of X-values, classical test generation algorithms, based on topological algorithms or formal Boolean satisfiability (SAT) or BDD-based reasoning, may fail to generate testing patterns or to prove faults untestable. This work proposes the first efficient stuck-at fault ATPG algorithm able to prove testability or untestability of faults in presence of X-values. It overcomes the principal inaccuracy and pessimism of classical algorithms when X-values are considered. This accuracy is achieved by mapping the test generation problem to an instance of quantified Boolean formula (QBF) satisfiability. The resulting fault coverage improvement is shown by experimental results on ISCAS benchmark and larger industrial circuits.
Index Terms - Unknown values, test generation, ATPG, QBF

Test Solution for Data Retention Faults in Low-Power SRAMs [p. 442]

L. B. Zordan, A. Bosio, L. Dilillo, P. Girard, A. Todri, A. Virazel and N. Badereddine

Low-power SRAMs embed mechanisms for reducing static power consumption. When the SRAM is not accessed during a long period, it switches into an intermediate low-power mode. In this mode, a voltage regulator is used to reduce the voltage supplied to the core-cells as low as possible without data loss. Thus, faulty-free behavior of the voltage regulator is crucial for ensuring data retention in core-cells when the SRAM is in low-power mode. This paper investigates the root cause of data retention faults due to voltage regulator malfunctions. This analysis is done under realistic conditions (i.e., industrial core-cells affected by process variations). Based on this analysis, we propose an efficient test flow for detecting data retention faults in low-power SRAMs.
Keywords - SRAM, low-power design, test algorithm, memory test.

Efficient SAT-based Dynamic Compaction and Relaxation for Longest Sensitizable Paths [p. 448]

Matthias Sauer, Sven Reimer, Tobias Schubert, Ilia Polian and Bernd Becker

Comprehensive coverage of small-delay faults under massive process variations is achieved when multiple paths through the fault locations are sensitized by the test pair set. Using one test pair per path may lead to impractical test set sizes and test application times due to the large number of near-critical paths in state-of-the-art circuits. We present a novel SAT-based dynamic test-pattern compaction and relaxation method for sensitized paths in sequential and combinational circuits. The method identifies necessary assignments for path sensitization and encodes them as a SAT-instance. An efficient implementation of a bitonic sorting network is used to find test patterns maximizing the number of simultaneously sensitized paths. The compaction is combined with an efficient lifting-based relaxation technique. An innovative implication-based path-conflict analysis is used for a fast identification of conflicting paths. Detailed experimental results demonstrate the applicability and quality of the method for academical and industrial benchmark circuits. Compared to fault dropping the number of patterns is significantly reduced by over 85% on average while at the same time leaving more than 70% of the inputs unspecified.

Process-Variation-Aware Iddq Diagnosis for Nano-Scale CMOS Designs - The First Step [p. 454]

Chia-Ling (Lynn) Chang, Charles H.-P. Wen and Jayanta Bhadra

Along with the shrinking CMOS process and rapid design scaling, both Iddq values and their variation of chips increase. As a result, the defect leakages become less significant when compared to the full-chip currents, making them more in-distinguishable for traditional Iddq diagnosis. Therefore, in this paper, a new approach called σ-Iddq diagnosis is proposed for reinterpreting original data and dianosing failing chips, intelligently. The overall flow consists of two key components, (1) σ-Iddq transformation and (2) defect-syndrome matching: σIddq transmation first manifests defect leakages by excluding both the process-variation and design-scaling impacts. Later, defect-syndrome matching applies data mining with a pre-built library to identify type and locations of defects on the fly. Experimental results show that an average of 93.68% accuracy with a resolution of 1.75 defect suspects can be achieved on ISCAS'89 and IWLS'05 benchmark circuits using a 45nm technology, demonstrating the effectivess of σ-Iddq diagnosis.

4.7: HOT TOPIC: Security Challenges in Automotive Hardware/Software Architecture Design

Organizer: Samarjit Chakraborty - TU Munich, DE
Moderators: Jason Xue - City Univ. of Hong Kong, HK; Dip Goswami - TU Munich, DE

Security Challenges in Automotive Hardware/Software Architecture Design [p. 458]

Florian Sagstetter, Martin Lukasiewycz, Sebastian Steinhorst; Marko Wolf, Alexandre Bouard, William R. Harris, Somesh Jha, Thomas Peyrin, Axel Poschmann, Samarjit Chakraborty

This paper is an introduction to security challenges for the design of automotive hardware/software architectures. State-of-the-art automotive architectures are highly heterogeneous and complex systems that rely on distributed functions based on electronics and software. As cars are getting more connected with their environment, the vulnerability to attacks is rapidly growing. Examples for such wireless communication are keyless entry systems, WiFi, or Bluetooth. Despite this increasing vulnerability, the design of automotive architectures is still mainly driven by safety and cost issues rather than security. In this paper, we present potential threats and vulnerabilities, and outline upcoming security challenges in automotive architectures. In particular, we discuss the challenges arising in electric vehicles, like the vulnerability to attacks involving tampering with the battery safety. Finally, we discuss future automotive architectures based on Ethernet/IP and how formal verification methods might be used to increase their security.

5.1: HOT TOPIC - System Approaches to Energy-Efficiency

Organizer: Ahmed Jerraya - CEA-LETI-MINATEC, FR
Moderators: Patrick Blouet - ST Ericsson, FR; Ahmed Jerraya - CEA-LETI-MINATEC, FR

Experiences with Mobile Processors for Energy Efficient HPC [p. 464]

Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic and Alex Ramirez

The performance of High Performance Computing (HPC) systems is already limited by their power consumption. The majority of top HPC systems today are built from commodity server components that were designed for maximizing the compute performance. The Mont-Blanc project aims at using low-power parts from the mobile domain for HPC. In this paper we present our first experiences with the use of mobile processors and accelerators for the HPC domain based on the research that was performed in the project. We show initial evaluation of NVIDIA Tegra 2 and Tegra 3 mobile SoCs and the NVIDIA Quadro 1000M GPU with a set of HPC microbenchmarks to evaluate their potential for energy-efficient HPC.

What Designs for Coming Supercomputers? [p. 469]

Xavier Vigouroux

The next grail sought by HPC community is the exascale, 100 times the current scale. This target will not be reached easily as many challenges are uprising. The first challenge, the Energy consumption, has become a strict constraint now with a limit set to 20MW (twice as the current top supercomputers). Multiplying the computing elements will imply to drastically reduce the power consumption of each of them. The second challenge will be to keep it cool as: first the overall power envelope, 20MW, include the energy for cooling and second, because 20MW will be turned into heat by joule effect. And the operating temperature of electronic must be bounded otherwise, the leakage (and thus the power consumption) increases and the reliability decreases. This brings us to a third challenge regarding the reliability of the machine, the number of components will be tremendous, thus, the probability of having failing ones will increase. It has to be managed in such a way that applications will not be impacted by the failures. Finally, The last challenge is related to the software stack of these supercomputers, how will we manage billions of threads, how will we debug it, ... New paradigms are currently being studied, for instance Bag of tasks, that try to tackle these aspects. These are the challenges we have to solve!! In this presentation, brightened up with insight into Bull roadmap, we present a possible future.

Energy-Efficient In-Memory Database Computing [p. 470]

Wolfgang Lehner

The efficient and flexible management of large datasets is one of the core requirements of modern business applications. Having access to consistent and up-to-date information is the foundation for operational, tactical, and strategic decision making. Within the last few years, the database community sparked a large number of extremely innovative research projects to push the envelope in the context of modern database system architectures. In this paper, we outline requirements and influencing factors to identify some of the hot research topics in database management systems. We argue that "even after 30 years of active database research" the time is right to rethink some of the core architectural principles and come up with novel approaches to meet the requirements of the next decades in data management. The sheer number of diverse and novel (e.g., scientific) application areas, the existence of modern hardware capabilities, and the need of large data centers to become more energy-efficient will be the drivers for database research in the years to come.

Performance Analysis of HPC Applications on Low-Power Embedded Platforms [p. 475]

Luka Stanisic, Brice Videau, Johan Cronsioe, Augustin Degomme, Vania Marangozova-Martin, Arnaud Legrand, Jean-François Méhaut

This paper presents performance evaluation and analysis of well-known HPC applications and benchmarks running on low-power embedded platforms. The performance to power consumption ratios are compared to classical x86 systems. Scalability studies have been conducted on the Mont-Blanc Tibidabo cluster.We have also investigated optimization opportunities and pitfalls induced by the use of these new platforms, and proposed optimization strategies based on auto-tuning.

5.2: PANEL: Can Energy Harvesting Deliver Enough Power for Automotive Electronics?

Organizers: Tom Kazmierski - University of Southampton, UK; Christoph Grimm - TU Kaiserslautern, DE
Moderators: Peter Neumann - Edacentrum, DE; Norbert Wehn - TU Kaiserslautern, DE

Alternative Power Supply Concepts for Self-Sufficient Wireless Sensor Nodes by Energy Harvesting [p. 481]

Robert Kappel, Günter Hofer, Gerald Holweg, Thomas Herndl

Replacing batteries in wireless sensor nodes by energy harvesting enables a maintenance-free operation and an increasing degree of miniaturization at the cost of higher power management efforts. The limited power capability of environmental sources requires a careful investigation of the different harvesting opportunities to find the optimal source in a specific application scenario. Promising resources in the automotive area are kinetic and thermoelectric based harvesters. In this talk physical properties of energy converters are analyzed to show their restrictions and allow power estimation. In addition examples of already established self-sufficient sensors are presented.

Adaptable, High Performance Energy Harvesters [p. 482]

Paul D. Mitcheson

Energy harvesting has become a very popular research topic over the last 12 years, but has only made an industrial impact in a few areas, noticeably in process plant monitoring, including the water and petrochemical processing industries. Like most technologies, greater adoption needs to be realized if performance is to increase and cost to decrease. Batteries cost only tens of pence per Wh, and whilst harvesters can in theory generate very large amount of energy over a long enough period of operation, a typical harvester can require a capital expenditure of tens to hundreds of pounds, making them unattractive in many applications. The automotive sector is a potential area in which harvesters could provide useful functionality and gain from economies of scale, if they can be made reliable enough with a high enough power density and work well in a wide enough variety of scenarios. Recent work on increasing the power density of energy harvesters has focused on improving the power electronic interface, tuning the resonant frequency of motion-driven harvesters and reducing the power consumption of the load electronics.
Keywords - energy harvesting; adaptive systems; power density

Ultra-Low Power: An EDA Challenge [p. 483]

Christoph Grimm, Javier Moreno, Xiao Pan

Visions such as the internet of things require vast amount of sensors distributed in our environment that strongly rely on circuits that are energy autonomous. However, design of such circuits is a challenge that is currently done by experts only. The challenge is to cope with circuit level design and even technology while designing an application. Unfortunately, tools and methods that support cross-layer and cross-domain optimizations are missing.
Keywords - ultra-low power, cross-layer optimization

DoE-based Performance Optimization of Energy Management in Sensor Nodes Powered by Tunable Energy-Harvesters [p. 484]

Tom J. Kazmierski, Leran Wang, Bashir Al-Hashimi, Geoff Merrett

An energy-harvester-powered wireless sensor node is a complicated system with many design parameters. To investigate the various trade-offs among these parameters, it is desirable to explore the multi-dimensional design space quickly. However, due to the large number of parameters and costly simulation CPU times, it is often difficult or even impossible to explore the design space via simulation. A design of experiment (DoE) approach using the response surface model (RSM) technique can enable fast design space exploration of a complete wireless sensor node powered by a tunable energy harvester. As a proof of concept, a software toolkit has been developed which implements the DoE-based design flow and incorporates the energy harvester, tuning controller and wireless sensor node. Several test scenarios are considered, which illustrate how the proposed approach permits the designer to adjust a wide range of system parameters and evaluate the effect almost instantly but still with high accuracy.
Keywords -energy harvesters, design of experiment, wireless sensor nodes

5.3: Post-Silicon Debug Techniques

Moderators: Jaan Raik - Tallinn University of Technology, EE; Adrian Evans - iRoC Technologies, FR

A Hybrid Approach for Fast and Accurate Trace Signal Selection for Post-Silicon Debug [p. 485]

Min Li and Azadeh Davoodi

The main challenge in post-silicon debug is the lack of observability to the internal signals of a chip. Trace buffer technology provides one venue to address this challenge by online tracing of a few selected state elements. Due to the limited bandwidth of the trace buffer, only a few state elements can be selected for tracing. Recent research has focused on automated trace signal selection problem in order to maximize restoration of the untraced state elements using the few traced signals. Existing techniques can be categorized into high quality but slow "simulation-based", and lower quality but much faster "metric-based" techniques. This work presents a new trace signal selection technique which has comparable or better quality than simulation-based while it has a fast runtime, comparable to the metric-based techniques.

Machine Learning-based Anomaly Detection for Post-silicon Bug Diagnosis [p. 491]

Andrew DeOrio, Qingkun Li, Matthew Burgess and Valeria Bertacco

The exponentially growing complexity of modern processors intensifies verification challenges. Traditional pre-silicon verification covers less and less of the design space, resulting in increasing post-silicon validation effort. A critical challenge is the manual debugging of intermittent failures on prototype chips, where multiple executions of a same test do not yield a consistent outcome. We leverage the power of machine learning to support automatic diagnosis of these difficult, inconsistent bugs. During post-silicon validation, lightweight hardware logs a compact measurement of observed signal activity over multiple executions of a same test: some may pass, some may fail. Our novel algorithm applies anomaly detection techniques similar to those used to detect credit card fraud to identify the approximate cycle of a bug's occurrence and a set of candidate root-cause signals. Compared against other state-of-the-art solutions in this space, our new approach can locate the time of a bug's occurrence with nearly 4x better accuracy when applied to the complex OpenSPARC T2 design.

Space Sensitive Cache Dumping for Post-silicon Validation [p. 497]

Sandeep Chandran, Smruti R. Sarangi and Preeti Ranjan Panda

The internal state of complex modern processors often needs to be dumped out frequently during post-silicon validation. Since the last level cache (considered L2 in this paper) holds most of the state, the volume of data dumped and the transfer time are dominated by the L2 cache. The limited bandwidth to transfer data off-chip coupled with the large size of L2 cache results in stalling the processor for long durations when dumping the cache contents off-chip. To alleviate this, we propose to transfer only those cache lines that were updated since the previous dump. Since maintaining a bit-vector with a separate bit to track the status of individual cache lines is expensive, we propose 2 methods: (i) where a bit tracks multiple cache lines and (ii) an Interval Table which stores only the starting and ending addresses of continuous runs of updated cache lines. Both methods require significantly lesser space compared to a bit-vector, and allow the designer to choose the amount of space to allocate for this design-for-debug (DFD) feature. The impact of reducing storage space is that some non-updated cache lines are dumped too. We attempt to minimize such overheads. Further, the Interval Table is independent of the cache size which makes it ideal for large caches. Through experimentation, we also determine the break-even point below which a t-lines/bit bit-vector is beneficial compared to an Interval Table

Fast and Accurate BER Estimation Methodology for I/O Links Based on Extreme Value Theory [p. 503]

Alessandro Cevrero, Nestor Evmorfopoulos, Charalampos Antoniadis, Paolo Ienne, Yusuf Leblebici, Andreas Burg and George Stamoulis

This paper introduces a novel approach towards the statistical analysis of modern high-speed I/O and similar communication links, which is capable of reliably to determine extremely low (~10 -12 or lower) bit error rates (BER) by using techniques from extreme value theory (EVT). The new method requires only a small amount of voltage values at the received eye center, which can be generated by running circuit/system level simulations or measuring fabricated I/O circuits, to predict link BERs. Unlike conventional techniques, no simplifying assumptions on link noise and interference sources are required making this approach extremely portable to any communication system operating with very low BER. Our experimental results show that the BER estimates from the proposed methodology are on the same order of magnitude as traditional time domain, transient eye diagram simulations for links with BER of 10-6 and 10-5 operating at 9.6 and 10.1 Gbps respectively.
Index Terms - BER, EVT, I/O Links

Automated Determination of Top Level Control Signals [p. 509]

Rohit Kumar Jain, Praveen Tiwari and Soumen Ghosh

During various stages of hardware design, different types of control signals get introduced; clock, reset are specified and connected at the RTL stage whereas signals like scan enable, isolation enable, power switch enable get added to implemented devices later in the flow. The quality of Top Level Control Signals (TLCS) has a direct impact on the quality of static verification which is used to verify the intended connectivity and functionality of fan-out networks corresponding to TLCS. Typically, users need to specify these TLCS (along with their intended types) for such static verification. But when TLCS are not known to the verification engineer, reverse-engineering of clock, reset and scan network implemented in a design becomes a non-trivial task. This paper proposes a framework to automatically generate a list of TLCS pertaining to the implemented design. The framework describes a heuristic-based analysis of fan-in cones, traversing backwards from the leaf cell instance pins. It is independent of design style(s) as its core strength lies in its capability to dynamically adapt to the new discoveries of the design elements made during the traversal.
Keywords - Inference of Top level control signals, Static Verification, Low Power

5.4: Novel Approaches for Real-Time Architectures

Moderators: Cristina Silvano - Politecnico di Milano, IT; Andreas Moshovos - University of Toronto, CA

A Cache Design for Probabilistically Analysable Real-time Systems [p. 513]

Leonidas Kosmidis, Jaume Abella, Eduardo Quiñones and Francisco J. Cazorla

Caches provide significant performance improvements, though their use in real-time industry is low because current WCET analysis tools require detailed knowledge of program's cache accesses to provide tight WCET estimates. Probabilistic Timing Analysis (PTA) has emerged as a solution to reduce the amount of information needed to provide tight WCET estimates, although it imposes new requirements on hardware design. At cache level, so far only fully-associative random-replacement caches have been proven to fulfill the needs of PTA, but they are expensive in size and energy. In this paper we propose a cache design that allows set-associative and direct-mapped caches to be analysed with PTA techniques. In particular we propose a novel parametric random placement suitable for PTA that is proven to have low hardware complexity and energy consumption while providing comparable performance to that of conventional modulo placement.

MARTHA: Architecture for Control and Emulation of Power Electronics and Smart Grid Systems [p. 519]

Michel A. Kinsy, Ivan Celanovic, Omer Khan and Srinivas Devadas

This paper presents a novel Multicore Architecture for Real-Time Hybrid Applications (MARTHA) with time-predictable execution, low computational latency, and high performance that meets the requirements for control, emulation and estimation of next-generation power electronics and smart grid systems. Generic general-purpose architectures running real-time operating systems (RTOS) or quality of service (QoS) schedulers have not been able to meet the hard real-time constraints required by these applications. We present a framework based on switched hybrid automata for modeling power electronics applications. Our approach allows a large class of power electronics circuits to be expressed as switched hybrid models which can be executed on a single hardware platform.

Conservative Open-Page Policy for Mixed Time-Criticality Memory Controllers [p. 525]

Sven Goossens, Benny Akesson and Kees Goossens

Complex Systems-on-Chips (SoC) are mixed time-criticality systems that have to support firm real-time (FRT) and soft real-time (SRT) applications running in parallel. This is challenging for critical SoC components, such as memory controllers. Existing memory controllers focus on either firm real-time or soft real-time applications. FRT controllers use a close-page policy that maximizes worst-case performance and ignore opportunities to exploit locality, since it cannot be guaranteed. Conversely, SRT controllers try to reduce latency and consequently processor stalling by speculating on locality. They often use an open-page policy that sacrifices guaranteed performance, but is beneficial in the average case. This paper proposes a conservative open-page policy that improves average-case performance of a FRT controller in terms of bandwidth and latency without sacrificing real-time guarantees. As a result, the memory controller efficiently handles both FRT and SRT applications. The policy keeps pages open as long as possible without sacrificing guarantees and captures locality in this window. Experimental results show that on average 70% of the locality is captured for applications in the CHStone benchmark, reducing the execution time by 17% compared to a close-page policy. The effectiveness of the policy is also evaluated in a multi-application use-case, and we show that the overall average-case performance improves if there is at least one FRT or SRT application that exploits locality.

An Efficient and Flexible Hardware Support for Accelerating Synchronization Operations on the STHORM Many-Core Architecture [p. 531]

Farhat Thabet, Yves Lhuillier, Caaliph Andriamisaina, Jean-Marc Philippe and Raphaël David

The current trend in embedded computing consists in increasing the number of processing resources on a chip. Following this paradigm, the STMicroelectronics/CEA Platform 2012 (P2012) project designed an area- and power-efficient many-core accelerator as an answer to the needs of computing power of next-generation data-intensive embedded applications. Synchronization handling on this architecture was critical since speed-ups of parallel implementations of embedded applications strongly depend on the ability to exploit the largest possible number of cores while limiting task management overhead. This paper presents the HardWare Synchronizer (HWS), a flexible hardware accelerator for synchronization operations in the P2012 architecture. Experiments on a multi-core test chip showed that the HWS has less than 1% area overhead while reducing synchronization latencies (up to 2.8 times) and contentions.

5.5: Error-Aware Adaptive Modern Computing Architectures

Moderators: Marco Santambroglio - Politecnico di Milano, IT; Marian Verhelst - Katholieke Universiteit Leuven, BE

Hot-Swapping Architecture with Back-biased Testing for Mitigation of Permanent Faults in Functional Unit Array [p. 535]

Zoltán Endre Rákossy, Masayuki Hiromoto, Hiroshi Tsutsui, Takashi Sato, Yukihiro Nakamura and Hiroyuki Ochi

Due to latest advances in semiconductor integration, systems are becoming more susceptible to faults leading to temporary or permanent failures. We propose a new architecture extension suitable for arrays of functional units (FUs), that will provide testing and replacement of faulty units, without interrupting normal system operation. The extension relies on datapath switching realized by the proposed hot-swapping algorithm and structures, by use of which functional units are tested and replaced by spares, at lower overheads than traditional modular redundancy. For a case study architecture, hot-swapping support could be added with only 29% area overhead. In this paper we focus on experimental evaluation of the hot-swapping system from a fabricated chip in 65nm CMOS process. Autonomous testing of the hot-swapping system is enhanced with back-bias circuitry to attain an early fault detection and restoration system. Experimental measurements prove that the proposed concept works well, predicting fault occurrence with a configurable prediction interval, while power measurements reveal that with only 20% power overhead the proposed system can attain reliability levels similar to triple modular redundancy. Additionally, measurements reveal that manufacturing randomness across the die can significantly influence identical sub-circuit reliability located in different parts in the die, although identical layout has been employed.

Variation-tolerant OpenMP Tasking on Tightly-coupled Processor Clusters [p. 541]

Abbas Rahimi, Andrea Marongiu, Paolo Burgio, Rajesh K.Gupta and Luca Benini

We present a variation-tolerant tasking technique for tightly-coupled shared memory processor clusters that relies upon model-ing advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking program-ming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuit-level variability as a high-level software knowledge. This is ac-complished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Soft-ware supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP [22], consequently instruction per cycle (IPC) of a 16-core pro-cessor cluster is increased up to 1.51x (1.17x on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(ΔT=90°C).

Accurate and Efficient Reliability Estimation Techniques during ADL-Driven Embedded Processor Design [p. 547]

Zheng Wang, Kapil Singh, Chao Chen and Anupam Chattopadhyay

The downscaling of technology features has brought the system developers an important design criteria, reliability, into prime consideration. Due to external radiation effects and temperature gradients, the CMOS device is not guaranteed anymore to function flawlessly. On the other hand, admission for errors to occur allows extending the power budget. The power-performance-reliability trade-off compounds the system design challenge, for which efficient design exploration framework is needed. In this work, we present a high-level processor design framework extended with two reliability estimation techniques. First, a simulation-based technique, which allows a generic instruction-set simulator to estimate reliability via high-level fault injection capability. Second, a novel analytical technique, which is based on the reliability model for coarse arithmetic logical operator blocks within a processor instruction. The techniques are tested with a RISC processor and several embedded application kernels. Our results show the efficiency and accuracy of these techniques against a HDL-level reliability estimation framework.
Keywords - Reliability Estimation; High-level Processor Design; Fault Simulation

5.6: Advances in Mixed-Signal, RF, and MEMS Testing

Moderators: Salvador Mir - TIMA Laboratory, FR; Adoración Rueda - University of Seville, ES

Handling Discontinuous Effects in Modeling Spatial Correlation of Wafer-level Analog/RF Tests [p. 553]

Ke Huang, Nathan Kupp, John M. Carulli, Jr. and Yiorgos Makris

In an effort to reduce the cost of specification testing in analog/RF circuits, spatial correlation modeling of wafer-level measurements has recently attracted increased attention. Existing approaches for capturing and leveraging such correlation, however, rely on the assumption that spatial variation is smooth and continuous. This, in turn, limits the effectiveness of these methods on actual production data, which often exhibits localized spatial discontinuous effects. In this work, we propose a novel approach which enables spatial correlation modeling of wafer-level analog/RF tests to handle such effects and, thereby, to drastically reduce prediction error for measurements exhibiting discontinuous spatial patterns. The core of the proposed approach is a k-means algorithm which partitions a wafer into k clusters, as caused by discontinuous effects. Individual correlation models are then constructed within each cluster, revoking the assumption that spatial patterns should be smooth and continuous across the entire wafer. Effectiveness of the proposed approach is evaluated on industrial probe test data from more than 3,400 wafers, revealing significant error reduction over existing approaches.

Fault Detection, Real-Time Error Recovery, and Experimental Demonstration for Digital Microfluidic Biochips [p. 559]

Kai Hu, Bang-Ning Hsu, Andrew Madison, Krishnendu Chakrabarty and Richard Fair

Advances in digital microfluidics and integrated sensing hold promise for a new generation of droplet-based biochips that can perform multiplexed assays to determine the identity of target molecules. Despite these benefits, defects and erroneous fluidic operations remain a major barrier to the adoption and deployment of these devices. We describe the first integrated demonstration of cyberphysical coupling in digital microfluidics, whereby errors in droplet transportation on the digital microfluidic platform are detected using capacitive sensors, the test outcome is interpreted by control hardware, and software-based error recovery is accomplished using dynamic reconfiguration. The hardware/software interface is realized through seamless interaction between control software, an off-the-shelf microcontroller and a frequency divider implemented on an FPGA. Experimental results are reported for a fabricated silicon device and links to videos are provided for the firstever experimental demonstration of cyberphysical coupling and dynamic error recovery in digital microfluidic biochips.

Fault Analysis and Simulation of Large Scale Industrial Mixed-Signal Circuits [p. 565]

Ender Yilmaz, Geoff Shofner, LeRoy Winemberg and Sule Ozev

High test quality can be achieved through defect oriented testing using analog fault modeling approach. However, this approach is computionally demanding and typically hard to apply to large scale circuits. In this work, we use an improved inductive fault analysis approach to locate potential faults at layout level and calculate the relative probability of each fault. Our proposed method yields actionable results such as fault coverage of each test, potential faults, and probability of each fault. We show that the computational requirement can be significantly reduced by incorporating fault probabilities. These results can be used to improve fault coverage or to improve defect resilience of the circuit.

Electrical Calibration of Spring-Mass MEMS Capacitive Accelerometers [p. 571]

Lingfei Deng, Vinay Kundur, Naveen Sai Jangala Naga, Muhlis Kenan Ozel, Ender Yilmaz, Sule Ozev, Bertan Bakkaloglu, Sayfe Kiaei, Divya Pratab and Tehmoor Dar

Testing and calibration of MEMS devices require physical stimulus, which results in the need for specialized test equipment and thus high test cost. It has been shown for various types of sensors that electrical stimulation can be used to facilitate lower cost calibration. In this paper, we present an electrical stimulus based test and calibration technique for overdamped spring-mass capacitive accelerometers which require the characterization of stationary and dynamic calibration coefficients. We show that these two coefficients can be electrically obtained.

5.7: Compilers and Software Synthesis for Embedded Systems

Moderators: Björn Franke - University of Edinburgh, UK; Heiko Falk - Ulm University, DE

Optimizing Remote Accesses for Offloaded Kernels: Application to High-Level Synthesis for FPGA [p. 575]

Christophe Alias, Alain Darte and Alexandru Plesco

Some data- and compute-intensive applications can be accelerated by offloading portions of codes to platforms such as GPGPUs or FPGAs. However, to get high performance for these kernels, it is mandatory to restructure the application, to generate adequate communication mechanisms for the transfer of remote data, and to make good usage of the memory bandwidth. In the context of the high-level synthesis (HLS), from a C program, of hardware accelerators on FPGA, we show how to automatically generate optimized remote accesses for an accelerator communicating to an external DDR memory. Loop tiling is used to enable block communications, suitable for DDR memories. Pipelined communication processes are generated to overlap communications and computations, thereby hiding some latencies, in a way similar to double buffering. Finally, not only intra-tile but also inter-tile data reuse is exploited to avoid remote accesses when data are already available in the local memory. Our first contribution is to show how to generate the sets of data to be read from (resp. written to) the external memory just before (resp. after) each tile so as to reduce communications and reuse data as much as possible in the accelerator. The main difficulty arises when some data may be (re)defined in the accelerator and should be kept locally. Our second contribution is an optimized code generation scheme, entirely at source-level, i.e., in C, that allows us to compile all the necessary glue (the communication processes) with the same HLS tool as for the computation kernel. Both contributions use advanced polyhedral techniques for program analysis and transformation. Experiments with Altera HLS tools demonstrate how to use our techniques to efficiently map C kernels to FPGA.

Sequentially Constructive Concurrency - A Conservative Extension of the Synchronous Model of Computation [p. 581]

Reinhard von Hanxleden, Michael Mendler, Joaquin Aguado, Björn Duderstadt, Insa Fuhrmann, Christian Motika, Stephen Mercer and Owen O'Brien

Synchronous languages ensure deterministic concurrency, but at the price of heavy restrictions on what programs are considered valid, or constructive. Meanwhile, sequential languages such as C and Java offer an intuitive, familiar programming paradigm but provide no guarantees with regard to deterministic concurrency. The sequentially constructive model of computation (SC MoC) presented here harnesses the synchronous execution model to achieve deterministic concurrency while addressing concerns that synchronous languages are unnecessarily restrictive and difficult to adopt. In essence, the SC MoC extends the classical synchronous MoC by allowing variables to be read and written in any order as long as sequentiality expressed in the program provides sufficient scheduling information to rule out race conditions. The SC MoC is a conservative extension in that programs considered constructive in the common synchronous MoC are also SC and retain the same semantics. In this paper, we identify classes of variable accesses, define sequential constructiveness based on the concept of SC-admissible scheduling, and present a priority-based scheduling algorithm for analyzing and compiling SC programs.

Fast and Accurate Cache Modeling in Source-Level Simulation of Embedded Software [p. 587]

Zhonglei Wang and Jörg Henkel

Recently, source-level software models are increasingly used for software simulation in TLM (Transaction Level Modeling)-based virtual prototypes of multicore systems. A source-level model is generated by annotating timing information into application source code and allows for very fast software simulation. Accurate cache simulation is a key issue in multicore systems design because the memory subsystem accounts for a large portion of system performance. However, cache simulation at source level faces two major problems: (1) as target data addresses cannot be statically resolved during source code instrumentation, accurate data cache simulation is very difficult at source level, and (2) cache simulation brings large overhead in simulation performance and therefore cancels the gain of source level simulation. In this paper, we present a novel approach for accurate data cache simulation at source level. In addition, we also propose a cache modeling method to accelerate both instruction and data cache simulation. Our experiments show that simulation with the fast cache model achieves 450.7 MIPS (million simulated instructions per second) on a standard x86 laptop, 2.3x speedup compared with a standard cache model. The source-level models with cache simulation achieve accuracy comparable to an Instruction Set Simulator (ISS). We also use a complex multimedia application to demonstrate the efficiency of the proposed approach for multicore systems design.

Automatic and Efficient Heap Data Management for Limited Local Memory Multicore Architectures [p. 593]

Ke Bai and Aviral Shrivastava

Limited Local Memory (LLM) multi-core architectures substitute cache with scratch pad memories (SPM), and therefore have much lower power consumption. As they lack of automatic memory management, programming on such architectures becomes challenging, in the sense that it requires the programmer/compiler to efficiently manage the limited local memory. Managing heap data of the tasks executing in the cores of an LLM multi-core is an important problem. This paper presents a fully automated and efficient scheme for heap data management. Specifically, we propose i) code transformation for automation of heap management, with seamless support for multi-level pointers, and ii) improved data structures to more efficiently manage unlimited heap data. Experimental results on several benchmarks from MiBench demonstrate an average 43% performance improvement over previous approach [1].

Software Enabled Wear-Leveling for Hybrid PCM Main Memory on Embedded Systems [p. 599]

Jingtong Hu, Qingfeng Zhuge, Chun Jason Xue, Wei-Che Tseng, and Edwin H.-M.Sha

Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics. However, relatively low endurance has limited its practical applications. In this paper, in additional to existing hardware level optimizations, we propose software enabled wear-leveling techniques to further extend PCM's lifetime when it is adopted in embedded systems. A polynomial-time algorithm, the Software Wear-Leveling (SWL) algorithm, is proposed in this paper to achieve wear-leveling without hardware overhead. According to the experimental results, the proposed technique can reduce the number of writes on the most-written bits by more than 80% when compared with a greedy algorithm, and by around 60% when compared with the existing Optimal Data Allocation (ODA) algorithm with under 6% memory access overhead.

Probabilistic Timing Analysis on Conventional Cache Designs [p. 603]

Leonidas Kosmidis, Charlie Curtsinger, Eduardo Quiñones, Jaume Abella, Emery Berger and Francisco J. Cazorla

Probabilistic timing analysis (PTA), a promising alternative to traditional worst-case execution time (WCET) analyses, enables pairing time bounds (named probabilistic WCET or pWCET) with an exceedance probability (e.g., 10^-16), resulting in far tighter bounds than conventional analyses. However, the applicability of PTA has been limited because of its dependence on relatively exotic hardware: fully-associative caches using random replacement. This paper extends the applicability of PTA to conventional cache designs via a software-only approach. We show that, by using a combination of compiler techniques and runtime system support to randomise the memory layout of both code and data, conventional caches behave as fully-associative ones with random replacement.

6.1: EMBEDDED TUTORIAL - HW-SW Architecture Approaches to Energy-Efficiency

Organizer: Ahmed Jerraya - CEA-LETI-MINATEC, FR
Moderators: Agnès Fritsch - Thales Group, FR; Ahmed Jerraya - CEA-LETI-MINATEC, FR

HW-SW Integration for Energy-Efficient/Variability-Aware Computing [p. 607]

Gasser Ayad, Andrea Acquaviva, Enrico Macii, Brahim Sahbi, Romain Lemaire

Recent trends in embedded system architectures brought a rapid shift towards multicore, heterogeneous and reconfigurable platforms. This imposes a large effort for programmers to develop their applications to efficiently exploit the underlying architecture. In addition, process variability issues lead to performance and power uncertainties, impacting expected quality of service and energy efficiency of the running software. In particular, variability may lead to sub-optimal runtime task allocation. In this paper we present a holistic approach to tackle these issues exploiting high level HW/SW modeling to customize the runtime library. The customization introduces variability awareness in task allocation decisions, with the final purpose of optimizing a given objective: Execution time, power consumption, or overall energy consumption. We present a complete walkthrough, from top-level modeling down to variability-aware execution using a parallelized computational kernel running on a next generation, NoC based, heterogeneous multicore simulation platform.

6.2: HOT TOPIC: Emerging Nanoscale Devices: A Booster for High Performance Computing

Organizers: Pierre-Emmanuel Gaillardon - EPFL, CH; Giovanni De Micheli - EPFL, CH
Moderators: Giovanni De Micheli - EPFL, CH; Ahmed Jerraya - CEA, LETI, Minatec, FR

Near-Threshold Voltage Design in Nanoscale CMOS [p. 612]

Vivek De

Near-Threshold Voltage (NTV) operation of a CMOS design is defined as the voltage-frequency operating point where the energy consumed per compute operation (pJ/op) reaches a minimum, or the energy efficiency (Mops/Watt) peaks. Typically, this operating voltage is above the nominal threshold voltage of the transistor. The peak efficiency is achieved by a balance of switching energy and idle or leakage energy.

Ultra-Wide Voltage Range Designs in Fully-Depleted Silicon-On-Insulator FETs [p. 613]

E. Beigne, A. Valentian, B. Giraud, O. Thomas, T. Benoist, Y. Thonnart, S. Bernard, G. Moritz, O. Billoint, Y. Maneglia, P. Flatresse, J.P. Noel, F. Abouzeid, B. Pelloux- Prayer, A. Grover, S. Clerc, P. Roche, J. Le Coz, S. Engels and R. Wilson

Todays' MPSoC applications are requiring a convergence between very high speed and ultra low power. Ultra Wide Voltage Range (UWVR) capability appears as a solution for high energy efficiency with the objective to improve the speed at very low voltage and decrease the power at high speed. Using Fully Depleted Silicon-On-Insulator (FDSOI) devices significantly improves the trade-off between leakage, variability and speed even at low-voltage. A full design framework is presented for UWVR operation using FDSOI Ultra Thin Body and Box technology considering power management, multi-VT enablement, standard cells design and SRAM bitcells. Technology performances are demonstrated on a ARM A9 critical path showing a speed increase from 40% to 200% without added energy cost. In opposite, when performance is not required, FDSOI enables to reduce leakage power up to 10X using Reverse Body Biasing.
Keywords - energy efficiency, low voltage, adaptive architectures, FDSOI, Ultra Thin Body and Box

Carbon Nanotube Circuits: Opportunities and Challenges [p. 619]

Hai Wei, Max Shulaker, Gage Hills, Hong-Yu Chen, Chi-Shuen Lee, Luckshitha Liyanage, Jie Zhang, H.-S. Philip Wong and Subhasish Mitra

Carbon Nanotube Field-Effect Transistors (CNFETs) are excellent candidates for building highly energy-efficient digital systems. However, imperfections inherent in carbon nanotubes (CNTs) pose significant hurdles to realizing practical CNFET circuits. In order to achieve CNFET VLSI systems in the presence of these inherent imperfections, careful orchestration of design and processing is required: from device processing and circuit integration, all the way to large-scale system design and optimization. In this paper, we summarize the key ideas that enabled the first experimental demonstration of CNFET arithmetic and storage elements. We also present an overview of a probabilistic framework to analyze the impact of various CNFET circuit design techniques and CNT processing options on system-level energy and delay metrics. We demonstrate how this framework can be used to improve the energy-delay-product (EDP) of CNFET-based digital systems.
Keywords - Carbon Nanotube; CNT; CNFET; Nanotechnology; Modeling; Imperfection; Variation; Three-Dimensional Circuits;

Vertically-Stacked Double-Gate Nanowire FETs with Controllable Polarity: From Devices to Regular ASICs [p. 625]

Pierre-Emmanuel Gaillardon, Luca Gaetano Amarù, Shashikanth Bobba, Michele De Marchi, Davide Sacchetto, Yusuf Leblebici and Giovanni De Micheli

Vertically stacked nanowire FETs (NWFETs) with gate-all-around structure are the natural and most advanced extension of FinFETs. At advanced technology nodes, many devices exhibit ambipolar behavior, i.e., the device shows n- and p-type characteristics simultaneously. In this paper, we show that, by engineering of the contacts and by constructing independent double-gate structures, the device polarity can be electrostatically programmed to be either n- or p-type. Such a device enables a compact realization of XOR-based logic functions at the cost of a denser interconnect. To mitigate the added area/routing overhead caused by the additional gate , an approach for designing an efficient regular layout, called Sea-of-Tiles is presented. Then, specific logic synthesis techniques, supporting the higher expressive power provided by this technology, are introduced and used to showcase the performance of the controllable-polarity NWFETs circuits in comparison with traditional CMOS circuits.
Keywords - Nanowire transistors; controllable polarity; regular fabrics; XOR logic synthesis

6.3: Verification and Simulation Support for Architecture

Moderators: Valeria Bertacco - University of Michigan, US; Elena Vatajelu - LIRMM, FR

On-the-fly Verification of Memory Consistency with Concurrent Relaxed Scoreboards [p. 631]

Leandro S. Freitas, Eberle A. Rambo and Luiz C. V. dos Santos

Parallel programming requires the definition of shared-memory semantics by means of a consistency model, which affects how the parallel hardware is designed. Therefore, verifying the hardware compliance with a consistency model is a relevant problem, whose complexity depends on the observability of memory events. Post-silicon checkers analyze a single sequence of events per core and so do most pre-silicon checkers, although one reported method samples two sequences per core. Besides, most are post-mortem checkers requiring the whole sequence of events to be available prior to verification. On the contrary, this paper describes a novel on-the-fly technique for verifying memory consistency from an executable representation of a multicore system. To increase efficiency without hampering verification guarantees, three points are monitored per core. The sampling points are selected to be largely independent from the core's microarchitecture. The technique relies on concurrent relaxed scoreboards to check for consistency violations in each core. To check for global violations, it employs a linear order of events induced by a given test case. We prove that the technique neither indicates false negatives nor false positives when the test case exposes an error that affects the sampled sequences, making it the first on-the-fly checker with full guarantees. We compare our technique with two post-mortem checkers under 2400 scenarios for platforms with 2 to 8 cores. The results show that our technique is at least 100 times faster than a checker sampling a single sequence per processor and it needs approximately 1/4 to 3/4 of the overall verification effort required by a post-mortem checker sampling two sequences per processor.

Fast Cache Simulation for Host-Compiled Simulation of Embedded Software [p. 637]

Kun Lu, Daniel Müller-Gritschneder and Ulf Schlichtmann

Host-compiled simulation has been proposed for software performance estimation, because of its high simulation speed. However, the simulation speed may be significantly lowered due to the cache simulation overhead. In this paper, we propose an approach that can reduce much of the cache simulation overhead, while still calculating cache misses precisely. For instruction cache, we statically analyze possible cache conflicts and perform cache conflicts aware annotation for host-compiled simulation. Within loops, the conflicts are dynamically captured by tagging the basic blocks instead of performing the expensive cache simulation. In this way, a vast majority of the cache accesses can be saved from simulation. For data cache, aggregated cache simulation is used for a large data block. Further, the data locality can be bound by considering the data allocation principle of a program. Experiments show that our approach improves the speed of host-compiled simulation by one order of magnitude, while providing the cache miss numbers with high accuracy.

A Critical-Section-Level Timing Synchronization Approach for Deterministic Multi-Core Instruction-Set Simulations [p. 643]

Fan-Wei Yu, Bo-Han Zeng, Yu-Hung Huang, Hsin-I Wu, Che-Rung Lee and Ren-Song Tsay

This paper proposes a Critical-Section-Level timing synchronization approach for deterministic Multi-Core Instruction-Set Simulation (MCISS). By synchronizing at each lock access instead of every shared-variable access and using a simple lock usage status managing scheme, our approach significantly improves simulation performance while executing all citical sections in a deterministic order. Experiments show that our approach performs 295% faster than the shared-variable synchronization approach on average and can effectively facilitate system-level software/hardware co-simulation.
Keywords - Deterministic, Multi-core instruction-set simulations, Timing Synchronization

Multi-level Phase Analysis for Sampling Simulation [p. 649]

Jiaxin Li, Weihua Zhang, Haibo Chen and Binyu Zang

Extremely long simulation time of architectural simulators has been a major impediment to their wide applicability. To accelerate architectural simulation, prior researchers have proposed representative sampling simulation to trade small loss of accuracy for notable speed improvement. Generally, they use fine-grained phase analysis to select only a small representative portion of program execution intervals for detailed cycle-accurate simulation, while functionally simulating the remaining portion. However, though phase granularity is one of the most important factors to simulation speed, it has not been well investigated and most prior researches explore a fine-grained scheme. This limits their effectiveness in further improving simulation speed with the requirement of increasingly complex architectural designs and new lengthy benchmarks. In this paper, by analyzing the impact of phase granularity on simulation speed, we observe that coarse-grained phases can better capture the overall program characteristics with a less number of phases and the last representative phase could be classified in a very early program position, leading to fewer execution internals being functionally simulated. By contrast, fine-grained phases usually have much shorter execution intervals and thus the overall detailed simulation time could be reduced. Based on the above observation, we design a multi-level sampling simulation technique that combines both fine-grained and coarse-grained phase analysis for sampling simulation. Such a scheme uses fine-grained simulation points to represent only the selected coarse-grained simulation points instead of the entire program execution, thus it could further reduce both the functional and detailed simulation time. Experimental results using SPEC2000 show such a framework is effective: using the SimPoint method as baseline, it can reduce about 90% functional simulation time and about 50% detailed simulation time. It finally achieves a geometric average speedup of 14.04X over SimPoint with comparable accuracy.

Hypervised Transient SPICE Simulations of Large Netlists & Workloads on Multi-Processor Systems [p. 655]

Grigorios Lyras, Dimitrios Rodopoulos, Antonis Papanikolaou and Dimitrios Soudris

The need for detailed simulation of integrated circuits has received significant attention since the early stages of design automation. Given the increasing device integration, these simulations have extreme memory footprints, especially within unified memory hierarchies. This paper overcomes the infeasible memory demands of modern circuit simulators. Structural partitioning of the netlist and temporal partitioning of the input signals allow distributed execution with minimal memory requirements. The proposed framework is validated with simulations of a circuit with more than 106 MOSFET devices. In comparison to a commercial tool, we observe minimal error and even x2:35 speedup for moderate netlist sizes. The proposed framework is proven highly reusable across a variety of execution platforms.

6.4: Design Space Exploration for Application Specific Architectures

Moderators: Andreas Moshovos - University of Toronto, CA; Georgi Gaydadjiev - Chalmers University of Technology, SE

A Meta-Model Assisted Coprocessor Synthesis Framework for Compiler/Architecture Parameters Customization [p. 659]

Sotirios Xydis, Gianluca Palermo, Vittorio Zaccaria and Cristina Silvano

Hardware coprocessors are extensively used in modern heterogeneous systems-on-chip (SoC) designs to provide efficient implementation of application-specific functions. Customized coprocessor synthesis exploits design space exploration to derive Pareto optimal design configurations for a set of targeted metrics. Existing exploration strategies for coprocessor synthesis have been focused on either time consuming iterative scheduling approaches or ad-hoc sampling of the solution space guided by the designer's experience. In this paper, we introduce a meta-model assisted exploration framework that eliminates the aforementioned drawbacks by using response surface models (RSMs) for generating customized coprocessor architectures. The methodology is based on the construction of analytical delay and area models for predicting the quality of the design points without resorting to costly architectural synthesis procedures. Various RSM techniques are evaluated with respect to their accuracy and convergence. We show that the targeted solution space can be accurately modeled through RSMs, thus enabling a speedup of the overall exploration runtime without compromising the quality of results. Comparative experimental results, over a set of real-life benchmarks, prove the effectiveness of the proposed approach in terms of quality improvements of the design solutions and exploration runtime reductions. An MPEG-2 decoder case study describes how the proposed approach can be exploited for customizing the architecture of two hardware accelerated kernels.

Energy-Efficient Memory Hierarchy for Motion and Disparity Estimation in Multiview Video Coding [p. 665]

Felipe Sampaio, Bruno Zatt, Muhammad Shafique, Luciano Agostini, Sergio Bampi and Jörg Henkel

This work presents an energy-efficient memory hierarchy for Motion and Disparity Estimation on Multiview Video Coding employing a Reference Frames-Centered Data Reuse (RCDR) scheme. In RCDR the reference search window becomes the center of the motion/disparity estimation processing flow and calls for processing all blocks requesting its data. By doing so, RCDR avoids multiple search window retransmissions leading to reduced number of external memory accesses, thus memory energy reduction. To deal with out-of-order processing and further reduce external memory traffic, a statistics-based partial results compressor is developed. The on-chip video memory energy is reduced by employing a statistical power gating scheme and candidate blocks reordering. Experimental results show that our reference-centered memory hierarchy outperforms the state-of-the-art [7][13] by providing reduction of up to 71% for external memory energy, 88% on-chip memory static energy, and 65% onchip memory dynamic energy.
Index Terms - Multiview Video Coding, MVC, 3D-Video, Low-Power Design, On-Chip Video Memory, Application-Aware DPM, Memory Hierarchy, Energy Efficiency, Motion Estimation, Disparity Estimation.

Improving Simulation Speed and Accuracy for Many-Core Embedded Platforms with Ensemble Models [p. 671]

E. Paone, N. Vahabi, V. Zaccaria, C. Silvano, D. Melpignano, G. Haugou and T. Lepley

In this paper, we introduce a novel modeling technique to reduce the time associated with cycle-accurate simulation of parallel applications deployed on many-core embedded platforms. We introduce an ensemble model based on artificial neural networks that exploits (in the training phase) multiple levels of simulation abstraction, from cycle-accurate to cycle-approximate, to predict the cycle-accurate results for unknown application configurations. We show that high-level modeling can be used to significantly reduce the number of low-level model evaluations provided that a suitable artificial neural network is used to aggregate the results. We propose a methodology for the design and optimization of such an ensemble model and we assess the proposed approach for an industrial simulation framework based on STMicroelectronics STHORM (P2012) many-core computing fabric.

Statically-scheduled Application-specific Processor Design: A Case-study on MMSE MIMO Equalization [p. 677]

Mostafa Rizk, Amer Baghdadi, Michel Jézéquel, Yasser Mohana and Youssef Atat

Many application-specific processor design approaches are being proposed and investigated nowadays. All of them aim to cope with the emerging flexibility requirement combined with the best performance efficiency. Application Specific Instruction-set Processor (ASIP) design approach is among the most explored, and thus in many application domains. However, this concept implies a dynamic scheduling of a set of instructions which generally lead to an overhead related to instruction decoding. To reduce this overhead, other approaches were proposed using static scheduling of datapath control signals. In this paper, we explore this last approach and illustrate its benefits through a design case-study on MMSE MIMO equalization. The proposed design has common main architectural choices as a state-of-the-art ASIP for comparison purpose. The obtained results illustrate a significant improvement in execution time while using identical computational resources and supporting same flexibility parameters.

Exploring Resource Mapping Policies for Dynamic Clustering on NoC-based MPSoCs [p. 681]

Gustavo Girão, Thiago Santini and Flávio R. Wagner

The dramatic increase in the number of processors, memories and other components in the same chip calls for resource-aware mechanisms to improve performance. This paper proposes four different resource mapping policies for NoC-based MPSoCs that leverage on distinct aspects of the parallel nature of the applications and on architecture constraints, such as off-chip memory latency. Results show that the use of these policies can improve performance up to 22.5% in average, and, in some cases, depending on the parallel programming model of each application, the improvement may reach up to 32%.

Characterizing the Performance Benefits of Fused CPU/GPU Systems Using FusionSim [p. 685]

Vitaly Zakharenko, Tor Aamodt and Andreas Moshovos

We use FusionSim to characterize the performance of the Rodinia benchmarks on fused and discrete systems. We demonstrate that the speed-up due to fusion is highly correlated with the input data size. We demonstrate that for benchmarks that benefit most from fusion, a 9.72x speed up is possible for small problem sizes. This speedup reduces to 1.84x with medium or large problem sizes. We study a simple, software-managed coherence solution for the fused system. We find that it imposes a minor performance overhead of 2% for most benchmarks and as high as 5% for some. Finally, we develop an analytical model for the performance benefit that is to be expected from fusion for applications with a simple communication and computation pattern and show that FusionSim follows the predicted performance trend.
Keywords - CPU and GPU Fusion;

6.5: Reliable Multi-Processor Computing Systems Design

Moderators: Jose Ayala - Complutense University of Madrid, ES; Vincenzo Rana - EPFL, CH

Reliability-Driven Task Mapping for Lifetime Extension of Networks-on-Chip Based Multiprocessor Systems [p. 689]

Anup Das, Akash Kumar and Bharadwaj Veeravalli

Shrinking transistor geometries, aggressive voltage scaling and higher operating frequencies have negatively impacted the lifetime reliability of embedded multi-core systems. In this paper, a convex optimization-based task-mapping technique is proposed to extend the lifetime of a multiprocessor systems-onchip (MPSoCs). The proposed technique generates mappings for every application enabled on the platform with variable number of cores. Based on these results, a novel 3D-optimization technique is developed to distribute the cores of an MPSoC among multiple applications enabled simultaneously. Additionally, reliability of the underlying network-on-chip links is also addressed by incorporating aging of links in the objective function. Our formulations are developed for directed acyclic graphs (DAGs) and synchronous dataflow graphs (SDFGs), making our approach applicable for streaming as well as non-streaming applications. Experiments conducted with synthetic and real-life application graphs demonstrate that the proposed approach extends the lifetime of an MPSoC by more than 30% when applications are enabled individually as well as in tandem.

A Work-Stealing Scheduling Framework Supporting Fault Tolerance [p. 695]

Yizhuo Wang, Weixing Ji, Feng Shi and Qi Zuo

Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.
Keywords - fault tolerance; work-stealing; multicore; cluster

A Cost-Effective Selective TMR for Heterogeneous Coarse-Grained Reconfigurable Architectures Based on DFG-Level Vulnerability Analysis [p. 701]

Takashi Imagawa, Hiroshi Tsutsui, Hiroyuki Ochi and Takashi Sato

This paper proposes a method to determine a priority for applying selective triple modular redundancy (selective TMR) against single event upset (SEU) to achieve cost-effective reliable implementation of an application circuit to a coarse-grained reconfigurable architecture (CGRA). The priority is determined by an estimation of the vulnerability of each node in the data flow graph (DFG) of the application circuit. The estimation is based on a weighted sum of the features and parameters of each node in the DFG which characterize impact of the SEU in the node to the output data. This method does not require time-consuming placement-and-routing processes, as well as extensive fault simulations for various triplicating patterns, which allows us to identify the set of nodes to be triplicated for minimizing the vulnerability under given area constraint at the early stage of design flow. Therefore, the proposed method enables us efficient design space exploration of reliability-oriented CGRAs and their applications.

CSER: HW/SW Configurable Soft-Error Resiliency for Application Specific Instruction-Set Processors [p. 707]

Tuo Li, Muhammad Shafique, Semeen Rehman, Swarnalatha Radhakrishnan, Roshan Ragel, Jude Angelo Ambrose, Jörg Henkel and Sri Parameswaran

Soft error has been identified as one of the major challenges to CMOS technology based computing systems. To mitigate this problem, error recovery is a key component, which usually accounts for a substantial cost, since they must introduce redundancies in either time or space. Consequently, using state-of-art recovery techniques could heavily worsen the design constraint, which is fairly stringent for embedded system design. In this paper, we propose a HW/SW methodology that generates the processor, which performs finely configured error recovery functionality targeting the given design constraints (e.g., performance, area and power). Our methodology employs three application-specific optimization heuristics, which generate the optimized composition and configuration based on the two primitive error recovery techniques. The resultant processor is composed of selected primitive techniques at corresponding instruction execution, and configured to perform error recovery at run-time accordingly to the scheme determined at design time. The experiment results have shown that our methodology can at best achieve nine times reliability while maintaining the given constraints, in comparison to the state of the art.

Reliability Analysis for Integrated Circuit Amplifiers Used in Neural Measurement Systems [p. 713]

Nico Hellwege, Nils Heidmann, Dagmar Peters-Drolshagen and Steffen Paul

NBTI and HCI are not only present in digital circuits but also in analog circuitry. Integrated circuit amplifiers as used in neural measurement systems (NMS) need to be resistive against degradation since these systems cannot be replaced easily. A topology driven design methodology to increase the reliability of amplifiers used for intracortical neural recording has been proposed in this work. This approach leads to a decrease in degradation for some system performances by a factor of three. It has been shown that the degradation of a circuit is highly dependent on the selected current mirror and biasing circuit.
Index Terms - Analog circuits, negative bias temperature instability (NBTI), neural measurement system (NMS), circuit reliability.

On-Line Testing of Permanent Radiation Effects in Reconfigurable Systems [p. 717]

Luca Cassano, Dario Cozzi, Sebastian Korf, Jens Hagemeyer, Mario Porrmann and Luca Sterpone

Partially reconfigurable systems are more and more employed in many application fields, including aerospace. SRAM-based FPGAs represent an extremely interesting hardware platform for this kind of systems, because they offer flexibility as well as processing power. In this paper we report about the ongoing development of a software flow for the generation of hard macros for on-line testing and diagnosing of permanent faults due to radiation in SRAM-FPGAs used in space missions. Once faults have been detected and diagnosed the flow allows to generate fine-grained patch hard macros that can be used to mask out the discovered faulty resources, allowing partially faulty regions of the FPGA to be available for further use.
Keywords - Automatic Test Pattern Generation, Fault Diagnosis; On-Line Testing; Permanent Radiation Effects; SRAM-FPGA

An Approach for Redundancy in FlexRay Networks Using FPGA Partial Reconfiguration [p. 721]

Shanker Shreejith, Kizheppatt Vipin, Suhaib A Fahmy and Martin Lukasiewycz

Safety-critical in-vehicle electronic control units (ECUs) demand high levels of determinism and isolation, since they directly influence vehicle behaviour and passenger safety. As modern vehicles incorporate more complex computational systems, ensuring the safety of critical systems becomes paramount. One-to-one redundant units have been previously proposed as measures for evolving critical functions like x-by-wire. However, these may not be viable solutions for power-constrained systems like next generation electric vehicles. Reconfigurable architectures offer alternative approaches to implementing reliable safety critical systems using more efficient hardware. In this paper, we present an approach for implementing redundancy in safety-critical in-car systems, that uses FPGA partial reconfiguration and a customised bus controller to offer fast recovery from faults. Results show that such an integrated design is better than alternatives that use discrete bus interface modules.

6.6: HOT TOPIC: Energy-Efficient Design and Test Techniques for Future Multi-Core Systems

Organizer: Krishnendu Chakrabarty - Duke University, US
Moderators: Mehdi Tahoori - Karlsruhe Institute of Technology, DE; Paul Pop - Technical University of Denmark, DK

Energy-Efficient Multicore Chip Design through Cross-Layer Approach [p. 725]

Paul Wettin, Jacob Murray, Partha Pande, Behrooz Shirazi and Amlan Ganguly

Traditional multi-core designs, based on the Network-on-Chip (NoC) paradigm, suffer from high latency and power dissipation as the system size scales up due to the inherent multi-hop nature of communication. Introducing long-range, low power, and high-bandwidth, single-hop links between far apart cores can significantly enhance the performance of NoC fabrics. In this paper, we propose design of a small-world network based NoC architecture with on-chip millimeter (mm)-wave wireless links. The millimeter wave small-world NoC (mSWNoC) is capable of improving the overall latency and energy dissipation characteristics compared to the conventional mesh-based counterpart. The mSWNoC helps in improving the energy dissipation, and hence the thermal profile, even further in presence of network-level dynamic voltage and frequency scaling (DVFS) without incurring any additional latency penalty.
Keywords - NoC, wireless, mm-wave, small world, DVFS

Breaking the Energy Barrier in Fault-Tolerant Caches for Multicore Systems [p. 731]

Paul Ampadu, Meilin Zhang and Vladimir Stojanovic

Balancing cache energy efficiency and reliability is a major challenge for future multicore system design. Supply voltage reduction is an effective tool to minimize cache energy consumption, usually at the expense of increased number of errors. To achieve substantial energy reduction without degrading reliability, we propose an adaptive fault-tolerant cache architecture, which provides appropriate error control for each cache line based on the number of faulty cells detected at reduced supply voltages. Our experiments show that the proposed approach can improve energy efficiency by more than 25% and energy-execution time product by over 10%, while improving reliability up to 4X using Mean-Error-To-Failure (METF) metric, compared to the next-best solution at the cost of 0.08% storage overhead.
Keywords - Energy efficiency, fault tolerance, cache, VLSI, multicore.

Testing for SoCs with Advanced Static and Dynamic Power-Management Capabilities [p. 737]

Xrysovalantis Kavousianos and Krishnendu Chakrabarty

Many multicore chips today employ advanced power management techniques. Multi-threshold CMOS (MTCMOS) is very effective for reducing standby leakage power. Dynamic voltage scaling and voltage islands which operate at multiple power-supply voltage levels, minimize dynamic power consumption. Effective defect screening for such chips requires advanced test techniques that target defects in the embedded cores and the power management structures. We describe recent advances in test generation and test scheduling techniques for SoCs that support power switches, voltage islands, and dynamic voltage scaling schemes.
Index Terms - Dynamic power, dynamic voltage scaling, power switches, SoC test scheduling, static power.

Towards Adaptive Test of Multi-core RF SoCs [p. 743]

Rajesh Mittal, Lakshmanan Balasubramanian, Chethan Kumar Y.B., V. R. Devanathan, Mudasir Kawoosa and Rubin A. Parekhji

This paper discusses how adaptive test techniques can be applied to multi-core RF SoCs, together with design implementation and test challenges. Various techniques specific to RF circuits covering calibration trims, power management modules, co-existence issues, concurrent testing, and test measurements are explained. Results on different designs are presented. Together, they highlight the need and scope of adaptive test for RF circuits, and share a new dimension in the test of multi-core circuits, under different constraints of design, test and test equipment.
Keywords: Adaptive test, RF test, multi-core chips, test time optimization.

6.7: Model-Based Design and Verification for Embedded Systems

Moderators: Wang Yi' - Uppsala University, SE; Saddek Bensalem - Verimag, FR

A Satisfiability Approach to Speed Assignment for Distributed Real-Time Systems [p. 749]

Pratyush Kumar, Devesh B. Chokshi and Lothar Thiele

We study the problem of assigning speeds to resources serving distributed applications with delay, buffer and energy constraints. We argue that the considered problem does not have any straightforward solution due to the intricately related constraints. As a solution, we propose using Real-Time Calculus (RTC) to analyse the constraints and a SATisfiability solver to efficiently explore the design space. To this end, we develop an SMT solver by using the OpenSMT framework and the Modular Performance Analysis (MPA) toolbox. Two key enablers for this implementation are the analysis of incomplete models and generation of conflict clauses in RTC. The results on problem instances with very large decision spaces indicate that the proposed SMT solver performs very well in practice.

Data Mining MPSoC Simulation Traces to Identify Concurrent Memory Access Patterns [p. 755]

Sofiane Lagraa, Alexandre Termier and Frédéric Pétrot

Due to a growing need for flexibility, massively parallel Multiprocessor SoC (MPSoC) architectures are currently being developed. This leads to the need for parallel software, but poses the problem of the efficient deployment of the software on these architectures. To address this problem, the execution of the parallel program with software traces enabled on the platform and the visualization of these traces to detect irregular timing behavior is the rule. This is error prone as it relies on software logs and human analysis, and requires an existing platform. To overcome these issues and automate the process, we propose the conjoint use of a virtual platform logging at hardware level the memory accesses and of a data-mining approach to automatically report unexpected instructions timings, and the context of occurrence of these instructions. We demonstrate the approach on a multiprocessor platform running a video decoding application.

Model-Based Energy Optimization of Automotive Control Systems [p. 761]

Joost-Pieter Katoen, Thomas Noll, Hao Wu, Thomas Santen and Dirk Seifert

Reducing the energy consumption of controllers in vehicles requires sophisticated regulation mechanisms. Better power management can be enabled by allowing the controller to shut down sensors, actuators or embedded control units in a way that keeps the car safe and comfortable for the user, with the goal of optimizing the (average or maximal) energy consumption. This paper proposes an approach to systematically explore the design space of SW/HW mappings to determine energy-optimal deployments. It employs constraint-solving techniques for generating deployment candidates and probabilistic analyses for computing the expected energy consumption of the respective deployment. The feasibility and scalability of the method is demonstrated by several case studies.

Formal Analysis of Sporadic Bursts in Real-Time Systems [p. 767]

Sophie Quinton, Mircea Negrean and Rolf Ernst

In this paper we propose a new method for the analysis of response times in uni-processor real-time systems where task activation patterns may contain sporadic bursts. We use a burst model to calculate how often response times may exceed the worst-case response time bound obtained while ignoring bursts. This work is of particular interest to deal with dual-cyclic frames in the analysis of CAN buses. Our approach can handle arbitrary activation patterns and the static priority preemptive as well as non-preemptive scheduling policies. Experiments show the applicability and the benefits of the proposed method.

7.1: HOT TOPIC - Many-Core SoC Approaches to Energy-Efficiency

Organizer: Ahmed Jerraya - CEA-LETI-MINATEC, FR
Moderators: Marc Duranton - CEA, FR; Ahmed Jerraya - CEA-LETI-MINATEC, FR

Development of Low Power Many-Core SoC for Multimedia Applications [p. 773]

Takashi Miyamori, Hui Xu, Takeshi Kodaka, Hiroyuki Usui, Toru Sano and Jun Tanabe

New media processing applications such as image recognition and AR (Augment Reality) have become into practical on embedded systems for automotive, digital-consumer and mobile products. Many-core processors have been proposed to realize much higher performance than multi-core processors. We have developed a low-power many-core SoC for multimedia applications in 40nm CMOS technology. Within a 210mm2 die, two 32-core clusters are integrated with dynamically reconfigurable processors, hardware accelerators, 2-channel DDR3 I/Fs, and other peripherals. Processor cores in the cluster share a 2MB L2 cache connected through a tree-based Network-on-Chip (NoC). Its total peak performance exceeds 1.5TOPS (Tera Operations Per Second). The high scalability and low power consumption are accomplished by parallelized firmware for multimedia applications. It operates the 1080p 30fps H.264 decoding about 400mW and the 4K2K 15fps super resolution under 800mW.
Keywords - Many-core; Network-on-Chip; VLIW; Low power ; Power gating; H.264; Super resolution

SoC Low-Power Practices for Wireless Applications [p. 778]

Nicolas Darbel and Stephane Lecomte

This paper describes current practices regarding low power SoC aimed at wireless applications.
Keywords - microprocessor, wireless application, gate-level models, DVFS, AVS, voltage stack, power management, low-power

3D Integration for Power-Efficient Computing [p. 779]

D. Dutoit, E. Guthmuller, I. Miro-Panades

3D stacking is currently seen as a breakthrough technology for improving bandwidth and energy efficiency in multi-core architectures. The expectation is to solve major issues such as external memory pressure and latency while maintaining reasonable power consumption. In this paper, we show some advances in this field of research, starting with memory interface solutions as WIDEIO experience on a real chip for solving DRAM accesses issue. We explain the integration of a 512-bit memory interface in a Network-on-Chip multi-core framework and we show the performance we can achieve, these results being based on a 65nm prototype integrating 10μm diameter Through Silicon Vias. We then present the potentiality of new fine grain 3D stacking technology for power-efficient memory hierarchy. We expose an innovative 3D stacked multi-cache strategy aimed at lowering memory latency and external memory bandwidth requirements and thus demonstrating the efficiency of 3D stacking to rethink architectures for obtaining unequalled performances in power efficiency.

7.2: Formal Verification Algorithms and Models

Moderators: Christoph Scholl - University of Freiburg, DE; Jason Baumgartner - IBM, US

Verifying Safety and Liveness for the FlexTM Hybrid Transactional Memory [p. 785]

Parosh Abdulla, Sandhya Dwarkadas, Ahmed Rezine, Arrvindh Shriraman and Yunyun Zhu

We consider the verification of safety (strict serializability and abort consistency) and liveness (obstruction and livelock freedom) for the hybrid transactional memory framework FLEXTM. This framework allows for flexible implementations of transactional memories based on an adaptation of the MESI coherence protocol. FLEXTM allows for both eager and lazy conflict resolution strategies. Like in the case of Software Transactional Memories, the verification problem is not trivial as the number of concurrent transactions, their size, and the number of accessed shared variables cannot be a priori bounded. This complexity is exacerbated by aspects that are specific to hardware and hybrid transactional memories. Our work takes into account intricate behaviours such as cache line based conflict detection, false sharing, invisible reads or non-transactional instructions. We carry out the first automatic verification of a hybrid transactional memory and establish, by adopting a small model approach, challenging properties such as strict serializability, abort consistency, and obstruction freedom for both an eager and a lazy conflict resolution strategies. We also detect an example that refutes livelock freedom. To achieve this, our prototype tool makes use of the latest antichain based techniques to handle systems with tens of thousands of states.

QF_BV Model Checking with Property Directed Reachability [p. 791]

Tobias Welp and Andreas Kuehlmann

In 2011, property directed reachability (PDR) was proposed as an efficient algorithm to solve hardware model checking problems. Recent experimentation suggests that it outperforms interpolation-based verification, which had been considered the best known algorithm for this purpose for almost a decade. In this work, we present a generalization of PDR to the theory of quantifier free formulae over bitvectors (QF BV), illustrate the new algorithm with representative examples and provide experimental results obtained from experimentation with a prototype implementation.

A Semi-Canonical Form for Sequential AIGs [p. 797]

Alan Mishchenko, Niklas Een, Robert Brayton, Michael Case, Pankaj Chauhan and Nikhil Sharma

In numerous EDA flows, time-consuming computations are repeatedly applied to sequential circuits. This motivates developing methods to determine what circuits have been processed already by a tool. This paper proposes an algorithm for semi-canonical labeling of nodes in a sequential AIG, allowing problems or sub-problems solved by an EDA tool to be cached with their computed results. This can speed up the tool when applied to designs with isomorphic components or design suites exhibiting substantial structural similarity.

Fast Cone-Of-Influence Computation and Estimation in Problems with Multiple Properties [p. 803]

C. Loiacono, M. Palena, P. Pasini, D. Patti, S. Quer, S. Ricossa, D. Vendraminetto and J. Baumgartner

This paper introduces a new technique for a fast computation of the Cone-Of-Influence (COI) of multiple properties. It specifically addresses frameworks where multiple properties belongs to the same model, and they partially or fully share their COI. In order to avoid multiple repeated visits of the same circuit sub-graph representation, it proposes a new algorithm, which performs a single topological visit of the variable dependency graph. It also studies mutual relationships among different properties, based on the overlapping of their COIs. It finally considers state variable scoring, based on their own COIs and/or their appearance in multiple COIs, as a new statistic for variable sorting and grouping/clustering in various Model Checking algorithms. Preliminary results show the advantages, and potential applications of these ideas.

Using Cubes of Non-state Variables with Property Directed Reachability [p. 807]

John D. Backes and Marc D. Riedel

A new SAT-Based algorithm for symbolic model checking has been gaining popularity. This algorithm, referred to as "Incremental Construction of Inductive Clauses for Indubitable Correctness" (IC3) or "Property Directed Reachability" (PDR), uses information learned from SAT instances of isolated time frames to either prove that an invariant exists, or provide a counter example. The information learned between each time frame is recorded in the form of cubes of the state variables. In this work, we study the effect of extending PDR to use cubes of intermediate variables representing the logic gates in the transition relation. We demonstrate that we can improve the runtime for satisfiable benchmarks by up to 3.2X, with an average speedup of 1.23X. Our approach also provides a speedup of up to 3.84X for unsatisfiable benchmarks.

Bridging the Gap between Dual Propagation and CNF-based QBF Solving [p. 811]

Alexandra Goultiaeva, Martina Seidl and Armin Biere

Conjunctive Normal Form (CNF) representation as used by most modern Quantified Boolean Formula (QBF) solvers is simple and powerful when reasoning about conflicts, but is not efficient at dealing with solutions. To overcome this inefficiency a number of specialized non-CNF solvers were created. These solvers were shown to have great advantages. Unfortunately, non-CNF solvers cannot benefit from sophisticated CNF-based techniques developed over the years. This paper demonstrates how the power of non-CNF structure can be harvested without the need for specialized solvers; in fact, it is easily incorporated into most existing CNF-based QBF solvers using a pre-existing mechanism of cube learning. We demonstrate this using a state-of-the-art QBF solver DepQBF, and experimentally show the effectiveness of our approach.

7.3: Dynamic Reconfiguration

Moderators: Diana Goehringer - Karlsruhe Institute of Technology, DE; Fabrizio Ferrandi - Politecnico di Milano, IT

Dynamic Configuration Prefetching Based on Piecewise Linear Prediction [p. 815]

Adrian Lifa, Petru Eles and Zebo Peng

Modern systems demand high performance, as well as high degrees of flexibility and adaptability. Many current applications exhibit a dynamic and nonstationary behavior, having certain characteristics in one phase of their execution, that will change as the applications enter new phases, in a manner unpredictable at design-time. In order to meet the performance requirements of such systems, it is important to have on-line optimization algorithms, coupled with adaptive hardware platforms, that together can adjust to the run-time conditions. We propose an optimization technique that minimizes the expected execution time of an application by dynamically scheduling hardware prefetches. We use a piecewise linear predictor in order to capture correlations and predict the hardware modules to be reached. Experiments show that the proposed algorithm outperforms the previous state-of-art in reducing the expected execution time by up to 27% on average.

An Automatic Tool Flow for the Combined Implementation of Multi-mode Circuits [p. 821]

Brahim Al Farisi, Karel Bruneel, João M. P. Cardoso and Dirk Stroobandt

A multi-mode circuit implements the functionality of a limited number of circuits, called modes, of which at any given time only one needs to be realised. Using run-time reconfiguration of an FPGA, all the modes can be implemented on the same reconfigurable region, requiring only an area that can contain the biggest mode. Typically, conventional run-time reconfiguration techniques generate a configuration for every mode separately. To switch between modes the complete reconfigurable region is rewritten, which often leads to very long reconfiguration times. In this paper we present a novel, fully automated tool flow that exploits similarities between the modes and uses Dynamic Circuit Specialization to drastically reduce reconfiguration time. Experimental results show that the number of bits that is rewritten in the configuration memory reduces with a factor from 4.6x to 5.1x without significant performance penalties.

Support for Dynamic Issue Width in VLIW Processors Using Generic Binaries [p. 827]

Anthony Brandon and Stephan Wong

Different applications exhibit different behavior that cannot be optimally captured by a fixed organization of a VLIW processor. However, through exploitation of reconfigurable hardware we can optimize the organization when running different applications. In this paper, we propose a novel way to execute the same binary on different issue-width processors without much hardware modifications. We propose to change the compiler and assembler to ensure correct results. Our experiments show an average slowdown of around 1:3x when compared to binaries compiled for specific issue-widths. This can be further improved to less than 1:09x on average with additional compiler optimizations. Even though the flexibility comes at a price, it can be exploited for many other purposes, such as: dynamic performance/energy trade-off and energy-saving mechanisms, dynamic hardware sharing, and dynamic code insertion for hardware fault detection mechanisms.

The RecoBlock SoC Platform: A Flexible Array of Reusable Run-Time-Reconfigurable IP-Blocks [p. 833]

Byron Navas, Ingo Sander and Johnny Öberg

Run-time reconfigurable (RTR) FPGAs combine the flexibility of software with the high efficiency of hardware. Still, their potential cannot be fully exploited due to increased complexity of the design process. Consequently, to enable an efficient design flow, we devise a set of prerequisites to increase the flexibility and reusability of current FPGA-based RTR architectures. We apply these principles to design and implement the RecoBlock SoC platform, which main characterization is (1) a RTR plug-and-play IP-Core whose functionality is configured at run-time; (2) flexible inter-block communication configured via software, and (3) built-in buffers to support data-driven streams and inter-process communications. We illustrate the potential of our platform by a tutorial case study using an adaptive streaming application to investigate different combinations of reconfigurable arrays and schedules. The experiments underline the benefits of the platform and shows resource utilization.
Keywords - reconfigurable architectures; partial and run-time reconfiguration; system-on-chip; adaptivity; embedded systems

DANCE: Distributed Application-aware Node Configuration Engine in Shared Reconfigurable Sensor Networks [p. 839]

Chih-Ming Hsieh, Zhonglei Wang and Jörg Henkel

Wireless sensor networks (WSNs) are often primarily tailored to single applications to achieve one specific mission. Considering that the same physical phenomenon can be used by multiple applications, the benefit of sharing the WSN infrastructure is obvious in terms of development and deployment cost. However, allocating the tasks to the WSNs to meet the requirements of all applications while keeping the energy efficiency is very challenging. Introducing reconfigurable nodes in the shared sensor networks can improve the performance, the energy efficiency and the flexibility but it increases the system complexity. In this paper, we propose a biologically inspired node configuration scheme in shared reconfigurable sensor network named DANCE, which can adapt to the changing environment and efficiently utilize WSN resources. Our experiments show that our scheme reduces the energy consumption by up to 76%.

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators [p. 843]

Cuong Pham-Quoc, Jan Heisswolf, Stephan Werner, Zaid Al-Ars, Jürgen Becker and Koen Bertels

The communication infrastructure is one of the important components of a multicore system along with the computing cores and memories. A good interconnect design plays a key role in improving the performance of such systems. In this paper, we introduce a hybrid communication infrastructure using both the standard bus and our area-efficient and delay-optimized network on chip for heterogeneous multicore systems, especially hardware accelerator systems. An adaptive data communication-based mapping for reconfigurable hardware accelerators is proposed to obtain a low overhead and latency interconnect. Experimental results show that the proposed communication infrastructure and the adaptive data communication-based mapping achieves a speed-up of 2.4x with respect to a similar system using only a bus as interconnect. Moreover, our proposed system achieves a reduction of energy consumption of 56% compared to the original system

7.4: Emerging Memory

Moderators: Ian O'Connor - Lyon Institute of Nanotechnology, FR; Siddharth Garg - University of Waterloo, CA

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches [p. 847]

Jue Wang, Xiangyu Dong and Yuan Xie

Emerging memory technologies are explored as potential alternatives to traditional SRAM/DRAM-based memory architecture in future microprocessor designs. Among various emerging memory technologies, Spin-Torque Transfer RAM (STT-RAM) has the benefits of fast read latency, low leakage power, and high density, and therefore has been investigated as a promising candidate for last-level cache (LLC). One of the major disadvantages for STT-RAM is the latency and energy overhead associated with the write operations. In particular, a long-latency write operation to STT-RAM cache may obstruct other cache accesses and result in severe performance degradation. Consequently, mitigation techniques to minimize the write overhead are required in order to successfully adopt this new technology for cache design. In this paper, we propose an obstruction-aware cache management policy called OAP. OAP monitors the cache to periodically detect LLC-obstruction processes, and manage the cache accesses from different processes. The experimental results on a 4-core architecture with an 8MB STT-RAM L3 cache shows that the performance can be improved by 14% on average and up to 42%, with a reduction of energy consumption by 64%.

STT-RAM Designs Supporting Dual-Port Accesses [p. 853]

Xiuyuan Bi, Mohamed Anis Weldon and Hai Li

The spin-transfer torque random access memory (STT-RAM) has been widely investigated as a promising candidate to replace the static random access memory (SRAM) as on-chip cache memories. However, the existing STT-RAM cell designs can be used for only single-port accesses, which limits the memory access bandwidth and constraints the system performance. In this work, we propose the design solutions to provide dual-port accesses for STT-RAM. The area increment by introducing an additional port is reduced by leveraging the shared source-line structure. Detailed analysis on the performance/ reliability degradation caused by dual-port accesses and the corresponding design optimization are performed. We propose two types of dual-port STT-RAM cell structures having 2 read/write ports (2RW) or 1-read/1-write port (1R/1W), respectively. Comparison shows that a 2RW STT-RAM cell consumes only 42% of area of a dual-port SRAM. The 1R/1W design further reduces 7.7% of cell area under same performance target.

Low Cost Power Failure Protection for MLC NAND Flash Storage Systems with PRAM/DRAM Hybrid Buffer [p. 859]

Jie Guo, Jun Yang, Youtao Zhang and Yiran Chen

In the latest PRAM/DRAM hybrid MLC NAND flash storage systems (NFSS), DRAM is used to temporarily store file system data for system response time reduction. To ensure data integrity, super-capacitors are deployed to supply the backup power for moving the data from DRAM to NAND flash during power failures. However, the capacitance degradation of super-capacitor severely impairs system robustness. In this work, we proposed a low cost power failure protection scheme to reduce the energy consumption of power failure protection and increase the robustness of the NFSS with PRAM/DRAM hybrid buffer. Our scheme enables the adoption of the more reliable regular capacitor to replace the super capacitor as the backup power. The experimental results shows that our scheme can substantially reduce the capacitance budget of power failure protection circuitry by 75.1% with very marginal performance and energy overheads.

SPaC: A Segment-based Parallel Compression for Backup Acceleration in Nonvolatile Processors [p. 865]

Xiao Sheng, Yiqun Wang, Yongpan Liu and Huazhong Yang

Nonvolatile processor (NVP) has become an emerging topic in recent years. The conventional NV processor equips each flip-flop with a nonvolatile storage for data backup, which results in much faster backup speed with significant area overheads. A compression based architecture (PRLC) solved the area problem but with a nontrivial increasing on backup time. This paper provides a segment-based parallel compression (SPaC) architecture to achieve tradeoffs between area and backup speed. Furthermore, we use an off-line and online hybrid method to balance the workloads of different compression modules in SPaC. Experimental results show that SPaC can achieve 76% speed up against PRLC and meanwhile reduces the area by 16% against conventional NV processors.

The Design of Sustainable Wireless Sensor Network Node Using Solar Energy and Phase Change Memory [p. 869]

Ping Zhou, Youtao Zhang and Jun Yang

Sustainability of wireless sensor network (WSN) is crucial to its economy and efficiency. While previous works have focused on solving the energy source limitation through solar energy harvesting, we reveal in this paper that sensor node's lifespan could also be limited by memory wear-out and battery cycle life. We propose a sustainable sensor node design that takes all three limiting factors into consideration. Our design uses Phase Change Memory (PCM) to solve Flash memory's endurance issue. By leveraging PCM's adjustable write width, we propose a low-cost, fine-grained load tuning technique that allows the sensor node to match current MPP of solar panel and reduces the number of discharge/charge cycles on battery. Our modeling and experiments show that our sustainable sensor node design can achieve on average 5.1 years of node lifetime, more than 2x over the baseline.

Optical Look Up Table [p. 873]

Zhen Li, Sébastien Le Beux, Christelle Monat, Xavier Letartre and Ian O'Connor

The computation capacity of conventional FPGA is directlyu proportional to the size and expressive power of Look Up Table (LUT) resources. Individual LUT performance is limited by transistor switching time and power dissipation, defined by the CMOS fabrication process. In this paper we propose OLUT, an optical core implementation of LUT, which has the potential for low latency and low power computation. In addition, the use of Wavelength Devision Multiplexing (WDM) allows parallel computation, which can further increase computation capacity. Preliminary experimental results demonstrate the potential for optically assisted on-chip computation.
Index Terms - silicon photonic architectures, WDM, LUT

A Verilog-A Model for Reconfigurable Logic Gates Based on Graphene pn-Junctions [p. 877]

Sandeep Miryrala, Mehrdad Montazeri, Andrea Calimera, Enrico Macii and Massimo Poncino

Single layer sheets of graphene show special electrical properties that can enable the next generation of smart ICs. Recent works have proven the availability of an electrostatically controlled pn-junction upon which it is possible to design multifunction reconfigurable logic devices that naturally behave as multiplexers. In this work we introduce a stable large-signal Verilog-A model that mimics the behavior of the aforementioned devices. The proposed model, validated through the SPICE characterization of a MUX-based standard cell library we designed as benchmark, represents a first step towards the integration of Electronic Design Automation tools that can support the design of all-graphene ICs.

7.5: Energy-efficient Architectures and Software Design for Power-constrained Systems

Moderators: Geoff Merrett - University of Southampton, UK; Gangadhar Garipelli - EPFL, CH

Optimal Control of a Grid-Connected Hybrid Electrical Energy Storage System for Homes [p. 881]

Yanzhi Wang, Xue Lin, Massoud Pedram, Sangyoung Park and Naehyuck Chang

Integrating residential photovoltaic (PV) power generation and electrical energy storage (EES) systems into the Smart Grid is an effective way of utilizing renewable power and reducing the consumption of fossil fuels. This has become a particularly interesting problem with the introduction of dynamic electricity energy pricing models since electricity consumers can use their PV-based energy generation and EES systems for peak shaving on their power demand profile from the grid, and thereby, minimize their electricity bill. Due to the characteristics of a realistic electricity price function and the energy storage capacity limitation, the control algorithm for a residential EES system should accurately account for various energy loss components during operation. Hybrid electrical energy storage (HEES) systems are proposed to exploit the strengths of each type of EES element and hide its weaknesses so as to achieve a combination of performance metrics that is superior to those of any of its individual EES components. This paper introduces the problem of how best to utilize a HEES system for a residential Smart Grid user equipped with PV power generation facilities. The optimal control algorithm for the HEES system is developed, which aims at minimization of the total electricity cost over a billing period under a general electricity energy price function. The proposed algorithm is based on dynamic programming and has polynomial time complexity. Experimental results demonstrate that the proposed HEES system and optimal control algorithm achieves 73.9% average profit enhancement over baseline homogeneous EES systems.
Keywords - hybrid electrical energy storage system; smart grid; optimal control

Radar Signature in Multiple Target Tracking System for Driver Assistant Application [p. 887]

Haisheng Liu and Smail Niar

This paper presents a new Driver Assistant System (DAS) using radar signatures. The new system is able in one hand to track multiple obstacles and on the other hand to identify obstacles during vehicle movements. The combination of these two functions on the same DAS gives the benefits of avoiding false alarms. Also, it makes possible to generate alarms that take into account the identification of the obstacles. The obstacle tracking process is simplified thanks to the identification stage. Hence, our low cost FPGA-based System-on-Chip is able to detect, recognize and track a large number of obstacles in a relatively reduced time period. Our experimental result proves that a speed up of 32% can be obtained compared to the standard system.
Index Terms - FPGA, Driver Assistance System, Radar signature, MTT, System-on-Chip

Development of a Fully Implantable Recording System for ECoG Signals [p. 893]

Jonas Pistor, Janpeter Hoeffmann, David Rotermund, Elena Tolstosheeva, Tim Schellenberg, Dmitriy Boll, Victor Gordillo-Gonzales, Sunita Mandon, Dagmar Peters-Drolshagen, Andreas Kreiter, Martin Schneider, Walter Lang, Klaus Pawelzik and Steffen Paul

This paper presents a fully implantable neural recording system for the simultaneous recording of 128 channels. The electrocorticography (ECoG) signals are sensed with 128 gold electrodes embedded in a 10 μm thick polyimide foil. The signals are picked up by eight amplifier array ICs and digitized with a resolution of 16 bit at 10 kHz. The digitized measurement data is processed in a reconfigurable digital ASIC, which is fabricated in a 0.35 μm CMOS technology and occupies an area of 2.8x2.8mm². After data reduction, the measurement data is fed into a transceiver IC, which transmits the data with up to 495 kbit/s to a base station, using an RF loop antenna on a flexible PCB. The power consumption of 84mW is delivered via inductive coupling from the base station.

A Methodology for Embedded Classification of Heartbeats Using Random Projections [p. 899]

Rubén Braojos, Giovanni Ansaloni and David Atienza

Smart Wireless Body Sensor Nodes (WBSNs) are a novel class of unobtrusive, battery-powered devices allowing the continuous monitoring and real-time interpretation of a subject's bio-signals. One of its most relevant applications is the acquisition and analysis of Electrocardiograms (ECGs). These low-power WBSN designs, while able to perform advanced signal processing to extract information on hearth conditions of subjects, are usually constrained in terms of computational power and transmission bandwidth. It is therefore beneficial to identify in the early stages of analysis which parts of an ECG acquisition are critical and activate only in these cases detailed (and computationally intensive) diagnosis algorithms. In this paper, we introduce and study the performance of a real-time optimized neuro-fuzzy classifier based on random projections, which is able to discern normal and pathological heartbeats on an embedded WBSN. Moreover, it exposes high confidence and low computational and memory requirements. Indeed, by focusing on abnormal heartbeats morphologies, we proved that a WBSN system can effectively enhance its efficiency, obtaining energy savings of as much as 63% in the signal processing stage and 68% in the subsequent wireless transmission when the proposed classifier is employed.

A Survy of Multi-Source Energy Harvesting Systems [p. 905]

Alex S. Weddell, Michele Magno, Geoff V. Merrett, Davide Brunelli, Bashir M. Al-Hashimi and Luca Benini

Energy harvesting allows low-power embedded devices to be powered from naturally-ocurring or unwanted environmental energy (e.g. light, vibration, or temperature difference). While a number of systems incorporating energy harvesters are now available commercially, they are specific to certain types of energy source. Energy availability can be a temporal as well as spatial effect. To address this issue, "hybrid" energy harvesting systems combine multiple harvesters on the same platform, but the design of these systems is not straightforward. This paper surveys their design, including trade-offs affecting their efficiency, applicability, and ease of deployment. This survey, and the taxonomy of multi-source energy harvesting systems that it presents, will be of benefit to designers of future systems. Furthermore, we identify and comment upon the current and future research directions in this field.

Capital Cost-Aware Design and Partial Shading-Aware Architecture Optimization of a Reconfigurable Photovoltaic System [p. 909]

Yanzhi Wang, Xue Lin, Massoud Pedram, Jaemin Kim and Naehyuck Chang

Photovoltaic (PV) systems are often subject to partial shading that significantly degrades the output power of the whole systems. Reconfiguration methods have been proposed to adaptively change the PV panel configuration according to the current partial shading pattern. The reconfigurable PV panel architecture integrates every PV cell with three programmable switches to facilitate the PV panel reconfiguration. The additional switches, however, increase the capital cost of the PV system. In this paper, we group a number of PV cells into a PV macro-cell, and the PV panel reconfiguration only changes the connections between adjacent PV macro-cells. The size and internal structure (i.e., the series-parallel connection of PV cells) of all PV macro-cells are the same and will not be changed after PV system installation in the field. Determining the optimal size of the PV macro-cell is the result of a trade-off between the decreased PV system capital cost and enhanced PV system performance. A larger PV macro-cell reduces the cost overhead whereas a smaller PV macro-cell achieves better performance. In this paper, we set out to calculate the optimal size of the PV macro-cells such that the maximum system performance can be achieved subject to an overall system cost limitation. This "design" problem is solved using an efficient search algorithm. In addition, we provide for in-field reconfigurability of the PV panel by enabling formation of series-connected groups of parallel-connected macro-cells. We ensure maximum output power for the PV system in response to any incurring partial shading pattern. This "architecture optimization" problem is solved using dynamic programming.

An Ultra-Low Power Hardware Accelerator Architecture for Wearable Computers Using Dynamic Time Warping [p. 913]

Reza Lotfian and Roozbeh Jafari

Movement monitoring using wearable computers has been widely used in healthcare and wellness applications. To reduce the form factor of wearable nodes which is dominated by battery size, ultra-low power signal processing is crucial. In this paper, we propose an architecture that can be viewed as a hardware accelerator and employs dynamic time warping (DTW) in a hierarchical fashion. The proposed architecture removes events that are not of interest from the signal processing chain as early as possible, deactivating all remaining modules. We consider tunable parameters such as sampling frequency and bit resolution of the incoming sensor readings for DTW to balance the power consumption and classification precision trade-off. We formulate a methodology for determining the optimal set of tunable parameters and provide a solution using Active-set algorithm. We synthesized the architecture using 45nm CMOS and illustrated that a three-tiered module achieves 98% accuracy with a power budget of 1.23μW, while a single level DTW consumes 6.3μW with the same accuracy. We furthermore propose a fast approximation methodology that runs 3200 times faster while introducing less than 3% error over the original optimization for determining the total power consumption.
Keywords - Hardware Accelerator; Activity Recognition; Granular Decision Making Module; Dynamic Time Warping

Efficient Cache Architectures for Reliable Hybrid Voltage Operation Using EDC Codes [p. 917]

Bojan Maric, Jaume Abella and Mateo Valero

Semiconductor technology evolution enables the design of sensor-based battery-powered ultra-low-cost chips (e.g., below 1 C) required for new market segments such as body, urban life and environment monitoring. Caches have been shown to be the highest energy and area consumer in those chips. This paper proposes a novel, hybrid-operation (high Vcc, ultra-low Vcc), single-Vcc domain cache architecture based on replacing energy-hungry bitcells (e.g., 10T) by more energy-efficient and smaller cells (e.g., 8T) enhanced with Error Detection and Correction (EDC) features for high reliability and performance predictability. Our architecture is proven to largely outperform existing solutions in terms of energy and area.
Index Terms - Caches, Low Energy, Reliability, Real-Time

7.6: On-Line Approaches towards Processor Resilience

Moderators: Yiorgos Makris - University of Dallas, US; Xavier Vera - Intel, ES 7.6.1 885

Efficient Software-Based Fault Tolerance Approach on Multicore Platforms [p. 921]

Hamid Mushtaq, Zaid Al-Ars and Koen Bertels

This paper describes a low overhead software-based fault tolerance approach for shared memory multicore systems. The scheme is implemented at user-space level and requires almost no changes to the original application. Redundant multithreaded processes are used to detect soft errors and recover from them. Our scheme makes sure that the execution of the redundant processes is identical even in the presence of non-determinism due to shared memory accesses. It provides a very low overhead mechanism to achieve this. Moreover it implements a fast error detection and recovery mechanism. The overhead incurred by our approach ranges from 0% to 18% for selected benchmarks. This is lower than comparable systems published in literature.

Using Explicit Output Comparisons for Fault Tolerant Scheduling (FTS) on Modern High-Performance Processors [p. 927]

Yue Gao, Sandeep K. Gupta and Melvin A. Breuer

Soft errors and errors caused by intermittent faults are a major concern for modern processors. In this paper we provide a drastically different approach for fault tolerant scheduling (FTS) of tasks in such processors. Traditionally in FTS, error detection is performed implicitly and concurrently with task execution, and associated overheads are incurred as increases in software run-time or hardware area. However, such embedded error detection (EED) techniques, e.g., watchdog processor assisted control flow checking, only provide approximately 70% error coverage [1, 2]. We propose the idea of utilizing straightforward explicit output comparison (EOC) which provides nearly 100% error coverage. We construct a framework for utilizing EOC in FTS, identify new challenges and tradeoffs, and develop a new off-line scheduling algorithm for EOC. We show that our EOC based approach provides higher error coverage and an average performance improvement of nearly 10% over EED-based FTS approaches, without increasing resource requirements. In our ongoing research we are identifying a richer set of ways of applying EOC, by itself and in conjunction with EED, to obtain further improvements.

Low Cost Permanent Fault Detection Using Ultra-Reduced Instruction Set Co-Processors [p. 933]

Sundaram Ananthanarayan, Siddharth Garg, Hiren D. Patel

In this paper, we propose a new, low hardware overhead solution for permanent fault detection at the micro-architecture/ instruction level. The proposed technique is based on an ultra-reduced instruction set co-processor (URISC) that, in its simplest form, executes only one Turing complete instruction - the subleq instruction. Thus, any instruction on the main core can be redundantly executed on the URISC using a sequence of subleq instructions, and the results can be compared, also on the URISC, to detect faults. A number of novel software and hardware techniques are proposed to decrease the performance overhead of online fault detection while keeping the error detection latency bounded including: (i) URISC routines and hardware support to check both control and data flow instructions; (ii) checking only a subset of instructions in the code based on a novel check window criterion; and (iii) URISC instruction set extensions. Our experimental results, based on FPGA synthesis and RTL simulations, illustrate the benefits of the proposed techniques.

Improving Fault Tolerance Utilizing Hardware-Software-Co-Synthesis [p. 939]

Heinz Riener, Stefan Frehse and Görschwin Fey

Embedded systems consist of hardware and software and are ubiquitous in safety-critical and mission-critical fields. The increasing integration density of modern, digital circuits causes an increasing vulnerability of embedded systems to transient faults. Techniques to improve the fault tolerance are often either implemented in hardware or in software. In this paper, we focus on synthesis techniques to improve the fault tolerance of embedded systems considering hardware and software. A greedy algorithm is presented which iteratively assesses the fault tolerance of a processor-based system and decides which components of the system have to be hardened choosing from a set of existing techniques. We evaluate the algorithm in a simple case study using a Traffic Collision Avoidance System (TCAS).
Index Terms - Fault tolerance, Formal methods, Synthesis, Optimization

A Dynamic Self-Adaptive Correction Method for Error Resilient Application [p. 943]

Luming Yan, Huaguo Liang, Zhengfeng Huang

The aggressive scaling down technology has posed transistor aging to be a new challenging to the reliability of circuits. Transistor aging could cause the gradual degradation of circuit performance and eventually lead to timing error. In this paper, a dynamic self-adaptive method is proposed to protect the circuit from the influence of transistor aging. This makes use of aging detection sensors and self-adaptive clock scaling cell. Aging sensors would automatically wake up the clock scaling cell to shift the clock phase of circuits when an error occurs. Then the timing error would be masked by a second sampling with the shifted clock. The method is simulated by Hspice using 65nm technology. The evaluation results show that this method is effective to error resilient with no impact on normal function of circuits, and it improves the MTTF by 1.16 times with 22.73% circuit overheads on average when the phase difference is 20% clock cycle.

7.7: EMBEDDED TUTORIAL: From Multi-core SoC to Scale-out Processors

Organizer: Luca Fanucci - University of PISA, IT
Moderators: Marcello Coppola - STMicroelectronics, FR; Luca Fanucci - University of Pisa, IT

From Embedded Multi-core SoCs to Scale-out Processors [p. 947]

Marcello Coppola, Babak Falsafi, John Goodacre and George Kornaros

Information technology is now an indispensable pillar of a modern day society. CMOS technologies, which lay the foundation for all digital platforms, however, are experiencing a major inflection point due to a slowdown in voltage scaling. The net result is that power is emerging as the key design constraint for all platforms from embedded systems to datacenters. This tutorial presents emerging design paradigms from embedded multicore SoCs to server processors for scale-out datacenters based on mobile cores.
Keywords - MPSoC, ARMv8 architecture, hardware virtualization, IOMMU, scale-out processors, total cost of ownership

8.1: HOT TOPIC - Fabrication Technology Approaches to Energy-Efficiency

Organizer: Ahmed Jerraya - CEA-LETI-MINATEC, FR
Moderator: Ahmed Jerraya - CEA-LETI-MINATEC, FR

UTBB FD-SOI: A Process/Design Symbiosis for Breakthrough Energy-efficiency [p. 952]

Philippe Magarshack, Philippe Flatresse and Giorgio Cesana

UTBB FD-SOI technology has become mainstream within STMicroelectronics, with the objective to serve a wide spectrum of mobile multimedia products. This breakthrough technology brings a significant improvement in terms of performance and power saving, complemented by an excellent responsiveness to power management design techniques for energy efficiency optimization. The symbiosis between process and design is key in this achievement enabling to provide already at 28nm node a real differentiation in terms of flexibility, cost and energy efficiency with respect to any process available on the market.
Keywords: UTBB FD-SOI, CMOS, high-performance, low-power, mobile application, SoC, energy efficiency, Back-Bias

Wireless Interconnect for Board and Chip Level [p. 958]

Gerhard P. Fettweis, Najeeb ul Hassan, Lukas Landau and Erik Fischer

Electronic systems of the future require a very high bandwidth communications infrastructure within the system. This way the massive amount of compute power which will be available can be inter-connected to realize future powerful advanced electronic systems. Today, electronic inter-connects between 3D chip-stacks, as well as intra-connects within 3D chip-stacks are approaching data rates of 100 Gbit/s soon. Hence, the question to be answered is how to efficiently design the communications infrastructure which will be within electronic systems. Within this paper approaches and results for building this infrastructure for future electronics are addressed.

Future Memory and Interconnect Technologies [p. 964]

Yuan Xie

The improvement of the computer system performance is constrained by the well-known memory wall and power wall. It has been recognized that the memory architecture and the interconnect architecture are becoming the overwhelming bottleneck in computer performance. Disruptive technologies, such as emerging non-volatile memory (NVM) technologies, 3D integration, and optical interconnects, are envisioned as promising future memory and interconnect technologies that can fundamentally change the landscape of the future computer architecture design with profound impact. This invited survey paper gives a brief introduction of these future memory and interconnect technologies, discusses the opportunities and challenges of these new technologies for future computer system designs.

8.2: Scheduling for Real-Time Embedded Systems

Moderators: Wido Kruijtzer - Synopsys, NL; Jan Madsen - Technical University of Denmark, DK

Optimized Scheduling of Multi-IMA Partitions with Exclusive Region for Synchronized Real- Time Multi-Core Systems [p. 970]

Jung-Eun Kim, Man-Ki Yoon, Sungjin Im, Richard Bradford and Lui Sha

Integrated Modular Avionics (IMA) architecture has been widely adopted by the avionics industry due to its strong temporal and spatial isolation capability for safety-critical real-time systems. The fundamental challenge to integrating an existing set of single-core IMA partitions into a multi-core system is to ensure that the isolation of the partitions will be maintained without incurring huge redevelopment and recertification costs. To address this challenge, we developed an optimized partition scheduling algorithm which considers exclusive regions to achieve the synchronization between partitions across cores. We show that the problem of finding the optimal partition schedule is NP-complete and present a Constraint Programming formulation. In addition, we relax this problem to find the minimum number of cores needed to schedule a given set of partitions and propose an approximation algorithm which is guaranteed to find a feasible schedule of partitions if there exists a feasible schedule of exclusive regions.

Quality-Aware Media Scheduling on MPSoC Platforms [p. 976]

Deepak Gangadharan, Samarjit Chakraborty and Roger Zimmermann

Applications that stream multiple video/audio or video+audio clips are being implemented in embedded devices. A Picture-in-Picture (PiP) application is one such application scenario, where two videos are played simultaneously. Although the PiP application is very efficiently handled in televisions and personal computers by providing maximum quality of service to the multiple streams, it is a difficult task in devices with resource constraints. In order to efficiently utilize the resources, it is essential to derive the necessary processor cycles for multiple video streams such that they are displayed with some prespecified quality constraint. Therefore, we propose a network calculus based formal framework to help schedule multiple media streams in the presence of buffer contraints. Further, our framework also presents a schedulability analysis condition to check if the multimedia streams can be scheduled such that a prespecified quality constraint is satisfied with the available service. We present this framework in the context of a PiP application, but it is applicable in general for multiple media streams. The results obtained using the formal framework were further verified using experiments involving system simulation.

Priority Assignment for Event-triggered Systems Using Mathematical Programming [p. 982]

Martin Lukasiewycz, Sebastian Steinhorst and Samarjit Chakraborty

This paper presents a methodology based on mathematical programming for the priority assignment of processes and messages in event-triggered systems with tight end-to-end real-time deadlines. For this purpose, the problem is converted into a Quadratically Constrained Quadratic Program (QCQP) and addressed with a state-of-the-art solver. The formulation includes preemptive as well as non-preemptive schedulers and avoids cyclic dependencies that may lead to intractable real-time analysis problems. For problems with stringent real-time requirements, the proposed mathematical programming method is capable of finding a feasible solution efficiently where other approaches suffer from a poor scalability. In case there exists no feasible solution, an algorithm is presented that uses the proposed method to find a minimal reason for the infeasibility which may be used as a feedback to the designer. To give evidence of the scalability of the proposed method and in order to show the clear benefit over existing approaches, a set of synthetic test cases is evaluated. Finally, a large realistic case study is introduced and solved, showing the applicability of the proposed method in the automotive domain.

Efficient and Scalable OpenMP-based System-level Design [p. 988]

Alessandro Cilardo, Luca Gallo, Antonino Mazzeo and Nicola Mazzocca

In this work we present an experimental environment for electronic system-level design based on the OpenMP programming paradigm. Fully compliant with the OpenMP standard, the environment allows the generation of heterogeneous hardware/software systems exhibiting good scalability with respect to the number of threads and limited performance overheads. Based on well-established OpenMP benchmarks, the paper also presents some comparisons with high-performance software implementations as well as with previous proposals oriented to pure hardware translation. The results confirm that the proposed approach achieves improved results in terms of both efficiency and scalability.

Utilizing Voltage-Frequency Islands in C-to-RTL Synthesis for Streaming Applications [p. 992]

Xinyu He, Shuangchen Li, Yongpan Liu, X. Sharon Hu and Huazhong Yang

Automatic C-to-RTL (C2RTL) synthesis can greatly benefit hardware design for streaming applications. However, stringent throughput/ area constraints, especially the demand for power optimization at the system level is rather challenging for existing C2RTL synthesis tools. This paper considers a power-aware C2RTL framework using voltage-frequency islands (VFIs) to address these challenges. Given the throughput, area, and power constraints, an MILP-based approach is introduced to synthesize C-code into an RTL design by simultaneously considering three design knobs, i.e., partition, parallelization, and VFI assignment to get the global optimal solution. A heuristic solution is also discussed to deal with the scalability challenge facing the MILP formulation. Experimental results based on four well known multimedia applications demonstrate the effectiveness of both solutions.

8.3: Logic Synthesis Techniques

Moderators: Michel Berkelaar - Delft University of Technology, NL; Jordi Cortadella - Universitat Politècnica Catalunya, ES

Minimization of P-Circuits Using Boolean Relations [p. 996]

Anna Bernasconi, Valentina Ciriani, Gabriella Trucco and Tiziano Villa

In this paper, we investigate how to use the complete flexibility of P-circuits, which realize a Boolean function by projecting it onto overlapping subsets given by a generalized Shannon decomposition. It is known how to compute the complete flexibility of P-circuits, but the algorithms proposed so far for its exploitation do not guarantee to find the best implementation, because they cast the problem as the minimization of an incompletely specified function. Instead, here we show that to explore all solutions we must set up the problem as the minimization of a Boolean relation, because there are don't care conditions that cannot be expressed by single cubes. In the experiments we report major improvements with respect to the previously published results.

Intuitive ECO Synthesis for High Performance Circuits [p. 1002]

Haoxing Ren, Ruchir Puri, Lakshmi Reddy, Smita Krishnaswamy, Cindy Washburn, Joel Earl and Joachim Keinert

In the IC industry, chip design cycles are becoming more compressed, while designs themselves are growing in complexity. These trends necessitate efficient methods to handle late-stage engineering change orders (ECOs) to the functional specification, often in response to errors discovered after much of the implementation is finished. Past ECO synthesis algorithms have typically treated ECOs as functional errors and applied error diagnosis techniques to solve them. However, error diagnosis methods are primarily geared towards finding a single change, and moreover, tend to be computationally complex. In this paper, we propose a unique methodology that can systematically incorporate human intuition into the ECO process. Our methodology involves finding a set of directly substitutable points known as functional correspondences between the original implementation and the new specification by using name-preserving synthesis and user hints, to diminish the size of the ECO problem. On average, our approach can reduce the size of logic changes by 94% from those reported in current literature. We then incorporate our logic ECO changes into an incremental physical synthesis flow to demonstrate its usability in an industrial setting. Our ECO synthesis methodology is evaluated on high-performance industrial designs. Results indicate that post-ECO worst negative slack (WNS) improved 14% and total negative slack (TNS) improved 46% over pre-ECO.
Keywords - Engineering Change Order, Logic Synthesis, Physical Synthesis

Retiming for Soft Error Minimization under Error-Latching Window Constraints [p. 1008]

Yinghai Lu and Hai Zhou

Soft error has become a critical reliability issue in nanoscale integrated circuits, especially in sequential circuits where a latched error will be propagated for many cycles and affect many outputs at different time. Retiming is a structural operation that relocates registers in a circuit without changing its functionality. In this paper, the effect of retiming on soft error rate (SER) of a sequential circuit is investigated considering both logic masking and timing masking. A minimum observability retiming problem under error-latching window constraints is formulated to reduce the SER of the circuit. And an efficient algorithm is proposed to solve the problem optimally. Experimental results show on average a 32.7% reduction on SER from the original circuits and a 15% improvement over the existing method.

Biconditional BDD: A Novel Canonical BDD for Logic Synthesis Targeting XOR-rich Circuits [p. 1014]

Luca Amarú, Pierre-Emmanuel Gaillardon and Giovanni De Micheli

We present a novel class of decision diagrams, called Biconditional Binary Decision Diagrams (BBDDs), that enable efficient logic synthesis for XOR-rich circuits. BBDDs are binary decision diagrams where the Shannon's expansion is replaced by the biconditional expansion. Since the biconditional expansion is based on the XOR/XNOR operations, XOR-rich logic circuits are efficiently represented and manipulated with canonical Reduced and Ordered BBDDs (ROBBDDs). Experimental results show that ROBBDDs have 37% fewer nodes on average compared to traditional ROBDDs. To exploit this opportunity in logic synthesis for XOR-rich circuits, we developed a BBDD-based One-Pass Synthesis (OPS) methodology. The BBDD-based OPS is capable to harness the potential of novel XOR-efficient devices, such as ambipolar transistors. Experimental results show that our logic synthesis methodology reduces the number of ambipolar transistors by 49.7% on average with respect to state-of-art commercial logic synthesis tool. Considering CMOS technology, the BBBD-based OPS reduces the device count by 31.5% on average compared to commercial synthesis tool.

Optimizing BDDs for Time-Series Dataset Manipulation [p. 1018]

Stergios Stergiou and Jawahar Jain

In this work we advocate the adoption of Binary decision Diagrams (BDDs) for storing and manipulating Time-Series datasets. We first propose a generic BDD transformation which identifies and removes 50% of all BDD edges without any loss of information. Following, we optimize the core operation for adding samples to a dataset and characterize its complexity. We iidentify time-range queries as one of the core operations executed on time-series datasets, and describe explicit Boolean function constructions that aid in efficiently executing them directly on BDDs. We exhibit significant space and performance gains when applying our algorithms on synthetic and real-life biosensor time-series datasets collected from field trials.

Incorporating the Impacts of Workload-Dependent Runtime Variations into Timing Analysis [p. 1022]

Farshad Firouzi, Saman Kiamehr, Mehdi Tahoori and Sani Nassif

In the nanometer era, runtime variations due to workload dependent voltage and temperature variations as well as transistor aging introduce remarkable uncertainty and unpredictability to nanoscale VLSI designs. Consideration of short-term and long-term workload-dependent runtime variations at design time and the interdependence of various parameters remain as major challenges. Here, we propose a static timing analysis framework to accurately capture the combined effects of various workload-dependent runtime variations happening at different time scales, making the link between system-level runtime effects and circuit-level design. The proposed framework is fully integrated with existing commercial EDA toolset, making it scalable for very large designs. We observe that for benchmark circuits, treating each aspect independently and ignoring their intrinsic interactions is optimistic and results in considerable underestimation of timing margin

8.4: High-Speed Robust NoCs

Moderators: Luca Carloni - Columbia University, US; Georgios Dimitrakopoulos - Thrace University, GR

Exploring Topologies for a Source-synchronous Ring-based Network-on-Chip [p. 1026]

Ayan Mandal, Sunil P. Khatri and Rabi N. Mahapatra

The mesh interconnection network has been preferred by the Network-on-Chip (NoC) community due to its simple implementation, high bandwidth and overall scalability. Most existing mesh-based NoC designs operate the mesh at the same or lower clock speed as the processing elements (PEs). Recently, a new source synchronous ring-based NoC architecture has been proposed, which runs significantly faster than the PEs and offers a significantly higher bandwidth and lower communication latency. The authors implement the NoC topology as a mesh of rings, which occupies the same area as that of a mesh. In this work, we evaluate two alternate source synchronous ring-based NoC topologies called the ring of stars (ROS) and the spine with rings (SWR), which occupy a much lower area, and are able to provide better performance in terms of communication latency compared to a state of the art mesh. In our proposed topologies, the clock and the data NoC are routed in parallel, yielding a fast, synchronous, robust design. Our design allows the PEs to extract a low jitter clock from the high speed ring clock by division. The area and performance of these ring-based NoC topologies is quantified. Experimental results on synthetic traffic show that the new ring-based NoC designs can provide significantly lower latency (upto 4.6x) compared to a state of the art mesh. The proposed floorplan-friendly topologies use fewer buffers (upto 50% less) and lower wire length (upto 64.3% lower) compared to the mesh. Depending on the performance and the area desired, a NoC designer can select among the topologies presented.

Proactive Aging Management in Heterogeneous NoCs through a Criticality-driven Routing Approach [p. 1032]

Dean Michael Ancajas, Koushik Chakraborty and Sanghamitra Roy

The emergence of power efficient heterogeneous NoCs presents an intriguing challenge in NoC reliability, particularly due to aging degradation. To effectively tackle this challenge, this work presents a dynamic routing algorithm that exploits the architecture level criticality of network packets while routing. Our proposed framework uses a Wearout Monitoring System (to track NBTI effect) and architecture-level criticality information to create a routing policy that restricts aging degradation with minimal impact on system level performance. Compared to the state-of-the-art BRAR (Buffered-Router Aware Routing), our best scheme achieves 38%, 53% and 29% improvements on network latency, system performance and Energy Delay Product per Flit (EDPPF) overheads, respectively.

Sensor-wise Methodology to Face NBTI Stress of NoC Buffers [p. 1038]

Davide Zoni and William Fornaciari

Networks-on-Chip (NoCs) are a key component for the new many-core architectures, from the performance and reliability standpoints. Unfortunately, continuous scaling of CMOS technology poses severe concerns regarding failure mechanisms such as NBTI and stress-migration. Process variation makes harder the scenario, decreasing device lifetime and performance predictability during chip fabrication. This paper presents a novel cooperative sensor-wise methodology to reduce the NBTI degradation in the network on-chip (NoC) virtual channel (VC) buffers, considering process variation effects as well. The changes introduced to the reference NoC model exhibit an area overhead below 4%. Experimental validation is obtained using a cycle accurate simulator considering both real and synthetic traffic patterns. We compare our methodology to the best sensor-less round-robin approach used as reference model. The proposed sensor-wise strategy achieves up to 26.6% and 18.9% activity factor improvement over the reference policy on synthetic and real traffic patterns respectively. Moreover a net NBTI Vth saving up to 54.2% is shown against the baseline NoC that does not account for NBTI.

An Area-efficient Network Interface for a TDM-based Network-on-Chip [p. 1044]

Jens Sparsø, Evangelia Kasapaki and Martin Schoeberl

Network interfaces (NIs) are used in multi-core systems where they connect processors, memories, and other IP-cores to a packet switched Network-on-Chip (NOC). The functionality of a NI is to bridge between the read/write transaction interfaces used by the cores and the packet-streaming interface used by the routers and links in the NOC. The paper addresses the design of a NI for a NOC that uses time division multiplexing (TDM). By keeping the essence of TDM in mind, we have developed a new area-efficient NI micro-architecture. The new design completely eliminates the need for FIFO buffers and credit based flow control - resources which are reported to account for 50-85% of the area in existing NI designs. The paper discusses the design considerations, presents the new NI micro-architecture, and reports area figures for a range of implementations.
Index Terms - Multiprocessor interconnection networks; Realtime systems; Time division multiplexing;

CARS: Congestion-Aware Request Scheduler for Network Interfaces in NoC-based Manycore Systems [p. 1048]

Masoud Daneshtalab, Masoumeh Ebrahimi, Juha Plosila and Hannu Tenhunen

Network congestion is a critical issue of memory parallelism in network-based manycore systems where multiple memories can be accessed simultaneously. Therefore, a congestion-aware method is necessitated to deal with the network congestion. In this paper, we present a streamlined method in order to reduce the network congestion. The idea is to use the global congestion information as a metric in network interfaces to reduce the congestion level of highly congested areas. Network interfaces connected to memory modules are equipped with an adaptive scheduler using the global congestion information to reduce additional traffic to congested areas. Experimental results with synthetic test cases demonstrate that the on-chip network utilizing the proposed adaptive scheduler presents up to 23% improvement in average latency.
Keywords: Network-on-Chip, Congestion-Aware Scheduler, Network Interfaces

8.5: Industrial Experiences with Embedded System Design

Moderators: Roberto Zafalon - ST Microelectronics, IT; Ralf Pferdmenges - Infineon Technologies, DE

Designing Tightly-coupled Extension Units for the STxP70 Processor [p. 1052]

Yves Janin, Valérie Bertin, Hervé Chauvet, Thomas Deruyter, Christophe Eichwald, Olivier-André Giraud, Vincent Lorquet and Thomas Thery

Designed by STMicroelectronics for the embedded market, the STxP70 processor is a small but extensible processor: designers have the possibility to define tightly-coupled extensions that can be reused in different designs. We explain why this modularity has a strong impact on the toolchain, detail the hardware/software flows and give results for two extensions.

A Fast and Accurate Methodology for Power Estimation and Reduction of Programmable Architectures [p. 1054]

Erwan Piriou, Raphaël David, Fahim Rahim and Solaiman Rahim

We present a power optimization methodology that provides a fast and accurate power model for programmable architectures. The approach is based on a new tool that estimates power consumption from a register transfer level (RTL) module description, activity files and technology library. It efficiently provides an instruction-level accurate power model and allows design space exploration for the register file. We demonstrate a 19% improvement for a standard RISC processor.

A Gate Level Methodology for Efficient Statistical Leakage Estimation in Complex 32nm Circuits [p. 1056]

Smriti Joshi, Anne Lombardot, Marc Belleville, Edith Beigne and Stephane Girard

A fast and accurate statistical method that estimates at gate level the leakage power consumption of CMOS digital circuits is demonstrated. Means, variances and correlations of logic gate leakages are extracted at library characterization step, and used for subsequent circuit statistical computation. In this paper, the methodology is applied to an eleven thousand cells ST test IP. The circuit leakage analysis computation time is 400 times faster than a single fast-Spice corner analysis, while providing coherent results.
Index Terms- Static Power, 32nm, leakage variability, correlation coefficients, covariance method, statistical leakage estimation

A Near-Future Prediction Method for Low Power Consumption on a Many-Core Processor [p. 1058]

Takeshi Kodaka, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto

We developed a method that predicts the required number of cores for executing threads in the near future on a many-core processor. It is designed for low power consumption without performance degradation. The evaluation result confirmed the proposed method is effective on a 32-cores processor.

Time- and Angle-triggered Real-time Kernel [p. 1060]

Damien Chabrol, Didier Roux, Vincent David, Mathieu Jan, Moha Ait Hmid, Patrice Oudin and Gilles Zeppa

Powertrain controllers are automotive applications that bring real-time constraints on software treatments based on the angular position of tooth in the motor. Theses constraints depend on the engine speed and can be as short as 100μsec. A time-triggered approach provides a predictable and reproducible execution of real-time systems but cannot cope with so tight constraints and does not allow to directly specifying the temporal behavior of the system based on angles. The contribution of this work is to present how the ability of the PharOS technology to combine several time domains (time and angle triggered) allows designing and executing powertrain controllers in a deterministic way on multi-core architectures. To this end, we present a prototype of a subset of a powertrain controller from Delphi based on PharOS.
Index Terms - time-triggered, angle-triggered, automotive powertrain controller.

An Extremely Compact JPEG Encoder for Adaptive Embedded Systems [p. 1063]

Josef Schneider and Sri Parameswaran

JPEG Encoding is a commonly performed application that is also very process and memory intensive, and not suited for low-power embedded systems with narrow data buses and small amounts of memory. An embedded system may also need to adapt its application in order to meet varying system constraints such as power, energy, time or bandwidth. We present here an extremely compact JPEG encoder that uses very few system resources, and which is capable of dynamically changing its Quality of Service (QoS) on the fly. The application was tested on a NIOS II core, AVR, and PIC24 microcontrollers with excellent results.

8.6: DfT Methods

Moderators: Peter Harrod - ARM, UK; Luigi Dillilo - LIRMM, FR

Non-Invasive Pre-Bond TSV Test Using Ring Oscillators and Multiple Voltage Levels [p. 1065]

Sergej Deutsch and Krishnendu Chakrabarty

Defects in TSVs due to fabrication steps decrease the yield and reliability of 3D stacked ICs, hence these defects need to be screened early in the manufacturing flow. Before wafer thinning, TSVs are buried in silicon and cannot be mechanically contacted, which severely limits test access. Although TSVs become exposed after wafer thinning, probing on them is difficult because of TSV dimensions and the risk of probe-induced damage. To circumvent these problems, we propose a non-invasive method for pre-bond TSV test that does not require TSV probing. We use open TSVs as capacitive loads of their driving gates and measure the propagation delay by means of ring oscillators. Defects in TSVs cause variations in their RC parameters and therefore lead to variations in the propagation delay. By measuring these variations, we can detect resistive open and leakage faults. We exploit different voltage levels to increase the sensitivity of the test and its robustness against random process variations. Results on fault detection effectiveness are presented through HSPICE simulations using realistic models for 45nm CMOS technology. The estimated DfT area cost of our method is negligible for realistic dies.

LFSR Seed Computation and Reduction Using SMT-Based Fault-Chaining [p. 1071]

Dhrumeel Bakshi and Michael S. Hsiao

We propose a new method to derive a small number of LFSR seeds for Logic BIST to cover all detectable faults as a first-order satisfiability problem involving extended theories. We use an SMT (Satisfiability Modulo Theories) formulation to efficiently combine the tasks of test-generation and seed-computation. We make use of this formulation in an iterative seed-reduction flow which enables the "chaining" of hard-to-test faults using very few seeds. Experimental results demonstrate that up to 79% reduction in the number of seeds can be achieved. Index Terms - LFSR Reseeding, Logic BIST, Test generation, Satisfiability Modulo Theories.

Scan Design with Shadow Flip-flops for Low Performance Overhead and Concurrent Delay Fault Detection [p. 1077]

Sébastien Sarrazin, Samuel Evain, Lirida Alves de Barros Naviner, Yannick Bonhomme and Valentin Gherman

This paper presents new scan solutions with low latency overhead and on-line monitoring support. Shadow flip-flops with scan design are associated to system flip-flops in order to (a) provide concurrent delay fault detection and (b) avoid the scan chain insertion of system flip-flops. A mixed scan architecture is proposed which involves flip-flops with shadow scan design at the end of timing-critical paths and flip-flops with standard scan at non-critical locations. In order to preserve system controllability during test, system flip-flops with shadow scan can be set in scan mode and selectively reset before switching to capture mode. It is shown that shadow scan design with asynchronous set and reset may have a lower latency overhead than standard scan design. A shadow scan solution is proposed which, in addition to concurrent delay fault detection, provides simultaneous scan and capture capability.
Keywords - shadow scan; dynamic variations; delay faults; monitoring; concurrent fault detection

On Candidate Fault Sets for Fault Diagnosis and Dominance Graphs of Equivalence Classes [p. 1083]

Irith Pomeranz

The goal of fault diagnosis is to identify a set of candidate faults, or fault locations, that explain an observed faulty output response of a chip. In fault diagnosis procedures that are based on specific fault models, a scoring algorithm can be used for defining sets of candidate faults that include the faults with the highest scores. This paper shows that it is possible to capture the underlying concepts that make fault scoring effective through a graph, which is referred to as the dominance graph. With a test set T used for fault diagnosis, the graph represents the dominance relations between the equivalence classes obtained with respect to T . The observed response Robs of a chip-under-diagnosis is associated with an equivalence class Cobs , and Cobs is added to the dominance graph. A candidate fault set is defined based on the dominance relations that are added to the graph due to the addition of Cobs . Certain properties of these dominance relations point to the type of the defect present in the chip, and the most appropriate algorithm for defining a set of candidate faults based on it.

A Fast and Efficient DFT for Test and Diagnosis of Power Switches in SoCs [p. 1089]

Xiaoyu Huang, Jimson Mathew, Rishad A. Shafik, Subhasis Bhattacharjee and Dhiraj K Pradhan

Power switches are increasingly becoming dominant leakage power reduction technique for sub-100nm CMOS technologies. Hence, fast and effective DFT solution for test and diagnosis of power switches is much needed to facilitate faster identification of potential faults and their locations. In this paper, we present a novel, coarse-grain DFT solution enabling divide and conquer based test and diagnosis solution of power switches. The proposed solution benefits from exponential time savings compared to previously reported solutions. Our DFT solution requires only (2 ⌈log²⌉ + 3) clock cycles in the worst case for test and diagnosis for m-segment power switches. These time savings are further substantiated by effective discharge circuit design, which eliminates the possibility of false test and hence significantly reducing the charge and discharge times. We validated the effectiveness of our proposed solution through SPICE simulations on a number of ISCAS benchmark circuits, synthesized using 90nm gate libraries.

8.7: Monitoring and Control of Cyber Physical Systems

Moderators: Rolf Ernst - Technische Universität Braunschweig, DE; Haibo Zeng - McGill University, CA

Control-Quality Driven Design of Cyber-Physical Systems with Robustness Guarantees [p. 1093]

Amir Aminifar, Petru Eles, Zebo Peng and Anton Cervin

Many cyber-physical systems comprise several control applications sharing communication and computation resources. The design of such systems requires special attention due to the complex timing behavior that can lead to poor control quality or even instability. The two main requirements of control applications are: (1) robustness and, in particular, stability and (2) high control quality. Although it is essential to guarantee stability and provide a certain degree of robustness even in the worst-case scenario, a design procedure which merely takes the worst-case scenario into consideration can lead to a poor expected (average-case) control quality, since the design is solely tuned to a scenario that occurs very rarely. On the other hand, considering only the expected quality of control does not necessarily provide robustness and stability in the worst-case. Therefore, both the robustness and the expected control quality should be taken into account in the design process. This paper presents an efficient and integrated approach for designing high-quality cyber-physical systems with robustness guarantees.

Compositional Analysis of Switched Ethernet Topologies [p. 1099]

Reinhard Schneider, Licong Zhang, Dip Goswami, Alejandro Masrur and Samarjit Chakraborty

In this paper we study distributed automotive control applications whose tasks are mapped onto different ECUs communicating via a switched Ethernet network. As traditional automotive communication buses like CAN, FlexRay, LIN and MOST are gradually reaching their performance limits because of the increasing complexity of automotive architectures and applications, Ethernet-based in-vehicle communication systems have attracted a lot of attention in recent times. However, currently there is very little work on systematic timing analysis for Ethernet which is important for its deployment in safety-critical scenarios like in an automotive architecture. In this work, we propose a compositional timing analysis technique that takes various features of switched Ethernet into account like network topology, frame priorities, communication delay, memory requirement on switches, performance, etc. Such an analysis technique is particularly suitable during early design phases of automotive architectures and control software deployment. We demonstrate its use in analyzing mixed-criticality traffic patterns consisting of messages from performance-oriented control loops and timing-sensitive real-time tasks. We further evaluate the tightness of the obtained analytical bounds with an OMNeT++ based network simulation environment, which involves long simulation time and does not provide formal guarantees.

Supervisor Synthesis for Controller Upgrades [p. 1105]

Johannes Kloos and Rupak Majumdar

During the life cycle of a cyber-physical system, it is sometimes necessary to upgrade a working controller with a new, but unverified, one which provides better performance or additional functionality. To make sure that system invariants are not broken because of bugs in the new controller, an architecture is used in which both controllers are implemented on the platform, and a supervisor process checks that the actions of the new controller keep the system within its safe states. If an invariant may be violated, the supervisor switches over to the old controller that ensures correct behavior, but possibly degraded performance. A key question in the design of such supervisors is the switching strategy: when should the supervisor reinstate the new controller after it has switched to the old one? In general, one would prefer to use the new controller as much as possible, provided it does not violate safety. However, arbitrarily switching back to the new controller can cause the system to become unstable, even when each controller in isolation ensures stability. We provide a supervisor synthesis procedure that uses a simple counting strategy for the supervisor. Our synthesized supervisor ensures that switching between the controllers ensures stability of the system, while maintaining its safety properties and providing a lower bound on the use of the new controller. We prove the correctness of the strategy and show on an example that it can provide close to optimal use of the new controller against many disturbance scenarios.

Event Density Analysis for Event Triggered Control Systems [p. 1111]

Tobias Bund, Benjamin Menhorn and Frank Slomka

In event triggered control systems, events occur aperiodically. For the real-time analysis of such systems, an appropriate approximation of the events' stimulation is necessary. Upper bounds have already been found for event triggered systems. For now, lower bounds have been assumed zero within the real-time analysis of event triggered systems. This work derives an approximated lower bound representing the maximum inter-sampling time. The bounds depend on the control system and the event generating mechanism. The beneficial effect is shown by analyzing an event triggered control system in a real-time analysis framework.

Model Predictive Control over Delay-Based Differentiated Services Control Networks [p. 1117]

Riccardo Muradore, Davide Quaglia and Paolo Fiorini

Networked control systems are a well-known sub-set of cyber-physical systems in which the plant is controlled by sending commands through a digital packet-based network. Current control networks provide advanced channel access mechanisms to guarantee low delay on a limited fraction of packets (low-delay class) while the other packets (un-protected class) experience a higher delay which increases with channel utilization. We investigate the extension of model predictive control to choose both the command value and its assignment to one of the two classes according to the predicted state of the plant and the knowledge of network condition. Experimental results show that more commands are assigned to the low-delay class when either the tracking error is high or the network condition is bad.

Multirate Controller Design for Resource- and Schedule-Constrained Automotive ECUs [p. 1123]

Dip Goswami, Alejandro Masrur, Reinhard Schneider, Chun Jason Xue and Samarjit Chakraborty

Automotive software mostly consists of a set of applications controlling the vehicle dynamics, engine and many other processes or plants. Since automotive systems design is highly cost driven, an important goal is to maximize the number of control applications to be packed onto a single processor or electronic control unit (ECU). Current design methods start with a controller design step, where the sampling period and controller gain values are decided based on given control performance objectives. However, operating systems (OS) on the ECU (e.g., ERCOSek) are usually pre-configured and offer only a limited set of sampling periods. Hence, a controller is implemented using an available sampling period, which is the shorter period closest to the one determined in the controller design step. However, this increases the load on the ECU (i.e., the processor runs the controller more often than what is actually required by design). This reduces the number of applications that can be mapped, and increases costs of the system. To overcome this predicament, we propose a multirate controller, which switches between multiple available sampling periods offered by the OS on the ECU. Apart from meeting all control objectives, this avoids the unnecessary ECU overload resulting from always sampling at a constant, higher rate.

Design of an Ultra-low Power Device for Aircraft Structural Health Monitoring [p. 1127]

Alessandro Perelli, Carlo Caione, Luca De Marchi, Davide Brunelli, Alessandro Marzani and Luca Benini

One of the popular structural health monitoring (SHM) applications of both automotive and aeronautic fields is devoted to the non-destructive localization of impacts in plate-like structures. The aim of this paper is to develop a miniaturized, self-contained and low power device for automated impact detection that can be used in a distributed fashion without central coordination. The proposed device uses an array of four piezoelectric transducers, bonded to the plate, capable to detect the guided waves generated by an impact, to a STM32F4 board equipped with an ARM Cortex-M4 microcontroller and a IEEE802.15.4 wireless transceiver. The waves processing and the localization algorithm are implemented on-board and optimized for speed and power consumption. In particular, the localization of the impact point is obtained by cross-correlating the signals related to the same event acquired by the different sensors in the warped frequency domain. Finally the performance of the whole system is analysed in terms of localization accuracy and power consumption, showing the effectiveness of the proposed implementation.

8.8: HOT TOPIC: Countering Counterfeit Attacks on Micro-Electronics

Organizers: Erik Jan Marinissen - IMEC, BE; Ingrid Verbauwhede - KU Leuven, BE
Moderators: Steven Jeter - Infineon Technologies, DE; Ingrid Verbauwhede - KU Leuven, BE

Qualification and Testing Process to Implement Anti-Counterfeiting Technologies into IC Packages [p. 1131]

Nathalie Kae-Nune and Stephanie Pesseguier

Counterfeiting is no longer limited to just fashion or luxury goods, the phenomenon has now reached electronics components which failure represents a high risk to the safety and security of human communities. One way for the semiconductor (SC) industry to fight against counterfeiting of electronic parts is to add technological innovation at the component level itself. The target is to enable the product authentication in a fast and reliable way. Because semiconductor manufacturing is a complex and delicate operation producing highly complex products which are sensitive to many environmental factors, any introduction of changes in its production - which the implementation of anti-counterfeiting (A/C) technologies must also comply to - must undergo thorough testing and qualification steps. This is mandatory to control the compliancy to the strict delivery requirements, quality and reliability level the industry has established, in line with the product performance specifications. This paper aims to explain the comprehensive requirements specification developed by members of semiconductor and related industries in Europe, to add authentication technologies solutions into IC packages. It also describes the qualification processes and testing plans to implement the most adequate and effective anti-counterfeiting technology (A/T). One of the main challenges in this A/C task is to make sure that the added A/C feature in electronic components does not create any additional reliability or failure issue, nor introduce additional risks that will benefit counterfeiters.
Keywords - Anti-counterfeiting technologies, authentication, remarking, re-packaging, component counterfeiting, failure analysis, failure prevention, reliability testing

Anti-Counterfeiting with Hardware Intrinsic Security [p. 1137]

Vincent van der Leest and Pim Tuyls

Counterfeiting of goods and electronic devices is a growing problem that has a huge economic impact on the electronics industry. Sometimes the consequences are even more dramatic, when critical systems start failing due to the use of counterfeit lower quality components. Hardware Intrinsic security (i.e. security systems built on the unique electronic fingerprint of a device) offers the potential to reduce the counterfeiting problem drastically. In this paper we will show how Hardware Intrinsic Security (HIS) can be used to prevent various forms of counterfeiting and over-production. HIS technology can also be used to bind software or user data to specific hardware devices, which provides additional security to both soft- and hardware vendors as well as consumers using HIS-enabled products. Besides showing the benefits of HIS, we will also provide an extensive overview of the results (both scientific and industrial) that Intrinsic-ID has achieved studying and implementing HIS.

9.1: HOT TOPIC: Smart Grid and Buildings

Organizer: Luca Benini - Università di Bologna, IT
Moderators: Andrea Acquaviva - Politecnico di Torino, IT; Luca Benini - Università di Bologna, IT

Sustainable Energy Policies: Research Challenges and Opportunities [p. 1143]

Michela Milano

Designing sustainable energy policies heavily impacts the economic development, environmental resource management and social acceptance. There are four main steps in the policy making process: planning, environmental assessment, implementation and monitoring. We focus here on the first three steps that are performed ex-ante. We describe in this paper these steps tailored on the energy policy process. We also propose enabling technologies for implementing a decision support system for energy policy making.

Self-aware Cyber-physical Systems and Applications in Smart Buildings and Cities [p. 1149]

Levent Gurgen, Ozan Gunalp, Yazid Benazzouz and Mathieu Galissot

The world is facing several challenges that must be dealt within the coming years such as efficient energy management, need for economic growth, security and quality of life of its habitants. The increasing concentration of the world population into urban areas puts the cities in the center of the preoccupations and makes them important actors for the world's sustainable development strategy. ICT has a substantial potential to help cities to respond to the growing demands of more efficient, sustainable, and increased quality of life in the cities, thus to make them "smarter". Smartness is directly proportional with the "awareness". Cyber-physical systems can extract the awareness information from the physical world and process this information in the cyber-world. Thus, a holistic integrated approach, from the physical to the cyber-world is necessary for a successful and sustainable smart city outcome. This paper introduces important research challenges that we believe will be important in the coming years and provides guidelines and recommendations to achieve self-aware smart city objectives.
Keywords - Cyber-physical systems, Autonomic computing, Self-aware systems, Smart city

Perpetual and Low-cost Power Meter for Monitoring Residential and Industrial Appliances [p. 1155]

Danilo Porcarelli, Domenico Balsamo, Davide Brunelli and Giacomo Paci

The recent research efforts in smart grids and residential power management are oriented to monitor pervasively the power consumption of appliances in domestic and non-domestic buildings. Knowing the status of a residential grid is fundamental to keep high reliability levels while real time monitoring of electric appliances is important to minimize power waste in buildings and to lower the overall energy cost. Wireless Sensor Networks (WSNs) are a key enabling technology for this application field because they consist of low-power, non-invasive and cost-effective intelligent sensor devices. We present a wireless current sensor node (WCSN) for measuring the current drawn by single appliances. The node can self-sustain its operations by harvesting energy from the monitored current. Two AAA batteries are used only as secondary power supply to guarantee a fast start-up of the system. An active ORing subsystem selects automatically the suitable power source, minimizing power losses typical of the classic diode configuration. The node harvests energy when the power consumed by the device under measurement is in the range 10W÷10kW, which also corresponds to the range of current 50mA÷50A drawn directly from the main. Finally the node features a low-power, 32-bit microcontroller for data processing and a wireless transceiver to send data via the IEEE 802.15.4 standard protocol.
Index Terms - Wireless sensor networks, smart metering, energy harvesting, active ORing, energy measuring.

9.2: System-Level Analysis and Simulation

Moderators: Wolfgang Müller - University of Paderborn, DE; Christian Haubelt - University of Rostock, DE

Analytical Timing Estimation for Temporally Decoupled TLMs Considering Resource Conflicts [p. 1161]

Kun Lu, Daniel Müller-Gritschneder and Ulf Schlichtmann

Transaction level models (TLMs) can use temporal decoupling to increase the simulation speed. However, there is a lack of modeling support to time the temporally decoupled TLMs. In this paper, we propose a timing estimation mechanism for TLMs with temporal decoupling. This mechanism features an analytical model and novel delay formulas. Concepts such as resource usage and availability are used to derive the delay formulas. Based on them, a fast scheduling algorithm resolves resource conflicts and dynamically determines the timing of concurrent transaction sequences. Experiments show that the delay estimation formulas are capable of capturing the timing effects of resource conflicts. At the same time, the overhead of the scheduling algorithm is very low, hence the simulation speed remains high.

Towards Performance Analysis of SDFGs Mapped to Shared-Bus Architectures Using Model-Checking [p. 1167]

Maher Fakih, Kim Grüttner, Martin Fränzle and Achim Rettberg

The timing predictability of embedded systems with hard real-time requirements is fundamental for guaranteeing their safe usage. With the emergence of multicore platforms this task became very challenging. In this paper, a model-checking based approach will be described which allows us to guarantee timing bounds of multiple Synchronous Data Flow Graphs (SDFG) running on shared-bus multicore architectures. Our approach utilizes Timed Automata (TA) as a common semantic model to represent software components (SDF actors) and hardware components of the multicore platform. These TA are explored using the UPPAAL model-checker for providing the timing guarantees. Our approach shows a significant precision improvement compared with the worst-case bounds estimated based on maximal delay for every bus access. Furthermore, scalability is examined to demonstrate analysis feasibility for small parallel systems.

Toward Polychronous Analysis and Validation for Timed Software Architectures in AADL [p. 1173]

Yue Ma, Huafeng Yu, Thierry Gautier, Paul Le Guernic, Jean-Pierre Talpin, Loïc Besnard and Maurice Heitz

High-level architecture modeling languages, such as Architecture Analysis & Design Language (AADL), are gradually adopted in the design of embedded systems so that design choice verification, architecture exploration, and system property checking are carried out as early as possible. This paper presents our recent contributions to cope with clock-based timing analysis and validation of software architectures specified in AADL. In order to avoid semantics ambiguities of AADL, we mainly consider the AADL features related to real-time and logical time properties. We endue them with a semantics in the polychronous model of computation; this semantics is quickly reviewed. The semantics enables timing analysis, formal verification and simulation. In addition, thread-level scheduling, based on affine clock relations is also briefly presented here. A tutorial avionic case study, provided by C-S, has been adopted to illustrate our overall contribution.
Keywords - AADL; MDE; Polychrony; timing analysis

Tuning Dynamic Data Flow Analysis to Support Design Understanding [p. 1179]

Jan Malburg, Alexander Finder and Görschwin Fey

Modern chip designs are getting more and more complex. To fulfill tight time-to-market constraints, third-party blocks and parts from previous designs are reused. However, these are often poorly documented, making it hard for a designer to understand the code. Therefore, automatic approaches are required which extract information about the design and support developers in understanding the design. In this paper we introduce a new dynamic data flow analysis tuned to automate design understanding. We present the use of the approach for feature localization and for understanding the design's data flow. In the evaluation, our analysis improves feature localization by reducing the uncertainty by 41% to 98% compared to a previous approach using coverage metrics.

Fast and Accurate TLM Simulations Using Temporal Decoupling for FIFO-based Communications [p. 1185]

Claude Helmstetter, Jérôme Cornet, Bruno Galilée, Matthieu Moy and Pascal Vivet

A known approach to improve the timing accuracy of an untimed or loosely timed TLM model is to add timing annotations into the code and to reduce the number of costly context switches using temporal decoupling, meaning that a process can go ahead of the simulation time before synchronizing again. Our current goal is to apply temporal decoupling to the TLM platform of a heterogeneous many-core SoC dedicated to high performance computing. Part of this SoC communicates using classic memory-mapped buses, but it can be extended with hardware accelerators communicating using FIFOs. Whereas temporal decoupling for memory-based transactions has been widely studied, FIFO-based communications raise issues that have not been addressed before. In this paper, we provide an efficient solution to combine temporal decoupling and FIFO-based communications.

Determining Relevant Model Elements for the Verification of UML/OCL Specifications [p. 1189]

Julia Seiter, Robert Wille, Mathias Soeken and Rolf Drechsler

Modeling languages such as UML or SysML received significant attention over the last years. They allow for an abstract description of systems already in the absence of a precise implementation or a hardware/ software partitioning. Additionally considering textual constraints, for example provided by means of OCL, enables to automatically check the specified systems e.g. for consistency of the structure or reachability of certain system states. However, for the majority of verification tasks, not the entire model has to be considered. In this work, we propose an approach that automatically determines reduced system models, i.e. system descriptions that only include model elements which are relevant for the considered verification task. Considering reduced models eases the access by the designer and supports incremental design and verification schemes. But most important, they improve the efficiency of the applied formal verification engine. Experiments demonstrate that already small reductions in the model lead to significant accelerations in the run-time of the verification engine.

Towards a Generic Verification Methodology for System Models [p. 1193]

Robert Wille, Martin Gogolla, Mathias Soeken, Mirco Kuhlmann and Rolf Drechsler

The use of modeling languages such as UML or SysML enables to formally specify and verify the behavior of digital systems already in the absence of a specific implementation. However, for each modeling method and verification task usually a separate verification solution has to be applied today. In this paper, a methodology is envisioned that aims at stopping this "inflation" of different verification approaches and instead employs a generic methodology. For this purpose, a given specification as well as the verification shall be transformed into a basic model which itself is specified by means of a generic modeling language. Then, a range of automatic reasoning engines shall uniformly be applied to perform the actual verification. A feasibility study demonstrates the applicability of the envisioned approach.

9.3: Thermal/Power Management Techniques for Energy-Efficient Systems

Moderators: Wolfgang Nebel - University of Oldenburg, DE; Alberto Macii - Politecnico di Torino, IT

A Sub-μA Power Management Circuit in 0.18μm CMOS for Energy Harvesters [p. 1197]

Biswajit Mishra, Cyril Botteron, Gabriele Tasselli, Christian Robert and Pierre-André Farine

We explore a miniature sensor node that could be placed in an environment which would interrogate, take decisions and transmit autonomously and seamlessly without the need of a battery. With the system completely powered by an energy harvester for autonomous operation, the power management becomes crucial. In this paper, we propose an ultra low power management circuit implemented in 0:18μm CMOS technology. As part of a stringent power requirements and very limited power offered by the energy harvesters, the proposed circuit provides a nanowatt power management scheme. Using postlayout simulation, we have evaluated the power consumption of the proposed power management unit (PMU) and report results that compares favorably to the state of the art.

Saliency Aware Display Power Management [p. 1203]

Yang Xiao, Kevin Irick, Vijay Narayanan, Dongwha Shin and Naehyuck Chang

In this paper, a bio-inspired technique of finding the regions of highest visual importance within an image is proposed for reducing power consumption in modern liquid crystal displays (LCDs) that utilize a 2D light-emitting diode (LED) backlighting system. The conspicuity map generated from this neuromorphic saliency model, along with an adaptive dimming method, is applied to the backlighting array to reduce the luminance of regions of least interest as perceived by a human viewer. Corresponding image compensation is applied to the saliency modulated image to minimize distortion and retain the original image quality. Experimental results shows average 65% power can be saved when the original display system is integrated with a low-overhead real-time hardware implementation of the saliency model.
Keywords - FPGA and ASIC design; LED; LCD; system level power management

Active-Mode Leakage Reduction with Data-Retained Power Gating [p. 1209]

Andrew B. Kahng, Seokhyeong Kang and Bongil Park

Power gating is one of the most effective solutions available to reduce leakage power. However, power gating is not practically usable in an active mode due to the overheads of inrush current and data retention. In this work, we propose a data-retained power gating (DRPG) technique which enables power gating of flip-flops during active mode. More precisely, we combine clock gating and power gating techniques, with the flip-flops being power-gated during clock masked periods. We introduce a retention switch which retains data during the power gating. With the retention switch, correct logic states and functionalities are guaranteed without additional control circuitry. The proposed technique can achieve significant active-mode leakage reduction over conventional designs with small area and performance overheads. In studies with a 65nm foundry library and open-source benchmarks, DRPG achieves up to 25.7% active-mode leakage savings (11.8% savings on average) over conventional designs.

A Power-Driven Thermal Sensor Placement Algorithm for Dynamic Thermal Management [p. 1215]

Hai Wang, Sheldon X.-D. Tan, Sahana Swarup and Xue-Xin Liu

On-chip physical thermal sensors play a vital role for accurately estimating the full-chip thermal profile. How to place physical sensors such that both the number of thermal sensors and the temperature estimation errors are minimized becomes important for on-chip dynamic thermal management of today's high-performance microprocessors. In this paper, we present a new systematic thermal sensor placement algorithm. Different from the traditional thermal sensor placement algorithms where only the temperature information is explored, the new placement method takes advantage of functional unit power information by exploiting the correlation of power estimation errors among functional blocks. The new power-driven placement algorithm applies the correlation clustering algorithm to determine both the locations of sensors and the number of sensors automatically such that the temperature estimation errors can be minimized. Experimental results on a dual-core architecture show that the new thermal sensor placements yield more accurate full-chip temperature estimation compared to the uniform and the k-means based placement approaches.

Active Power-Gating-Induced Power/Ground Noise Alleviation Using Parasitic Capacitance of On-Chip Memories [p. 1221]

Xuan Wang, Jiang Xu, Wei Zhang, Xiaowen Wu, Yaoyao Ye, Zhehui Wang, Mahdi Nikdast and Zhe Wang

By integrating multiple processing units and memories on a single chip, multiprocessor system-on-chip (MPSoC) can provide higher performance per energy and lower cost per function to applications with growing complexity. In order to maintain the power budget, power gating technique is widely used to reduce the leakage power. However, it will introduce significant power/ground (P/G) noises, and threat the reliability of MPSoCs. With significant area, power and performance overheads, traditional methods rely on reinforced circuits or fixed protection strategies to reduce P/G noises caused by power gating. In this paper, we propose a systematic approach to actively alleviating P/G noises using the parasitic capacitance of on-chip memories through sensor network on-chip (SENoC). We utilize the parasitic capacitance of on-chip memories as dynamic decoupling capacitance to suppress P/G noises and develop a detailed Hspice model for related study. SENoC is developed to not only monitor and report P/G noises but also coordinate processing units and memories to alleviate such transient threats at run time. Extensive evaluations show that compared with traditional methods, our approach saves 11.7% to 62.2% energy consumption and achieves 13.3% to 69.3% performance improvement for different applications and MPSoCs with different scales. We implement the circuit details of our approach and show its low area and energy consumption overheads.

Adaptive Thermal Management for Portable System Batteries by Forced Convection Cooling [p. 1225]

Qing Xie, Siyu Yue, Massoud Pedram, Donghwa Shin and Naehyuck Chang

Cycle life of a battery largely varies according to the battery operating conditions, especially the battery temperature. In particular, batteries age much faster at high temperature. Extensive experiments have shown that the battery temperature varies dramatically during continuous charge or discharge process. This paper introduces a forced convection cooling technique for the batteries that power a portable system. Since the cooling fan is also powered by the same battery, it is critical to develop a highly effective, low power-consuming solution. In addition, there is a fundamental tradeoff between the service time of a battery equipped with fans and the cycle life of the same battery. In particular, as the fan speed is increased, the power dissipated by the fan goes up and hence the full charge capacity of the battery is lost at a faster rate, but at the same time, the battery temperature remains lower and hence the battery longevity increases. This is the first work that formulates the adaptive thermal management problem for batteries (ATMB) in portable systems and provides a systematic solution for it. A hierarchical algorithm combining reinforcement learning at the lower level and dynamic programming at the upper level is proposed to derive the ATMB policy.
Keywords - battery system; adaptive thermal management; forced convection cooling;

9.4: Emerging Architectures

Moderators: Yvain Thonnart - CEA-LETI, FR; Michael Niemier - University of Notre Dame, US

Sparse-Rotary Oscillator Array (SROA) Design for Power and Skew Reduction [p. 1229]

Ying Teng and Baris Taskin

This paper presents a unique rotary oscillator array (ROA) topology - the sparse-ROA (SROA). The SROA eliminates the need for redundant rings in a typical, mesh-like rotary topology optimizing the global distribution network of the resonant clocking technology. To this end, a design methodology is proposed for SROA construction based on the distribution of the synchronous components. The methodology eliminates the redundant rings of the ROA and reduces the tapping wirelength, which leads to a power saving of 32.1%. Furthermore, a skew control function is implemented into the SROA design methodology as a part of the optimization of the connections among tapping points and subtree roots. This control function leads to a clock skew reduction of 47.1% compared to a square-shaped ROA network design, which is verified through HSPICE.

Reversible Logic Synthesis of k-Input, m-Output Lookup Tables [p. 1235]

Alireza Shafaei, Mehdi Saeedi and Massoud Pedram

Improving circuit realization of known quantum algorithms by CAD techniques has benefits for quantum experimentalists. In this paper, we address the problem of synthesizing a given k-input, m-output lookup table (LUT) by a reversible circuit. This problem has interesting applications in the famous Shor's number-factoring algorithm and in quantum walk on sparse graphs. For LUT synthesis, our approach targets the number of control lines in multiple-control Toffoli gates to reduce synthesis cost. To achieve this, we propose a multi-level optimization technique for reversible circuits to benefit from shared cofactors. To reuse output qubits and/or zero-initialized ancillae, we un-compute intermediate cofactors. Our experiments reveal that the proposed LUT synthesis has a significant impact on reducing the size of modular exponentiation circuits for Shor's quantum factoring algorithm, oracle circuits in quantum walk on sparse graphs, and the well-known MCNC benchmarks.
Keywords - Lookup tables; Logic synthesis; Reversible circuits; Shor's quantum number-factoring algorithm; Binary welded tree.

3D-MMC: A Modular 3D Multi-Core Architecture with Efficient Resource Pooling [p. 1241]

Tiansheng Zhang, Alessandro Cevrero, Giulia Beanato, Panagiotis Athanasopoulos, Ayse K. Coskun and Yusuf Leblebici

This paper demonstrates a fully functional hardware and software design for a 3D stacked multi-core system for the first time. Our 3D system is a low-power 3D Modular Multi-Core (3D-MMC) architecture built by vertically stacking identical layers. Each layer consists of cores, private and shared memory units, and communication infrastructures. The system uses shared memory communication and Through-Silicon-Vias (TSVs) to transfer data across layers. A serialization scheme is employed for inter-layer communication to minimize the overall number of TSVs. The proposed architecture has been implemented in HDL and verified on a test chip targeting an operating frequency of 400MHz with a vertical bandwidth of 3.2Gbps. The paper first evaluates the performance, power and temperature characteristics of the architecture using a set of software applications we have designed. We demonstrate quantitatively that the proposed modular 3D design improves upon the cost and performance bottlenecks of traditional 2D multi-core design. In addition, a novel resource pooling approach is introduced to efficiently manage the shared memory of the 3D stacked system. Our approach reduces the application execution time significantly compared to 2D and 3D systems with conventional memory sharing.

Cache Coherence Enabled Adaptive Refresh for Volatile STT-RAM [p. 1247]

Jianhua Li, Liang Shi, Qing'an Li, Chun Jason Xue, Yiran Chen and Yinlong Xu

Spin-Transfer Torque RAM (STT-RAM) is extensively studied in recent years. Recent work proposed to improve the write performance of STT-RAM through relaxing the retention time of STT-RAM cell, magnetic tunnel junction (MTJ). Unfortunately, frequent refresh operations of volatile STT-RAM could dissipate significantly extra energy. In addition, refresh operations can severely conflict with normal read/write operations and results in degraded cache performance. This paper proposes Cache Coherence Enabled Adaptive Refresh (CCear) to minimize refresh operations for volatile STT-RAM. Through novel modifications to cache coherence protocol, CCear can effectively minimize the number of refresh operations on volatile STT-RAM. Full-system simulation results show that CCear approaches the performance of the ideal refresh policy with negligible overhead.

Is TSV-based 3D Integration Suitable for Inter-die Memory Repair? [p. 1251]

Mihai Lefter, George R. Voicu, Mottaqiallah Taouil, Marius Enachescu, Said Hamdioui and Sorin D. Cotofana

In this paper we address lower level issues related to 3D inter-die memory repair in an attempt to evaluate the actual potential of this approach for current and foreseeable technology developments. We propose several implementation schemes both for inter-die row and column repair and evaluate their impact in terms of area and delay. Our analysis suggests that current state-of-the-art TSV dimensions allow inter-die column repair schemes at the expense of reasonable area overhead. For row repair, however, most memory configurations require TSV dimensions to scale down at least with one order of magnitude in order to make this approach a possible candidate for 3D memory repair. We also performed a theoretical analysis of the implications of the proposed 3D repair schemes on the memory access time, which indicates that no substantial delay overhead is expected and that many delay versus energy consumption tradeoffs are possible.

Thermomechanical Stress-Aware Management for 3D IC Designs [p. 1255]

Qiaosha Zou, Tao Zhang, Eren Kursun and Yuan Xie

The thermomechanical stress has been considered as one of the most challenging problems in three-dimensional integration circuits (3D ICs), due to the thermal expansion coefficient mismatch between the through-silicon vias (TSVs) and silicon substrate, and the presence of elevated thermal gradients. To address the stress issue, we propose a thorough solution that combines design-time and run-time techniques for the relief of thermomechanical stress and the associated reliability issues. A sophisticated TSV stress-aware floorplan policy is proposed to minimize the possibility of wafer cracking and interfacial delamination. In addition, the run-time thermal management scheme effectively eliminates large thermal gradients between layers. The experimental results show that the reliability of 3D design can be significantly improved due to the reduced TSV thermal load and the elimination of mechanical damaging thermal cycling pattern.

9.5: Manufacturing and Design Security

Moderators: Fresco Regazzoni - TU Delft / University of Lugano, CH; Patrick Schaumont - Virginia Tech, US

Is Split Manufacturing Secure? [p. 1259]

Jeyavijayan (JV) Rajendran, Ozgur Sinanoglu and Ramesh Karri

Split manufacturing of integrated circuits (IC) is being investigated as a way to simultaneously alleviate the cost of owning a trusted foundry and eliminate the security risks associated with outsourcing IC fabrication. In split manufacturing, a design house (with a low-end, in-house, trusted foundry) fabricates the Front End Of Line (FEOL) layers (transistors and lower metal layers) in advanced technology nodes at an untrusted high-end foundry. The Back End Of Line (BEOL) layers (higher metal layers) are then fabricated at the design house's trusted low-end foundry. Split manufacturing is considered secure (prevents reverse engineering and IC piracy) as it hides the BEOL connections from an attacker in the FEOL foundry. We show that an attacker in the FEOL foundry can exploit the heuristics used in typical floorplanning, placement, and routing tools to bypass the security afforded by straightforward split manufacturing. We developed an attack where an attacker in the FEOL foundry can connect 96% of the missing BEOL connections correctly. To overcome this security vulnerability in split manufacturing, we developed a fault analysis-based defense. This defense improves the security of split manufacturing by deceiving the FEOL attacker into making wrong connections.

Trojan Detection via Delay Measurements: A New Approach to Select Paths and Vectors to Maximize Effectiveness and Minimize Cost [p. 1265]

Byeongju Cha and Sandeep K. Gupta

One of the growing issues in IC design is how to establish trustworthiness of chips fabricated by untrusted vendors. Such process, often called Trojan detection, is challenging since the specifics of hardware Trojans inserted by intelligent adversaries are difficult to predict and most Trojans do not affect the logic behavior of the circuit unless they are activated. Also, Trojan detection via parametric measurements becomes increasingly difficult with increasing levels of process variations. In this paper we propose a method that maximizes the resolution of each path delay measurement, in terms of its ability to detect the targeted Trojan. In particular, for each Trojan, our approach accentuates the Trojan's impact by generating a vector that sensitizes the shortest path passing via the Trojan's site. We estimate the minimum number of chips to which each vector must be applied to detect the Trojan with sufficient confidence for a given level of process variations. Finally, we demonstrate the significant improvements in effectiveness and cost provided by our approach under high levels of process variations. Experimental results on several benchmark circuits show that we can achieve dramatic reduction in test cost using our approach compared to classical path delay testing.
Keywords - Hardware Trojan; security; parametric test.

High-Sensitivity Hardware Trojan Detection Using Multimodal Characterization [p. 1271]

Kangqiao Hu, Abdullah Nazma Nowroz, Sherief Reda and Farinaz Koushanfar

Vulnerability of modern integrated circuits (ICs) to hardware Trojans has been increasing considerably due to the globalization of semiconductor design and fabrication processes. The large number of parts and decreased controllability and observability to complex ICs internals make it difficult to efficiently perform Trojan detection using typical structural tests like path latency and leakage power. In this paper, we present new accurate methods for Trojan detection that are based upon post-silicon multimodal thermal and power characterization techniques. Our approach first estimates the detailed post-silicon spatial power consumption using thermal maps of the IC, then applies 2DPCA to extract features of the spatial power consumption, and finally uses statistical tests against the features of authentic ICs to detect the Trojan. To characterize real-world ICs accurately, we perform our experiments in presence of 20% - 40% CMOS process variation. Our results reveal that our new methodology can detect Trojans with 3-4 orders of magnitude smaller power consumptions than the total power usage of the chip, while it scales very well because of the spatial view to the ICs internals by the thermal mapping.

Reverse Engineering Digital Circuits Using Functional Analysis [p. 1277]

Pramod Subramanyan, Nestan Tsiskaridze, Kanika Pasricha, Dillon Reisman, Adriana Susnea and Sharad Malik

Integrated circuits (ICs) are now designed and fabricated in a globalized multi-vendor environment making them vulnerable to malicious design changes, the insertion of hardware trojans/malware and intellectual property (IP) theft. Algorithmic reverse engineering of digital circuits can mitigate these concerns by enabling analysts to detect malicious hardware, verify the integrity of ICs and detect IP violations. In this paper, we present a set of algorithms for the reverse engineering of digital circuits starting from an unstructured netlist and resulting in a high-level netlist with components such as register files, counters, adders and subtracters. Our techniques require no manual intervention and experiments show that they determine the functionality of more than 51% and up to 93% of the gates in each of the practical test circuits that we examine.

A Practical Testing Framework for Isolating Hardware Timing Channels [p. 1281]

Jason Oberg, Sarah Meiklejohn, Timothy Sherwood and Ryan Kastner

This work identifies a new formal basis for hardware information flow security by providing a method to separate timing flows from other flows of information. By developing a framework for identifying these different classes of information flow at the gate-level, one can either confirm or rule out the existence of such flows in a provable manner. To demonstrate the effectiveness of our presented model, we discuss its usage on a practical example: a CPU cache in a MIPS processor written in Verilog HDL and simulated in a scenario which accurately models previous cache-timing attacks. We demonstrate how our framework can be used to isolate the timing channel used in these attacks.

9.6: Improving IC Quality and Lifetime Though Advanced Characterization

Moderators: Rob Aitken - ARM, US; Mehdi Tahoori - Karlsruhe Institute of Technology, DE

Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis and Modeling [p. 1285]

Yu Cai, Erich F. Haratsch, Onur Mutlu and Ken Mai

With continued scaling of NAND flash memory process technology and multiple bits programmed per cell, NAND flash reliability and endurance are degrading. Understanding, characterizing, and modeling the distribution of the threshold voltages across different cells in a modern multi-level cell (MLC) flash memory can enable the design of more effective and efficient error correction mechanisms to combat this degradation. We show the first published experimental measurement-based characterization of the threshold voltage distribution of flash memory. To accomplish this, we develop a testing infrastructure that uses the read retry feature present in some 2Y-nm (i.e., 20-24nm) flash chips. We devise a model of the threshold voltage distributions taking into account program/erase (P/E) cycle effects, analyze the noise in the distributions, and evaluate the accuracy of our model. A key result is that the threshold voltage distribution can be modeled, with more than 95% accuracy, as a Gaussian distribution with additive white noise, which shifts to the right and widens as P/E cycles increase. The novel characterization and models provided in this paper can enable the design of more effective error tolerance mechanisms for future flash memories.
Index Terms - NAND Flash, Memory Reliability, Memory Signal Processing, Threshold Voltage Distribution, Read Retry

Efficient Importance Sampling for High-sigma Yield Analysis with Adaptive Online Surrogate Modeling [p. 1291]

Jian Yao, Zuochang Ye and Yan Wang

Massively repeated structures such as SRAM cells usually require extremely low failure rate. This brings on a challenging issue for Monte Carlo based statistical yield analysis, as huge amount of samples have to be drawn in order to observe one single failure. Fast Monte Carlo methods, e.g. importance sampling methods, are still quite expensive as the anticipated failure rate is very low. In this paper, a new method is proposed to tackle this issue. The key idea is to improve traditional importance sampling method with an efficient online surrogate model. The proposed method improves the performance for both stages in importance sampling, i.e. finding the distorted probability density function, and the distorted sampling. Experimental results show that the proposed method is 1e2X~1e5X faster than the standard Monte Carlo approach and achieves 5X~22X speedup over existing state-of-the-art techniques without sacrificing estimation accuracy.

Metastability Challenges for 65nm and Beyond; Simulation and Measurements [p. 1297]

Salomon Beer, Ran Ginosar, Jerome Cox, Tom Chaney and David M. Zar

Recent synchronizer metastability measurements indicate degradation of MTBF with technology scaling, calling for measurement and calibration circuits in 65nm and below. Degradation of parameters can be even worse if the system is operated at extreme supply voltages and temperature conditions. In this work we study the behavior of synchronizers in a broad range of supply voltage and temperature corners. A digital on-chip measurement system is presented that helps to characterize synchronizers in future technologies and a new calibrating system is shown that accounts for changes in delay values due to supply voltage and temperature changes. We present a detailed comparison of measurements and simulations for a fabricated 65nm bulk CMOS circuit and discuss implications of the measurements for synchronization systems in 65nm and beyond. We propose an adaptive self-calibrating synchronizer to account for supply voltage, temperature, global process variations and DVFS.

Design and Implementation of an Adaptive Proactive Reconfiguration Technique for SRAM Caches [p. 1303]

Peyman Pouyan, Esteve Amat, Francesc Moll and Antonio Rubio

Scaling of device dimensions toward nano-scale regime has made it essential to innovate novel design techniques for improving the circuit robustness. This work proposes an implementation of adaptive proactive reconfiguration methodology that can first monitor process variability and BTI aging among 6T SRAM memory cells and then apply a recovery mechanism to extend the SRAM lifetime. Our proposed technique can extend the memory lifetime between 2X to 4.5X times with a silicon area overhead of around 10% for the monitoring units, in a 1kB 6T SRAM memory chip.

9.7: Design and Scheduling

Moderators: Giuseppe Lipari - ENS - Cachan, FR; Stefan Petters - CISTER/INESC-TEC, ISEP, PT

Architecture and Optimal Configuration of a Real-Time Multi-Channel Memory Controller [p. 1307]

Manil Dev Gomony, Benny Akesson and Kees Goossens

Optimal utilization of a multi-channel memory, such as Wide IO DRAM, as shared memory in multi-processor platforms depends on the mapping of memory clients to the memory channels, the granularity at which the memory requests are interleaved in each channel, and the bandwidth and memory capacity allocated to each memory client in each channel. Firm real-time applications in such platforms impose strict requirements on shared memory bandwidth and latency, which must be guaranteed at design-time to reduce verification effort. However, there is currently no real-time memory controller for multichannel memories, and there is no methodology to optimally configure multi-channel memories in real-time systems. This paper has four key contributions: (1) A real-time multi-channel memory controller architecture with a new programmable Multi-Channel Interleaver unit. (2) A novel method for logical-to-physical address translation that enables interleaving memory requests across multiple memory channels at different granularities. (3) An optimal algorithm based on an Integer Linear Program (ILP) formulation to map memory clients to memory channels considering their communication dependencies, and to configure the memory controller for minimum bandwidth utilization. (4) We experimentally evaluate the run-time of the algorithm and show that an optimal solution can be found within 15 minutes for realistically sized problems. We also demonstrate configuring a multi-channel Wide IO DRAM in a High-Definition (HD) video and graphics processing system to emphasize the effectiveness of our approach.

Holistic Design Parameter Optimization of Multiple Periodic Resources in Hierarchical Scheduling [p. 1313]

Man-Ki Yoon, Jung-Eun Kim, Richard Bradford and Lui Sha

Hierarchical scheduling of periodic resources has been increasingly applied to a wide variety of real-time systems due to its ability to accommodate various applications on a single system through strong temporal isolation. This leads to the question of how one can optimize over the resource parameters while satisfying the timing requirements of real-time applications. A great deal of research has been devoted to deriving the analytic model for the bounds on the design parameter of a single resource as well as its optimization. The optimization for multiple periodic resources, however, requires a holistic approach due to the conflicting requirements of the limited computational capacity of a system among resources. Thus, this paper addresses a holistic optimization of multiple periodic resources with regard to minimum system utilization. We extend the existing analysis of a single resource in order for the variable interferences among resources to be captured in the resource bound, and then solve the problem with Geometric Programming (GP). The experimental results show that the proposed method can find a solution very close to the one optimized via an exhaustive search and that it can explore more solutions than a known heuristic method.

Robust and Extensible Task Implementations of Synchronous Finite State Machines [p. 1319]

Qi Zhu, Peng Deng, Marco Di Natale and Haibo Zeng

Model-based design using synchronous reactive (SR) models is widespread for the development of embedded control software. SR models ease verification and validation, and enable the automatic generation of implementations. In SR models, synchronous finite state machines (FSMs) are commonly used to capture changes of the system state under trigger events. The implementation of a synchronous FSM may be improved by using multiple software tasks instead of the traditional single-task solution. In this work, we propose methods to quantitatively analyze task implementations with respect to a breakdown factor that measures the timing robustness, and an action extensibility metric that measures the capability to accommodate upgrades. We propose an algorithm to generate a correct and efficient task implementation of synchronous FSMs for these two metrics, while guaranteeing the schedulability constraints.

FBLT: A Real-Time Contention Manager with Improved Schedulability [p. 1325]

Mohammed Elshambakey and Binoy Ravindran

We consider software transactional memory (STM) concurrency control for embedded multicore real-time software, and present a novel contention manager for resolving transactional conflicts, called FBLT. We upper bound transactional retries and task response times under FBLT, and identify when FBLT has better real-time schedulability than the previous best contention manager, PNF. Our implementation in the Rochester STM framework reveals that FBLT yields shorter or comparable retry costs than competitor methods.

A Virtual Prototyping Platform for Real-time Systems with a Case Study for a Two-wheeled Robot [p. 1331]

Daniel Mueller-Gritschneder, Kun Lu, Erik Wallander, Marc Greim and Ulf Schlichtmann

In today's real-time system design, a virtual prototype can help to increase both the design speed and quality. Developing a virtual prototyping platform requires realistic modeling of the HW system, accurate simulation of the real-time SW, and integration with a reactive real-time environment. Such a VP simulation platform is often difficult to develop. In this paper, we propose a case-study of autonomous two-wheeled robot to show how to develop a virtual prototyping platform rapidly in SystemC/TLM to adequately aid in the design of this instable system with hard real-time constraints. Our approach is an integration of four major model components. Firstly, an accurate physical model of the robot is provided. Secondly, a virtual world is modeled in Java that offers a 3D environment for the robot to move in. Thirdly, the embedded control SW is developed. Finally, the overall HW system is modeled in SystemC at transaction level. This HW model wraps the physical model, interacts with the virtual world, and simulates the real-time SW by integrating an Instruction Set Simulator of the embedded CPU. By integrating these components into a platform, designers can efficiently optimize the embedded SW architecture, explore the design space and check real-time conditions for different system parameters such as buffer sizes, CPU frequency or cache configurations.
Keywords - Virtual Prototyping, Transaction Level Modeling, Real-time Constraints, Embedded Systems

Sufficient Real-Time Analysis for an Engine Control Unit with Constant Angular Velocities [p. 1335]

Victor Pollex, Timo Feld, Frank Slomka, Ulrich Margull, Ralph Mader and Gerhard Wirrer

Engine control units in the automotive industry are particular challenging real-time systems regarding their real-time analysis. Some of the tasks of such an engine control unit are triggered by the engine, i.e. the faster the angular velocity of the engine, the more frequent the tasks are executed. Furthermore, the execution time of a task may vary with the angular velocity of the engine. As a result the worst case does not necessarily occur when all tasks are activated simultaneously. Hence this behavior cannot be addressed appropriately with the currently available real-time analysis methods. In this paper we make a first step towards a real-time analysis for an engine control unit. We present a sufficient real-time analysis assuming that the angular velocity of the engine is arbitrary but fixed.

10.1: HOT TOPIC: Smart Data Centers Design and Optimization

Organizer: David Atienza - EPFL, CH
Moderators: Roman Hermida - UCM, ES; Ayse Coskun - Boston University, US

Roadmap towards Ultimately-Efficient Zeta-Scale Datacenters [p. 1339]

Patrick Ruch, Thomas Brunschwiler, Stephan Paredes, Ingmar Meijer, and Bruno Michel

Chip microscale liquid-cooling reduces thermal resistance and improves datacenter efficiency with higher coolant temperatures by eliminating chillers and allowing thermal energy re-use in cold climates. Liquid cooling enables an unprecedented density in future computers to a level similar to a human brain. This is mediated by a dense 3D architecture for interconnects, fluid cooling, and power delivery of energetic chemical compounds transported in the same fluid. Vertical integration improves memory proximity and electrochemical power delivery creating valuable space for communication. This strongly improves large system efficiency thereby allowing computers to grow beyond exa-scale.
Keywords - datacenter; energy; reuse; packaging; cooling; power-supply; stacking

Correlation-Aware Virtual Machine Allocation for Energy-Efficient Datacenters [p. 1345]

Jungsoo Kim, Martino Ruggiero, David Atienza and Marcel Lederberger

Server consolidation plays a key role to mitigate the continuous power increase of datacenters. The recent advent of scale-out applications (e.g., web search, MapReduce, etc.) necessitate the revisit of existing server consolidation solutions due to distinctively different characteristics compared to traditional high-performance computing (HPC), i.e., user interactive, latency critical, and operations on large data sets split across a number of servers. This paper presents a power saving solution for datacenters that especially targets the distinctive characteristics of the scale-out applications. More specifically, we take into account correlation information of core utilization among virtual machines (VMs) in server consolidation to lower actual peak server utilization. Then, we utilize this reduction to achieve further power savings by aggressively-yet-safely lowering the server operating voltage and frequency level. We have validated the effectiveness of the proposed solution using 1) multiple clusters of real-life scale-out application workloads based web search and 2) utilization traces obtained from real datacenter setups. According to our experiments, the proposed solution provides up to 13.7% power savings with up to 15.6% improvement of Quality-of-Service (QoS) compared to existing correlation-aware VM allocation schemes for datacenters.

Resource Efficient Computing for Warehouse-scale Datacenters [p. 1351]

Christos Kozyrakis

An increasing amount of information technology services and data are now hosted in the cloud, primarily due to the cost and scalability benefits for both the end-users and the operators of the warehouse-scale datacenters (DCs) that host cloud services. Hence, it is vital to continuously improve the capabilities and efficiency of these large-scale systems. Over the past ten years, capability has improved by increasing the number of servers in a DC and the bandwidth of the network that connects them. Cost and energy efficiency have improved by eliminating the high overheads of the power delivery and cooling infrastructure. To achieve further improvements, we must now examine how well we are utilizing the servers themselves, which are the primary determinant for DC performance, cost, and energy efficiency. This is particularly important since the semiconductor chips used in servers are now energy limited and their efficiency does not scale as fast as in the past. This paper motivates the need for resource efficient computing in large-scale datacenters and reviews the major challenges and research opportunities.

10.2: EMBEDDED TUTORIAL: On the Use of GP-GPUs for Accelerating Computing Intensive EDA Applications

Organizer: Franco Fummi - University of Verona, IT
Moderators: Franco Fummi - University of Verona, IT; Florian Letombe - SpringSoft, FR

On the Use of GP-GPUs for Accelerating Compute-intensive EDA Applications [p. 1357]

Valeria Bertacco, Debapriya Chatterjee, Nicola Bombieri, Franco Fummi, Sara Vinco, A.M. Kaushik, Hiren D. Patel

General purpose graphics processing units (GPGPUs) have recently been explored as a new computing paradigm for accelerating compute-intensive EDA applications. Such massively parallel architectures have been applied in accelerating the simulation of digital designs during several phases of their development - corresponding to different abstraction levels, specifically: (i) gate-level netlist descriptions, (ii) register-transfer level and (iii) transaction-level descriptions. This embedded tutorial presents a comprehensive analysis of the best results obtained by adopting GP-GPUs in all these EDA applications.

10.3: Thermal Analysis and Power Optimization Techniques

Moderators: Siddharth Garg - University of Waterloo, CA; Yiran Chen - University of Pittsburgh, US

Substitute-and-Simplify: A Unified Design Paradigm for Approximate and Quality Configurable Circuits [p. 1367]

Swagath Venkataramani, Kaushik Roy and Anand Raghunathan

Many applications are inherently resilient to in-exactness or approximations in their underlying computations. Approximate circuit design is an emerging paradigm that exploits this inherent resilience to realize hardware implementations that are highly efficient in energy or performance. In this work, we propose Substitute-And-SIMplIfy (SASIMI), a new systematic approach to the design and synthesis of approximate circuits. The key insight behind SASIMI is to identify signal pairs in the circuit that assume the same value with high probability, and substitute one for the other. While these substitutions introduce functional approximations, if performed judiciously, they result in some logic to be eliminated from the circuit while also enabling downsizing of gates on critical paths (simplification), resulting in significant power savings. We propose an automatic synthesis framework that performs substitution and simplification iteratively, while ensuring that a user-specified quality constraint is satisfied. We extend the proposed framework to perform automatic synthesis of quality configurable circuits that can dynamically operate at different accuracy levels depending on application requirements. We used SASIMI to automatically synthesize approximate and quality configurable implementations of a wide range of arithmetic units (Adders, Multipliers, MAC), complex data paths (SAD, FFT butterfly, Euclidean distance) and ISCAS85 benchmarks, using various error metrics such as error rate and average error magnitude. The synthesized approximate circuits demonstrate power improvements of 10%-28% for tight error constraints, and 30%-60% for relaxed error constraints. The quality configurable circuits obtain between 14%-40% improvement in energy in the approximate mode, while incurring no energy overheads in the accurate mode.
Index Terms - Low Power Design, Approximate Computing, Approximate Circuits, Logic Synthesis

Enhancing Multicore Reliability through Wear Compensation in Online Assignment and Scheduling [p. 1373]

Thidapat Chantem, Yun Xiang, X. Sharon Hu and Robert P. Dick

System reliability is a crucial concern especially in multicore systems which tend to have high power density and hence temperature. Existing reliability-aware methods are either slow and non-adaptive (offline techniques) or do not use task assignment and scheduling to compensate for uneven core wear states (online techniques). In this article, we present a dynamically-activated task assignment and scheduling algorithm based on theoretical results that explicitly optimizes system lifetime. We also propose a data distillation method that dramatically reduces the size of the thermal profiles to make full system reliability analysis viable online. Simulation results show that our algorithm results in between 27-291% improvement to system lifetime compared to existing techniques for four-core systems.

NUMANA: A Hybrid Numerical and Analytical Thermal Simulator for 3-D ICs [p. 1379]

Yu-Min Lee, Tsung-Heng Wu, Pei-Yu Huang and Chi-Ping Yang

By combining analytical and numerical simulation techniques, this work develops a hybrid thermal simulator, NUMANA, which can effectively deal with complicated material structures, to estimate the temperature profile of a 3-D IC. Compared with a commercial tool, ANSYS, its maximum relative error is only 1.84%. Compared with a well known linear system solver, SuperLU [1], it can achieve orders of magnitude speedup.

Explicit Transient Thermal Simulation of Liquid-Cooled 3D ICs [p. 1385]

Alain Fourmigue, Giovanni Beltrame and Gabriela Nicolescu

The high heat flux and compact structure of three-dimensional circuits (3D ICs) make conventional air-cooled devices more subsceptible to overheating. Liquid cooling is an alternative that can improve heat dissipation, and reduce thermal issues. Fast and accurate thermal models are needed to appropriately dimension the cooling system at design time. Several models have been proposed to study different designs, but generally with low simulation performance. In this paper, we present an efficient model of the transient thermal behaviour of liquid-cooled 3D ICs. In our experiments, our approach is 60 times faster and uses 600 times less memory than state-of-the-art models, while maintaining the same level of accuracy.
Index Terms - 3D ICs, Liquid-cooling, Compact Thermal Model, Finite Difference Method

Mitigating Dark Silicon Problems Using Superlattice-based Thermoelectric Coolers [p. 1391]

Francesco Paterna and Sherief Reda

Dark silicon is an emerging problem in multi-core processors, where it is not possible to enable all cores simultaneously because of either insufficient parallelism in software applications or because of high-spatial power densities that generate hot-spot constraints. Superlattice-based thermoelectric cooling (TEC) is a promising technology that offers large heat pumping capability and the ability to target hot spots of each core independently. In this paper, we devise novel system-level methods that address the two main sources of dark silicon using superlattice TECs. Our methods leverage the TECs in conjunction with dynamic voltage and frequency scaling and number of threads to maximize the performance of multicore processor under thermal and power constraints. Using an experimental setup based on a quad-core processor, we provide an evaluation of the trade-offs among performance, temperature and power consumption arising from the use of superlattice-based TECs. Our results demonstrate the potential of this emerging cooling technology in mitigating dark silicon problems and in improving the performance of multi-core processors.

Run-time Probabilistic Detection of Miscalibrated Thermal Sensors in Many-core Systems [p. 1395]

Jia Zhao, Shiting (Justin) Lu, Wayne Burleson and Russell Tessier

Many-core architectures use large numbers of small temperature sensors to detect thermal gradients and guide thermal management schemes. In this paper a technique to identify thermal sensors which are operating outside a required accuracy is described. Unlike previous on-chip temperature estimation approaches, our algorithms are optimized to run on-line while thermal management decisions are being made. The accuracy of a sensor is determined by comparing its readings to expected values from a probability distribution function determined from surrounding sensors. Experiments show that a sensor operating outside a desired accuracy can be identified with a detection rate of over 90% and an average false alarm rate of < 6%, with a confidence level of 90%. The run time of our method is shown to be around 3x lower than a recently-published temperature estimation method, enhancing its suitability for runtime implementation.

10.4: Abstraction Techniques and SAT/SMT-Based Optimizations

Moderators: Fahim Rahim - Atrenta, FR; Julian Schmaltz - Open University of the Netherlands, NL

GLA: Gate-Level Abstraction Revisited [p. 1399]

Alan Mishchenko, Niklas Een, Robert Brayton, Jason Baumgartner, Hari Mony and Pradeep Nalla

Verification benefits from removing logic that is not relevant for a proof. Techniques for doing this are known as localization abstraction. Abstraction is often performed by selecting a subset of gates to be included in the abstracted model; the signals feeding into this subset become unconstrained cut-points. In this paper, we propose several improvements to substantially increase the scalability of automated abstraction. In particular, we show how a better integration between the BMC engine and the SAT solver is achieved, resulting in a new hybrid abstraction engine, that is faster and uses less memory. This engine speeds up computation by constant propagation and circuit-based structural hashing while collecting UNSAT cores for the intermediate proofs in terms of a subset of the original variables. Experimental results show improvements in the abstraction depth and size.

Lemma Localization: A Practical Method for Downsizing SMT-Interpolants [p. 1405]

Florian Pigorsch and Christoph Scholl

Craig interpolation has become a powerful and universal tool in the formal verification domain, where it is used not only for Boolean systems, but also for timed systems, hybrid systems, and software programs. The latter systems demand interpolation for fragments of first-order logic. When it comes to model checking, the structural compactness of interpolants is necessary for efficient algorithms. In this paper, we present a method to reduce the size of interpolants derived from proofs of unsatisfiability produced by SMT (Satisfiability Modulo Theory) solvers. Our novel method uses structural arguments to modify the proof in a way, that the resulting interpolant is guaranteed to have smaller size. To show the effectiveness of our approach, we apply it to an extensive set of formulas from symbolic hybrid model checking.

Core Minimization in SAT-based Abstraction [p. 1411]

Anton Belov, Huan Chen, Alan Mishchenko and Joao Marques-Silva

Automatic abstraction is an important component of modern formal verification flows. A number of effective SAT-based automatic abstraction methods use unsatisfiable cores to guide the construction of abstractions. In this paper we analyze the impact of unsatisfiable core minimization, using state-of-the-art algorithms for the computation of minimally unsatisfiable subformulas (MUSes), on the effectiveness of a hybrid (counterexample-based and proof-based) abstraction engine. We demonstrate empirically that core minimization can lead to a significant reduction in the total verification time, particularly on difficult testcases. However, the resulting abstractions are not necessarily smaller. We notice that by varying the minimization effort the abstraction size can be controlled in a non-trivial manner. Based on this observation, we achieve a further reduction in the total verification time.

Optimization Techniques for Craig Interpolant Compaction in Unbounded Model Checking [p. 1417]

G. Cabodi, C. Loiacono and D. Vendraminetto

This paper addresses the problem of reducing the size of Craig interpolants generated within inner steps of SAT-based Unbounded Model Checking. Craig interpolants are obtained from refutation proofs of unsatisfiable SAT runs, in terms of and/or circuits of linear size, w.r.t. the proof. Existing techniques address proof reduction, whereas interpolant compaction is typically considered as an implementation problem, tackled using standard logic synthesis techniques. We propose an integrated three step process, in which we: (1) exploit an existing technique to detect and remove redundancies in refutation proofs, (2) apply combinational logic reductions (constant propagation, ODC-based simplifications, and BDD-based sweeping) directly on the proof graph data structure, (3) eventually apply ad hoc combinational logic synthesis steps on interpolant circuits. The overall procedure is novel (as well as parts of the above listed steps), and represents an advance w.r.t. the state-of-the art. The paper includes an experimental evaluation, showing the benefits of the proposed technique, on a set of benchmarks from the Hardware Model Checking Competition 2011.

Formal Analysis of Steady State Errors in Feedback Control Systems Using HOL-Light [p. 1423]

Osman Hasan and Muhammad Ahmad

The accuracy of control systems analysis is of paramount importance as even minor design flaws can lead to disastrous consequences in this domain. This paper provides a higher-order-logic theorem proving based framework for the formal analysis of steady state errors in feedback control systems. In particular, we present the formalization of control system foundations, like transfer functions, summing junctions, feedback loops and pickoff points, and steady state error models for the step, ramp and parabola cases. These foundations can be built upon to formally specify a wide range of feedback control systems in higher-order logic and reason about their steady state errors within the sound core of a theorem prover. The proposed formalization is based on the complex number theory of the HOL-Light theorem prover. For illustration purposes, we present the steady state error analysis of a solar tracking control system.

A Novel Concurrent Cache-friendly Binary Decision Diagram Construction for Multi-core Platforms [p. 1427]

Mahmoud Elbayoumi, Michael S. Hsiao and Mustafa ElNainay

Currently, BDD packages such as CUDD depend on chained hash tables. Although they are efficient in terms of memory usage, they exhibit poor cache performance due to dynamic allocation and indirections of data. Moreover, they are less appealing for concurrent environments as they need thread-safe garbage collectors. Furthermore, to take advantage of the benefits from multi-core platforms, it is best to re-engineer the underlying algorithms, such as whether traditional depth-first search (DFS) construction, breadth-first search (BFS) construction, or a hybrid BFS with DFS would be best. In this paper, we introduce a novel BDD package friendly to multicore platforms that builds on a number of heuristics. Firstly, we re-structure the Unique Table (UT) using a concurrency-friendly Hopscotch hashing to improve caching performance. Secondly, we re-engineer the BFS Queues with hopscotch hashing. Thirdly, we propose a novel technique to utilize BFS Queues to simultaneously work as a Computed Table (CT). Finally, we propose a novel incremental Mark-Sweep Garbage Collector (GC). We report results for both BFS and hybrid BFS-DFS construction methods. With these techniques, even with a single-threaded BDD, we were able to achieve a speedup of up to 8x compared to a conventional single-threaded CUDD package. When two-threads are launched, another 1.5x speedup is obtained.

10.5: Design and Verification of Mixed-Signal Circuits

Moderators: Catherine Dehollain - EPFL, CH; Gunhan Dundar - Bogazici University, TR

A Low-Power and Low-Voltage BBPLL-Based Sensor Interface in 130nm CMOS for Wireless Sensor Networks [p. 1431]

Jelle Van Rethy, Hans Danneels, Valentijn De Smedt, Wim Dehaene and Georges Gielen

A low-power and low-voltage BBPLL-based sensor interface for resistive sensors in Wireless Sensor Networks is presented. The interface is optimized towards low power, fast start-up time and fast conversion time, making it primarily useful in autonomous wireless sensor networks. The interface is time/frequency-based, making it less sensitive to lower supply voltages and other analog non-idealities, whereas conventional amplitude-based interfaces do suffer largely from these non-idealities, especially in smaller CMOS technologies. The sensor-to-digital conversion is based on the locking behavior of a digital PLL, which also includes transient behavior after startup. Several techniques such as VDD scaling, coarse and fine tuning and pulse-width modulated feedback are implemented to decrease the transient and acquisition time and the power to optimize the total energy consumption. In this way the sensor interface consumes only 61μW from a 0.8V DC power supply with a one-sample conversion time of less than 20μs worst-case. The sensor interface is designed and implemented in UMC130 CMOS technology and outputs 8 bit parallel with 7.72 ENOB. Due to its fast start-up time, fast conversion time and low power consumption, it only consumes 5.79 pJ/bit-conversion, which is a state-of-the-art energy efficiency compared to recent resistive sensor interfaces.

Reachability Analysis of Nonlinear Analog Circuits through Iterative Reachable Set Reduction [p. 1436]

Seyed Nematollah Ahmadyan and Shohba Vasudevan

We propose a methodology for reachability analysis of nonlinear analog circuits to verify safety properties. Our iterative reachable set reduction algorithm initially considers the entire state space as reachable. Our algorithm iteratively determines which regions in the state space are unreachable and removes those unreachable regions from the over approximated reachable set. We use the State Partitioning Tree (SPT) algorithm to recursively partition the reachable set into convex polytopes. We determine the reachability of adjacent neighbor polytopes by analyzing the direction of state space trajectories at the common faces between two adjacent polytopes. We model the direction of the trajectories as a reachability decision function that we solve using a sound root counting method. We are faithful to the nonlinearities of the system. We demonstrate the memory efficiency of our algorithm through computation of the reachable set of Van der Pol oscillation circuit.

Formal Verification of Analog Circuit Parameters across Variation Utilizing SAT [p. 1442]

Merritt Miller and Forrest Brewer

A fast technique for proving steady-state analog circuit operation constraints is described. Based on SAT, the technique is applicable to practical circuit design and modeling scenarios as it does not require algebraic device models. Despite the complexity of representing accurate transistor I/V characteristics, run-time and problem scaling behavior is excellent.
Index Terms - Analog Verification, Discrete Representation, Circuit Modeling, SAT

Extracting Analytical Nonlinear Models from Analog Circuits by Recursive Vector Fitting of Transfer Function Trajectories [p. 1448]

Dimitri De Jonghe, Dirk Deschrijver, Tom Dhaene and Georges Gielen

This paper presents a technique for automatically extracting analytical behavioral models from the netlist of a nonlinear analog circuit. Subsequent snapshots of the internal circuit Jacobian are sampled during time-domain analysis and are then processed into Transfer Function Trajectories (TFT). The TFT data project the nonlinear dynamics of the system onto a hyperplane in the mixed state-space/frequency domain. Next Recursive Vector Fitting (RVF) algorithm is used to extract an analytical Hammerstein model out of the TFT data in an automated fashion. The resulting RVF model equations are implemented as an accurate nonlinear behavioral model in the time domain. The model is guaranteed stable by construction and can trade off complexity for accuracy. The technique is validated on a high-speed analog buffer circuit containing 70 linear and nonlinear components, showing a 7X speedup.

Statistical Modeling with the Virtual Source MOSFET Model [p. 1454]

Li Yu, Lan Wei, Dimitri Antoniadis, Ibrahim Elfadel and Duane Boning

A statistical extension of the ultra-compact Virtual Source (VS) MOSFET model is developed here for the first time. The characterization uses a statistical extraction technique based on the backward propagation of variance (BPV) with variability parameters derived directly from the nominal VS model. The resulting statistical VS model is extensively validated using Monte Carlo simulations, and the statistical distributions of several figures of merit for logic and memory cells are compared with those of a BSIM model from a 40-nm CMOS industrial design kit. The comparisons show almost identical distributions with distinct run time advantages for the statistical VS model. Additional simulations show that the statistical VS model accurately captures non-Gaussian features that are important for low-power designs.

Automatic Circuit Sizing Technique for the Analog Circuits with Flexible TFTs Considering Process Variation and Bending Effects [p. 1458]

Yen-Lung Chen, Wan-Rong Wu, Guan-Ruei Lu and Chien-Nan Jimmy Liu

Flexible electronics are possible alternative for portable consumer applications with many advantages. However, the circuit design for flexible electronics is still challenging, especially for sensitive analog circuits. Significant parameter variations and bending effects of flexible TFTs further increase the difficulties for circuit designers. In this paper, an automatic circuit sizing technique is proposed for the analog circuits with flexible TFTs. The process variation and bending effects of flexible TFTs are considered simultaneously in the optimization flow. As shown in the experimental results, the proposed approach can further improve the design yield and significantly reduce the design overhead.

10.6: On-Line Testing Techniques

Moderators: Cecilia Metra - University of Bologna, IT; Cristiana Bolchini - Politecnico Di Milano, IT

On-Line Functionally Untestable Fault Identification in Embedded Processor Cores [p. 1462]

P. Bernardi, M. Bonazza, E. Sanchez, M. Sonza Reorda and O. Ballan

Functional testing of embedded processors is a challenging task and additional constraints are imposed when a functional test procedure has to be executed online. In the latter case, a significant amount of the processor faults cannot be detected since related to the debug/test circuitry or because of memory configuration constraints. In this paper we identify several sources of on-line functional untestability and propose a set of techniques to exactly measure their impact on the fault coverage. Experimental results related to an industrial case study are reported, showing that the fault coverage loss due to the considered untestability sources may reach more than 13%.

Capturing Vulnerability Variations for Register Files [p. 1468]

Javier Carretero, Enric Herrero, Matteo Monchiero and Tanausú Ramírez and Xavier Vera

Soft error rates are estimated based on worst-case architectural vulnerability factor (AVF). Therefore, it makes tracking real-time accurate AVF very attractive to computer designers: more accurate AVF numbers will allow turning on more features at runtime while keeping the promised SDC and DUE rates. This paper presents a hardware mechanism based on linear regressions to estimate the AVF (SDC and DUE) of the register file for out-of-order cores. Our results show that we are able to have a high correlation factor at low cost.

Error Detection in Ternary CAMs Using Bloom Filters [p. 1474]

Salvatore Pontarelli, Marco Ottavi, Adrian Evans and Shi-Jie Wen

This paper presents an innovative approach to detect soft errors in Ternary Content Addressable Memories (TCAMs) based on the use of Bloom Filters. The proposed approach is described in detail and its performance results are presented. The advantages of the proposed method are that no modifications to the TCAM device are required, the checking is done on-line and the approach has low power and area overheads.

AVF-driven Parity Optimization for MBU Protection of In-core Memory Arrays [p. 1480]

Michail Maniatakos, Maria K. Michael and Yiorgos Makris

We propose an AVF-driven parity selection method for protecting modern microprocessor in-core memory arrays against MBUs. As MBUs constitute more than 50% of the upsets in latest technologies, error correcting codes or physical interleaving are typically employed to effectively protect out-of-core memory structures, such as caches. However, such methods are not applicable to high-performance in-core arrays, due to computational complexity, high delay and area overhead. To this end, we revisit parity as an effective mechanism to detect errors and we resort to pipeline flushing and checkpointing for correction. We demonstrate that optimal parity tree construction for MBU detection is a computationally complex problem, which we then formulate as an integer-linear-program (ILP). Experimental results on Alpha 21264 and Intel P6 in-core memory arrays demonstrate that optimal parity tree selection can achieve great vulnerability reduction, even when a small number of bits are added to the parity trees, compared to simple heuristics. Furthermore, the ILP formulation allows us to find better solutions by effectively exploring the solution space in the presence of multiple parity trees; results show that the presence of 2 parity trees offers a vulnerability reduction of more than 50% over a single parity tree.

An Enhanced Double-TSV Scheme for Defect Tolerance in 3D-IC [p. 1486]

Hsiu-Chuan Shih and Cheng-Wen Wu

Die stacking based on Through-Silicon Via (TSV) is considered as an efficient way to reducing power consumption and form factor. In the current stage, the failure rate of TSV is still high, so some type of defect tolerance scheme is required. Meanwhile, the concept of double-via, which is normally used in traditional layer to layer interconnection, can be one of the feasible tolerance schemes. Double-via/TSV has a benefit compared to TSV repair: it can eliminate the fuse configuration procedure as well as the fuse layer. However, double-TSV has a problem of signal degradation and leakage caused by short defects. In this work, an enhanced scheme for double-TSV is proposed to solve the short-defect problem through signal path division and VDD isolation. Result shows that the enhanced double-TSV can tolerate both open and short defects, with reasonable area and timing overhead.
Keywords - TSV; 3D-IC; open-defect; short-defect; defect tolerance; yield improvement

Mempack: An Order of Magnitude Reduction in the Cost, Risk, and Time for Memory Compiler Certification [p. 1490]

Kartik Mohanram, Matthew Wartell and Sundar Iyer

Advances in memory compiler technology have helped accelerate the integration of hundreds of unique embedded memory macros in contemporary low-power, high-speed SoCs. The heavy use of compiled memories poses multiple challenges on the characterization, validation, and reliability fronts. This motivates solutions that can reduce overall cost, time, and risk to certify memories through the identification of a reduced set of "fundamental" memory macros that can be used one or more times to realize all memory instances in the design. This paper describes MemPack, a fast, general method based upon the classical change-making algorithm for the identification of such fundamental memory macros. By relaxing the need for exact realization of memories and tolerating wastage within the context of change-making, MemPack enables tradeoffs between memory capacity and reduction in the number of fundamental macros. It also controls multiplexing and instantiation costs, minimizing the impact on critical path delay and address line loading. Results on industrial and synthetic benchmarks for three different optimization objectives (performance, balance, and minimization) show that MemPack is effective in identifying fundamental sets that are as much as 16x smaller than the original set for 0.8-4.7% wasted bits.

Exploiting Replicated Checkpoints for Soft Error Detection and Correction [p. 1494]

Fahrettin Koc, Kenan Bozdas, Burak Karsli and Oguz Ergin

Register renaming is a widely used technique to remove false dependencies in contemporary superscalar microprocessors. A register alias table (RAT) is formed to hold current locations of the values that correspond to the architectural registers. Some recently designed processors take a copy of the rename table at each branch instruction, in order to recover its contents when a misspeculation occurs. In this paper first we investigate the RAT vulnerability against transient errors. Then we analyze the vulnerability of RAT checkpoints and propose two techniques for soft error detection and correction utilizing redundantly taken copies of the entries whose content is the same with the previous and/or next checkpoints. Simulation results of the spec 2006 benchmarks reveal that on the average RAT vulnerability is 25% and checkpoint vulnerability is 6%. Results also reveal that redundancy exists at sequential checkpoint copies and can be used for error detection and correction purposes. We propose techniques that exploit this redundancy and show that faults in 41% of all checkpoints and 44% of rolled-back checkpoints can be detected and errors in 33% of the rolled-back checkpoints can be corrected. Since we exploit the already available storage, proposed error detection and correction techniques can be implemented with minimal hardware overhead.
Keywords - Microprocessors, Register Rename, Checkpoint, RAT Vulnerability, Soft Error, Error Detection and Correction

10.7: Embedded Software for Many-Core Architectures

Moderators: Oliver Bringmann - University of Tübingen, DE; Sébastien Le Beux - Lyon Institute of Nanotechnology, FR

Game-Theoretic Analysis of Decentralized Core Allocation Schemes on Many-Core Systems [p. 1498]

Stefan Wildermann, Tobias Ziermann and Jürgen Teich

Many-core architectures used in embedded systems will contain hundreds of processors in the near future. Already now, it is necessary to study how to manage such systems when dynamically scheduling applications with different phases of parallelism and resource demands. A recent research area called invasive computing proposes a decentralized workload management scheme of such systems: applications may dynamically claim additional processors during execution and release these again, respectively. In this paper, we study how to apply the concepts of invasive computing for realizing decentralized core allocation schemes in homogeneous many-core systems with the goal of maximizing the average speedup of running applications at any point in time. A theoretical analysis based on game theory shows that it is possible to define a core allocation scheme that uses local information exchange between applications only, but is still able to provably converge to optimal results. The experimental evaluation demonstrates that this allocation scheme reduces the overhead in terms of exchanged messages by up to 61:4% and even the convergence time by up to 13:4% compared to an allocation scheme where all applications exchange information globally with each other.

Enabling Fine-Grained OpenMP Tasking on Tightly-Coupled Shared Memory Clusters [p. 1504]

Paolo Burgio, Giuseppe Tagliavini, Andrea Marongiu and Luca Benini

Cluster-based architectures are increasingly being adopted to design embedded many-cores. These platforms can deliver very high peak performance within a contained power envelope, provided that programmers can make effective use the available parallel cores. This is becoming an extremely difficult task, as embedded applications are growing in complexity and exhibit irregular and dynamic parallelism. The OpenMP tasking extensions represent a powerful abstraction to capture this form of parallelism. However, efficiently supporting it on cluster-based embedded SoCs is not easy, because the fine-grained parallel workload present in embedded applications can not tolerate high memory and run-time overheads. In this paper we present our design of the runtime support layer to OpenMP tasking for an embedded shared memory cluster, identifying key aspects to achieving performance and discussing important architectural support to removing major bottlenecks

ARTM: A Lightweight Fork-join Framework for Many-core Embedded Systems [p. 1510]

Maroun Ojail, Raphael David, Yves Lhuillier and Alexandre Guerre

Embedded architectures are moving to multi-core and many-core concepts in order to sustain ever growing computing requirements within complexity and power budgets. Programming many-core architectures not only needs parallel programming skills, but also efficient exploitation of fine grain parallelism at both architecture and runtime levels. Scheduler reactivity is however increasingly important as tasks granularity is reduced, in order to keep the overhead of the scheduling to a minimum. This paper presents a lightweight fork-join framework for scheduling fine grain parallel tasks on embedded many-core systems. The asynchronous nature of the fork-join model used in this framework permits to dramatically decrease its scheduling overhead. Experimentation conducted in this paper show that the overhead induced by this framework is of 33 cycles per scheduled task. Also, we show that near-ideal speedup can be obtained by the ARTM framework for data parallel applications and that ARTM achieves better results than other state of the art parallelization techniques.

Pipelets: Self-Organizing Software Pipelines for Many-Core Architctures [p. 1516]

Janmartin Jahn and Jörg Henkel

We present the novel concept of Pipelets: self-organizing stages of software pipelines that monitor their computational demands and communication patterns and interact to optimize the performance of the application they belong to. They enable dynamic task remapping and exploit application-specific properties. Our experiments show that they improve performance by up to 31.2% compared to state-of-the-art when resource demands of applications alter at runtime as is the case for many complex applications.

An Integrated Approach for Managing the Lifetime of Flash-Based SSDs [p. 1522]

Sungjin Lee, Taejin Kim, Ji-Sung Park and Jihong Kim

As the semiconductor process is scaled down, the endurance of NAND flash memory greatly deteriorates. To overcome such a poor endurance characteristic and to provide a reasonable storage lifetime, system-level endurance enhancement techniques are rapidly adopted in recent NAND flash-based storage devices like solid-state drives (SSDs). In this paper, we propose an integrated lifetime management approach for SSDs. The proposed lifetime management technique combines several lifetime-enhancement schemes, including lossless compression, deduplication, and performance throttling, in an integrated fashion so that the lifetime of SSDs can be maximally extended. By selectively disabling less effective lifetime-enhancement schemes, the proposed technique achieves both high performance and high energy efficiency while meeting the required lifetime. Our evaluation results show that the proposed technique, over the SSDs with no lifetime management schemes, improves write performance by up to 55% and reduces energy consumption by up to 43% while satisfying a 5-year lifetime warranty.

10.8: PANEL: Will 3D-IC Remain a Technology of the Future...Even in the Future?

Organizer: Marco Casale-Rossi - Synopsys, US
Moderators: Giovanni De Micheli - EPFL, CH; Marco Casale-Rossi - Synopsys, US
Invited Speaker: Patrick Leduc
Panelists: Patrick Blouet, Brendan Farley, Anna Fontanelli, Dragomir Milojevic, Steve Smith

PANEL: Will 3D-IC Remain a Technology of the Future...Even in the Future? [p. 1526]

If asked "who needs faster planes?" the vast majority of the 2.75 billion airline passengers (source: IATA 2011) would say that they do need faster planes, and that they need them right now. Still, the commercial aircrafts cruising speed has remained the same - 800 km/hour - over the last 50+ years, and after the sad end of the Concorde project, neither Airbus nor Boeing are seriously working on the topic. Along the same lines, when asked "who needs 3D-IC?", most IC designers say that they desperately need 3D-IC to keep advancing electronic products performance, whilst addressing the needs of higher bandwidth, lower cost, heterogeneous integration, and power constraints. Still, 3D-IC continues to be the technology of the future. What are the road blocks towards 3D-IC adoption? Is it process technology, foundry or OSAT commercial offering, or EDA, or the business economics that is holding 3D-IC on the ground? In the introductory presentation of this panel session, LETI Patrick Leduc will illustrate the state-of-the-art of commercial, mainstream 3D-IC. EPFL Professor Giovanni de Micheli will then moderate an industry and research panel, to understand what are the key factors preventing 3D-IC from becoming the technology of today

11.1: HOT TOPIC: Smart Health

Organizers and Moderators: Daniela De Venuto - Politecnico di Bari, IT; Alberto Sangiovanni Vincentelli - University of California, Berkeley, US

Dr. Frankenstein's Dream Made Possible: Implanted Electronic Devices [p. 1531]

Daniela De Venuto and Alberto Sangiovanni Vincentelli

The developments in micro-nano-electronics, biology and neuro-sciences make it possible to imagine a new world where vital signs can be monitored continuously, artificial organs can be implanted in human bodies and interfaces between the human brain and the environment can extend the capabilities of men thus making the dream of Dr. Frankenstein become true. This paper surveys some of the most innovative implantable devices and offers some perspectives on the ethical issues that come with the introduction of this technology.

Addressing the Healthcare Cost Dilemma by Managing Health instead of Managing Illness - An Opportunity for Wearable Wireless Sensors [p. 1537]

Chris Van Hoof and Julien Penders

The cost of healthcare is increasing worldwide. Without disruptive changes, a large part of the population in many developed countries will no longer be able to afford healthcare by 2040. Part of the solution will come from focusing on prevention. Having personal tools at everyone's disposal, which will help people to monitor their health and to change their behavior, can enable disease prevention. Managing weight and managing stress are two societal challenges where a behavioral change can have huge cost savings. In this paper, it is shown how wearable sensor devices are able to detect energy expenditure as well as monitor stress levels. System aspects and validation are discussed. Because convenience and user acceptance are key for making these tools a success, smaller form factors and more convenient sensor locations on the body are required.
Keywords - wireless sensors, body-area networks, healthcare

Electronic Implants: Power Delivery and Management [p. 1540]

Jacopo Olivo, Sara S. Ghoreishizadeh, Sandro Carrara and Giovanni De Micheli

A power delivery system for implantable biosensors is presented. The system, embedded into a skin patch and located directly over the implantation area, is able to transfer up to 15 mW wirelessly through the body tissues by means of an inductive link. The inductive link is also used to achieve bidirectional data communication with the implanted device. Downlink communication (ASK) is performed at 100 kbps; uplink communication (LSK) is performed at 66.6 kbps. The received power is managed by an integrated system including a voltage rectifier, an amplitude demodulator and a load modulator. The power management system is presented and evaluated by means of simulations.
Index Terms - Remote powering, inductive link, energy harvesting, implantable biosensors, lactate measurement.

Cyborg Insects, Neural Interfaces and Other Things: Building Interfaces between the Synthetic and the Multicellular [p. 1546]

J. Van Kleef, T. Massey, P. Ledochowitsch, R. Muller, R. Tiefenauer, T. Blanche, Hirotaka Sato and M.M. Maharbiz

Keywords - neural engineering, MEMS, BMI, neural interfaces

11.2: High-Level Synthesis and Coarse-Grained Reconfigurable Architectures

Moderators: Philippe Coussy - Universite de Bretagne-Sud/Lab-STICC, FR; Fadi Kurdahi - University of California Irvine, US

Share with Care: A Quantitative Evaluation of Sharing Approaches in High-level Synthesis [p. 1547]

Alex Kondratyev, Luciano Lavagno, Mike Meyer and Yosinori Watanabe

This paper focuses on the resource sharing problem when performing high-level synthesis. It argues that the conventionally accepted synthesis flow when resource sharing is done after scheduling is sub-optimal because it cannot account for timing penalties from resource merging. The paper describes a competitive approach when resource sharing and scheduling are performed simultaneously. It provides a quantitative evaluation of both approaches and shows that performing sharing during scheduling wins over the conventional approach in terms of quality of results.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring [p. 1553]

Daniel Gomez-Prado, Maciej Ciesielski and Russell Tessier

This paper describes a system-level approach to improve the latency of FPGA designs by performing optimization of the design specification on a functional level prior to high-level synthesis. The approach uses Taylor Expansion Diagrams (TEDs), a functional graph-based design representation, as a vehicle to optimize the dataflow graph (DFG) used as input to the subsequent synthesis. The optimization focuses on critical path compaction in the functional representation before translating it into a structural DFG representation. Our approach engages several passes of a traditional high-level synthesis (HLS) process in a simulated annealing-based loop to make efficient cost trade-offs. The algorithm is time efficient and can be used for fast design space exploration. The results indicate a latency performance improvement of 22% on average versus HLS with the initial DFG for a series of designs mapped to Altera Stratix II devices.

A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Simultaneous ILP and TLP Exploitation [p. 1559]

Mateus Beck Rutzig, Antonio Carlos S. Beck and Luigi Carro

As the number of embedded applications increases, companies are launching new platforms within short periods of time to efficiently execute software with the lowest possible energy consumption. However, for each new platform deployment, new tool chains, with additional libraries, debuggers and compilers must come along, breaking binary compatibility. This strategy implies in high hardware and software redesign costs. In this scenario, we propose the exploitation of Custom Reconfigurable Arrays for Multiprocessor Systems (CReAMS). CReAMS is composed of multiple adaptive reconfigurable processors that simultaneously exploit Instruction and Thread Level Parallelism. It works in a transparent fashion, so binary compatibility is maintained, with no need to change the software development process or environment. We also show that CReAMS delivers higher performance per watt in comparison to a 4-issue Superscalar processor, when the same power budget is considered for both designs.
keywords- reconfigurable system, multiprocessor, embedded systems

High-Level Modeling and Synthesis for Embedded FPGAs [p. 1565]

Xiaolin Chen, Shuai Li, Jochen Schleifer, Thomas Coenen, Anupam Chattopadhyay, Gerd Ascheid and Tobias G. Noll

The fast evolving applications in modern digital signal processing have an increasing demand for components which have high computational power and energy efficiency without compromising the flexibility. Embedded FPGA, which is the customized FPGA with heterogeneous fine-grained application specific operations and routing resources, has shown significantly improved efficiency in terms of throughput, power dissipation and chip area for the target application domain. On the other hand, the complexity of such architecture makes it difficult to perform an efficient architecture exploration and application synthesis without tool support. In this work, we propose a framework for the design of embedded FPGA (eFPGA) architectures, which is extended from an existing framework for Coarse-Grained Reconfigurable Architectures (CGRAs). The framework is composed of a high-level modeling formalism for eFPGAs to explore the mapping space, and a retargetable application synthesis flow. To enable fast design space exploration, a force-directed placement algorithm is proposed. Finally, we demonstrate the efficacy of this framework with demanding application kernels.

Scheduling Independent Liveness Analysis for Register Binding in High Level Synthesis [p. 1571]

Vito Giovanni Castellana and Fabrizio Ferrandi

Classical techniques for register allocation and binding require the definition of the program execution order, since a partial ordering relation between operations must be induced to perform liveness analysis through data-flow equations. In High Level Synthesis (HLS) flows this is commonly obtained through the scheduling task. However for some HLS approaches, such a relation can be difficult to be computed, or not statically computable at all, and adopting conventional register binding techniques, even when feasible, cannot guarantee maximum performances. To overcome these issues we introduce a novel scheduling-independent liveness analysis methodology, suitable for dynamic scheduling architectures. Such liveness analysis is exploited in register binding using standard graph coloring techniques, and unlike other approaches it avoids the insertion of structural dependencies, introduced to prevent run-time resource conflicts in dynamic scheduling environments. The absence of additional dependencies avoids performance degradation and makes parallelism exploitation independent from the register binding task, while on average not impacting on area, as shown through the experimental results.

Fast Shared On-Chip Memory Architecture for Efficient Hybrid Computing with CGRAs [p. 1575]

Jongeun Lee, Yeonghun Jeong and Sungsok Seo

While Coarse-grained Reconfigurable Architectures (CGRAs) are very efficient at handling regular, compute-intensive loops, their weakness at control-intensive processing and the need for frequent reconfiguration require another processor, for which usually a main processor is used. To minimize the overhead arising in such collaborative execution, we integrate a dedicated sequential processor (SP) with a reconfigurable array (RA), where the crucial problem is how to share the memory between SP and RA while keeping the SP's memory access latency very short. We present a detailed architecture, control, and program example of our approach, focusing on our optimized on--chip shared memory organization between SP and RA. Our preliminary results demonstrate that our optimized memory architecture is very effective in reducing kernel execution times (23.5% compared to a more straightforward alternative), and our approach can reduce the RA control overhead and other sequential code execution time in kernels significantly resulting in up to 23.1% reduction in kernel execution time, compared to the conventional system using the main processor for sequential code execution.

Compiling Control-Intensive Loops for CGRAs with State-Based Full Predication [p. 1579]

Kyuseung Han, Kiyoung Choi and Jongeun Lee

Predication is an essential technique to accelerate kernels with control flow on CGRAs. While state-based full predication (SFP) can remove wasteful power consumption on issuing/decoding instructions from conventional full predication, generating code for SFP is challenging for general CGRAs, especially when there are multiple conditionals to be handled due to exploiting data level parallelism. In this paper, we present a novel compiler framework addressing central issues such as how to express the parallelism between multiple conditionals, and how to allocate resources to them to maximize the parallelism. In particular, by separating the handling of control flow and data flow, our framework can be integrated with conventional mapping algorithms for mapping data flow. Experimental results demonstrate that our framework can find and exploit parallelism between multiple conditionals, thereby leading to 2.21 times higher performance on average than a naive approach.
Index Terms - CGRA; reconfigurable architecture; predication; predicated execution, conditional, compilation;

11.3: Efficient NoC Routing Mechanisms

Moderators: Fabien Clermidy - CEA-LETI, FR; Jose Flich - Technical University of Valencia, ES

DeBAR: Deflection Based Adaptive Router with Minimal Buffering [p. 1583]

John Jose, Bhawna Nayak, Kranthi Kumar and Madhu Mutyam

Energy efficiency of the underlying communication framework plays a major role in the performance of multicore systems. NoCs with buffer-less routing are gaining popularity due to simplicity in the router design, low power consumption, and load balancing capacity. With minimal number of buffers, deflection routers evenly distribute the traffic across links. In this paper, we propose an adaptive deflection router, DeBAR, that uses a minimal set of central buffers to accommodate a fraction of mis-routed flits. DeBAR incorporates a hybrid flit ejection mechanism that gives the effect of dual ejection with a single ejection port, an innovative adaptive routing algorithm, and a selective flit buffering based on flit marking. Our proposed router design reduces the average flit latency and the deflection rate, and improves the throughput with respect to the existing minimally buffered deflection routers without any change in the critical path.

Contrasting Wavelength-Routed Optical NoC Topologies for Power-Efficient 3D-Stacked Multicore Processors Using Physical-Layer Analysis [p. 1589]

Luca Ramini, Paolo Grani, Sandro Bartolini and Davide Bertozzi

Optical networks-on-chip (ONoCs) are currently still in the concept stage, and would benefit from explorative studies capable of bridging the gap between abstract analysis frameworks and the constraints and challenges posed by the physical layer. This paper aims to go beyond the traditional comparison of wavelength-routed ONoC topologies based only on their abstract properties, and for the first time assesses their physical implementation efficiency in an homogeneous experimental setting of practical relevance. As a result, the paper can demonstrate the significant and different deviation of topology layouts from their logic schemes under the effect of placement constraints on the target system. This becomes then the preliminary step for the accurate characterization of technology-specific metrics such as the insertion loss critical path, and to derive the ultimate impact on power efficiency and feasibility of each design.

Topology-Agnostic Fault-Tolerant NoC Routing Method [p. 1595]

Eduardo Wachter, Augusto Erichsen, Alexandre Amory and Fernando Moraes

Routing algorithms for NoCs were extensively studied in the last 12 years, and proposals for algorithms targeting some cost function, as latency reduction or congestion avoidance, abound in the literature. Fault-tolerant routing algorithms were also proposed, being the table-based approach the most adopted method. Considering SoCs with hundred of cores in a near future, features as scalability, reachability, and fault assumptions should be considered in the fault-tolerant routing methods. However, the current proposals some have some limitations: (1) increasing cost related to the NoC size, compromising scalability; (2) some healthy routers may not be reached even if there is a source-target path; (3) some algorithms restricts the number of faults and their location to operate correctly. The present work presents a method, inspired in VLSI routing algorithms, to search the path between source-target pairs where the network topology is abstracted. Results present the routing path for different topologies (mesh, torus, Spidergon and Hierarchical-Spidergon) in the presence of faulty routers. The silicon area overhead and total execution time of the path computation is small, demonstrating that the proposed method may be adopted in NoC designs.

Fault-Tolerant Routing Algorithm for 3D NoC Using Hamiltonian Path Strategy [p. 1601]

Masoumeh Ebrahimi, Masoud Daneshtalab and Juha Plosila

While Networks-on-Chip (NoC) have been increasing in popularity with industry and academia, it is threatened by the decreasing reliability of aggressively scaled transistors. In this paper, we address the problem of faulty elements by the means of routing algorithms. Commonly, fault-tolerant algorithms are complex due to supporting different fault models while preventing deadlock. When moving from 2D to 3D network, the complexity increases significantly due to the possibility of creating cycles within and between layers. In this paper, we take advantages of the Hamiltonian path to tolerate faults in the network. The presented approach is not only very simple but also able to support almost all one-faulty unidirectional links in 2D and 3D NoCs.

Modeling and Analysis of Fault-tolerant Distributed Memories for Networks-on-Chip [p. 1605]

Abbas BanaiyanMofrad, Nikil Dutt and Gustavo Girão

Advances in technology scaling increasingly make Network-on-Chips (NoCs) more susceptible to failures that cause various reliability challenges. With increasing area occupied by different on-chip memories, strategies for maintaining fault-tolerance of distributed on-chip memories become a major design challenge. We propose a system-level design methodology for scalable fault-tolerance of distributed on-chip memories in NoCs. We introduce a novel reliability clustering model for fault-tolerance analysis and shared redundancy management of on-chip memory blocks. We perform extensive design space exploration applying the proposed reliability clustering on a block-redundancy fault-tolerant scheme to evaluate the tradeoffs between reliability, performance, and overheads. Evaluations on a 64-core chip multiprocessor (CMP) with an 8x8 mesh NoC show that distinct strategies of our case study may yield up to 20% improvements in performance gains and 25% improvement in energy savings across different benchmarks, and uncover interesting design configurations.

11.4: System-Level Modelling for Physical Properties

Moderators: Frank Oppenheimer - OFFIS, DE; François Pêcheux - UPMC, FR

System-Level Modeling of Energy in TLM for Early Validation of Power and Thermal Management [p. 1609]

Tayeb Bouhadiba, Matthieu Moy and Florence Maraninchi

Modern systems-on-a-chip are equipped with power architectures, allowing to control the consumption of individual components or subsystems. These mechanisms are controlled by a power-management policy often implemented in the embedded software, with hardware support. Today's circuits have an important static power consumption, whose low-power design require techniques like DVFS or power-gating. A correct and efficient management of these mechanisms is therefore becoming nontrivial. Validating the effect of the power management policy needs to be done very early in the design cycle, as part of the architecture exploration activity. High-level models of the hardware must be annotated with consumption information. Temperature must also be taken into account since leakage current increases exponentially with it. Existing annotation techniques applied to loosely-timed or temporally-decoupled models would create bad simulation artifacts on the temperature profile (e.g. unrealistic peaks). This paper addresses the instrumentation of a timed transaction-level model of the hardware with information on the power consumption of the individual components. It can cope not only with power-state models, but also with Joule-per-bit traffic models, and avoids simulation artifacts when used in a functional/ power/temperature co-simulation.

System-Level Modeling and Microprocessor Reliability Analysis for Backend Wearout Mechanisms [p. 1615]

Chang-Chih Chen and Linda Milor

Backend wearout mechanisms are major reliability concerns for modern microprocessors. In this paper, a framework which contains modules for backend time-dependent dielectric breakdown (BTDDB), electromigration (EM), and stress-induced voiding (SIV) is proposed to analyze circuit layout geometries and interconnects to accurately estimate state-of-art microprocessor lifetime due to each mechanism. Our methodology incorporates the detailed electrical stress, temperature, linewidth and cross-sectional areas of each interconnect within the microprocessor system. We analyze several layouts using our methodology and highlight the lifetime-limiting wearout mechanisms, along with the reliability-critical microprocessor functional units, using standard benchmarks
Keywords - Wearout Mechanisms; Microprocessor; Reliability; EM; SIV; SM; TDDB; Aging

Automatic Success Tree-Based Reliability Analysis for the Consideration of Transient and Permanent Faults [p. 1621]

Hananeh Aliee, Michael Glaß, Felix Reimann and Jürgen Teich

Success tree analysis is a well-known method to quantify the dependability features of many systems. This paper presents a system-level methodology to automatically generate a success tree from a given embedded system implementation and subsequently analyzes its reliability based on a state-of-the-art Monte Carlo simulation. This enables the efficient analysis of transient as well as permanent faults while considering methods such as task and resource redundancy to compensate these. As a case study, the proposed technique is compared with two analysis techniques, successfully applied at system level: (1) a BDD-based reliability analysis technique and (2) a SAT-assisted approach, both suffering from exponential complexity in either space or time. Experimental results performed on an extensive test suite show that: (a) Opposed to the Success Tree (ST) and SAT-assisted approaches, the BDD-based approach is highly vulnerable to exhaust available memory during its construction for moderate and large test cases. (b) The proposed ST technique is competitive to the SAT-assisted analysis in analysis speed and accuracy, while being the only technique that is suitable to also handle large and complex system implementations in which permanent and transient faults may occur concurrently.

Hybrid Prototyping of Multicore Embedded Systems [p. 1627]

Ehsan Saboori and Samar Abdi

This paper presents a novel modeling technique for multicore embedded systems, called Hybrid Protoyping. The fundamental idea is to simulate a design with multiple cores by creating an emulation kernel in software on top of a single physical instance of the core. The emulation kernal switches between tasks mapped to different cores and manages the logical simulation times of the invidual cores. As a result, we can achieve fast and cycle-accurate simulation of symmetric multicore designs, thereby overcoming the accuracy concerns of virtual prototyping and the scalability issues of physical prototyping. Our experiments with industrial multicore designs show that the simulation time with hybrid prototyping grows only linearly with the number of cores and the inter-core communication traffic, while providing 100% accuracy.
Keywords - Embedded systems; Validation; Multicore design; Virtual prototyping, FPGA prototyping

11.5: Energy Challenges for Multi-Core and NoC Architectures

Moderators: Alberto Garcia-Ortiz - University of Bremen, DE; Domenik Helms - OFFIS, DE

Communication and Migration Energy Aware Design Space Exploration for Multicore Systems with Intermittent Faults [p. 1631]

Anup Das, Akash Kumar and Bharadwaj Veeravalli

Shrinking transistor geometries, aggressive voltage scaling and higher operating frequencies have negatively impacted the dependability of embedded multicore systems. Most existing research works on fault-tolerance have focused on transient and permanent faults of cores. Intermittent faults are a separate class of defects resulting from on-chip temperature, pressure and voltage variations and lasting for a few cycles to several seconds or more. Operations of cores impacted by intermittent faults are suspended during these cycles but come back alive when conditions become favorable. This paper proposes a technique to model the availability of multiprocessor systems-on-chip (MPSoCs) with intermittent and reparable device defects. This model is based on Markov chain with stochastic fault distribution and can be applied even for permanent faults. Based on this model, a design space pruning technique is proposed to select a set of task mappings (with variable resource usage), which minimizes the task communication energy while satisfying the MPSoC availability constraint. Moreover, task migration overhead is also minimized, which is an important consideration for frequently occurring intermittent and temperature related faults, where prolonged system downtime during task re-mapping is not desired. Experiments conducted with real-life and synthetic application task graphs demonstrate that the proposed technique minimizes communication energy by 30% and reduces migration overhead by 50% as compared to the existing approaches.

40.4fJ/bit/mm Low-Swing On-Chip Signaling with Self-Resetting Logic Repeaters Embedded within a Mesh NoC in 45nm SOI CMOS [p. 1637]

Sunghyun Park, Masood Qazi, Li-Shiuan Peh and Anantha P. Chandrakasan

Mesh NoCs are the most widely-used fabric in high-performance many-core chips today. They are, however, becoming increasingly power-constrained with the higher on-chip bandwidth requirements of high-performance SoCs. In particular, the physical datapath of a mesh NoC consumes significant energy. Low-swing signaling circuit techniques can substantially reduce the NoC datapath energy, but existing low-swing circuits involve huge area footprints, unreliable signaling or considerable system overheads such as an additional supply voltage, so embedding them into a mesh datapath is not attractive. In this paper, we propose a novel low-swing signaling circuit, a self-resetting logic repeater, to meet these design challenges. The SRLR enables single-ended low-swing pulses to be asynchronously repeated, and hence, consumes less energy than differential, clocked low-swing signaling. To mitigate global process variations while delivering high energy efficiency, three circuit techniques are incorporated. Fabricated in 45nm SOI CMOS, our 10mm SRLR-based low-swing datapath achieves 6.83Gb/s/μm bandwidth density with 40.4fJ/bit/mm energy at 4.1Gb/s data rate at 0.8V.

3D Reconfgurable Power Switch Network for Demand-supply Matching between Multi-output Power Converters and Many-core Microprocessors [p. 1643]

Kanwen Wang, Hao Yu, Benfei Wang and Chun Zhang

A 3D reconfigurable power switch network is introduced to optimally provide demand-supply matching between on-chip multi-output power converters and many-core microprocessors. For effective DVFS power management of many cores by area-efficient on-chip power converters, the reconfigurable power switch network supports space and time multiplexed access between power converters and cores. An integer linear programming is deployed to find one configuration of space-time multiplexing that can match between supply and demand with balanced utilization. The overall power management system is verified in SystemC-AMS based models. Experiment results show that the proposed design achieves 35.36% power saving on average when compared to the one without using the proposed power management.

Thermal-Aware Datapath Merging for Coarse-Grained Reconfigurable Processors [p. 1649]

Sotirios Xydis, Gianluca Palermo and Cristina Silvano

The increased power densities of deep submicron process technologies have made on-chip temperature to become a critical design issue for high-performance integrated circuits. In this paper, we address the datapath merging problem faced during the design of coarse-grained reconfigurable processors from a thermal-aware perspective. Assuming a reconfigurable processor able to execute a sequence of datapath configurations, we formulate and efficiently solve the thermal-aware datapath merging problem as a minimum cost network flow. In addition, we integrate floorplan awareness of the underlying reconfigurable processor guiding the merging decision to account also for the effects of heat diffusion. Extensive experimentation regarding different configuration scenarios, technology nodes and clock frequencies showed that the adoption of the proposed thermal-aware methodology delivers up to 8.27K peak temperature reductions and achieves better temperature flattening in comparison to a low power but thermal-unaware approach.

11.6: Modelling and Design for Signal and Power Integrity

Moderators: Stefano Grivet-Talocia - Politecnico di Torino, IT; Piero Triverio - University of Toronto, CA

Placement Optimization of Power Supply Pads Based on Locality [p. 1655]

Pingqiang Zhou, Vivek Mishra and Sachin S. Sapatnekar

This paper presents an efficient algorithm for the placement of power supply pads in flip-chip packaging for high-performance VLSI circuits. The placement problem is formulated as a mixed-integer linear program (MILP), subject to the constraints on mean-time-to-failure (MTTF) for the pads and the voltage drop in the power grid. To improve the performance of the optimizer, the pad placement problem is solved based on the divide-and-conquer principle, and the locality properties of the power grid are exploited by modeling the distant nodes and sources coarsely, following the coarsening stage in multigrid-like approach. An accurate electromigration (EM) model that captures current crowding and Joule heating effects is developed and integrated with our C4 placement approach. The effectiveness of the proposed approach is demonstrated on several designs adapted from publicly released benchmarks.

GPU-Friendly Floating Random Walk Algorithm for Capacitance Extraction of VLSI Interconnects [p. 1661]

Kuangya Zhai, Wenjian Yu and Hao Zhuang

The floating random walk (FRW) algorithm is an important field-solver algorithm for capacitance extraction, which has several merits compared with other boundary element method (BEM) based algorithms. In this paper, the FRW algorithm is accelerated with the modern graphics processing units (GPUs). We propose an iterative GPU-based FRW algorithm flow and the technique using an inverse cumulative probability array (ICPA), to reduce the divergence among walks and the global-memory accessing. A variant FRW scheme is proposed to utilize the benefit of ICPA, so that it accelerates the extraction of multi-dielectric structures. The technique for extracting multiple nets concurrently is also discussed. Numerical results show that our GPU-based FRW brings over 20X speedup for various test cases with 0.5% convergence criterion over the CPU counterpart. For the extraction of multiple nets, our GPU-based FRW outperforms the CPU counterpart by up to 59X.

Periodic Jitter and Bounded Uncorrelated Jitter Decomposition Using Incoherent Undersampling [p. 1667]

Nicholas L. Tzou, Debesh Bhatta, Sen-Wen Hsiao and Abhijit Chatterjee

Jitter measurement is an essential part for testing high speed digital I/O and clock distribution networks. Precise jitter characterization of signals at critical internal nodes provides valuable information for hardware fault diagnosis and next generation design. Recently, incoherent undersampling has been proposed as a low-cost solution for signal integrity characterization at high data rate. Incoherent undersampling drastically reduces the sampling rate compared to Nyquist rate sampling without relying on the availability of a data synchronous clock. In this paper, we propose a jitter decomposition and characterization method based on incoherent undersampling. Associated fundamental period estimation techniques along with properties of incoherent undersampling, are used to isolate the effects of periodic and periodic crosstalk jitter. Mathematical analysis and hardware experiments using commercial off-the-shelf components are performed to prove the viability of the proposed method.
Keywords - Incoherent Undersampling; Jitter Separation; Periodic Jitter; Bounded Uncorrelated Jitter; Crosstalk Jitter

Crosstalk Avoidance Codes for 3D VLSI [p. 1673]

Rajeev Kumar and Sunil P. Khatri

In 3D VLSI, through-silicon vias (TSVs) are relatively large, and closely spaced. This results in a situation in which noise on one or more TSVs may deteriorate the delay and signal integrity of neighboring TSVs. In this paper, we first quantify the parasitics in contemporary TSVs, and then come up with a classification of crosstalk sequences as 0C, 1C, ... 8C sequences. Next, we present inductive approaches to quantify the exact overhead for 8C, 6C and 4C crosstalk avoidance codes (CACs) for a 3xn mesh arrangement of TSVs. These overheads for different CACs for a 3xn mesh arrangement of TSVs are used to calculate the lower bounds on the corresponding overheads for an nxn mesh arrangements of TSVs. We also discuss an efficient way to implement the coding and decoding (CODEC) circuitry for limiting the maximum crosstalk to 6C. Our experimental results show that for a TSV mesh arrangement driven by inverters implemented in a 22nm technology, the coding based approaches yields improvements which are in line with the theoretical predictions.

Large-Scale Flip-Chip Power Grid Reduction with Geometric Templates [p. 1679]

Zhuo Feng

Realizable power grid reduction becomes key to efficient design and verification of nowadays large-scale power delivery networks (PDNs). Existing state-of-the-art realizable reduction techniques for interconnect circuits, such as TICER algorithm, can not be well suited for effective power grid reductions, since reducing the mesh-structured power grids by TICER's nodal elimination scheme may introduce excessive number of new edges in the reduced grids that can be even harder to solve than the original grid due to the drastically increased sparse matrix density. In this work, we present a novel geometric template based reduction technique for reducing large-scale flip-chip power grids. Our method first creates geometric template according to the original power grid topology and then performs novel iterative grid corrections to improve the accuracy by matching the electrical behaviors of the reduced template grid with the original grid. Our experimental results show that the proposed reduction method can reduce industrial power grid designs by up to 95% with very satisfactory solution quality.

11.7: Powerful Aging

Moderators: Jose Pineda de Gyvez - NXP Semiconductors, NL; Mehdi Tahoori - Karlsruhe Institute of Technology, DE

Impact of Adaptive Voltage Scaling on Aging-Aware Signoff [p. 1683]

Tuck-Boon Chan, Wei-Ting Jonas Chan and Andrew B. Kahng

Transistor aging due to bias temperature instability (BTI) is a major reliability concern in sub-32nm technology. Aging decreases performance of digital circuits over the entire IC lifetime. To compensate for aging, designs now typically apply adaptive voltage scaling (AVS) to mitigate performance degradation by elevating supply voltage. Varying the supply voltage of a circuit using AVS also causes the BTI degradation to vary over lifetime. This presents a new challenge for margin reduction in conventional signoff methodology, which characterizes timing libraries based on transistor models with pre-calculated BTI degradations for a given IC lifetime. Many works have separately addressed predictive models of BTI and the analysis of AVS, but there is no published work that considers BTI-aware signoff that accounts for the use of AVS during IC lifetime. This motivates us to study how the presence of AVS should affect aging-aware signoff. In this paper, we first simulate and analyze circuit performance degradation due to BTI in the presence of AVS. Based on our observations, we propose a rule-of-thumb for chip designers to characterize an aging-derated standard-cell timing library that accounts for the impact of AVS. According to our experimental results, this aging-aware signoff approach avoids both overestimation and underestimation of aging - either of which results in power or area penalty - in AVS enabled systems.

A Parallel Fast Transform-Based Preconditioning Approach for Electrical-Thermal Co- Simulation of Power Delivery Networks [p. 1689]

Konstantis Daloukas, Alexia Marnari, Nestor Evmorfopoulos, Panagiota Tsompanopoulou and George I. Stamoulis

Efficient analysis of massive on-chip power delivery networks is among the most challenging problems facing the EDA industry today. Due to Joule heating effect and the temperature dependence of resistivity, temperature is one of the most important factors that affect IR drop and must be taken into account in power grid analysis. However, the sheer size of modern power delivery networks (comprising several thousands or millions of nodes) usually forces designers to neglect thermal effects during IR drop analysis in order to simplify and accelerate simulation. As a result, the absence of accurate estimates of Joule heating effect on IR drop analysis introduces significant uncertainty in the evaluation of circuit functionality. This work presents a new approach for fast electrical-thermal co-simulation of large-scale power grids found in contemporary nanometer-scale ICs. A state-of-the-art iterative method is combined with an efficient and extremely parallel preconditioning mechanism, which enables harnessing the computational resources of massively parallel architectures, such as graphics processing units (GPUs). Experimental results demonstrate that the proposed method achieves a speedup of 66.1X for a 3.1M-node design over a state-of-the-art direct method and a speedup of 22.2X for a 20.9M-node design over a state-of-the-art iterative method when GPUs are utilized.

Hierarchically Focused Guardbanding: An Adaptive Approach to Mitigate PVT Variations and Aging [p. 1695]

Abbas Rahimi, Luca Benini and Rajesh K. Gupta

This paper proposes a new model of functional units for variation-induced timing errors due to PVT variations and device Aging (PVTA). The model takes into account PVTA parameter variations, clock frequency, and the physical details of Placed-and-Routed (P&R) functional units in 45nm TSMC analysis flow. Using this model and PVTA monitoring circuits, we propose Hierarchically Focused Guardbanding (HFG) as a method to adaptively mitigate PVTA variations. We demonstrate the effectiveness of HFG on GPU architecture at two granularities of observation and adaptation: (i) fine-grained instruction-level; and (ii) coarse-grained kernel-level. Using coarse-grained PVTA monitors with kernel-level adaptation, the throughput increases by 70% on average. By comparison, the instruction-by-instruction monitoring and adaptation enhances throughput by a factor of 1.8x-2.1x depending on the configuration of PVTA monitors and the type of instructions executed in the kernels.
Keywords - adaptive guardbanding; PVT variation; aging; GPU;

Effective Power Network Prototyping via Statistical-Based Clustering and Sequential Linear Programming [p. 1701]

Sean Shih-Ying Liu, Chieh-Jui Lee, Chuan-Chia Huang, Hung-Ming Chen, Chang-Tzu Lin and Chia-Hsin Lee

In this paper, we propose a framework that automatically generates a power network based on given placed design and verifies the power network by the commercial tool without IR and Electro-Migration (EM) violations. Our framework integrates synthesis, optimization and analysis of power network. A deterministic method is proposed to decide number and location of power stripes based on clustering analysis. After an initial power network is synthesized, we propose a sensitivity matrix Gs which is the correlation between updates in stripe resistance and nodal voltage. An optimization scheme based on Sequential Linear Programming (SLP) is applied to iteratively adjust power network to satisfy a given IR drop constraint. The proposed framework constantly updates voltage distribution in response to incremental change in power network. To accurately capture voltage distribution on a given chip, our power network models every existing power stripes and via resistances on each layer. Experimental result demonstrates that our power network analysis can accurately capture voltage distribution on a given chip and effectively minimize power network area. The proposed methodology is experimented on two real designs in TSMC 90nm and UMC 90nm technology respectively and achieves 9%-32% reduction in power network area, compared with the results from modern commercial PG synthesizer

A Network-Flow Based Algorithm for Power Density Mitigation at Post-Placement Stage [p. 1707]

Sean Shih-Ying Liu, Ren-Guo Luo and Hung-Ming Chen

In this paper, we propose a power density mitigation algorithm at post-placement stage. Our proposed framework first identifies cluster of bins with high temperature, then propagates power density away from high temperature region by balancing regional power density. The problem of balancing regional power density is modeled as a supply-demand problem and solution is obtained with minimal displacement of cells. An analytical temperature profiling algorithm is tightly integrated within the framework to constantly update the temperature profile in response to incremental perturbation to placement. Our proposed approach can effectively reduce maximum temperature compared to previous works on temperature mitigation.

An Efficient Wirelength Model for Analytical Placement [p. 1711]

B.N.B. Ray and Shankar Balachandran

Smooth approximations to half-perimeter wirelength are being investigated actively because of the recent increase in interest in analytical placement. It is necessary to not just provide smooth approximations but also to provide error analysis and convergence properties of these approximations. We present a new approximation scheme which uses a non-recursive approximation to the max function. We also show the convergence properties and the error bounds. The accuracy of our proposed scheme is better than those of the popular Logarithm-Sum-Exponential (LSE) wirelength model [7] and the recently proposed Weighted Average(WA) wirelength model[3]. We also experimentally validate the comparison by using global and detail placements produced by NTU Placer [1] on ISPD 2004 benchmark suite. The experimentations on benchmarks validate that the error bounds of our model are lower, with an average of 4% error in the total wirelength.

11.8: EMBEDDED TUTORIAL: Advances in Asynchronous Logic: From Principles to GALS & NoC, Recent Industry Applications, and Commercial CAD Tools

Organizer: Pascal Vivet - CEA-LETI, FR
Moderators: Robin Wilson - STMicroelectronics, FR; Beigné Edith - CEA-LETI, FR

Advances in Asynchronous Logic: From Principles to GALS & NoC, Recent Industry Applications, and Commercial CAD Tools [p. 1715]

Alex Yakovlev, Pascal Vivet, Marc Renaudin

The growing variability and complexity of advanced CMOS technologies makes the physical design of clocked logic in large Systems-on-Chip more and more challenging. Asynchronous logic has been studied for many years and become an attractive solution for a broad range of applications, from massively parallel multi-media systems to systems with ultra-low power & low-noise constraints, like cryptography, energy autonomous systems, and sensor-network nodes. The objective of this embedded tutorial is to give a comprehensive and recent overview of asynchronous logic. The tutorial will cover the basic principles and advantages of asynchronous logic, some insights on new research challenges, and will present the GALS scheme as an intermediate design style with recent results in asynchronous Network-on-Chip for future Many Core architectures. Regarding industrial acceptance, recent asynchronous logic applications within the microelectronics industry will be presented, with a main focus on the commercial CAD tools available today.
Keywords - component; asynchronous design, handshake circuits, GALS, CAD flow

12.1: HOT TOPIC: Internet of Energy - Connecting Smart Mobility in the Cloud

Organizer: Ovidiu Vermesan - SINTEF, NO
Moderators: TBA

Interactions of Large Scale EV Mobility and Virtual Power Plants [p. 1725]

R. Mock, J. Reinschke, T. S. Cinotti, L. Bononi

The complex interactions between electric mobility on a large scale with the electric distribution grid constitute a considerable challenge regarding the feasibility, the efficiency and the stability of smart electric distribution grids. On the one hand, the steadily increasing share of decentralized power generation from renewable sources entails a move away from electro-mechanical generators with huge inertia towards systems with distributed small-and medium-scale generators which are coupled to the grid via inverters. On the other hand, large-scale electric mobility which interacts with such a decentralized grid will have a huge impact on the power generation, storage potential and consumption patterns of a grid. Grid infrastructure simulations which take into account the details of these interactions and which are backed by comprehensive demonstrators may help to shed light on crucial aspects of both energy and information exchange between the traffic and the electric energy infrastructure regime. This will be highlighted by selected topics which intend to shed light on the scope and the challenges inherent in this area of simulation.

Innovative Energy Storage Solutions for Future Electromobility in Smart Cities [p. 1730]

Kevin Green, Salvador Rodriguez González, Ruud Wijtvliet

The stochastic nature of renewable energy sources will no doubt place strain upon the electrical distribution networks as power generation is converted to environmentally friendly methods. The use of energy storage technologies could significantly improve the usability of these energy sources. A domestic installation, based on a 4 kWh energy storage unit, is under development and modeling shows that the proposed unit would improve the energy autonomy of a household.
Keywords - Battery energy storage, photo-voltaics, smart grid.

Automotive Ethernet: In-vehicle Networking and Smart Mobility [p. 1735]

Peter Hank, Steffen Müller, Ovidiu Vermesan, Jeroen Van Den Keybus

This paper discusses novel communication network topologies and components and describes an evolutionary path of bringing Ethernet into automotive applications with focus on electric mobility. For next generation in-vehicle networking, the automotive industry identified Ethernet as a promising candidate besides CAN and FlexRay. Ethernet is an IEEE standard and is broadly used in consumer and industry domains. It will bring a number of changes for the design and management of in-vehicle networks and provides significant re-use of components, software, and tools. Ethernet is intended to connect inside the vehicle high-speed communication requiring sub-systems like Advanced Driver Assistant Systems (ADAS), navigation and positioning, multimedia, and connectivity systems. For hybrid (HEVs) or electric vehicles (EVs), Ethernet will be a powerful part of the communication architecture layer that enables the link between the vehicle electronics and the Internet where the vehicle is a part of a typical Internet of Things (IoT) application. Using Ethernet for vehicle connectivity will effectively manage the huge amount of data to be transferred between the outside world and the vehicle through vehicle-to-x (V2V and V2I or V2I+I) communication systems and cloud-based services for advanced energy management solutions. Ethernet is an enabling technology for introducing advanced features into the automotive domain and needs further optimizations in terms of scalability, cost, power, and electrical robustness in order to be adopted and widely used by the industry.
Keywords - Ethernet; automotive; electric vehicle; smart grid; EV communication architecture; domain based communication; in-vehicle networking; vehicle network topology

Smart, Connected and Mobile: Architecting Future Electric Mobility Ecosystems [p. 1740]

Ovidiu Vermesan, Lars-Cyril Blystad, Reiner John, Peter Hank, Roy Bahr, Alessandro Moscatelli

This paper provides an overview on facts and trends towards the introduction of connected electric vehicle (EV) and discusses how and to what extent electric mobility will be integrated into the Internet of Energy (IoE) and Smart grid infrastructure to provide novel energy management solutions. In this context the EVs are evolving from mere transportation mediums to advanced mobile connectivity ecosystem platforms.
Keywords - electric vehicle; Internet of Energy; in-vehicle communication; telematics;connected vehicle

e-Mobility - The Next Frontier for Automotive Industry [p. 1745]

Roberto Zafalon, Giovanni Coppola, Ovidiu Vermesan

This paper provides an overview on the introduction of electric vehicles (EV) and discusses how electric mobility will influence the developments in automotive industry by integrating the EVs into the Internet of Energy (IoE) and Smart grid infrastructure by providing novel business models and requiring new semiconductor devices and modules. In this context the EVs are evolving from mere transportation mediums to advanced mobile connectivity ecosystem platforms.
Keywords - electric vehicle; Internet of Energy; in-vehicle communication; telematics;connected vehicle

Semiconductor Technologies for Smart Mobility Management [p. 1749]

Reiner John, Martin Schulz, Ovidiu Vermesan, Kai Kriegel

This paper provides an overview of the latest developments in the development of semiconductor devices for implementation of electronic modules for EVs and HEVs and the implementation of charging stations and the interface with the smart grid infrastructure. The design choices are influenced by the power level of the different applications.
Keywords - electric vehicle; Internet of Energy; semiconductor technologies; MOS; IGBT;

12.2: Methodologies to Improve Yield, Reliability and Security in Embedded Systems

Moderators: Luciano Lavagno - Politecnico di Torino, IT; Jürgen Teich - University of Erlangen-Nuremberg, DE

A New Paradigm for Trading Off Yield, Area and Performance to Enhance Performance per Wafer [p. 1753]

Yue Gao, Melvin A. Breuer and Yanzhi Wang

In this paper we outline a novel way to 1) predict the revenue associated with a wafer, 2) maximize the projected revenue through unconventional yield enhancement techniques, and 3) produce dice from the same mask that may have different performances and selling prices. Unlike speed binning, such heterogeneity is intentional by design. To achieve these goals we overturn the traditional concepts of redundancy, and present a novel design flow for yield enhancement called "Reduced Redundancy Insertion", where spares can potentially have less area and performance than their fathers. We develop a model for the revenue associated with the new design methodology that integrates system configuration and leverages yield, area and performance. The primary metric used in this model is termed "Expected Performance per Area", which is a measure that can be reliably estimated for different system architectures, and can be maximized by using algorithms proposed in this paper. We present theoretical models and case studies that characterize our designs, and experimental results that validate our prediction. We show that using Reduced Redundancy can improve wafer revenue by 10-30%.

Leveraging Variable Function Resilience for Selective Software Reliability on Unreliable Hardware [p. 1759]

Semeen Rehman, Muhammad Shafique, Pau Vilimelis Aceituno, Florian Kriebel, Jian-Jia Chen and Jörg Henkel

State-of-the-art reliability optimizing schemes deploy spatial or temporal redundancy for the complete functionality. This introduces significant performance/area overhead which is often prohibitive within the stringent design constraints of embedded systems. This paper presents a novel scheme for selective software reliability optimization constraint under user-provided tolerable performance overhead constraint. To enable this scheme, statistical models for quantifying software resilience and error masking properties at function and instruction level are proposed. These models leverage a whole new range of reliability optimization. Given a tolerable performance overhead, our scheme selectively protects the reliability-wise most important instructions based on their masking probability, vulnerability, and redundancy overhead. Compared to state-of-the-art [7], our scheme provides a 4.84X improved reliability at 50% tolerable performance overhead constraint.

Optimization of Secure Embedded Systems with Dynamic Task Sets [p. 1765]

Ke Jiang, Petru Eles and Zebo Peng

In this paper, we approach embedded systems design from a new angle that considers not only quality of service but also security as part of the design process. Moreover, we also take into consideration the dynamic aspect of modern embedded systems in which the number and nature of active tasks are variable during run-time. In this context, providing both high quality of service and guaranteeing the required level of security becomes a difficult problem. Therefore, we propose a novel secure embedded systems design framework that efficiently solves the problem of runtime quality optimization with security constraints. Experiments demonstrate the efficiency of our proposed techniques.

12.3: NoC Mapping and Synthesis

Moderators: Andreas Hansson - ARM, UK; Jaime Murillo - EPFL, CH

Shared Memory Aware MPSoC Software Deployment [p. 1771]

Timo Schönwald, Alexander Viehl, Oliver Bringmann and Wolfgang Rosenstiel

In this paper we present a novel approach for mapping interconnected software components onto cores of homogenous MPSoC architectures. The analytic mapping process considers shared memory communication as well as the routing algorithm controlling packet-based communication. The software components are mapped with the constraints of avoiding communication conflicts as well as access conflicts to shared memory resources. The core of the elaborated approach consists of an algorithm for software mapping which is inspired by force-directed scheduling from high-level synthesis. Experimental results show that the presented approach increases the overall system performance by 22% while reducing the average communication latency by 35%. For presenting the major advantages of the developed solution, we optimized an advanced driver assistance system on the Tilera TILEPro64 processor.

Fast and Optimized Task Allocation Method for Low Vertical Link Density 3-Dimensional Networks-on-Chip Based Many Core Systems [p. 1777]

Haoyuan Ying, Thomas Hollstein and Klaus Hofmann

The advantages of moving from 2-Dimensional Networks-on-Chip (NoCs) to 3-Dimensional NoCs for any application must be justified by the improvements in performance, power, latency and the overall system costs, especially the cost of Through-Silicon-Via (TSV). The trade-off between the number of TSVs and the 3D NoCs system performance becomes one of the most critical design issues. In this paper, we present a fast and optimized task allocation method for low vertical link density (TSV number) 3D NoCs based many core systems, in comparison to the classic methods as Genetic Algorithm (GA) and Simulated Annealing (SA), our method can save quite a number of design time.We take several state-of-the-art benchmarks and the generic scalable pseudo application (GSPA) with different network scales to simulate the achieved design (by our method), in comparison to GA and SA methods achieved designs, our technique can achieve better performance and lower cost. All the experiments have been done in GSNOC framework (written in SystemC-RTL), which can achieve the cycle accuracy and good flexibility.

A Spectral Clustering Approach to Application-Specific Network-on-Chip Synthesis [p. 1783]

Vladimir Todorov, Daniel Mueller-Gritschneder, Helmut Reinig and Ulf Schlichtmann

Modern System-on-Chip (SoC) design relies heavily on efficient interconnects like Networks-on-Chip (NoCs). They provide an effective, flexible and cost efficient way of communication exchange between the individual processing elements of the SoC. Therefore, the choice of topology and design of the NoC itself plays a crucial role in the performance of the system. Depending on the field of application, standard topologies like meshes, fat-trees, and tori might be suboptimal in terms of power consumption, latency and area. This calls for a custom topology design methodology, which is based on the requirements imposed by the application, function and the use-cases of the SoC in question. This work proposes a fast approach, which uses spectral clustering and cluster ensembles to partition the system using normalized cuts and insert the necessary routers. Then, by using delay-constrained minimum spanning trees, links between the individual routers are created, such that any present latency constraints are satisfied at minimum cost. Results from applying the methodology to a smartphone SoC are presented.

12.4: Emerging Logic

Moderators: Aida Todri-Sanial - CNRS-LIRMM, FR; Marco Ottavi - University of Rome "Tor Vegata", IT

A SPICE-Compatible Model of Graphene Nano-Ribbon Field-Effect Transistors Enabling Circuit-Level Delay and Power Analysis under Process Variation [p. 1789]

Ying-Yu Chen, Artem Rogachev, Amit Sangai, Giuseppe Iannaccone, Gianluca Fiori and Deming Chen

This paper presents the first parameterized, SPICE-compatible compact model of a Graphene Nano-Ribbon Field-Effect Transistor (GNRFET) with doped reservoirs that also supports process variation. The current and charge models closely match numerical TCAD simulations. In addition, process variation in transistor dimension, edge roughness, and doping level in the reservoir are accurately modeled. Our model provides a means to analyze delay and power of graphene-based circuits under process variation, and offers design and fabrication insights for graphene circuits in the future. We show that edge roughness severely degrades the advantages of GNRFET circuits; however, GNRFET is still a good candidate for low-power applications.

Systematic Design of Nanomagnet Logic Circuits [p. 1795]

Indranil Palit, X. Sharon Hu, Joshep Nahas and Michael Niemier

Nanomagnet Logic (NML) is an emerging device architecture that performs logic operations through fringing field interactions between nano-scale magnets. The design space for NML circuits is large and so far there exists no systematic approach for determining the parameter values (e.g., device-to-device spacings, clocking field strength etc.) to generate a predictable design solution. This paper presents a formal methodology for designing NML circuits that marshals the design parameters to generate a layout that is guaranteed to evolve correctly in time at 0K. The approach is further augmented to identify functional design targets when considering thermal noise associated with higher temperatures. The approach is applied to identify layouts for a 2-input AND gate, a "corner turn," and a 3-input majority gate. Layouts are verified through simulations both at 0K and room temperature (300K).

Defect-Tolerant Logic Hardening for Crossbar-based Nanosystems [p. 1801]

Yehua Su and Wenjing Rao

Crossbar-based architectures are promising for the future nanoelectronic systems. However, due to the inherent unreliability of nanoscale devices, the implementation of any logic functions relies on aggressive defect-tolerant schemes applied at the post-manufacturing stage.Most of such defect-tolerant approaches explore mapping choices between logic variables/products and crossbar vertical/horizontal wires. In this paper, we develop a new approach, namely fine-grained logic hardening, based on the idea of adding redundancies into a logic function so as to boost the success rate of logic implementation. We propose an analytical framework to evaluate and fine-tune the amount and location of redundancy to be added for a given logic function. Furthermore, we devise a method to optimally harden the logic function so as to maximize the defect tolerance capability. Simulation results show that the proposed logic hardening scheme boosts defect tolerance capability significantly in yield improvement, compared to mapping-only schemes with the same amount of hardware cost.

On Reconfigurable Single-Electron Transistor Arrays Synthesis Using Reordering Techniques [p. 1807]

Chang-En Chiang, Li-Fu Tang, Chun-Yao Wang, Ching-Yi Huang, Yung-Chih Chen, Suman Datta and Vijaykrishnan Narayanan

Power consumption has become one of the primary challenges in meeting Moore's law. Fortunately, Single-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore's law due to its ultra low power consumption during operation. An automated mapping approach for the SET architecture has been proposed recently for facilitating design realization. In this paper, we propose an enhanced approach consisting of variable reordering, product term reordering, and mapping constraint relaxation techniques to minimizing the area of mapped SET arrays. The experimental results show that our enhanced approach, on average, saves 40% in area and 17% in mapping time compared to the state-of-the-art approach for a set of MCNC and IWLS 2005 benchmarks.

12.5: Emerging Technology Architectures for Energy-Efficient Memories

Moderators: Marisa López-Vallejo - Universidad Politecnica Madrid, ES; Naehyuck Chang - Seoul National University, KR

D-MRAM Cache: Enhancing Energy Efficiency with 3T-1MTJ DRAM / MRAM Hybrid Memory [p. 1813]

Hiroki Noguchi, Kumiko Nomura, Keiko Abe, Shinobu Fujita, Eishi Arima, Kyundong Kim, Takashi Nakada, Shinobu Miwa and Hiroshi Nakamura

This paper describes a proposal of non-volatile cache architecture utilizing novel DRAM / MRAM cell-level hybrid structured memory (D-MRAM) that enables effective power reduction for high performance mobile SoCs without area overhead. Here, the key point to reduce active power is intermittent refresh process for the DRAM-mode. D-MRAM has advantage to reduce static power consumptions compared to the conventional SRAM, because there are no static leakage paths in the D-MRAM cell and it is not needed to supply voltage to its cells when used as the MRAM-mode. Besides, with advanced perpendicular magnetic tunnel junctions (p-MTJ), which decreases the write energy and latency without shortening its retention time, D-MRAM is capable of power reduction by replacing the traditional SRAM caches. Considering the 65-nm CMOS technology, the access latencies of 1MB memory macro are 2.2 ns / 1.5 ns for read / write in DRAM mode, and 2.2 ns / 4.5 ns in MRAM mode, while those of SRAM are 1.17 ns. The SPEC CPU2006 benchmarks have revealed that the energy per instruction (EPI) of the total cache memory can be dramatically reduced by 71 % on average, and the instruction per cycle (IPC) performance of the D-MRAM cache architecture degraded only by approximately 4 % on average in spite of its latency overhead.

Leveraging Sensitivity Analysis for Fast, Accurate Estimation of SRAM Dynamic Write Vmin [p. 1819]

James Boley, Vikas Chandra, Robert Aitken and Benton Calhoun

Circuit reliability in the presence of variability is a major concern for SRAM designers. With the size of memory ever increasing, Monte Carlo simulations have become too time consuming for margining and yield evaluation. In addition, dynamic write-ability metrics have an advantage over static metrics because they take into account timing constraints. However, these metrics are much more expensive in terms of runtime. Statistical blockade is one method that reduces the number of simulations by filtering out non-tail samples, however the total number of simulations required still remains relatively large. In this paper, we present a method that uses sensitivity analysis to provide a total speedup of ~112X compared with recursive statistical blockade with only a 3% average loss in accuracy. In addition, we show how this method can be used to calculate dynamic VMIN and to evaluate several write assist methods.

DWM-TAPESTRI - An Energy Efficient All-Spin Cache Using Domain Wall Shift Based Writes [p. 1825]

Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy and Anand Raghunathan

Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach - shift based write - that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the cache hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWMTAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.

12.6: Clock Distribution and Analogue Circuit Synthesis

Moderators: Tiziano Villa - University of Verona, IT; Georges Gielen - Katholieke Universiteit Leuven, BE

Co-Synthesis of Data Paths and Clock Control Paths for Minimum-Period Clock Gating [p. 1831]

Wen-Pin Tu, Shih-Hsu Huang and Chun-Hua Cheng

Although intentional clock skew can be utilized to reduce the clock period, its application in gated clock designs has not been well studied. A gated clock design includes both data paths and clock control paths, but conventional clock skew scheduling only focus on data paths. Based on that observation, in this paper, we propose an approach to perform the co-synthesis of data paths and clock control paths in a nonzero skew gated clock design. Our objective is to minimize the required inserted delay for working with the lower bound of the clock period (under clocking constraints of both data paths and clock control paths). Different from previous works, our approach can guarantee no clocking constraint violation in the presence of clock gating. Experimental results show our approach can effectively enhance the circuit speed with almost no penalty on the power consumption.
Keywords - Clock Period Minimization, Delay Insertion, Clock Gating, Data Path Synthesis.

Slack Budgeting and Slack to Length Converting for Multi-Bit Flip-Flop Merging [p. 1837]

Chia-Chieh Lu and Rung-Bin Lin

In this paper we propose a flexible slack budgeting approach for post-placement multi-bit flip-flop (MBFF) merging. Our approach considers existing wiring topology and flip-flop delay changes for achieving more accurate slack budgeting. Besides, we propose a slack-to-length converting approach to translating timing slack into equivalent wire length for simplifying a merging process. We also develop a merging method to evaluate our slack budgeting approach. Our slack budgeting and MBFF merging programs are fully integrated into an industrial design flow. Experimental results show that our approach on average achieves 3.4% area saving, 50% clock tree power saving, and 5.3% total power saving.
Keywords - Multi-bit flip-flop; slack budgeting; low power

Area Optimization on Fixed Analog Floorplans Using Convex Area Functions [p. 1843]

A. Unutulmaz, G. Dündar and F.V. Fernández

A methodology to optimize the area of a fixed non-slicing floorplan is presented in this paper. Areas of transistors, capacitors and resistors are formulated as convex functions and area is minimized by solving a sequence of convex problems. The methodology is practical even with many components and variants. Moreover symmetry constraints are satisfied during optimization.

PAGE: Parallel Agile Genetic Exploration towards Utmost Performance for Analog Circuit Design [p. 1849]

Po-Cheng Pan, Hung-Ming Chen and Chien-Chih Lin

This paper presents an agile hierarchical synthesis framework for analog circuit. To acknowledge the limitation for a given topology analog circuit, this hierarchical synthesis work proposes a performance exploration technique and a non-uniform-step simulation process. Apart from spec targeted designs, this proposed approach can help to search the solutions better than designers' expectation. A parallel genetic algorithm (PAGE) method is employed for performance exploration. Unlike other evolution-based topology explorations, this is the first method that regards performance constraints as input genome for evolution and resolves the multiple-objective problem with the multiple-population feature. Populations of selected performance are transfered to device variables by re-targeting technique. Based on a normalization of device variable distribution, a probabilistic stochastic simulation significantly reduces the convergence time to find the global optima of circuit performance. Experimental results show that our approach on radio-frequency distributed amplifier (RFDA) and folded cascode operational amplifier (Op-Amp) in different technologies can obtain better runtime and higher quality in analog synthesis.

12.7: Physical Design

Moderators: Carl Sechen - University of Texas at Dallas, US; Bill Swartz - InternetCAD, US

Fast and Efficient Lagrangian Relaxation-Based Discrete Gate Sizing [p. 1855]

Vinicius S. Livramento, Chrystian Guth, José Luís Güntzel and Marcelo O. Johann

Discrete gate sizing has attracted a lot of attention recently as the EDA industry faces the challenge of optimizing large standard cell-based circuits. The discreteness of the problem, along with complex timing models, stringent constraints and ever increasing circuit sizes make the problem very difficult to tackle. Lagrangian Relaxation is an effective technique to handle complex constrained optimization problems and therefore has been used for gate sizing. In this paper, we propose an improved Lagrangian Relaxation formulation for leakage power minimization that accounts for maximum gate input slew and maximum gate output capacitance in addition to the circuit timing constraints. We also present a fast topological greedy heuristic to solve the Lagrangian Relaxation Subproblem and a complementary procedure to fix the few remaining slew and capacitace violations. The experimental results, generated by using the ISPD 2012 Discrete Gate Sizing Contest infrastructure, show that our technique is able to optimize a circuit with up to 959K gates within only 51 minutes. Comparing to the ISPD Contest top three teams, our technique obtained on average 18.9%, 16.7% and 43.8% less leakage power, while being 38, 31 and 39 times faster.

Enhanced Metamodeling Techniques for High-Dimensional IC Design Estimation Problems [p. 1861]

Andrew B. Kahng, Bill Lin and Siddhartha Nath

Accurate estimators of key design metrics (power, area, delay, etc.) are increasingly required to achieve IC cost reductions in system-level through physical layout optimizations. At the same time, identifying physical or analytical models of design metrics has become very challenging due to interactions among many parameters that span technology, architecture and implementation. Metamodeling techniques can simplify this problem by deriving surrogate models from samples of actual implementation data. However, the use of metamodeling techniques in IC design estimation is still in its infancy, and practitioners need more systematic understanding. In this work, we study the accuracy of metamodeling techniques across several axes: (1) low- and high-dimensional estimation problems, (2) sampling strategies, (3) sample sizes, and (4) accuracy metrics. To help obtain more general conclusions, we study these axes for three very distinct chip design estimation problems: (1) area and power of networks-on-chip routers, (2) delay and output slew of standard cells under power delivery network noise, and (3) wirelength and buffer area of clock trees. Our results show that (1) adaptive sampling can effectively reduce the sample size required to derive surrogate models by up to 64% (or, increase estimation accuracy by up to 77%) compared with Latin hypercube sampling; (2) for low-dimensional problems, Gaussian process-based models can be 1.5x more accurate than tree-based models, whereas for high-dimensional problems, tree-based models can be up to 6x more accurate than Gaussian process-based models; and (3) a variant of weighted surrogate modeling [7], which we call hybrid surrogate modeling, can improve estimation accuracy by up to 3x. Finally, to aid architects, design teams, and CAD developers in selection of the appropriate metamodeling techniques, we propose guidelines based on the insights gained from our studies.

Sub-Quadratic Objectives in Quadratic Placement [p. 1867]

Markus Struzyna

This paper presents a new flexible quadratic and partitioning-based global placement approach which is able to optimize a wide class of objective functions, including linear, sub-quadratic, and quadratic net lengths as well as positive linear combinations of them. Based on iteratively re-weighted quadratic optimization, our algorithm extends the previous linearization techniques. If l is the length of some connection, most placement algorithms try to optimize l¹ or l². We show that optimizing l^p with 1 < p < 2 helps to improve even linear connection lengths. With this new objective, our new version of the flowbased partitioning placement tool BonnPlace [25] is able to outperform the state-of-the-art force-directed algorithms SimPL, RQL, ComPLx and closes the gap to MAPLE in terms of (linear) HPWL.

CATALYST: Planning Layer Directives for Effective Design Closure [p. 1873]

Yaoguang Wei, Zhuo Li, Cliff Sze, Shiyan Hu, Charles J. Alpert and Sachin S. Sapatnekar

For the last several technology generations, VLSI designs in new technology nodes have had to confront the challenges associated with reduced scaling in wire delays. The solution from industrial back-end-of-line process has been to add more and more thick metal layers to the wiring stacks. However, existing physical synthesis tools are usually not effective in handling these new thick layers for design closure. To fully leverage these degrees of freedom, it is essential for the design flow to provide better communication among the timer, the router, and different optimization engines. This work proposes a new algorithm, CATALYST, to perform congestion- and timing-aware layer directive assignment. Our flow balances routing resources among metal stacks so that designs benefit from the availability of thick metal layers by achieving improved timing and buffer usage reduction while maintaining routability. Experiments demonstrate the effectiveness of the proposed algorithm.

12.8: EMBEDDED TUTORIAL: Closed-Loop Control for Power and Thermal Management in Multi-core Processors: Formal Methods and Industrial Practice

Organizer: Ibrahim Elfadel - Masdar Institute of Science and Technology, AE
Moderators: Petru Eles - Linkopings University, SE; Jose Ayala - Complutense University of Madrid, ES

Closed-Loop Control for Power and Thermal Management in Multi-core Processors: Formal Methods and Industrial Practice [p. 1879]

Ibrahim (Abe) M. Elfadel, Radu Marculescu and David Atienza

The need to use feedback to come up with context-dependent and workload-aware strategies for runtime power and thermal management (PTM) in high-end and mobile processors has been advocated since the early 2000. Two seminal papers that appeared in 2002 [1], [2] defined a framework for the use of feedback mechanisms for power and temperature control. In [1], the focus was on power management with the goal being to extend battery life on the AMD Mobile Athlon. This was one of the earliest papers to use DVFS settings as actuators to guarantee a given energy level in the battery at the end of a given time interval. The controller was implemented using a combination of OS files and Linux kernel modules. Almost simultaneously, [2] posed the dynamic thermal management task as a formal control-theoretic problem requiring the thermal modeling of the processor and the use of the established control structures of classical feedback theory. Some of the defining features of [2] include the development of layout-based thermal RC models for the processor; the use of an architecturally-driven control mechanism, namely, the instruction fetching rate; and the use of the SPEC2000 benchmarks to illustrate temperature control action under various workloads. The controller used in [2] is a Proportional-Integral-Differential (PID) structure whose input is the deviation of the sensed temperature from the target temperature and whose output is the toggle rate of the instruction fetching mechanism.

DATE 2013 ABSTRACTS