Goto Session:

1.1 Opening Session
UB01 Session 1
2.1 EXECUTIVE SESSION: How to Handle Today’s Design Complexity
2.2 Panel: Emerging vs. Established Technologies: a Two Sphinxes’ Riddle at the Crossroads?
2.3 Making automotive systems safer and more energy efficient
2.4 Modern Challenges in Analog and Mixed-Signal Design
2.5 Low-Power and Efficient Architectures
2.6 Real-Time memory hierarchies
2.7 Yield and Reliability for Robust Systems
2.8 Hot Topic: Technology Transfer towards Horizon 2020
UB02 Session 2
3.1 EXECUTIVE SESSION: Advanced Technology Challenges & Opportunities
3.2 Panel: The World Is Going... Analog & Mixed-Signal! What about EDA?
3.3 Secure Hardware Primitives and Implementations
3.4 Modeling and Optimization of Power Distribution Networks
3.5 Robust Architectures
3.6 Cyber Physical Systems: Security and Co-design
3.7 On line Strategies for Reliability
3.8 Hot Topic: Mission Profile Aware Design - The Solution for Successful Design of Tomorrows Automotive Electronics
UB03 Session 3
IP1 Interactive Presentations
4.1 EXECUTIVE SESSION: Addressing Challenges of Reliable Chips
4.2 Hot Topic: Multicore Systems in Safety Critical Electronic Control Units for Automotive and Avionics
4.3 Secure Device Identification
4.4 "Almost there" emerging technologies
4.5 Memory System Architectures
4.6 Code Generation and Optimization for Embedded Platforms
4.7 Dependable System Design
4.8 State-of-the-art in Verification: European Tertulia IC Design - Enabling AMS Structured Verification / Verification in FPGA & IP design flows
UB04 Session 4
Exhibition-Reception Exhibition Reception
5.1 SPECIAL DAY Hot Topic: Predictable Multi-Core Computing
5.2 Hot Topic: Hacking and Protecting Hardware: Threats and Challenges
5.3 Reliable Systems in the Age of Variability
5.4 Prediction and optimization of timing variations
5.5 Boosting the Scalability of Formal Verification Technologies
5.6 Emerging logic technologies
5.7 Test Generation and Optimization
5.8 Hot Topic: System Integration - The Bridge between More than Moore and More Moore
IP2 Interactive Presentations
UB05 Session 5
6.1 SPECIAL DAY Hot Topic: The fight against Dark Silicon
6.2 Embedded Tutorial: Emerging Transistor Technologies: From Devices to Architectures
6.3 Management of Micro/Macro Renewable Energy Storage Systems
6.4 Power delivery and distribution
6.5 Beyond EDA: Extending the Application Domain of Formal Methods
6.6 Model-Based Design and Hardware/Software Interfaces
6.7 Hardening Approaches at Different Design Levels
6.8 First Time Right in Analog Design Enabling New Business Cases
UB06 Session 6
7.0 Special Day Keynote
UB07 Session 7
7.1 SPECIAL DAY Panel: HW/SW Co-Development - The Industrial Workflow
7.2 Embedded Tutorial: Cross Layer Resiliency in Real World
7.3 Low power methods and multicore architectures for mobile health applications
7.4 Runtime memory optimization and GPU/manycore architectures
7.5 Emerging memory technologies
7.6 Performance and timing analysis
7.7 Design-for-Test and Test Access
IP3 Interactive Presentations
UB08 Session 8
1.1 Opening Session

Date: Tuesday 25 March 2014
Time: 08:30 - 10:30
Location / Room: Grosser Saal

Organiser:
Gerhard Fettweis, Technische Universität Dresden, DE

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>1.1.1 WELCOME ADDRESSES</td>
<td>Speakers: Gerhard Fettweis 1 and Luca Fanucci 2</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>1Technische Universität Dresden, DE; 2University of Pisa, IT</td>
<td></td>
</tr>
<tr>
<td>08:50</td>
<td>1.1.2 PRESENTATION OF DISTINGUISHED AWARDS</td>
<td>Speaker: DATE Executive Committee</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract: DATE 2014 Best Paper Awards</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>EDAA Lifetime Achievement Award 2014 (Rolf Ernst, TU Braunschweig, DE)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>EDAA Outstanding Dissertation Awards 2013</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>ACM SIGDA Distinguished Service Award (Peter Marwedel, TU Dortmund, DE)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>DATE Fellow Award (Enrico Macii, Politecnico di Torino, IT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>IEEE/EDAC Outstanding Service Contribution Award 2013 (Enrico Macii, Politecnico di Torino, IT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>IEEE CS TTYC Outstanding Contribution Award (Enrico Macii, Politecnico di Torino, IT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>IEEE Fellow Award (Cecilia Metra, University of Bologna, IT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Read More ...</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
1.1.4 KEYNOTE ADDRESS: THE GROWING IMPORTANCE OF MICROELECTRONICS FROM A FOUNDRY PERSPECTIVE

Speaker: Gerd Teepe, GLOBALFOUNDRIES, DE

Abstract
Microelectronics is the dominant industrial technology of today. Its rate of innovation, spelled out by Moore’s Law, is exceptional by any commercial metric, especially, as it has been on this trajectory for almost 40 years. It is not surprising, that other industrial sectors are taking advantage of the innovation engine of the semiconductors for its own product innovation: Cars are safer and more economic, medical diagnostics are performing to a significantly higher level, and energy efficiency from the generation to the consumer is a lot more efficient. “The Internet” has become the basis for our communication, organization and living in our economies with significant impact to our society. However, the Semiconductors industry is under a powerful transformation marked by the following trends:

- Design Complexity is facing new challenges, as technological complexity is transferred to the design space at an accelerated pace
- The SOC is dominating the design space
- Intelligent Things are emerging with unprecedented cognitive and motion capabilities
- The supply chain transformation is in full motion, with the foundry model at the forefront
- With these powerful trends in motion, we will have to rethink our approach towards semiconductors

As part of the industrial system. It will not be sufficient any more to “enhance” traditional products like Cars, TVs, machines or phones with semiconductor content to make them perform at a higher level to increase its value to consumers. We need to rethink the connected world around us to truly assess the next generation of intelligent applications, which we are about to enter.

10:30 End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

UB01 Session 1

Date: Tuesday 25 March 2014
Time: 10:30 - 12:30
Location / Room: University Booth, Booth 3, Exhibition Area

UB01.01 QUANTUMEDA: A VISUALIZATION AND DESIGN ENVIRONMENT FOR TOPOLOGICAL QUANTUM CIRCUITS

Authors: Ilia Polian, Wolfgang Wallner and Alexandru Paler, University of Passau, DE

Abstract
Quantum circuits use quantum-mechanical properties of certain physical systems, such as superposition and entanglement, to perform massively parallel calculations. They provide polynomial algorithms for problems for which only inefficient algorithms with asymptotically-exponential running time are known in conventional mod-els of computation. Building a scalable quantum computer that can process a large number of quantum bits (qubits) is one of the grand challenges of modern science. While first small quantum computers have been experimentally demonstrated and a number of implementation technologies have been suggested, all of them encounter difficulties when it comes to scaling. The central difficulty is the high susceptibility of such circuits to noise and decoherence, which necessitates the use of special quantum error correction. Topological quantum computing (TQC) is a paradigm that offers a pathway to scalability. It strikes a balance between systematic, intuitive methods to design large computations, and relatively loose requirements on the vulnerability of individual qubits to errors. The availability of a platform for implementing large quantum algo-rithms constitutes the need for methods to manage design complexity, including automatic synthesis, optimization, compaction, verification and visualization of TQC circuits. Topological quantum circuits are based on a three-dimensional cluster of qubits which supports high error rates. We will present the first environment for design of TQC circuits. The environment allows the user to graphically enter the structure of a circuit, add, delete and re-shape individual qubits, and perform optimization and compaction (both manually and by global replacement). The circuits are represented on an intermediate technology-independent level, where “logical qubits” that consist of a large number of physical qubits perform error-corrected operations. For example, the circuit in Fig. 1 shows an error-corrected CNOT gate implemented by four logical qubits represented by colored structures. The optimized representation can be translated into instruction sequences for a classical computer that operates the actual quantum hardware.

More information ...

UB01.02 AN AUTOMATED DESIGN FLOW FOR FAST PROTOTYPING OF SIMULINK MODELS ONTO MPSoC

Authors: Francesco Robino and Johnny Öberg, Royal Institute of Technology, SE

Abstract
Simulink is a modelling environment suitable to model embedded systems at system-level. However there is no standard to rapidly prototype Simulink models onto modern multiprocessor system-on-chip (MPSoC). In this demonstration we show how our NoC System Generator tool can be used as part of an automated platform-based design flow to synthesize a Simulink model to a network-on-chip based MPSoC implementation on FPGA. The performance of the generated prototype scales with the number of processors.

More information ...

UB01.03 HEROES^2: A SYSTEMC FRAMEWORK FOR MODELING, SIMULATION AND TESTING OF HETEROGENEOUS SOFTWARE-INTENSIVE SYSTEMS

Authors: Markus Becker¹, Wolfgang Mueller¹, Ulrich Kiffmeier² and Joachim Stroop²

¹University of Paderborn/C-LAB, DE, ²DSPACE GmbH, DE

Abstract
Heroes^2 is a SystemC framework for modeling/simulation of heterogeneous SW-intensive systems. It has 8 abstraction levels for corefinement of application/environment models from continuous/discrete models to networked embedded SW stacks. Support of various SW/comm. abstractions is achieved by combining AMS Mcs, TLM, Ns models (MW, RTOS, HAL) and QEMU user mode/system emulator. Interfacing w/ a commercial AUTOSAR toolchain is provided, i.e., code generators, integration and experimentation tools.

More information ...

End of session
UB01 Session 1

End of session

Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).
Abstract

More information ...
Executives:
Sanjive Agarwala, Fellow & Silicon Director, Texas Instruments, US
Paul Lo, Senior Vice President, Synopsys, US
Rainer Kress, Head Design Methodology, Infineon, DE
Wolfgang Maier, Director, IBM, DE

The widening gap between growing SOC complexity and designer productivity limits traditional chip design methods and flows. This results in several new approaches and innovative methods that work to elevate the limitations of different aspects of complex SOC design. Executives in this session will discuss the impact of complexity and the new opportunities it may bring in designing today’s SOC.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>13:00</td>
<td>End of session</td>
<td>Lunch Break in Exhibition Area</td>
<td>Sandwich lunch</td>
</tr>
</tbody>
</table>

2.2 Panel: Emerging vs. Established Technologies: a Two Sphinxes’ Riddle at the Crossroads?

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 6

Organiser:
Marco Casale-Rossi, Synopsys, Inc., US

Chair:
Giovanni De Micheli, EPFL, CH

Crossroads have always been challenging: they require a decision; in Egyptian and Greek mythology they were often guarded by two sphinxes trying to cheat the traveler with their riddles. The two sphinxes, the knight and the knife, the lady and the tiger, are just few instances of difficult puzzles that have kept logicians and mathematicians busy for the last 5,000 years. Today, you are walking down Moore's Law road when you come to a crossroads: one road brings you into the land of emerging technologies: 14, 10 and 7 nanometer, FDSOI, FinFET, 3D-IC, ... beyond and below; the other road holds you into the land of established technologies: 28, 40, 65, and 90 nanometers, possibly even above, A&M/S, MEMS,... Choosing the right road is critical to lead your project and your company to success, but making the right decision is increasingly difficult, as it encompasses complex technical and economic considerations. However, unlike the mythological traveler, you won’t run into the sphinxes but, rather, into some of our industry best experts; unlike the sphinxes, they will strive to provide you with honest advice about the ‘road conditions’, and you are allowed to ask multiple questions to them to figure out which road is the best for you.

Panelists:
- Rob Atken, ARM Ltd., US
- Antun Domic, Synopsys, Inc., US
- Manfred Horstmann, GLOBALFOUNDRIES, DE
- Robert Hum, Mentor Graphics Corp., US
- Philippe Magarshack, STMicroelectronics, FR

13:00 End of Session
Lunch Break in Exhibition Area
Sandwich lunch

2.3 Making automotive systems safer and more energy efficient

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 1

Chair:
Bart Vermeulen, NXP, NL

Co-Chair:
Sebastian Steinhorst, TUM-CREATE, SG

With the transition from hydraulic and mechanical automotive systems to electronic systems, the requirements on safety and energy efficiency are becoming increasingly important. The papers in this session address these issues by presenting robustness improvements at component and system level, advanced energy management at network level, and multi-variant design space exploration.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 11:30    | 2.3.1 EMULATION-BASED ROBUSTNESS ASSESSMENT FOR AUTOMOTIVE SMART-POWER ICS | Speakers: Manuel Harrant1, Thomas Nirmaier1, Jerome Kirscher1, Christoph Grimm2 and Georg Pelz1  
1Infineon Technologies AG, DE; 2TU Kaiserslautern, DE |  
Abstract In this paper we present a concept for assessing the robustness of automotive smart power ICs through lab measurements with respect to application variance and parameter spread. Classical compliance to the product specification, where only minimum and maximum values are defined, is not enough to assess device robustness since complex transients of application components cannot be defined within single specification parameters. That is why application fitness becomes a necessary task to reduce device failures, which may occur in the application. One solution would be to enhance traditional lab verification methods with a concept that considers application and parameter spread. This innovative concept is demonstrated on an electronic throttle control application. It has been emulated in real-time, including power amplification and application-relevant parameters. Within this application space, Monte Carlo experiments were carried out to evaluate the influence of parameter spread on selected system characteristics. Finally, an appropriate metric was used to quantify the robustness of the micro-electronic device within its application. |
The research and development on in-vehicle networks (IVNs) is driven by two main requirements: bandwidth and robustness. In this paper we address the robustness requirement. We focus on FlexRay IVNs that are used for safety-critical applications. We analyze and discuss faults that may affect the startup and operation of a FlexRay network. These failures may not only occur during the startup phase of the vehicle, but they may also happen due to a bus problem that requires the bus to be reinitialized during normal operation. Here any startup failure leads to a critical situation like a brake system failure. The fault scenarios we discuss in this paper are the resetting leading coldstart node (RLCN), the deaf coldstart node (DCN), and the babbling idiot (BI). These faults are described in literature, but neither the precise behavior of all involved nodes, nor a clear solution is provided to contain their impact. The idea of a bus guardian (BG) is given in a draft specification of the FlexRay consortium, but no details are given. In this paper, we extend on these ideas by investigating and implementing a detailed (BG) concept, based on our fault analysis. We subsequently evaluate the successful containment of the three fault types in simulation. We also quantify the chip area cost of our solution.

**Abstract**

We perform a case study on a circuit implementation of a well-known adaptive filter algorithm. The results from the analytical and simulation models show the effectiveness of our approach. The analytical model is accurate enough to estimate the effects of transient errors on the performance of a digital circuit. Our analytical method also reduces the run time significantly in a design phase.

**Keywords**

Analytical model, transient error resiliency, digital circuit, adaptive filters.

**Acknowledgments**

This work was supported by the German Research Foundation (DFG) within the Transregio SFB 109.

**References**


**Authors**

Bart Vermeulen, Michael Nijssen, Jürgen Teich, and Bart Steyaert

University of Erlangen-Nuremberg, DE; AUDI AG Ingolstadt, DE
13:00 2.4.4 (Best Paper Award Candidate)
ZONOTOPE-BASED NONLINEAR MODEL ORDER REDUCTION FOR FAST PERFORMANCE BOUND ANALYSIS OF ANALOG CIRCUITS WITH MULTIPLE-INTERVAL-VALUED PARAMETER VARIATIONS

Speakers:
Yang Song, Sai Manjo Pd and Hao Yu, Nanyang Technological University, SG

Abstract
It is challenging to efficiently evaluate performance bound of high-precision analog circuits with multiple parameter variations at nano-scale. In this paper, a nonlinear model order reduction is proposed to deploy zonotope-based model for multiple-interval-valued parameter variations. As such, one can have a zonotope-based reachability analysis to generate a set of trajectories with performance bound defined. By further constraining local parameterized subspaces to approximate a number of zonotopes along the set of trajectories, one can perform nonlinear model order reduction to generate the performance bound under parameter variations. As shown by numerical experiments, the zonotope-based nonlinear macromodeling by order of 19 achieves up to 500x speedup when compared to Monte Carlo simulations of the original model; and up to 50% smaller error when compared to previous parameterized nonlinear macromodeling under the same order.

13:00 2.4.3 IMPLEMENTATION ISSUES IN THE HIERARCHICAL COMPOSITION OF PERFORMANCE MODELS OF ANALOG CIRCUITS

Speakers:
Manuel Velasco-Jiménez, Rafael Castro-López, Elisenda Roca and Francisco Fernández, IMSE-CNM, CSIC and Universidad de Sevilla, ES

Abstract
Emerging hierarchical design methodologies based on the use of Pareto-optimal fronts (PoFs) are promising candidates to reduce the bottleneck caused by the design of complex analog circuits. However, little work has been reported about how to transmit the information provided by the PoFs of low hierarchical level blocks through the hierarchy to compose the performance models of higher level blocks. This composition actually poses several problems such as the dependence of the PoF performances on the surrounding circuitry and the complexity of dealing with multi-dimensional PoFs in order to explore more efficiently the design space. To deal with these problems, this paper proposes new mechanisms to represent and select candidate solutions from multi-dimensional PoFs that are transformed to the changing operating conditions enforced by the surrounding circuitry. These mechanisms are demonstrated with the generation of the performance model of an active filter by composing previously generated PoFs of operational amplifiers.

13:01 2.4.3 A NOVEL LOW POWER 11-BIT HYBRID ADC USING FLASH AND DELAY LINE ARCHITECTURES

Speakers:
Hsun-Cheng Lee and Jacob Abraham, The University of Texas at Austin, US

Abstract
This paper presents a novel low power 11-bit hybrid ADC using flash and delay line architectures, where a 4-bit flash ADC is followed by a 7-bit delay-line ADC. This hybrid ADC inherits accuracy and power efficiency from flash ADCs and delay-line ADCs, respectively. Also, in order to reduce the power of the first-stage flash ADC, a power-saving technique is adopted by biasing the DAC current of the first-stage flash by 1.7μA instead of the operational current,17μA in stand-by mode. The hybrid ADC was designed and simulated in a commercial 65nm process. With a 1.1 V supply and 100 MS/s, the ADC achieves an SNDR of 60 dB and consumes 1.6 mW, which results in a figure of merit (FOM) of 19.4 fJ/conversion-step without any calibration technique. Also, Monte Carlo simulations are performed with a 3σ device mismatch for the SNDR estimation, and the SNDR is observed to be better than 58.5 dB.
miss status handling registers are proposed.

feasibility of non-volatile memories for instruction caches to improve energy efficiency. To handle the write delay and energy issues of NVMs, an analysis and extensions to the high-efficiency video encoder to design a distributed scratchpad memory system with adaptive SPM data allocation and power management. The third paper explores the for memory intensive workloads through throttling of warps on different cores. The second paper leverages the application-specific knowledge of the next-generation parallelized high-efficiency video coding. This session presents three papers on energy efficiency in memory-intensive systems.

The first paper aims at energy-efficient scheduling of cooperative-thread arrays on GPGPUs for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS scheduling allows the system to save power when compared to existing techniques. The second paper leverages the application-specific knowledge of the next-generation parallelized high-efficiency video coding. An energy-efficient distributed Scratchpad Video Memory Architecture (dSVM) for the next-generation parallel High Efficiency Video Coding is presented. Our approach exploits the parallelism and locality of HEVC to design a scalable and energy-efficient dSVM that can handle the memory requirements of HEVC. The key is to leverage the HEVC and video content knowledge. Furthermore, we integrate an adaptive power management policy for SPMs to manage the power states of different memory parts at run time depending upon the varying video content properties. Our dSVM external memory energy savings increase with an increasing number of parallel HEVC threads and size of search window. Moreover, our SPM power management reacts to the current video properties and achieves up to 50% on-chip leakage energy savings.

Speakers: Carolina Radojicic, Christoph Grimm, Javier Moreno and Xiao Pan, TU Kaiserslautern, DE

Abstract

The paper describes an approach for semi-symbolic analysis of mixed-signal systems that contain discontinuous functions, e.g. due to modeling comparators. For modeling and semi-symbolic simulation, we use extended Affine Arithmetic. Affine Arithmetic is currently limited to accurate analysis of linear functions and mild non-linear functions, but not yet discontinuities. In this paper we extend the approach to also handle discontinuities. For demonstration, we symbolically analyze a Δ-modulator.

Speakers: Cristian Ferent and Alex Doboli, Stony Brook University, US

Abstract

This paper presents a reasoning-based approach to analog circuit synthesis using ordered node clustering representations (ONCR) to describe alternative circuit features and symbolic circuit comparison to characterize performance trade-offs of synthesized solutions. Case studies illustrate application of the proposed methods to topology selection and refinement.

Speakers: Jinbo Wan and Hans KerkHoff, CAES-TDT, CTIT, University of Twente, NL

Abstract

Analog and mixed-signal IPs are increasingly required to use digital fabrication technologies and are deeply embedded into system-on-chips (SoCs). These developments append more requirements and challenges on analog testing methodologies. Traditional analog testing methods suffer from less accessibility and control with regard to these embedded analog circuits in SoCs. As an alternative, an embedded instrument for analog OpAmp IP tests is proposed in this paper. It can provide the exact gain and offset values of OpAmps instead of only pass/fail result. What's more, it is an non-invasive monitor and can work online without isolating the DUT Opamp from its surrounding feedback networks. Nor does it require accurate test stimulations. In addition, the monitor can remove its own offsets without additional complex self-calibration circuits. All self-calibrations are completed in the digital domain after each measurement in real time. Therefore it is also suitable for aging-sensitive applications, in which the monitor may suffer from aging mechanisms and has additional offset drifts as well. The monitor measurement range for offset is from 0.2mV to 70mV and for gain it is from 0dB to 40dB. The offset for error measurements can be 10% of the measurement value with plus/minus 0.1mV, and -2.5dB for gain measurements.

Analog and mixed-signal IPs are increasingly required to use digital fabrication technologies and are deeply embedded into system-on-chips (SoCs). These developments append more requirements and challenges on analog testing methodologies. Traditional analog testing methods suffer from less accessibility and control with regard to these embedded analog circuits in SoCs. As an alternative, an embedded instrument for analog OpAmp IP tests is proposed in this paper. It can provide the exact gain and offset values of OpAmps instead of only pass/fail result. What's more, it is an non-invasive monitor and can work online without isolating the DUT Opamp from its surrounding feedback networks. Nor does it require accurate test stimulations. In addition, the monitor can remove its own offsets without additional complex self-calibration circuits. All self-calibrations are completed in the digital domain after each measurement in real time. Therefore it is also suitable for aging-sensitive applications, in which the monitor may suffer from aging mechanisms and has additional offset drifts as well. The monitor measurement range for offset is from 0.2mV to 70mV and for gain it is from 0dB to 40dB. The offset for error measurements can be 10% of the measurement value with plus/minus 0.1mV, and -2.5dB for gain measurements.

Speakers: Cristian Silvano, Politecnico di Milano, IT

Chair:

Todd Austin, University of Michigan, US

Co-Chair:

Cristina Silvano, Politecnico di Milano, IT

Location / Room: Konferenz 3

Date: Tuesday 25 March 2014

Time: 11:30 - 13:00

This session presents three papers on energy efficiency in memory-intensive systems. The first paper aims at energy-efficient scheduling of cooperative-thread arrays on GPGPUs for memory-intensive GPGPU workloads. The second paper leverages the application-specific knowledge of the next-generation parallelized high-efficiency video coding. The third paper explores the for memory-intensive workloads through throttling of warps on different cores. The second paper leverages the application-specific knowledge of the next-generation parallelized high-efficiency video coding. An energy-efficient distributed Scratchpad Video Memory Architecture (dSVM) for the next-generation parallel High Efficiency Video Coding is presented. Our approach exploits the parallelism and locality of HEVC to design a scalable and energy-efficient dSVM that can handle the memory requirements of HEVC. The key is to leverage the HEVC and video content knowledge. Furthermore, we integrate an adaptive power management policy for SPMs to manage the power states of different memory parts at run time depending upon the varying video content properties. Our dSVM external memory energy savings increase with an increasing number of parallel HEVC threads and size of search window. Moreover, our SPM power management reacts to the current video properties and achieves up to 50% on-chip leakage energy savings.
The papers in this session deal with analysis and management of memory hierarchies for complex real-time systems, both from the deterministic and the probabilistic point of view.
**TIME-PREDICTABLE EXECUTION OF MULTITHREADED APPLICATIONS ON MULTICORE SYSTEMS**

**Speakers:**
Ahmed Alhammad and Rodolfo Pellizzoni, University of Waterloo, CA

**Abstract**
In multicore systems, contention for access to main memory between application threads complicates timing analysis and may lead to pessimistic bounds on execution time. This is particularly problematic for real-time applications, which require provable bounds on worst-case performance. In this work, we employ a predictable execution model to schedule memory accesses performed by application threads without relying on unpredictable hardware arbiters. In addition, we statically schedule application's threads to core to form a combined approach that improves precision, without significantly increasing its complexity. The performance of the various approaches are compared on benchmark programs.

---

**MINIMIZING STACK MEMORY FOR HARD REAL-TIME APPLICATIONS ON MULTICORE PLATFORMS**

**Speakers:**
Chuansheng Dong and Haibo Zeng, McGill University, CA

**Abstract**
In multicore systems, contention for access to main memory between application threads complicates timing analysis and may lead to pessimistic bounds on execution time. This is particularly problematic for real-time applications, which require provable bounds on worst-case performance. In this work, we employ a predictable execution model to schedule memory accesses performed by application threads without relying on unpredictable hardware arbiters. In addition, we statically schedule application's threads to core to form a combined approach that improves precision, without significantly increasing its complexity. The performance of the various approaches are compared on benchmark programs.

---

**ON THE CORRECTNESS, OPTIMALITY AND PRECISION OF STATIC PROBABILISTIC TIMING ANALYSIS**

**Speakers:**
Sebastian Altmeyer and Robert Davis

**Authors**
Sebastian Altmeyer and Robert Davis

**Abstract**
In this paper, we investigate Static Probabilistic Timing Analysis (SPTA) for single processor systems that use a cache with an evict-on-miss random replacement policy. We show that previously published formulae for the probability of a cache hit can produce results that are optimistic and unsound when used to compute probabilistic Worst-Case Execution Time (pWCET) distributions. We investigate the correctness, optimality, and precision of different approaches to SPTA. We prove that one of the previously published formulae for the probability of a cache hit is optimal with respect to the limited information that it uses. We improve upon this formulation by using extra information about cache contention. To investigate the precision of various approaches to SPTA, we introduce a simple exhaustive method that computes a precise pWCET distribution, albeit at the cost of exponential complexity. Further, we integrate this precise approach, applied to small numbers of frequently accessed memory blocks, with imprecise analysis of other memory blocks, to form a combined approach that improves precision, without significantly increasing its complexity. The performance of the various approaches are compared on benchmark programs.

---

**COMPREHENSIVE ANALYSIS OF ALPHA AND NEUTRON PARTICLE-INDUCED SOFT ERRORS IN AN EMBEDDED PROCESSOR AT NANOSCALES**

**Speakers:**
Mojtaba Ebrahimi, Adrian Evans, Mehdi B. Tahoori, Razi Seyyedi, Enrico Costenaro, and Dan Alexandrescu

**Authors**
Mojtaba Ebrahimi, Adrian Evans, Mehdi B. Tahoori, Razi Seyyedi, Enrico Costenaro, and Dan Alexandrescu

**Abstract**
We present results of Soft Error Rate (SER) analysis of an embedded processor. Our SER analysis platform accurately models all generation, propagation and masking effects starting from a technology response model derived using TCAD simulations at the device level all the way to application masking. The platform employs a combination of empirical models at the device level, analytical error propagation at logic level and fault emulation at the architecture/application level to provide the detailed contribution of each component (flip-flops, combinational gates, and SRAMs) to the overall SER. At each stage in the modeling hierarchy, an appropriate level of abstraction is used to propagate the effect of errors to the next higher level. Unlike previous studies which are based on very simple test chips, analyzing the entire processor gives more insight into the contributions of different components to the overall SER. The results of this analysis can assist circuit designers to adopt effective hardening techniques to reduce the overall SER while meeting required power and performance constraints.

---

**2.7 Yield and Reliability for Robust Systems**

**Date:** Tuesday 25 March 2014

**Time:** 11:30 - 13:00

**Location / Room:** Konferenz 5

**Chair:**
Joan Figueras, UPC, ES

**Co-Chair:**
Jose Pineda de Gyvez, NXP, NL

Robustness is increasingly a requirement for SoCs and memories, and effects such as wearout, BTI, and soft errors are important to consider as part of design. Another important component of robust design is tolerance of rare events. Understanding design robustness helps predict and enhance yield.

---

**On the Correctness, Optimality and Precision of Static Probabilistic Timing Analysis**

**Speakers:**
Sebastian Altmeyer and Robert Davis

**Authors**
Sebastian Altmeyer and Robert Davis

**Abstract**
In this paper, we investigate Static Probabilistic Timing Analysis (SPTA) for single processor systems that use a cache with an evict-on-miss random replacement policy. We show that previously published formulae for the probability of a cache hit can produce results that are optimistic and unsound when used to compute probabilistic Worst-Case Execution Time (pWCET) distributions. We investigate the correctness, optimality, and precision of different approaches to SPTA. We prove that one of the previously published formulae for the probability of a cache hit is optimal with respect to the limited information that it uses. We improve upon this formulation by using extra information about cache contention. To investigate the precision of various approaches to SPTA, we introduce a simple exhaustive method that computes a precise pWCET distribution, albeit at the cost of exponential complexity. Further, we integrate this precise approach, applied to small numbers of frequently accessed memory blocks, with imprecise analysis of other memory blocks, to form a combined approach that improves precision, without significantly increasing its complexity. The performance of the various approaches are compared on benchmark programs.
12:00 2.7.2 BIAS TEMPERATURE INSTABILITY ANALYSIS OF FINFET BASED SRAM CELLS
Speakers: Seyab Khan1, Innocent Agbo2, Said Hamdoul3, Halli Kukner4, Ben Kaczer4, Praveen Raghavan4 and Francyk Catthoor4
1Technical University Delft, NL; 2TU Delft, NL; 3Delft University of Technology, NL; 4IMEC, BE; 5imec, BE
Abstract
Bias Temperature Instability (BTI) is posing a major reliability challenge for today’s and future nano-devices as it degrades their performance. This paper provides a comprehensive analysis of BTI impact, in terms of time-dependent degradation, on FinFET based SRAM cell. The evaluation metrics are read Static Noise Margin (SNM), hold SNM and Write Trip Point (WTP); while the aspects investigated consist dependence on the supply voltage, cell strength, and design styles (6 versus 8 Transistor cells). A comparison between FinFET and planar CMOS based SRAM cells degradation is also covered. The simulation results show that: (a) FinFET based cells show lower degradation (by 16.72% in WTP and 14.19% in hold SNM) (b) 12% increment in the cell’s supply voltage enhances its read SNM by 9% (c) Strengthening only the pull-down transistors in the cell by 1.5 reduces BTI induced read SNM degradation by 26.61% (d) PT SRAM cells has 1.43 higher WTP than 6T cell; however, the cells suffer from 31.13% higher read SNM and 8.05% higher hold SNM degradations than 6T SRAM cells and (e) FinFET based SRAM cells are more vulnerable to BTI degradation than planar CMOS based cells

12:30 2.7.3 SSFB: A HIGHLY-EFFICIENT AND SCALABLE SIMULATION REDUCTION TECHNIQUE FOR SRAM YIELD ANALYSIS
Speakers: Manish Rana and Ramon Canal, Universitat Politecnica de Catalunya, ES
Abstract
We present a methodology that uses information on both fresh and aged ICs and tries to distinguish between the fresh and aged population based on metrics rapidly and accurately. The tool relies on accurate circuit-level simulations of failure mechanisms such as ageing, soft-errors and parametric failures. The obtained results can then help couple low-level metrics with higher-level design choices. A new technique for rapid estimation of low-probability failure events is also proposed. We present three use-cases of our prototype tool to demonstrate its diverse capabilities in autonomously guiding large SRAM based robust memory designs.

13:00 IP1-12 Shrikanth Ganapathy1, Ramon Canal2, Dan Alexandrescu2, Enric Costenaro2, Antonio Gonzalez2 and Antonio Rubio2
1Universitat Politecnica de Catalunya, ES; 2RoC Technologies, FR; 3Intel and Universitat Politecnica de Catalunya, ES
Abstract
WEAR-OUT ANALYSIS OF ERROR CORRECTION TECHNIQUES IN PHASE-CHANGE MEMORY
Speakers: Caio Hoffman, Luiz Ramos, Rodolfo Azevedo and Guido Araujo, University of Campinas, BR
Abstract
Phase-Change Memory (PCM) is a new memory technology and a possible replacement for DRAM, whose scaling limitations require new lithography technologies. Despite being promising, PCM has limited endurance (its cells withstand roughly 10^8 bit-flips before failing), which prompted the adoption of Error Correction Techniques (ECTs). However, previous lifetime analyses of ECTs did not consider the difference between the bit-flip frequencies of data and code bits, which may lead to inaccurate wear-out analyses for the ECTs. In this work, we improve the wear-out analysis of PCM by modeling and analyzing the bit-flip probabilities of five ECTs. Our models also enable an accurate estimation of energy consumption and analysis of the endurance-energy trade-off for each ECT.

13:02 IP1-14 Dee Wong Chang1, Sule Ozerv1, Ozgur Sinanoglu2 and Ramesh Karri3
1Arizona State University, US; 2New York University Abu Dhabi, AE; 3Polytechnic Institute of New York University, US
Abstract
Approximating the Age of RF/ANALOG Circuits Through Re-characterization and Statistical Estimation
Speakers: Counterfeit ICs have become an issue for semiconductor manufacturers due to impacts on their reputation and lost revenue. Counterfeit ICs are either products that are intentionally mislabeled or legitimate products that are extracted from electronic waste. The former is easier to detect whereas the latter is harder since they are identical to new devices but display degraded performance due to environmental and use stress conditions. Detecting counterfeit ICs that are extracted from electronic waste requires an approach that can approximate the age of manufactured devices based on their parameters. In this paper, we present a methodology that uses information on both fresh and aged ICs and tries to distinguish between the fresh and aged population based on an estimate of the age. Since analog devices age mainly due to their bias stress, input signals play less of a role. Hence, it is possible to use simulation models to approximate the aging process, which would give us access to a large population of aged devices. Using this information, we can construct a statistical model that approximates the age of a given circuit. We use a Low noise amplifier (LNA) and an NMOS LC oscillator to demonstrate that individual aged devices can be accurately classified using the proposed method.
2.8 Hot Topic: Technology Transfer towards Horizon 2020

Date: Tuesday 25 March 2014
Location / Room: Exhibition Theatre
Organiser: Rainer Leupers, RWTH Aachen,
Chair: Norbert Wehn, TU Kaiserslautern, DE

European research projects produce many excellent results, and the quality of research papers at DATE and other major European conferences is often outstanding. But how many academic research results in computing technologies and EDA actually make it into industrial practice? In the context of the transition into the Horizon 2020 framework program, the European research community is currently investigating novel ways of stimulating additional academia-industry technology transfer. This special session contributes by discussing concrete transfer experiences and new concepts. Furthermore it will exemplify several success stories from both academic and industrial perspectives.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>13:00</td>
<td>End of session</td>
<td></td>
<td></td>
</tr>
<tr>
<td>12:45</td>
<td>Lunch Break in Exhibition Area</td>
<td>Sandwich lunch</td>
<td></td>
</tr>
</tbody>
</table>

Abstract:

TETRACOM (Technology Transfer in Computing Systems) is a novel instrument called Technology Transfer Project (TTP). TTPs help to lower the barrier for researchers to make the first steps towards commercialisation of their research results. TTPs are designed to provide incentives for TT at small to medium scale via partial funding of dedicated, well-defined, and short term academia-industry collaborations that bring concrete R&D results into industrial use. This will be implemented via competitive calls for TTP proposals. It is expected to fund up to 50 TTPs. The TTP activities will be complemented by Technology Transfer Infrastructures (TTIs) that provide training, service, and dissemination actions. These are designed to encourage a larger fraction of the R&D community to engage in TTPs, possibly even for the first time. Altogether, TETRACOM is conceived as the major pilot project of its kind in the area of Computing Systems, acting as a TT catalyst for the mutual benefit of academia and industry. It is expected to acquire around more than 20 new contractors over the project duration. TETRACOM complements and actually precedes the use of existing financial instruments such as venture capital or business angels based funding.

Open questions:

- How do you measure the success of a technology transfer project?
- What are the key factors for a successful technology transfer?
- What are the challenges faced by researchers when translating their research into industry applications?
- How can academic institutions support researchers in the transition from research to industry?
- What role do technology transfer infrastructures (TTIs) play in facilitating technology transfer?
**UB02.01 QUANTUMEDA: A VISUALIZATION AND DESIGN ENVIRONMENT FOR TOPOLOGICAL QUANTUM CIRCUITS**

**Authors:** Ilia Polian, Wolfgang Wallner and Alexandru Paler, University of Passau, DE

**Abstract**
Quantum circuits use quantum-mechanical properties of certain physical systems, such as superposition and entanglement, to perform massively parallel calculations. They provide polynomial algorithms for problems for which only inefficient algorithms with asymptotically-exponential running time are known in conventional mod-els of computation. Building a scalable quantum computer that can process a large number of quantum bits (qubits) is one of the grand challenges of modern science. While first small quantum computers have been experimentally demonstrated and a number of implementation technologies have been suggested, all of them encounter difficulties when it comes to scaling. The central difficulty is the high susceptibility of such circuits to noise and decoherence, which necessitates the use of special quantum error correction. Topological quantum computing (TQC) is a paradigm that offers a path to scalability. It strikes a balance between systematic, intuitive methods to design large computations, and relatively loose requirements on the vulnerability of individual qubits to errors. The availability of a platform for implementing large quantum algo-rithm constitutes the need for methods to manage design complexity, including automatic synthesis, optimization, compaction, verification and visualization of TQC circuits. Topological quantum circuits are based on a three-dimensional cluster of qubits which supports highly efficient topological quantum error-correcting codes. In this way, the circuits can operate even though its individual qubits are subject to relatively high error rates. We will present the first environment for design of TQC circuits. The environment allows the user to graphically enter the structure of a circuit, add, delete and re-shape individual qubits, and perform optimization and compaction (both manually and by global replacement). The circuits are represented on an intermediate technology-independent level, where "logical qubits" that consist of a large number of physical qubits perform error-corrected operations. For example, the circuit in Fig. 1 shows an error-corrected CNOT gate implemented by four logical qubits represented by colored squares. The optimized representation can be translated into instruction sequences for a classical computer that operates the actual quantum hardware.

**More information ...**

---

**UB02.02 AN AUTOMATED DESIGN FLOW FOR FAST PROTOTYPING OF SIMULINK MODELS ONTO MPSOC**

**Authors:** Francesco Robino and Johnny Öberg, Royal Institute of Technology, SE

**Abstract**
Simulink is a modelling environment suitable to model embedded systems at system-level. However there is no standard to rapidly prototype Simulink models onto modern multiprocessor system-on-chip (MPSoC). In this demonstration we show how our NoC System Generator tool can be used as part of an automated platform-based design flow to synthesize a Simulink model to a network-on-chip based MPSoC implementation on FPGA. The performance of the generated prototype scales with the number of processors.

**More information ...**

---

**UB02.03 CUCUMBER-VERILOG: BEHAVIOR DRIVEN DEVELOPMENT FOR CIRCUIT DESIGN AND VERIFICATION**

**Authors:** Melanie Diepenbeck, Mathias Soeken, Ulrich Kühne and Rolf Drechsler, University of Bremen, DE

**Abstract**
When designing hardware one usually applies a top-down approach in which starting from a natural language specification a design is implemented and afterwards tested and verified for correctness. In contrast, software development is pushed towards agile techniques such as Test Driven Development (TDD), where tests play a central role in driving the implementation. Behavior Driven Development (BDD) extends TDD by using natural language style scenarios to describe the tests. Essentially, in both techniques testing and implementation is interleaved: first, test cases are written, and secondly, the implementation is extended to satisfy them. Since nowadays 70% of the effort to design hardware systems is spent on verification, test and verification should receive more attention and be applied as soon as possible. We present a BDD tool tailored for the Verilog hardware description language which enables a new design flow for hardware design, test, and verification. BDD acceptance tests are readily given by means of the natural language specification. Assigning test code to their sentences yields a testbench which serves as a starting point for the implementation. In the same time, the natural language scenarios form a test documentation that is easily accessible also to non-experts. Furthermore, our tool allows for the generalization of test cases to properties suitable for formal verification. As properties are typically more difficult to formalize than test cases, our approach facilitates the access to formal verification. In our demonstration, we will show how to implement hardware designs using our BDD tool and how properties are generalized from test cases which can then be verified by a model checker automatically.

**More information ...**

---

**UB02.04 BUILDING A PROTOTYPING PLATFORM FOR INVESTIGATING THE IMPACT OF ATTACKS AGAINST AUTOMOTIVE NETWORKS**

**Authors:** Alexander Stühring¹, Günter Ehmen¹ and Sibylle Fröschle²
1University of Oldenburg, DE; ²OFFIS, DE

**Abstract**
The University of Oldenburg is working on solutions to ensure a secure communication in the automotive domain. This is a key requirement for safe applications in the context of future Car2X applications. In order to achieve this goal we are using a self-developed prototyping platform to analyze and demonstrate the impact of attacks on in-vehicle buses and wireless networks. Moreover, the visitors are able to start attacks and observe the consequences in a simulated driving scenario.

**More information ...**

---

**UB02.05 HWDEBLUR: DESIGN OF A HIGH PERFORMANCE CORE FOR REMOVING BLUR EFFECT ON IMAGES**

**Authors:** Giuseppe Airo' Farulla, Giulio Gambardella, Marco Indaco, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT

**Abstract**
This work aims at developing a high performance FPGA-based IP-core able to perform a deblurring algorithm in real-time. Modern approaches to deblurring usually exploit only handle simple types of blur, or need heavy user inter-action. Moreover, they usually require several minutes (or even whole hours) to process a single image. Our purpose is to study the current state-of-the-art and identify the best deblurring algorithms that are suitable for a hardware implementation. The selected algorithm is optimized and implemented in hardware in order to perform the deblurring task with highest possible performances.

**More information ...**

---

**UB02.06 ENERGY-MODULATED COMPUTING**

**Authors:** Maxim Rykunov, Reza Ramezani, Abdullah Baz, Xuefu Zhang, Delong Shang, Andrey Mokhov, Danil Sokolov, Fei Xia and Alex Yakovlev, Newcastle University, GB

**Abstract**
This demo will illustrate the principle of energy-modulated computing according to which the flow of energy entering a computing system determines its computational flow. This principle will be fundamental for building future autonomous systems, such as those powered by energy harvesting sources and aimed for survival in power-deficient conditions. The demo includes a set of experimental circuits (with three VLSI chips and PCBs) to work in variable power supply conditions and software tools for digital and analogue co-design (Workcraft, Petrify, MPSAT).

**More information ...**
Mixed-signal simulation engines - but also debugging aids - are no longer a nice to have. According to IBS, the cost of implementing and verifying the mixed-signal demands for more design automation in both implementation and verification domains. Lossless interfaces between digital and analog design environments, multi-technology interfacing/interacting with people, environment, and other systems. Specialty silicon foundries are now stable members of top ten revenue rankings. This technology trend increasing systems integration, by nature leading to heterogeneity, and to the complex, digital computing functions being complemented by scores of on-chip analog functions.

Contrarily to a common belief, the world is not going digital! Analog and mixed-signal electronics is more and more important and yet pervasive. This is due both to their impact on the solutions offered by the ecosystem players. Their need for smooth interdependency between them. The executives in this session will discuss upcoming innovations in the semiconductor industry and their impact on the solutions offered by the eco system players.

### EXECUTIVE SESSION: Advanced Technology Challenges & Opportunities

**Date:** Tuesday 25 March 2014  
**Time:** 14:30 - 16:00  
**Location / Room:** Saal 1

**Organiser:**  
Yervant Zorian, Fellow & Chief Architect, Synopsys, US

**Executives:**  
Lorent Remont, VP, Global Foundries, DE  
Wenchi Chang, Senior Manager, TSMC, NL  
Joachim Kunkel, Senior Vice President & GM, Synopsys, US  
Gerd Teepe, VP, Global Foundries, DE

The continuous technology scaling and their new applications are dramatically impacting the semiconductor industry. This may also significantly affect the dependency between eco-system players necessitating smooth interdependency between them. The executives in this session will discuss upcoming innovations in the semiconductor industry and their impact on the solutions offered by the eco system players.

**Time** | **Presentation Title** | **Authors**
---|---|---
16:00 | End of session |  
16:00 | **Coffee Break in Exhibition Area** | On Tuesday–Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

**End of session**
The performance and robustness of 3D power distribution networks is of critical importance for state of the art electronic designs. The papers in this session discuss new modeling.

3.3 Secure Hardware Primitives and Implementations

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 1
Chair: Paolo Maiistri, TIMA, FR
Co-Chair: Patrick Schaumont, Virginia tech, US

System designers need secure building blocks for robust system protection against physical attacks. This session presents novel hardware designs and analysis on code-based cryptography, random number generators and IP protection mechanisms using watermarking.

3.3.1 LIGHTWEIGHT CODE-BASED CRYPTOGRAPHY: QC-MDPC McEliece Encryption on Reconfigurable Devices

Speakers: Ingo von Maurice and Tim Güneysu, Ruhr-Universität Bochum, DE

Abstract: With the break of RSA and ECC cryptosystems in an era of quantum computing, asymmetric code-based cryptography is an established alternative that can be a potential replacement. A major drawback are large keys in the range between 50kByte to several MByte that prevented real-world applications of code-based cryptosystems so far. A recent proposal by Misoczki et al. showed that quasi-cyclic moderate density parity-check (QC-MDPC) codes can be used in McEliece encryption -- reducing the public key to just 0.6kByte to achieve a 80-bit security level. Despite of reasonably small key sizes that could also enable small designs, previous work only report high-performance implementations with high resource consumptions of more than 13,000 slices on a large Xilinx Virtex-6 FPGA for a combined en-/decryption unit. In this work we focus on lightweight implementations of code-based cryptography and demonstrate that McEliece encryption using QC-MDPC codes can be implemented with a significantly smaller resource footprint -- still achieving reasonable performance sufficient for many applications, e.g., challenge-response protocols or hybrid firmware encryption. More precisely, our design requires just 68 slices for the encryption and around 150 slices for the decryption unit and is able to en-/decrypt an input block in 2.2ms and 13.4ms, respectively.

3.3.2 ON THE ASSUMPTION OF MUTUAL INDEPENDENCE OF JITTER REALIZATIONS IN P-TRNG STOCHASTIC MODELS

Speakers: Patrick Haddad

1STMicroelectronics, FR; 2Laboratory Hubert Curien, University of Lyon, UJM Saint-Etienne, FR; 3Hubert Curien Laboratory, Jean Monnet University, FR

Abstract: Security in true random number generation in cryptography is based on entropy per bit at the generator output. The entropy is evaluated using stochastic models. Several recent works propose stochastic models based on assumptions related to selected physical analog phenomena such as noisy signals and on the knowledge of the principle of randomness extraction from the obtained noisy analog signal. However, these assumptions simplify often considerably the underlying analog processes, which include several noise sources. In this paper, we present a new comprehensive multilevel approach, which enables to build the stochastic model based on detailed analysis of noise sources starting at transistor level and on conversion of the noise to the clock jitter exploited at the generator level. Using this approach, we can estimate proportion of the jitter coming only from the thermal noise, which is included in the total clock jitter.

3.3.3 CLOCK-MODULATION BASED WATERMARK FOR PROTECTION OF EMBEDDED PROCESSORS

Speakers: Jedrzej Kufel, Peter Wilson, Stephen Hill, Bashir Al-Hashimi, Paul N. Whatmough and James Myers

1University of Southampton, GB; 2ARM, GB; 3ARM, US

Abstract: This paper presents a novel watermark generation technique for the protection of embedded processors. In previous work, a load circuit is used to generate detectable watermark patterns in the ASIC power supply. This approach leads to hardware area overheads. We propose removing the dedicated load circuit entirely, instead to compensate the reduced power consumption the watermark power pattern is emulated by reusing existing clock gated sequential logic as a zero-overhead load circuit and modulating the clock-gating enable signal with the watermark sequence. The proposed technique has been validated through experiments using two ASICs in 65nm CMOS, one with an ARM Cortex-M0 microcontroller and one with a Cortex-A5 microprocessor. Silicon measurement results verify the viability of the technique for embedded processors. Furthermore, the proposed clock modulation technique demonstrates a significant area reduction, without compromising the detection performance. In our experiments an area overhead reduction of 98% was achieved. Through reuse of existing logic and reduction of watermark hardware implementation costs, the proposed clock modulation technique offers an improved robustness against removal attacks.

16:00 End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.4 Modeling and Optimization of Power Distribution Networks

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 2
Chair: Luca Daniel, MIT, US
Co-Chair: Stefano Grivet-Talocia, Politecnico di Torino, IT

The performance and robustness of 3D power distribution networks is of critical importance for state of the art electronic designs. The papers in this session discuss new modeling.
and optimization approaches for their efficient characterization and robust design, including order reduction, variability impact, via planning, decoupling capacitor selection, and thermal effects.

### 15:00 3.4.2 EFFICIENT ANALYSIS OF VARIABILITY IMPACT ON INTERCONNECT LINES AND RESISTOR NETWORKS

**Speakers:**
Jorge Fernandez Villena 1 and Luis Miguel Silvera 2
1INESC ID; PT; 2INESC ID/IST - Lisbon University, PT

**Abstract**
Continued technology scaling coupled with limited lithographic capabilities is a leading cause of increased design variability. In the nanometer regime lithography fails to keep pace with Moore's Law and printed feature sizes are a small fraction of the wavelength of light used in current processes. Such sub-wavelength printing makes features highly susceptible to perturbations in the lithographic process conditions which leads to printed designs exhibiting increased variability. Such variability directly affects design behavior and performance in multiple ways. One of the areas of concern is power grid (PG) design, where lithographic errors may locally modify the wire widths. These variations, that may affect any and all wires in the grid, have a critical impact on the power distribution across the chip, introducing considerable current fluctuations which are a potential cause for electromigration effects. To analyze and account for the impact of these errors requires a complete extraction of the PG, which generates a large resistive network, potentially with several million elements, whose simulation is computationally challenging. This paper proposes a fast and accurate variability analysis of very large resistive networks, such as PG extracted netlists, that allows estimating the effects of multiple parameter settings in reasonable time. The proposed model can be easily combined with Litho/ CMP simulators in order to boost much needed design-aware lithography.

### 15:30 3.4.3 IMPLICIT INDEX-AWARE MODEL ORDER REDUCTION FOR RLC/RC NETWORKS

**Speakers:**
Nicodemus Banagaaya1, Giuseppe Ali'2, Wil . H. A. Schilders 1 and Caren Tischendorf 3
1Eindhoven University of Technology, NL; 2University of Calabria and INFN, Gruppo collegato di Cosenza, IT; 3Institute of Mathematics, Humboldt-Universitat zu Berlin, DE

**Abstract**
This paper introduces the implicit-IMOR method for differential algebraic equations. This method is a modification of the Index-aware model order reduction (IMOR) method proposed in our earlier papers which is the explicit-IMOR method. It also involves first splitting the differential-algebraic equations (DAEs) into differential and algebraic parts using a basis of projectors. In contrast with the explicit-IMOR method, the implicit-IMOR method leads to implicit differential and algebraic parts. We demonstrate the implicit-IMOR method using the RLC/RC networks, but it can also be applied to other problems which are modeled with differential-algebraic equations.

### 16:00 3.4.4 P/G TSV PLANNING FOR IR-DROP REDUCTION IN 3D-ICS

**Speakers:**
Shengcheng Wang 1, Farshad Firozzi 2, Fabian Oborli 1 and Mehdi Tahoori 1
1Karlsruhe Institute of Technology, DE; 2Karlsruhe Institute of Technology (KIT), DE

**Abstract**
In recent years, interconnect issues emerged as major performance challenges for Two-Dimensional-Integrated- Circuits (2D-ICs). In this context, Three-Dimensional-ICs (3D-ICs), which consist of several active layers stacked above each other, offer a very attractive alternative to conventional 2D-ICs. However, 3D-ICs also face many challenges associated with the Power Distribution Network (PDN) design due to the increasing power density and larger supply current compared to 2D-ICs. As an important part of 3D-IC PDNs, Power/Ground (P/G) Through-Silicon-Via (TSVs) should be well-managed. Excessive or ill-placed P/G TSVs impact the power integrity (e.g. IR-drop), and also consume a considerable amount of chip real estate. In this work, we propose a Mixed-Integer-Linear-Programming (MILP)-based technique to plan the P/G TSVs. The goal of our approach is to minimize the average IR-drop while satisfying the total area constraint of TSVs by optimizing the P/G TSV placement. Therefore, the locations, sizes and the total number of the P/G TSVs are co-optimized simultaneously. The experimental results show that the average IR-drop can be reduced by 11.8% in average using the proposed method compared to a random placement technique with a much smaller runtime.
3.5 Robust Architectures

Date: Tuesday 25 March 2014
Location / Room: Konferenz 3
Chair: Todd Austin, University of Michigan, US
Co-Chair: Muhammad Shaqique, Karlsruhe Institute of Technology, DE

This session presents the design of novel architectures to support real-time and secure systems. The first paper couples a time-division multiplexed NoC with a real-time memory controller to design a cost-effective real-time system with improved worst-case latency at reduced area and power consumption. The next paper proposes bus designs for multi-core systems, including the trade-offs between area, power consumption, and performance. The final paper presents a hardware solution using lockstep shadow thread execution to design a cost-effective real-time system with improved worst-case latency at reduced area and power consumption.

1. COST-EFFECTIVE DECAP SELECTION FOR BEYOND DIE POWER INTEGRITY
Authors: Yi-En Chen¹, Tu-Hsung Tsai¹, Shih-Hao Chen² and Hung-Ming Chen¹
Abstract
In many-core systems, it is essential to stabilize the power supply and maintain transmission quality (PQ) during operation. Power decaps are a vital element in power delivery networks (PDNs) to ensure the stability of power supply. The design of power decaps introduces new challenges for the PDN architecture. This paper presents a new methodology to solve the problem. Encouraging experimental results are reported to demonstrate the effectiveness of our approach.

2. CHARACTERIZING POWER DELIVERY SYSTEMS WITH ON/OFF-CHIP VOLTAGE REGULATORS FOR MANY-CORE PROCESSORS
Authors: Xuan Wang, Jiang Xu, Zhe Wang, Kevin J. Chen, Xiaowen Wu and Zhehui Wang
Abstract
Design of power delivery system has great influence on the power management in many-core processor systems. Moving voltage regulators from off-chip to on-chip gains more and more interest in the power delivery system design because it is able to provide fast voltage scaling and multiple power domains. Previous works are proposed to implement power efficient on-chip regulators. It is also important to analyze the characteristics of the entire power delivery system to optimize the performance and costs of employing on-chip regulators. In this work, we develop an analytical model to evaluate important characteristics of the power delivery system, including on-chip/off-chip voltage regulators and the passive on-chip/counter parasitic. Compared with SICE simulations, our model achieves a fast system-level evaluation with comparable accuracy. Based on the model, geometric programming is utilized to find the optimal power efficiency of different architectures of power delivery systems under constraints of output voltage stability and area. Experiments show that compared with the conventional architecture using off-chip regulators, the hybrid one using both on-chip and off-chip voltage regulators achieves 10% power efficiency improvement and 60% area reduction of voltage regulators on average. We conclude that the hybrid architecture has potential for high power efficiency and small area at heavy workload, but careful account for the overhead of on-chip regulators is needed.

3. MASK-COST-AWARE ECO ROUTING
Authors: Hsi-An Chien¹, Zhen-Yu Peng¹, Yun-Ru Wu², Ting-Hsüng Wang², Hsin-Chang Lin², Chi-Feng Wu² and Ting-Chi Wang²
Abstract
In this paper, we study a mask-cost-aware routing problem for engineering change order (ECO). By taking into account old routes for possible reuse, we present an approach for the problem. Encouraging experimental results are reported to demonstrate the effectiveness of our approach.

1.30 Coffee Break in Exhibition Area
On Tuesday-Thursdays the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).
formally guaranteeing control robustness for a communication channel with a bounded number of frame losses. The joint design of a feedback controller and a server-based resource reservation mechanism to guarantee closed-loop stability. The third paper describes a codesign approach by an attacker who tries to reduce the sensor fusion quality and suggests an algorithmic approach to increase robustness against this attack. The second paper addresses the This session showcases recent results in cybersecurity and codesign in CPS. The first paper analyzes a stealth cyberattack scenario where a distributed sensor system is disturbed

Time | Label | Presentation Title | Authors
--- | --- | --- | ---
15:00 | 3.5.2 | BUS DESIGNS FOR TIME-PROBABILISTIC MULTICORE Processors | Javier Jalle1, Leonidas Kosmidis1, Jaume Abellà2, Eduardo Quinones1 and Francisco Cazorla3
1 | Barcelona Supercomputing Center, ES; 2Barcelona Supercomputing Center (BSC-CNS), ES; 3Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
Probabilistic Timing Analysis (PTA) reduces the amount of information needed to provide tight WCET estimates in real-time systems with respect to classic timing analysis. PTA imposes new requirements on hardware design that have been shown implementable for single-core architectures. However, no support has been proposed for multicores so far. In this paper, we propose several probabilistically-analyzable bus designs for multicore processors ranging from 4 cores connected with a single bus, to 16 cores deploying a hierarchical bus design. We derive analytical models of the probabilistic timing behaviour for the different bus designs, show their suitability for PTA and evaluate their hardware cost. Our results show that the proposed bus designs (i) fulfill PTA requirements, (ii) allow deriving WCET estimates with the same cost and complexity as in single-core processors, and (iii) provide higher guaranteed performance than single-core processors, 3.4x and 6.6x on average for an 8-core and a 16-core setup respectively.

15:30 | 3.5.3 | PROGRAMMABLE DECODER AND SHADOW THREADS: TOLERATE REMOTE CODE INJECTION EXPLOITS WITH DIVERSIFIED REDUNDANCY | Weidong Shi1, Ziyi Liu1, Shouhuai Xu2 and Zhiqiang Lin3
1University of Houston, US; 2University of Texas at San Antonio, US; 3University of Texas at Dallas, US
Abstract
We present a lightweight hardware framework for providing high assurance detection and prevention of code injection attacks using a lockstep diversified shadow execution. Recent studies show that hardware diversification can detect software attacks by checking the consistency of their behavior simultaneously. Unfortunately, the severe performance degradation and extra system costs caused by these methods are unacceptable in many applications. This paper presents a hardware-level, lockstep shadow thread framework to enrich the diversity of the software execution, with the facilitation from programmable hardware decoder and novel CPU support of tightly coupled non-executing shadow thread technique. Specifically, given a piece of (legacy) binary code, we first generate diversified binary versions using an offline binary rewriter and programmable hardware binary translator at runtime. Two diversified binary code images are launched as dual simultaneous threads in the hardware layer with one as the primary thread and the other one as shadow thread. Instructions from the shadow thread are not executed but just compared, and thus incur no OS side-effects. The extended CPU is able to decode instructions from both threads, and dispatch them to next stage pipeline for a lockstep comparison. Any mismatch of the decoded instructions from the two threads caused by remotely injected binary code will be detected. Our design provides instruction set randomization (ISR) with minimal cost in performance, when compared with straight-forward ISR implementation. The simulation results indicate that our framework incurs very small overheads and provides a protection against code injection attacks.

16:00 | IP1-19 | EXPLOITING NARROW-WIDTH VALUES FOR IMPROVING NON-VOLATILE CACHE LIFETIME | Guanshan Duan and Shuai Wang, Nanjing University, CN
Abstract
Due to the high cell density, low leakage power consumption, and less vulnerability to soft errors, the non-volatile memory technologies are among the most promising alternatives for replacing the traditional DRAM and SRAM technologies used in implementing main memory and caches in the modern microprocessor. However, the severe performance degradation and extra system costs caused by these methods are unacceptable in many applications. This paper presents a hardware-level, lockstep shadow thread framework to enrich the diversity of the software execution, with the facilitation from programmable hardware decoder and novel CPU support of tightly coupled non-executing shadow thread technique. Specifically, given a piece of (legacy) binary code, we first generate diversified binary versions using an offline binary rewriter and programmable hardware binary translator at runtime. Two diversified binary code images are launched as dual simultaneous threads in the hardware layer with one as the primary thread and the other one as shadow thread. Instructions from the shadow thread are not executed but just compared, and thus incur no OS side-effects. The extended CPU is able to decode instructions from both threads, and dispatch them to next stage pipeline for a lockstep comparison. Any mismatch of the decoded instructions from the two threads caused by remotely injected binary code will be detected. Our design provides instruction set randomization (ISR) with minimal cost in performance, when compared with straight-forward ISR implementation. The simulation results indicate that our framework incurs very small overheads and provides a protection against code injection attacks.

16:01 | IP1-20 | PARTIAL-SET: WRITE SPEEDUP OF PCM MAIN MEMORY | Li Bing1, Shan Shuchang2, Hu Yu2 and Li XiaoWei3
1ICT, UCAS, CN; 2ICT, CAS, CN; 3ICT, CAS, CN
Abstract
Phase change memory (PCM) is a promising nonvolatile memory technology developed as a possible DRAM replacement. Although it offers the read latency close to that of DRAM, PCM generally suffers from the long write latency. Long write request may block the read requests on the critical path of cache/memory access, incurring adverse impact on the system performance. Besides, the write performance of PCM is very asymmetric, i.e., the SET operation (writing ' 1') is much slower than that of the RESET operation (writing ' 0'). In this work, we re-examine the resistance transform process during the SET operation of PCM and propose a novel Partial-SET scheme to alleviate the long write latency issue of PCM. During a write access to a memory line, a short Partial-SET pulse is applied first to program the PCM cells to a pre-stable state, achieving the same write latency as RESET. The partially-SET cells are then fully programmed within the retention window to preserve the data integrity. Experimental results show that our Partial-SET scheme can improve the memory access performance of PCM by more than 45% averagely with very marginal storage overhead.

16:00 | End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.6 Cyber Physical Systems: Security and Co-design
Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 4
Chair:
Rolf Ernst, Technische Universität Braunschweig, DE
Co-Chair:
Anuradha Annaswamy, MIT, US
This session showcases recent results in cybersecurity and codesign in CPS. The first paper analyzes a stealth cyberattack scenario where a distributed sensor system is disturbed by an attacker who tries to reduce the sensor fusion quality and suggests an algorithmic approach to increase robustness against this attack. The second paper addresses the joint design of a feedback controller and a server-based resource reservation mechanism to guarantee closed-loop stability. The third paper describes a codesign approach formally guaranteeing control robustness for a communication channel with a bounded number of frame losses.

Time | Label | Presentation Title | Authors
--- | --- | --- | ---
16:00 | End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).
3.7 On line Strategies for Reliability

**Time**  14:30  3.6.1  (Best Paper Award Candidate)  ATTACK-RESILIENT SENSOR FUSION  
**Speakers:**  Radoslav Ivanov, Miroslav Pajic and Insup Lee, University of Pennsylvania, US  
**Abstract**  This work considers the problem of attack-resilient sensor fusion in an autonomous system where multiple sensors measure the same physical variable. A malicious attacker may corrupt a subset of these sensors and send wrong measurements to the controller on their behalf, potentially compromising the safety of the system. We formalize the goals and constraints of such an attacker who also wants to avoid detection by the system. We argue that the attacker's capabilities depend on the amount of information she has about the correct sensors' measurements. In the presence of a shared bus where messages are broadcast to all components connected to the network, the attacker may consider all other measurements before sending her own in order to achieve maximal impact. Consequently, we investigate effects of communication schedules on sensor fusion performance. We provide worst- and average-case results in support of the Ascending schedule, where sensors send their measurements in a fixed succession based on their precision, starting from the most precise sensors. Finally, we provide a case study to illustrate the use of this approach.

**Time**  15:00  3.6.2  BANDWIDTH-EFFICIENT CONTROLLER-SERVER CO-DESIGN WITH STABILITY GUARANTEES  
**Speakers:**  Amir Aminfar\(^1\), Enrico Bini\(^2\), Petru Eles\(^1\) and Zebo Peng\(^1\)  
\(^1\)Linköping University, SE; \(^2\)Lund University, SE  
**Abstract**  Many cyber-physical systems comprise several control applications implemented on a shared platform, for which stability is a fundamental requirement. This is as opposed to the classical hard real-time systems where often the criterion is meeting the deadline. However, the stability of control applications depends on not only the delay experienced, but also the jitter. Therefore, the notion of deadline is considered to be artificial for control applications that promotes the need for new techniques for designing cyber-physical systems. The approach in this paper is built on a server-based resource reservation mechanism, which provides compositionality, isolation, and the opportunity of systematic controller-server co-design. We address the controller-server co-design of such systems to obtain design solutions with the minimal bandwidth to guarantee stability.

**Time**  15:30  3.6.3  FAULT-TOLERANT CONTROL SYNTHESIS AND VERIFICATION OF DISTRIBUTED EMBEDDED SYSTEMS  
**Speakers:**  Matthias Kauer\(^1\), Damoon Soudabakhsh\(^2\), Dip Goswami\(^3\), Samajit Chakrabarty\(^4\) and Anuradha Annaswamy\(^5\)  
\(^1\)TUM CREATE Ltd., SG; \(^2\)Massachusetts Institute of Technology, US; \(^3\)Eindhoven University of Technology, NL; \(^4\)TU Munich, DE; \(^5\)MIT, US  
**Abstract**  We deal with synthesis of distributed embedded control systems closed over a faulty or severely constrained communication network. Such overloaded communication networks are common in cost-sensitive domains such as automotive. Design of such systems aims to meet all deadlines following the traditional notion of schedulability. In this paper, we assume robustness of the controller and propose a novel implementation approach to achieve a tighter design. Toward this, we answer two research questions: (i) given a distributed architecture, how to characterize and formally verify the bound on deadline misses, (ii) given such a bound, how to design a controller such that desired stability and Quality of Control (QoC) requirements are met. We address question (i) by modeling a distributed embedded architecture as a network of Event Count Automata (ECA), and subsequently introducing and formally verifying a property formulation with reduced complexity. We address question (ii) by introducing a novel fault-tolerant control strategy which adjusts the control input at runtime based on the occurrence of fault or drop. We show that QoC under fault communication improves significantly using the proposed fault-tolerant strategy.

**Time**  16:00  1P1-21, 195  GARBAGE COLLECTION FOR MULTI-VERSION INDEX ON FLASH MEMORY  
**Speakers:**  Kam-Yiu Lam\(^1\), Jian-Tao Wang\(^1\), Yuan-Hao Chang\(^1\), Jen-Wei Hsieh\(^1\), Po-Chun Huang\(^1\), Chung Keung Poon\(^1\) and ChunJiang Zhu\(^1\)  
\(^1\)City University of Hong Kong, HK; \(^2\)Academia Sinica, TW; \(^3\)National Taiwan University of Science and Technology, TW; \(^4\)Academia Sinica, TW; \(^5\)City University of Hong Kong, TW  
**Abstract**  In this paper, we study the important performance issues in using the purging-range query to reclaim old data versions to be free blocks in a flash-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based

**Time**  16:01  1P1-22, 395  D2CYBER: A DESIGN AUTOMATION TOOL FOR DEPENDABLE CYBERCARS  
**Speakers:**  Arsalan Munir and Farinaz Koushanfar, Rice University, US  
**Abstract**  The next generation of automobiles (also known as cybercars) will increasingly incorporate electronic control units (ECUs) in novel automotive control applications. Recent work has demonstrated vulnerability of modern car control systems to security attacks that directly impacts the cybercar's physical safety and dependability. In this paper, we provide an integrated approach for the design of secure and dependable cybercars using a case study: a steer-by-wire (SBW) application over controller area network (CAN). The challenge is to embed both security and dependability over CAN while ensuring that the requirements for this subset of cybercar applications are not violated. Our approach enables early design feasibility analysis by embedding essential security (i.e., confidentiality, integrity, and authentication) protocols over CAN subject to the real-time constraints imposed by the desired quality of service and behavioral reliability. Our method leverages multi-core ECUs for providing fault-tolerance using redundant multi-threading (RMT) and also further enhances RMT for quick error detection. We quantify the error resilience of our approach and evaluate the interplay of performance, fault-tolerance, security, and scalability for our SBW case study.

**Time**  16:02  1P1-23, 819  CONTRACT-BASED DESIGN OF CONTROL PROTOCOLS FOR SAFETY-CRITICAL CYBER-PHYSICAL SYSTEMS  
**Speakers:**  Pierluigi Nuzzo, John Finn, Antonio Iannopollo and Alberto Sangiovanni-Vincentelli, University of California at Berkeley, US  
**Abstract**  We introduce a platform-based design methodology that addresses the complexity and heterogeneity of cyber-physical systems by using assume-guarantee contracts to formalize the design process and enable realization of control protocols in a hierarchical and compositional manner. Given the architecture of the physical plant to be controlled, the design is carried out as a sequence of refinement steps from an initial specification to a final implementation, including synthesis from requirements and mapping of higher-level functional and non-functional models into a set of candidate solutions built out of a library of components at the lower level. Initial top-level requirements are captured as contracts and expressed using linear temporal logic (LTL) and signal temporal logic (STL) formulas to enable requirement analysis and early detection of inconsistencies. Requirements are then refined into a controller architecture by combining reactive synthesis steps from LTL specifications with simulation-based design space exploration steps. We demonstrate our approach on the design of embedded controllers for aircraft electric power distribution.

**Time**  16:00  End of session  
**Break**  Coffee Break in Exhibition Area  
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).
This section presents different approaches to improve reliability of circuits and systems by using on line techniques. It shows different methods that can be applied to caches, processors and multicore architectures.

**Abstract**

Technology scaling leads to significant faulty bit rates in on-chip caches. In this work, we propose a methodology to mitigate the impact of defective bits (due to permanent faults) in first-level set-associative data caches. Our technique assumes that faulty caches are enhanced with the ability of disabling their defective parts at cache subblock granularity. Our experimental findings reveal that while the occurrence of hard-faults in faulty caches may have a significant impact in performance, a lot of room for improvement exists, if someone is able to take into account the spatial reuse patterns of the to-be-referenced blocks (not all the data fetched into the cache is accessed). To this end, we propose frugal PC-indexed spatial predictors (with very small storage requirements) to orchestrate the (re)placement decisions among the fully and partially unusable faulty blocks. Using cycle-accurate simulations, we show that the performance overhead of AOC is insignificant.

This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled Multi-Threaded (DMT) multicore processor, called Data-Flow schedulable Redundant Multi-Threading (DRMT). Meanwhile, it presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection related inter-core communication and alleviate the performance and hardware overheads induced by output comparison. Results show that the performance overhead of DRMT is less than 60% even when the number of threads is four times the number of processing elements. Also the performance and hardware overheads of ADC are insignificant.

In this paper, we demonstrate that the sensitized path delays in various microprocessor pipe stages exhibit intriguing temporal and spatial variations during the execution of real world applications. To effectively exploit these delay variations, we propose Dynamically Adaptable Resilient Pipeline (DARP)--a series of runtime techniques to boost power performance efficiency and fault tolerance in a pipelined microprocessor. DARP employs early error prediction to avoid a major portion of timing errors. Using a rigorous circuit-architectural infrastructure, we demonstrate substantial improvements in the performance (9.4-20%) and energy efficiency (6.4-27.9%), compared to state-of-the-art techniques.

**Abstract**

In this paper, we design and implement a fault-detection mechanism in a data-flow scheduled multi-threaded processor. We present a methodology to detect and avoid faults in a multi-core processor. Our approach is based on the observation that faults can be detected in the execution of real-world applications. We present a novel methodology to detect faults in real-time applications. Our methodology exploits the data-flow scheduling of tasks in a multi-core processor to detect and avoid faults.

This paper designs and implements a methodology to mitigate the impact of faulty data caches in on-chip caches. We propose a methodology to mitigate the impact of defective bits (due to permanent faults) in first-level set-associative data caches. Our technique assumes that faulty caches are enhanced with the ability of disabling their defective parts at cache subblock granularity. Our experimental findings reveal that while the occurrence of hard-faults in faulty caches may have a significant impact in performance, a lot of room for improvement exists, if someone is able to take into account the spatial reuse patterns of the to-be-referenced blocks (not all the data fetched into the cache is accessed). To this end, we propose frugal PC-indexed spatial predictors (with very small storage requirements) to orchestrate the (re)placement decisions among the fully and partially unusable faulty blocks. Using cycle-accurate simulations, we show that the performance overhead of AOC is insignificant.

This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled Multi-Threaded (DMT) multicore processor, called Data-Flow schedulable Redundant Multi-Threading (DRMT). Meanwhile, it presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection related inter-core communication and alleviate the performance and hardware overheads induced by output comparison. Results show that the performance overhead of DRMT is less than 60% even when the number of threads is four times the number of processing elements. Also the performance and hardware overheads of ADC are insignificant.

In this paper, we design and implement a fault-detection mechanism in a data-flow scheduled multi-threaded processor. We present a methodology to detect and avoid faults in a multi-core processor. Our approach is based on the observation that faults can be detected in the execution of real-world applications. We present a novel methodology to detect faults in real-time applications. Our methodology exploits the data-flow scheduling of tasks in a multi-core processor to detect and avoid faults.
MISSION PROFILES - SOLUTION OR CHALLENGE? THE OEM PERSPECTIVE

Speaker:
Ulrich Abelein, AUDI AG, DE

Abstract
The original equipment manufacturer (OEM) is driven by its own quality and innovation goals to implement the newest available and suitable semiconductor technologies. In this talk the OEM perspective with regard to mission profiles will be presented and discussed. The difference between the current use of standard sets of requirements and a mission profile approach will be evaluated. This will be demonstrated by actual and upcoming challenges in the automotive industry. Therefore the use of up-to-date technologies in accordance with declining maturing and product development times has to be considered. Mission profiles become increasingly important as they provide the opportunity to cover these requirements. A necessary step to assemble a mission profile is the derivation of relevant functional load and environmental stress conditions of an electronic component and its sub-components. Therefore a formalized communication within the supply chain is necessary to ensure a consistent availability of all relevant data. Dominant loads must be determined and appropriately allocated. One challenge is to consider the influence of singular events on sporadic failures. Another challenge is the different time frame of the product engineering process of OEM, Tier 1 and semiconductor manufacturer. Despite the existence of multiple challenges to derive mission profiles, the mission profiles approach shows great promise to enable the design of robust electronic components for specific applications even in the presence of yet immature technologies.

MISSION PROFILE AWARE IC DESIGN - A CASE STUDY

Speakers:
Goeran Jerke1 and Andrew Kahng2
1Robert Bosch GmbH, DE; 2University of California, San Diego, USA, US

Abstract
In this paper we propose to exploit so called Mission Profiles to address increasing requirements on safety and power efficiency for automotive power ICs. Mission profile awareness aids the automation of robustness aware design by formalizing and partially automating the generation, transformation, propagation and usage of all component-specific functional loads and environmental conditions for design implementation and validation. In addition, it aids the development of electronic components in yet immature technologies or in technologies with tight property variation bounds. This paper introduces the general concept, requirements and context of mission profile aware design. The general design approach is presented along with key differences and enhancements to existing design approaches. A case study focusing on mission profile usage and electromigration failure avoidance is presented to demonstrate various aspects of mission profile aware design.

MISSION PROFILE AWARE ROBUSTNESS ASSESSMENT OF AUTOMOTIVE POWER DEVICES

Speakers:
Thomas Nirmaier1, Andreas Burger2, Manuel Harrant1, Alexander Viehl1, Oliver Bringmann3, Wolfgang Rosenstiel1 and Georg Pelz1
1Infineon Technologies AG, DE; 2FZI Research Center for Information Technology, DE; 3University of Tuebingen, DE

Abstract
In this paper we propose to exploit so-called Mission Profiles to address increasing requirements on safety and power efficiency for automotive power ICs. These Mission Profiles constrain the required device performance space to valid application scenarios. Mission Profile data can be represented in arbitrary forms like temperature histograms or cumulated drive cycle data. Hence, the derivation of realistic verification scenarios on device level requires the generation of environmental properties such as e.g. temperatures, board net conditions or currents. For the assessment of real application robustness we present a methodology to extract finite state machines out of measured vehicle data and integrate them in Mission Profiles. Subsequently Markov processes are derived from these finite state machines in order to automatically generate Mission Profile compliant test scenarios for the design and verification process. As a motivating example we show industry fault cases in which missing application fitness to power transient variations finally results in device failure. Verification results based on lab data are outlined and show the benefits of a fully mission profile driven IC verification flow.

APPLICATION OF MISSION PROFILES TO ENABLE CROSS-DOMAIN CONSTRAINT-DRIVEN DESIGN

Speakers:
Carolin Katzschke1, Marc-Philipp Sohn1, Markus Olbrich1, Volker Meyer zu Bexten 2, Markus Tristl2 and Erich Barke1
1Institute of Microelectronic Systems, Leibniz Universität Hannover, DE; 2Infineon Technologies AG, DE

Abstract
Mission Profiles contain top-level stress information for the design of future systems. These profiles are refined and transformed to design constraints. We present methods to propagate the constraints between design domains like package and chip. We also introduce a cross-domain methodology for corresponding constraint transformation system ConDUTC. The proposed methods are demonstrated on the basis of an automotive analog/mixed-signal application.

MISSION PROFILES - SOLUTION OR CHALLENGE? THE OEM PERSPECTIVE

Speaker:
Ulrich Abelein, AUDI AG, DE

Abstract
The original equipment manufacturer (OEM) is driven by its own quality and innovation goals to implement the newest available and suitable semiconductor technologies. In this talk the OEM perspective with regard to mission profiles will be presented and discussed. The difference between the current use of standard sets of requirements and a mission profile approach will be evaluated. This will be demonstrated by actual and upcoming challenges in the automotive industry. Therefore the use of up-to-date technologies in accordance with declining maturing and product development times has to be considered. Mission profiles become increasingly important as they provide the opportunity to cover these requirements. A necessary step to assemble a mission profile is the derivation of relevant functional load and environmental stress conditions of an electronic component and its sub-components. Therefore a formalized communication within the supply chain is necessary to ensure a consistent availability of all relevant data. Dominant loads must be determined and appropriately allocated. One challenge is to consider the influence of singular events on sporadic failures. Another challenge is the different time frame of the product engineering process of OEM, Tier 1 and semiconductor manufacturer. Despite the existence of multiple challenges to derive mission profiles, the mission profiles approach shows great promise to enable the design of robust electronic components for specific applications even in the presence of yet immature technologies.
UB03.01 LARA: THE LARA COMPILER SUITE
Authors: Joao Bispo, Pedro Pinto, Ricardo Nobre, Tiago Carvalho and Joao Cardoso, Universidade do Porto, PT

Abstract
LARA is an aspect-oriented programming (AOP) language which allows the description of sophisticated code instrumentation schemes, advanced mapping strategies including conditional decisions, based on hardware/software resources, and of sophisticated sequences of compiler transformations. Furthermore, LARA provides mechanisms for controlling all elements of a toolchain in a consistent and systematic way, using a unified programming interface. We present three compiler tools developed around the LARA technology, MATISSE, MANET and ReflectC. MATISSE is a compiler which 1) allows analyses and transformations on MATLAB code and 2) generates C code from the MATLAB code. MATISSE can be fully controlled by LARA aspects, which can define the type and shape of MATLAB variables, specify code insertion/removal actions, and define specialization directives and other additional information. MATISSE can output transformed MATLAB code and specialized C code. The knowledge provided by the LARA aspects allows MATISSE to generate C tailored to specific targets (e.g., use statically declared arrays to be compliant with the high-level synthesis tools such as Catapult C). MANET is a source-to-source compiler for ANSI C based on Cetus, and is controlled using LARA aspects. MANET manages to leverage the expressiveness and modularity of LARA to query and manipulate the Cetus AST, providing an easy compilation flow with main goal of code instrumentation and code transformations. LARA aspects allow for a simple selection of program elements in the code which can be analyzed or transformed, by either consulting their attributes or applying actions. Thus, MANET can be used to provide information reports based on compiler analyses, to implement sophisticated code instrumentation strategies, or to perform code optimizations and transformations. ReflectC is a C compiler based on CoSy's compiler framework. CoSy's configurability and retargetability make ReflectC particularly effective for exploration of compiler transformations and optimizations on possible architecture variations, and it is being used for hardware/software co-design and design space exploration (DSE). We will present demos of the tools and the use of LARA aspects and strategies to guide our suite of compilation tools providing: 1) C code generation from MATLAB code, according to information provided by LARA aspects; 2) Instrumentation of C code to be used for collecting specific compile and runtime information (e.g., execution time, range of values for specific variables, custom profiling); 3) User-controlled compiler optimizations targeting several architectures and DSE of sequences of compiler optimizations bearing in mind performance improvements. In addition to presenting examples for each of the tools of the LARA compilation suite, we show an execution of the complete toolchain, controlled by LARA aspects.

More information ...

UB03.02 AN AUTOMATED DESIGN FLOW FOR FAST PROTOTYPING OF SIMULINK MODELS ONTO MPSoC
Authors: Francesco Robino and Johnny Öberg, Royal Institute of Technology, SE

Abstract
Simulink is a modelling environment suitable to model embedded systems at system-level. However there is no standard to rapidly prototype Simulink models onto modern multiprocessor system-on-chip (MPSoC). In this demonstration we show how our NoC System Generator tool can be used as part of an automated platform-based design flow to synthesize a Simulink model to a network-on-chip based MPSoC implementation on FPGA. The performance of the generated prototype scales with the number of processors.

More information ...

UB03.03 PATN: A PERFORMANCE ANALYSIS TOOL FOR NOC
Authors: Yang Chen and Zhonghai Lu, KTH Royal Institute of Technology, SE

Abstract
With processors increased onto a single chip, and more and more time sensitive applications added to on-chip systems, performance bound analysis becomes essential for QoS Networking-on-Chip (NoC) designs and evaluations. For the purpose of providing the reliable and automated analysis for QoS NoC, we propose PATN (Performance Analysis Tool for NoC), which automatically computes the end-to-end delay bounds of data flows, and backlog bounds of buffers for NoC with arbitrary topology. PATN is designed based on network calculus, which lies on solid mathematical foundations and provides well-guaranteed accuracy of the results. Network Calculus based analysis has been successfully employed for various communications networks, such as SpaceWire, AFDX, etc.. For example, Airbus adopted and approved the network calculus based analysis for certification on its aircraft A380. In this demonstration, we give a whole view of PATN through two segments. First, we explore the architecture and main functions: show the working flow and tracing log by analysing end-to-end delay bound of a data flow in a simple network. The log shows that the analysis follows the theoretical methodology exactly, hence to obtain the correct and tight results, which as good as that the theory can achieve. Second, we use PATN to analyse the delay bounds and backlog bounds for 3 NoCs with different topologies – binary tree, mesh, and hierarchical topology of binary tree and mesh. The analyses demonstrate computation speed and scalability of PATN. Moreover, comparisons of the delay bound, computed with different configuration parameters of the flows and routers, are conducted. It shows how the delay bound is effected by the parameters.

More information ...

UB03.04 COMPILER FOR MAPPING STREAM PROCESSING APPLICATIONS ONTO REAL-TIME HETEROGENEOUS MULTIPROCESSOR SYSTEMS
Authors: Stefan Geuns, Berend Dekens, Philip Wilmanns, Joost Hausmans, Guus Kuiper and Marco Bekooij, University of Twente, NL

Abstract
Heterogeneous multiprocessor system are employed for power-efficiency reasons in wearable software defined radios. These systems are hardware cost-effective and offer a superior performance compared to their homogenous counterparts. However these systems are notoriously hard to program without tool support, which makes it is desirable that programming is simplified with the help of an optimizing multiprocessor compiler for stream processing applications. This demonstration shows our multiprocessor compiler for mapping real-time stream processing applications onto our real-time heterogeneous multi-core system. The applications are described as sequential programs and are compiled into parallel task graphs. Buffer capacities are computed using dataflow analysis techniques given the real-time constraints of the application. Our multi-core system contains 16 MicroBlaze processor cores as well as two hardware accelerators and is prototypically suited for a Xilinx Virtex-6 FPGA. A connection-less communication ring is used for inter-processor communication. Our system is equipped with an analog RF front-end, which enables us to demonstrate PAL-video reception and decoding.

More information ...

UB03.05 HWEDEBLUR: DESIGN OF A HIGH PERFORMANCE CORE FOR REMOVING BLUR EFFECT ON IMAGES
Authors: Giuseppe Airo' Farulla, Giulio Gambardella, Marco Indaco, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT

Abstract
This work aims at developing a high performance FPGA-based IP-core able to perform a deblurring algorithm in real-time. Modern approaches to deblurring usually either only handle simple types of blur, or need heavy user interaction. Moreover, they usually require several minutes (or even whole hours) to process a single image. Our purpose is to study the current state-of-the-art and identify the best deblurring algorithms that are suitable for a hardware implementation. The selected algorithm is optimized and implemented in hardware in order to perform the deblurring task with highest possible performances.

More information ...

UB03.06 PHARAOH: PARALLEL AND HETEROGENEOUS ARCHITECTURES FOR REAL-TIME APPLICATIONS
Authors: Luciano Lavagno1, Mihai Lazarescu1, Hector Posadas2 and Eugenio Villar2
1Politecnico di Torino, IT; 2Universidad de Cantabria, ES

Abstract
In this demo, we will present the work-in-progress of the EU FP7 PHARAOH project, started in September 2011. The first objective of the project is the development of new techniques and tools capable to assist the designer in the development of parallel embedded systems, from executable specifications to target-specific implementation and debugging on a multicore platform. This tool chain offers and implements several parallelization strategies, reflecting the functional and non-functional constraints of the system, and driving the designer into incremental parallelization and adaptation steps. The second objective of the project is to develop monitoring and control techniques in the middleware of the system capable to automatically adapt platform services to application requirements and therefore reduce power consumption transparently. The demo will cover specifically: - the software parallelization tool suite, - the parallel software modeling and code generation suite.

More information ...
Presentations session the award 'Best IP of the Day' is given. Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is executed in isolation. The VEPs are also predictable, meaning that all interference is bounded. This makes them virtualized also in terms of performance bounds, which enables firm real-time applications to be verified using formal performance analysis frameworks. The CompSOC platform uses the CoMik microkernel to implement virtual processors on each processor time through temporal partitioning. Each application can use its own operating system (e.g. Compose, μCOS-III) and model of computation (e.g. CSDF, KPN, TT) in its VEP, to suit its level of time criticality. As more applications are integrated on a single SOC, the need arises for more dynamic behaviour. The system should be able to start, modify and stop applications at run time without affecting running applica- tions. For this purpose the CompSOC platform has been extended with a predictable and composable resource management framework. It manages application bundles that contain 1) an application in the form of executables (ELFs on multiple processors), and also 2) the specifications of the (one or more) particular VEPs that the application executes in, consisting of virtual processors, NOC connections, virtualised mem- ories, etc. At run time, the resource management framework can dynamically load and start application bundles by creating a VEP and then loading, booting, and executing an application within it. VEPs can also be modified, stopped, and deleted at run time. Our University Booth will present virtual-execution-platform and application-bundle concepts using an interactive demonstrator. It will show that the CompSOC has been extended with dynamic functionality, without sacrificing its key strengths: composability and predictability. We will demonstrate this through the use of the resource management framework and application bundles, showing that we can create, modify and delete virtual execution platforms running a mixed time-criticality application dynamically at run-time.

More information ...

A HOLISTIC APPROACH TO POWER MANAGEMENT FOR ENERGY HARVESTING EMBEDDED SYSTEMS

Authors: Kyungsoo Lee, Hideki Takase and Tohru Ishihara, Kyoto University, JP

Abstract
We present a holistic approach to maximizing the energy efficiency of energy harvesting embedded systems which consist of a processor system and an energy harvesting system. A power management program integrated on a real-time OS optimally switches operation mode of the processor and configuration of the energy harvesting system according to the workload of the processor and harvesting situation. The demonstration will show that our prototype system consisting of our processor chip and harvesting system board stably runs using harvested energy only. The processor has multiple cores having a different performance in each to efficiently by our power management program implemented on Toppers OS.

More information ...

FAULTIFY: PROBABILISTIC CIRCUIT FAULT EMULATION

Authors: David May and Walter Stechele, TUM, DE

Abstract
We want to demonstrate an FPGA-based probability-aware fault emulator and its corresponding algorithms in the context of a real-time H.264 decoder. The demo will show that reliability constraints can be relaxed inside the circuit without noticeable degradation of the image quality when carefully investigating where the constraints can be relaxed. We will show how this investigation can to be done using our emulator and we will show the effect of a relaxed robustness of the circuit in real-time.

More information ...

RTL+: DESIGN ENVIRONMENT: WALK BEFORE YOU RUN.

Authors: Somayeh Sadeghi-Kohan, Behnaz Pourmohseni, Amir Reza Neekoei, Hanieh Hashemi, Hamed Najafi Haghi and Zainalabedin Navabi, University of Tehran, IR

Abstract
To enable development of high level designs with hardware correspondence, synthesizability must be satisfied in a top-down manner. Thus in this work, instead of using TLM-2.0 which is not established for synthesis, we will start with a level above RT level, "RTL+". "RTL+" is basically using TLM-1.0 channels and includes abstract communications and handshakings that are mainly hidden from the designer. We develop a package of Systemc channels with hardware correspondence (synthesizable HDL) for the communication between various cores (with simple interfaces) and standard buses.

More information ...

17:30 End of session
18:30 Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level)

The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

IP1 Interactive Presentations

Date: Tuesday 25 March 2014
Time: 16:00 - 16:30
Location / Room: Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award ‘Best IP of the Day’ is given.

Label Presentation Title

IP1-1 SAFE: SECURITY-AWARE FLEXRAY SCHEDULING ENGINE

Speakers: Gang Han1, Haibo Zeng2, Yaping Li3 and Wenhua Dou4

1National University of Defense Technology, CN; 2McGill University, CA; 3The Chinese University of Hong Kong, CN

Abstract
In this paper, we propose SAFE (Security Aware FlexRay scheduling Engine), to provide a problem definition and a design framework for FlexRay static segment scheduling to address the new challenge on security. From a high level specification of the application, the architecture and communication middleware are synthesized to satisfy security requirements, in addition to extensibility, costs, and end-to-end latencies. The proposed design process is applied to two industrial case studies consisting of a set of active safety functions and an X-by-wire system respectively.
**IP1-2**  
**TRANSIENT ERRORS RESILIENCY ANALYSIS Technique FOR AUTOMOTIVE SAFETY CRITICAL APPLICATIONS**  
Speakers: Sujan Pandey and Bart Vermeulen, NXP Semiconductors, NL  

**Abstract**  
When a single bit is flipped as a result of a transient error in an electronic circuit, its effect can have a severe impact if the circuit is deployed in safety critical domains such as automotive, aeronautics, and industrial automation. In the design phase it is therefore essential to evaluate, and where necessary improve, the resilience of a circuit to all possible transient errors. In this paper, we present a method to analyze the transient error resiliency of a digital circuit. This method is based on an analytical model. It models a transient error as a random function and finds the vulnerable number of bits for each node. We perform a case study on a circuit implementation of a well-known adaptive filter algorithm. The results from the analytical and simulation models show that the analytical model is accurate enough to estimate the effects of transient errors on the performance of a digital circuit. Our analytical method also reduces the run time significantly in a design phase.

**IP1-3**  
**MODEL BASED HIERARCHICAL OPTIMIZATION STRATEGIES FOR ANALOG DESIGN AUTOMATION**  
Speakers: Engin Aflacan, Gunhan Dundar, Falk Baskaya, Simge Ay and Francisco Fernandez  

**Abstract**  
The design complexity of analog circuits by using flat optimization-based approaches is inefficient, even impossible, due to the high number of design variables and the growth of the cost of performance evaluation with the circuit size. Over the past two decades, top-down hierarchical design approaches have been developed and applied. They are based on hierarchical circuit decomposition and specification transmission from top-level to lower level blocks. However, such specification transmission is usually performed with little knowledge on the feasibility of the specifications, leading, therefore, to costly redesign iterations. Even if the specification transmission is successful, there is no guarantee that it is optimal in terms of e.g., power consumption or area occupation. To palliate this problem, two novel model-based hierarchical synthesis methods are proposed in this paper: Model-Based Hierarchical Optimization (MBHO) and Improved Model-Based Hierarchical Optimization (IMBHO). They are based on the concurrent design at higher and lower hierarchical levels and appropriate communication between the different processes. Experimental results on a filter example comparing the new approaches and the conventional top-down design approach are provided.

**IP1-4**  
**A NOVEL LOW POWER 11-BIT HYBRID ADC USING FLASH AND DELAY LINE ARCHITECTURES**  
Speakers: Hsun-Cheng Lee and Jacob Abraham, the University of Texas at Austin, US  

**Abstract**  
This paper presents a novel low power 11-bit hybrid ADC using flash and delay line architectures, where a 4-bit flash ADC is followed by a 7-bit delay-line ADC. This hybrid ADC inherits accuracy and power efficiency from flash ADCs and delay-line ADCs, respectively. Also, in order to reduce the power of the first stage flash ADC, a power-saving technique is adopted by biasing the DC tail current of the pre-amplifiers at $5 \mu A$ instead of the operational current, $47 \mu A$ in stand-by mode. The hybrid ADC was designed and simulated in a commercial 65nm process. With a 1.1 V supply and 100 MS/s, the ADC achieves an SNDR of 60 dB and consumes 1.6 mW, which results in a figure of merit (FOM) of 19.4 fJ/conversion-step without any calibration technique. Also, Monte Carlo simulations are performed with a 3σ device mismatch for the SNDR estimation, and the SNDR is observed to be better than 58.5 dB.

**IP1-5**  
**SEMI-SYMBOLIC ANALYSIS OF MIXED-SIGNAL SYSTEMS INCLUDING DISCONTINUITIES**  
Speakers: Carma Radojicic, Christoph Grimm, Javier Moreno and Xiao Pan, TU Kaiserslautern, DE  

**Abstract**  
The paper describes an approach for semi-symbolic analysis of mixed-signal systems that contain discontinuous functions, e.g. due to modeling comparators. For modeling and semi-symbolic simulation, we use extended Affine Arithmetic. Affine Arithmetic is currently limited to accurate analysis of linear functions and mild non-linear functions, but not yet discontinuities. In this paper we extend the approach to also handle discontinuities. For demonstration, we symbolically analyze a ΣΔ-modulator.

**IP1-6**  
**NOVEL CIRCUIT TOPOLOGY SYNTHESIS METHOD USING CIRCUIT FEATURE MINING AND SYMBOLIC COMPARISON**  
Speakers: Cristian Ferent and Alex Doboli, Stony Brook University, US  

**Abstract**  
This paper presents a reasoning-based approach to analog circuit synthesis using ordered node clustering representations (ONCR) to describe alternative circuit features and symbolic circuit comparison to characterize performance trade-offs of synthesized solutions. Case studies illustrate application of the proposed methods to topology selection and refinement.

**IP1-7**  
**AN EMBEDDED OFFSET AND GAIN INSTRUMENT FOR OPAMP IPS**  
Speakers: Jinbo Wan and Hans Kerkhoff, CAES-TDT, CTIT, University of Twente, NL  

**Abstract**  
Analog and mixed-signal IPs are increasingly required to use digital fabrication technologies and are deeply embedded into system-on-chips (SoC). These developments.append more requirements and challenges on analog testing methodologies. Traditional analog testing methods suffer from less accessibility and control with regard to these embedded analog circuits in SoCs. As an alternative, an embedded instrument for analog OpAmp IP tests is proposed in this paper. It can provide the exact gain and offset values of OpAmps instead of only pass/fail result. What's more, it is an non-invasive monitor and can work online without isolating the DUT Opamp from its surrounding feedback networks. Nor does it require accurate test stimulations. In addition, the monitor can remove its own offsets and mismatch for the SNDR estimation, and the SNDR is observed to be better than 58.5 dB.

**IP1-8**  
**EVX: VECTOR EXECUTION ON LOW POWER EDGE CORES**  
Speakers: Milovan Duric, Oscar Palomar, Aaron Smith, Osman Unsal, Adrian Cristal, Mateo Valero and Doug Burger  

**Abstract**  
In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions.
PROGRAM AFFINITY PERFORMANCE MODELS FOR PERFORMANCE AND UTILIZATION

Speakers:
Ryan Moore and Bruce Childers, University of Pittsburgh, US

Abstract
Multithreaded applications have a wide variety of behavior, causing complex interactions with today’s chip multiprocessor machines. Application threads may have large private working sets, and may compete for cache space and memory bandwidth. These threads benefit from large private caches. Other threads may share data or communicate, and thus, execute more quickly if using shared caches. Many applications fail somewhere in between, requiring careful thread-to-core assignments to maximize performance. Yet because of the large number of thread-to-core assignments on today’s chip multiprocessors, it is time and energy prohibitive to exhaustively try and determine the best assignment. In this paper, we present and demonstrate application performance models that predict application performance given a proposed thread-to-core assignment. We show how these models can be quickly built and used to select thread-to-core assignments for multiple programs and to improve system utilization.

ADVANCED SIMD: EXTENDING THE REACH OF CONTEMPORARY SIMD ARCHITECTURES

Speakers:
Matthias Boettcher1, Giacomo Gabrielli2, Mbou Eyou2, Alastair Reid2 and Bashir M. Al-Hashimi1
1University of Southampton, GB; 2ARM Ltd., GB

Abstract
SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures such as the Intel SSE/AVA have evolved by adding support for wider registers and data types, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity. This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/data/stack width and L1 data cache configuration) is quantified and discussed. We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.

A TIGHTLY-COUPLED HARDWARE CONTROLLER TO IMPROVE SCALABILITY AND PROGRAMMABILITY OF SHARED-MEMORY HETEROGENEOUS CLUSTERS

Speakers:
Paolo Burgio1, Robin Danilo2, Andrea Marongiu3, Philippe Coussy4 and Luca Benini5
1University of Bologna, Université de Bretagne-Sud, IT; 2Université de Bretagne-Sud; FR; 3University of Bologna, IT; 4Universite de Bretagne-Sud / Lab-STICC, FR; 5Università di Bologna, IT

Abstract
Modern embedded systems are designed in such a way that new application-specific units (e.g., New York University Abu Dhabi, AE; 2Arizona State University, US; 3Polytechnic Institute of New York University, US

Counterfeit ICs have become an issue for semiconductor manufacturers due to impacts on their reputation and lost revenue. Counterfeit ICs are either products that are intentionally mislabeled or legitimate products that are extracted from electronic waste. The former is easier to detect whereas the latter is harder since they are extracted from functional circuits with random but identical pinout and circuitry. In this paper, we present an integrated framework for early-stage memory robustness analysis, INFORMER, that helps high-level designers estimate memory reliability metrics rapidly and accurately.

INFORMER: AN INTEGRATED FRAMEWORK FOR EARLY-STAGE MEMORY ROBUSTNESS ANALYSIS

Speakers:
Shrikanth Ganapathy1, Ramon Canal1, Dan Alexandrescu2, Enrico Costanero2, Antonio Gonzalez2 and Antonio Rubio1
1Universitat Politècnica de Catalunya, ES; 2RoC Technologies, FR; 3Intel and Universitat Politècnica de Catalunya, ES

Abstract
With the growing importance of memory-related failure and aging, it becomes necessary to design reliable, fast and low-power embedded memories. Adopting a variation-aware design paradigm requires a holistic perspective of memory-wide metrics such as yield, power and performance. However, accurate assessment of memory failures is highly dependent on circuit implementation styles, technology parameters and architecture-level specifics. In this paper, we propose a fully automated tool - INFORMER that helps high-level designers estimate memory reliability metrics rapidly and accurately. We show that INFORMER enables an accurate estimation of energy consumption and analysis of the endurance-energy trade-off for each ECT.

WEAR-OUT ANALYSIS OF ERROR CORRECTION TECHNIQUES IN PHASE-CHANGE MEMORY

Speakers:
Caio Hoffman, Luiz Ramos, Rodolfo Azevedo and Guido Araújo, University of Campinas, BR

Abstract
Phase-Change Memory (PCM) is new technology in memory and a possible replacement for DRAM, whose scaling limitations require new lithography technologies. Despite being promising, PCM has limited endurance (its cells withstand roughly 10^8 bit-flips before failing), which prompted the adoption of Error Correction Techniques (ECTs). However, previous lifetime analyses of ECTs did not consider the difference between the bit-flip frequencies of data and code bits, which may lead to inaccurate wear-out analyses for the ECTs. In this work, we improve the wear-out analysis of PCM by modeling and analyzing the bit-flip probabilities of five ECTs. Our models also enable an accurate estimation of energy consumption and analysis of the endurance-energy trade-off for each ECT.

APPROXIMATING THE AGE OF RF/ANALOG CIRCUITS THROUGH RE-COLORIZATION AND STATISTICAL ESTIMATION

Speakers:
Doohwang Chang1, Sule Ozev1, Ozgur Sinanoglu2 and Ramesh Kari3
1Arizona State University, US; 2New York University Abu Dhabi, AE; 3Polytechnic Institute of New York University, US

Abstract
Counterfeit ICs have become an issue for semiconductor manufacturers due to impacts on their reputation and lost revenue. Counterfeit ICs are either products that are intentionally mislabeled or legitimate products that are extracted from electronic waste. The former is easier to detect whereas the latter is harder since they are identical to new devices but display degraded performance due to environmental and use stress conditions. Detecting counterfeit ICs that are extracted from electronic waste requires an approach that can approximate the age of manufactured devices based on their parameters. In this paper, we present a methodology that uses information on both fresh and aged ICs and tries to distinguish between the fresh and aged population based on an estimate of the age. Since analog devices are subject to their bias stress, input signals play a less of a role. Hence, it is possible to use simulation models to approximate the aging process, which would allow us to access a large population of aged devices. Using this information, we can construct a large population model that approximates the age of a given circuit. We use a Low noise amplifier (LNA) and an NMOs LC oscillator to demonstrate that individual aged devices can be accurately classified using the proposed method.

PACKAGE GEOMETRIC AWARE THERMAL ANALYSIS BY INFRARED-RADIATION THERMAL IMAGES

Speakers:
Jui-Hung Chien1, Hao Yu2, Rwei-Siang Hu2, Hsu-Hu-Lin3 and Shih-Chieh Chang3
1Industrial Technology Research Institute, TW; 2None, TW; 3NTHU, TW

Abstract
Since packages affect the amount of heat transfer, it is important to include package and heat sink in thermal analysis. In this paper, we study the full-chip thermal response with different packages. We first discuss the difficulties of obtaining accurate package models for simulation. To facilitate a designer to perform thermal simulation with different packages, we propose to use a matrix called the package-transfer matrix which can transform a temperature profile of one package to another temperature profile of the desired package. To estimate and verify a package-transfer matrix, we propose an efficient method which uses Infrared Radiation (IR) images from two carefully design test chips with PBGA packages. Our experimental results show that the default package model CBGA in HotSpot can be accurately transferred to any other package through the package-transfer matrix.


COST-EFFECTIVE DECAP SELECTION FOR BEYOND DIE POWER INTEGRITY

Speakers:
Yi-En Chen1, Tu-Hzung Tsai1, Shi-Hao Chen2 and Hung-Ming Chen1
1Department of Electronics Engineering National Chiao Tung University Hsinchu, Taiwan 300, R.O.C., TW; 2Global Unichip Corp, Hsinchu, Taiwan, TW

Abstract
In designing reliable power distribution networks (PDN) for power integrity (PI), it is essential to stabilize voltage supply to devices on chip. We usually employ decoupling capacitors (decap) to suppress the noise generated by the switching of devices. There have been numerous prior works on how to select/insert decaps in chip, package, or board to maintain PI, however optimal decap selection is usually not applicable due to design budget and manufacturability. Moreover, design cost is seldom touched or mentioned. In this research, we propose an efficient methodology “PDPSO” to automatically optimizing the selection of available decaps. This algorithm not only takes advantage of particle swarm optimization (PSO) to stochastically search the design space, but takes the most effective range of decaps into consideration to outperform the basic PSO. We apply this to three real package designs and the results show that, compared to the original decap selection by rules of thumb, our approach could shorten the design period and we have better combination of decaps at the same or lower cost. In addition, our methodology can also consider package-board co-design in optimizing different operation frequencies.

CHARACTERIZING POWER DELIVERY SYSTEMS WITH ON/OFF-CHIP VOLTAGE REGULATORS FOR MANY-CORE PROCESSORS

Speakers:
Xuan Wang, Jiang Xu, Zhe Wang, Kevin J. Chen, Xiaowen Wu and Zhehui Wang, HKUST, HK

Abstract
Design of power delivery system has great influence on the power management in many-core processor systems. Moving voltage regulators from off-chip to on-chip gives more and more interest in the power delivery system design, because it is able to provide fast voltage scaling and multiple power domains. Previous works are proposed to implement power efficient on-chip regulators. It is also important to analyze the characteristics of the entire power delivery system to explore the tradeoff between the promising properties and costs of employing on-chip regulators. In this work, we develop an analytical model to evaluate important characteristics of the power delivery system, including on-chip/off-chip voltage regulators and the passive on-chip/on-board parasitic. Compared with SPICE simulations, our model achieves a fast system-level evaluation with comparable accuracy. Based on the model, geometric programming is utilized to find the optimal power efficiency of different architectures of power delivery systems under constraints of output voltage stability and area. Experiments show that compared with the conventional architecture using off-chip regulators, the hybrid one using both on-chip and off-chip voltage regulators achieves 1.0% power efficiency improvement and 68% area reduction of voltage regulators on average. We conclude that the hybrid architecture has potential for high power efficiency and small area at heavy workload, but careful account for the overhead of on-chip regulators is needed.

PARTIAL-SET: WRITE SPEEDUP OF PCM MAIN MEMORY

Speakers:
Li Bing1, Shan Shuchang2, Hu Yu2 and Li Xiaowei3 3
1ICT, UCAS, CN; 2ICT, CAS, CN; 3ICT, CAS, CN

Abstract
Phase change memory (PCM) is a promising nonvolatile memory technology developed as a possible DRAM replacement. Although it offers the read latency close to that of DRAM, PCM generally suffers from the long write latency. Long write request may block the read requests on the critical path of cache/memory access, incurring adverse impact on the system performance. Besides, the write performance of PCM is very asymmetric, i.e., the SET operation (writing ‘1’) is much slower than that of the RESET operation (writing ‘0’). In this work, we re-examine the resistance transform process during the SET operation of PCM and propose a novel Partial-set Scheme to alleviate the long write latency issue of PCM. During a write access to a memory line, a short Partial-set pulse is applied first to program the PCM cells to a pre-stable state, achieving the same write latency as RESET. The partially-set cells are then fully programmed within the retention window to preserve the data integrity. Experimental results show that by combining all our proposed schemes, the lifetime of the non-volatile cells can be improved by 245% on average.

GARBAGE COLLECTION FOR MULTI-VERSION INDEX ON FLASH MEMORY

Speakers:
Kam-Yiu Lam1, Jian-Tao Wang1, Yuan-Hao Chang2, Jen-Wei Hsieh3, Po-Chun Huang4, Chung Keung Poon5 and Chunjiang Zhu1
1City University of Hong Kong, HK; 2Academia Sinica, TW; 3National Taiwan University of Science and Technology, TW; 4Academia Sinica, TW; 5City University of Hong Kong, TW

Abstract
In this paper, we study the important performance issues in using the purging-range query to reclaim old data versions to be free blocks in a flash-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based placement (FPB) scheme to place data versions in a block, the efficiency in garbage collection can be further enhanced by increasing the deadspans of data versions and reducing reallocation cost especially when the spaces of the flash memory for the databases are limited.

D2CYBER: A DESIGN AUTOMATION TOOL FOR DEPENDABLE CYBERCARS

Speakers:
Arslan Munir and Farinaz Koushanfar, Rice University, US

Abstract
The next generation of automobiles (also known as cybercars) will increasingly incorporate electronic control units (ECUs) in novel automotive control applications. Recent work has demonstrated vulnerability of modern car control systems to security attacks that directly impacts the cybercar’s physical safety and dependability. In this paper, we provide an integrated approach for the design of secure and dependable cybercars using a case study: a steer-by-wire (SBW) application over controller area network (CAN). The challenge is to embed both security and dependability over CAN while ensuring that the real-time constraints of the cybercar applications are not violated. Our approach enables early design feasibility analysis by embedding essential security primitives (i.e., confidentiality, integrity, and authentication) over CAN subject to the real-time constraints imposed by the desired quality of service and behavioral reliability. Our method leverages multi-core ECUs for providing fault-tolerance by redundant multi-threading (RMT) and also further enhances RMT for quick error detection. We quantify the error resilience of our approach and evaluate the interplay of performance, fault-tolerance, security, and scalability for our SBW case study.

CHARACTERIZING POWER DELIVERY SYSTEMS WITH ON/OFF-CHIP VOLTAGE REGULATORS FOR MANY-CORE PROCESSORS

Speakers:
Xuan Wang, Jiang Xu, Zhe Wang, Kevin J. Chen, Xiaowen Wu and Zhehui Wang, HKUST, HK

Abstract
Design of power delivery system has great influence on the power management in many-core processor systems. Moving voltage regulators from off-chip to on-chip gives more and more interest in the power delivery system design, because it is able to provide fast voltage scaling and multiple power domains. Previous works are proposed to implement power efficient on-chip regulators. It is also important to analyze the characteristics of the entire power delivery system to explore the tradeoff between the promising properties and costs of employing on-chip regulators. In this work, we develop an analytical model to evaluate important characteristics of the power delivery system, including on-chip/off-chip voltage regulators and the passive on-chip/on-board parasitic. Compared with SPICE simulations, our model achieves a fast system-level evaluation with comparable accuracy. Based on the model, geometric programming is utilized to find the optimal power efficiency of different architectures of power delivery systems under constraints of output voltage stability and area. Experiments show that compared with the conventional architecture using off-chip regulators, the hybrid one using both on-chip and off-chip voltage regulators achieves 1.0% power efficiency improvement and 68% area reduction of voltage regulators on average. We conclude that the hybrid architecture has potential for high power efficiency and small area at heavy workload, but careful account for the overhead of on-chip regulators is needed.

PARTIAL-SET: WRITE SPEEDUP OF PCM MAIN MEMORY

Speakers:
Li Bing1, Shan Shuchang2, Hu Yu2 and Li Xiaowei3 3
1ICT, UCAS, CN; 2ICT, CAS, CN; 3ICT, CAS, CN

Abstract
Phase change memory (PCM) is a promising nonvolatile memory technology developed as a possible DRAM replacement. Although it offers the read latency close to that of DRAM, PCM generally suffers from the long write latency. Long write request may block the read requests on the critical path of cache/memory access, incurring adverse impact on the system performance. Besides, the write performance of PCM is very asymmetric, i.e, the SET operation (writing ‘1’) is much slower than that of the RESET operation (writing ‘0’). In this work, we re-examine the resistance transform process during the SET operation of PCM and propose a novel Partial-set Scheme to alleviate the long write latency issue of PCM. During a write access to a memory line, a short Partial-set pulse is applied first to program the PCM cells to a pre-stable state, achieving the same write latency as RESET. The partially-set cells are then fully programmed within the retention window to preserve the data integrity. Experimental results show that by combining all our proposed schemes, the lifetime of the non-volatile cells can be improved by 245% on average.

GARBAGE COLLECTION FOR MULTI-VERSION INDEX ON FLASH MEMORY

Speakers:
Kam-Yiu Lam1, Jian-Tao Wang1, Yuan-Hao Chang2, Jen-Wei Hsieh3, Po-Chun Huang4, Chung Keung Poon5 and Chunjiang Zhu1
1City University of Hong Kong, HK; 2Academia Sinica, TW; 3National Taiwan University of Science and Technology, TW; 4Academia Sinica, TW; 5City University of Hong Kong, TW

Abstract
In this paper, we study the important performance issues in using the purging-range query to reclaim old data versions to be free blocks in a flash-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based placement (FPB) scheme to place data versions in a block, the efficiency in garbage collection can be further enhanced by increasing the deadspans of data versions and reducing reallocation cost especially when the spaces of the flash memory for the databases are limited.

D2CYBER: A DESIGN AUTOMATION TOOL FOR DEPENDABLE CYBERCARS

Speakers:
Arslan Munir and Farinaz Koushanfar, Rice University, US

Abstract
The next generation of automobiles (also known as cybercars) will increasingly incorporate electronic control units (ECUs) in novel automotive control applications. Recent work has demonstrated vulnerability of modern car control systems to security attacks that directly impacts the cybercar’s physical safety and dependability. In this paper, we provide an integrated approach for the design of secure and dependable cybercars using a case study: a steer-by-wire (SBW) application over controller area network (CAN). The challenge is to embed both security and dependability over CAN while ensuring that the real-time constraints of the cybercar applications are not violated. Our approach enables early design feasibility analysis by embedding essential security primitives (i.e., confidentiality, integrity, and authentication) over CAN subject to the real-time constraints imposed by the desired quality of service and behavioral reliability. Our method leverages multi-core ECUs for providing fault-tolerance by redundant multi-threading (RMT) and also further enhances RMT for quick error detection. We quantify the error resilience of our approach and evaluate the interplay of performance, fault-tolerance, security, and scalability for our SBW case study.
A FAULT DETECTION MECHANISM IN A DATA-FLOW SCHEDULED MULTITHREADED PROCESSOR

Speakers: Jian Fu1, Qiang Yang1, Raphael Poss1, Chris Jesshope1 and Chunyuan Zhang2
1University of Amsterdam, NL; 2National University of Defense Technology, CN

Abstract
This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled Multi-Threaded (DMT) multicore processor, called Data-flow scheduled Redundant Multi-Threading (DRMT). Meanwhile, It presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection related inter-core communication and alleviate the performance and hardware overheads induced by output comparison. Results show that the performance overhead of DRMT is less than 60% even when the number of threads is four times the number of processing elements. Also the performance and hardware overheads of AOC are insignificant.

1. CONTRACT-BASED DESIGN OF CONTROL PROTOCOLS FOR SAFETY-CRITICAL CYBER-PHYSICAL SYSTEMS

Speakers: Pierluigi Nuzzo, John Finn, Antonio Iannapolio and Alberto Sangiovanni-Vincentelli, University of California at Berkeley, US

Abstract
We introduce a platform-based design methodology that addresses the complexity and heterogeneity of cyber-physical systems by using assume-guarantee contracts to formalize the design process and enable realization of control protocols in a hierarchical and compositional manner. Given the architecture of the physical plant to be controlled, the design is carried out as a sequence of refinement steps from an initial specification to a final implementation, including synthesis from requirements and mapping of higher-level functional and non-functional models into a set of candidate solutions built out of a library of components at the lower level. Initial top-level requirements are captured as contracts and expressed using linear temporal logic (LTL) and signal temporal logic (STL) formulas to enable requirement analysis and early detection of inconsistencies. Requirements are then refined into a controller architecture by combining reactive synthesis steps from LTL specifications with simulation-based design space exploration steps. We demonstrate our approach on the design of embedded controllers for aircraft electric power distribution.

4.1 EXECUTIVE SESSION: Addressing Challenges of Reliable Chips

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Saal 1
Organiser: Yervant Zorian, Fellow & Chief Architect, Synopsys, US

Executives:
Dan Alexandrescu, President & CEO, iROC Technologies, FR
Robert Aitken, Fellow, ARM, US
Robert Hum, GM & VP, Mentor Graphics, US
Stefan Singer, Fellow, Freescale, DE

While today's SOCs systematically use semiconductor production quality assessment and optimization solutions, meeting end-product requirements for reliability and availability augments the need to prepare the SOC design in advance to address such requirements. The speakers in this executive session will address the current trends and challenges in the semiconductor reliability and discuss the level of readiness needed in a chip to meet today's SOC requirements.

4.2 Hot Topic: Multicore Systems in Safety Critical Electronic Control Units for Automotive and Avionics

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 6
Organisers:
Jürgen Becker, KIT, DE
Oliver Sander, KIT, DE
Chair:
Jürgen Becker, KIT, DE
Co-Chair:
Oliver Sander, KIT, DE

Future applications in automotive and avionics show an ever increasing demand of computational processing power. The use of multicore devices is now emerging in embedded electronics. However these solutions are not directly applicable because of technical requirements that come along with the domain of safety critical and mixed critical applications, such as in automotive or avionics. The major challenge for deployment of multicore devices in safety critical applications such as automotive or avionics, is the lack of determinism and support of segregation due to shared resources. The goal of this session is to present the challenges that arise from the use of multicore devices in embedded safety-critical systems and mixed critical systems.

17:00 4.2.1 AUTOSAR AND MULTICORE

Speakers: Stefan Kuntz1 and Rolf Schneider2
1Continental Automotive GmbH, DE; 2AUDI AG, DE

Abstract
AUTOSAR already supports developing applications for and integrating software components onto multicore based platforms. In addition, these capabilities pave the way for helping to migrate existing applications, originally developed for being executed on single core platforms, to multicore based platforms. This talk provides a brief introduction of the current state of AUTOSAR's multicore support and presents some scenarios that draws the attention to multicore specific questions and challenges in the particular context. Possible future directions in improving the AUTOSAR standard with regard to multicore and to gain more benefit from the availability of multiple cores, independent execution units, are sketched out.
Oscillator-based and Sense-Amplifier-based PUFs.

Physically Unclonable Functions (PUF) have received much attention for fingerprinting of electronic devices. This session presents novel constructions and threats on Ring-Oscillator (RO)-PUF.

Patrick Schaumont, Virginia Tech, US

Co-Chair:

Location / Room:

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30

Time Label Presentation Title Authors

17:30 4.2.2 CONCEPTS TO VALIDATE THE SAFE APPLICATION OF MULTICORE ARCHITECTURES IN THE AVIONICS DOMAIN Speaker:

Ottmar Bender, Airbus Defence and Space, DE

Abstract

This presentation explains how commercially available multicore processors can be applied for safety critical applications in avionics systems. It also describes remaining difficulties which need to be solved for a full exploitation of multicore technology in the avionics domain. Furthermore a concept of an airborne radar application demonstrator built on multicore architecture is shown. This demonstrator shall allow the validation of essential solutions for the specific difficulties emerging from current multicore architectures.

18:00 4.2.3 MONITORING AND WCET ANALYSIS IN COTS MULTI-CORE-SOC-BASED MIXED-CRITICALITY SYSTEMS Speakers:

Jan Nowotcki1, Michael Paulitsch2, Arne Henrichsen3, Werner Pongratz3 and Andreas Schacht3

1EADS Innovation Works, DE; 2EADS Innovation Work, DE; 3Cassidian, DE

Abstract

The performance and power efficiency of multi-core processors are attractive features for safety-critical applications, for example in avionics. But the inherent use of shared resources complicates timing analysability. In this paper we discuss a novel approach to compute the Worst-Case Execution Time (WCET) of multiple hard real-time applications scheduled on a Commercial Off-The-Shelf (COTS) multi-core processor. The analysis is closely coupled with mechanisms for temporal partitioning as, for instance, required in ARINC 653-based systems. Based on a discussion of the challenges for temporal partitioning and timing analysis of complex multi-core architectures, we present a new generic architecture for reasoning about the requirements for re-usability and incremental development and certification, we use this model to describe our integrated analysis approach.

18:15 4.2.4 HARDWARE VIRTUALIZATION SUPPORT FOR SHARED RESOURCES IN MIXED-CRITICALITY MULTICORE SYSTEMS Speakers:

Oliver Sander1, Timo Sandmann2, Viet Vu Duy3, Steffen Bahr3, Falco Bapp1, Juergen Becker3, Hans Ulrich Michel4, Dirk Kaule4, Daniel Adam4, Enno Luebben4, Jürgen Hairbucher4, Andre Richter4, Christian Herber7 and Andreas Herkersdorf8

1KIT, DE; 2Karlsruhe Institute of Technology (KIT), DE; 3Karlsruhe Institute of Technology, DE; 4BMW F+T, DE; 5Intel GmbH, DE; 6TUM, DE; 7Technische Universität München, DE; 8TU München, DE

Abstract

Electric/Electronic architectures in modern automobiles evolve towards an hierarchical approach where functionalities from several ECUs are consolidated into few domain computers. Performance requirements directly lead to multicore solutions but also to a combination of very different requirements on such ECUs. Using virtualization in addition is one promising way of achieving segregation in time and space of shared resources. Based on examples taken from the automotive domain several concepts for efficient hardware extensions of coprocessors and I/O devices is shown in this contribution. These provide mechanisms to ensure quality of service (QoS) levels in terms of execution time, throughput and latency. The resulting infotainment architecture is a feasibility study and is integrated into a vehicle demonstrator as centralized infotainment platform (VCT).

18:30 End of session

Speakers

Tim Gueneyssu, RUB, DE

Co-Chair:

Patrick Schaumont, Virginia Tech, US

Physically Unclonable Functions (PUF) have received much attention for fingerprinting of electronic devices. This session presents novel constructions and threats on Ring-Oscillator-based and Sense-Amplifier-based PUFs.

3.4 Secure Device Identification

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 1

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 1

Time Label Presentation Title Authors

17:00 4.3.1 ARO-PUF: AN AGING-RESISTANT RING OSCILLATOR PUF DESIGN

Speakers:

Md. Tauhidur Rahman1, Domenic Forte2, Jim Fahmy2 and Mohammad Tehrani2

1University of Connecticut, US; 2Comcast, US

Abstract

Physically Unclonable Functions (PUFs) have emerged as a strong basis for the potential to generate chip-specific identifiers and cryptographic keys. However it has been shown that the stability of these identifiers and keys is heavily impacted by aging and environmental variations. Previous techniques have mostly focused on improving PUF robustness against supply noise and temperature but aging has largely been neglected. In this paper, we propose a new aging resistant design for the popular ring-oscillator (RO)-PUF. Simulation results demonstrate that our aging resistant RO-PUF (called ARO-PUF) can produce unique, random, and more reliable keys. Only 7.7% bits get flipped on average over 10 years operation period for an ARO-PUF due to aging where the value is 32% for a conventional RO-PUF. The ARO-PUF shows an average inter-chip HD of 49.67% (close to ideal value 50%) and better than the conventional RO-PUF (≈45%). With lower error, ARO-PUF offers ≈24X area reduction for a $128$-bit key because of reduced ECC complexity and smaller PUF footprint.

17:30 4.3.2 (Best Paper Award Candidate)

AN EFFICIENT RELIABLE PUF-BASED CRYPTOGRAPHIC KEY GENERATOR IN 65NM CMOS

Speakers:

Mudit Bharagava1 and Ken Mai2

1ARM, US; 2Carnegie Mellon University, US

Abstract

Physical unclonable functions (PUFs) are primitives that generate high-entropy, tamper resistant bits for use in secure systems. For applications such as cryptographic key generation, the PUF response bits must be highly reliable, consistent across multiple evaluations under voltage and temperature variations. Conventionally, error correcting codes (ECC) have been used to improve response reliability, but these techniques have significant area, power, and delay overheads and are vulnerable to information leakage. In this work, we present a hardware solution with no ECC, but instead uses built-in self-test to determine which PUF bits are reliable and only uses those bits for key generation. We implemented a prototype of the key generator in a 65nm bulk CMOS testchip. The key generator generates 1213 bits in an area of <50kμm2 with a measured bit error rate of < 5 × 10−9 in both the nominal and worst case corners (100k measurements each). This is equivalent to a 128-bit key failure rate of < 10−6. The system can generate a 128-bit key in 1.15μs. Finally, we present a realization of a "strong"-PUF that uses 128 of these highly reliable bits in conjunction with an Advanced Encryption Standard (AES) cryptographic primitive and has a response time of 40ns and is realized in an area of 84kμm2.
error prone, numerous industrial and academic efforts are targeting 3D integration, and integrated microfluidics promise to have a profound impact on healthcare and other fields. The three papers in this session all address "nearer-term" emerging technologies.

Michael Niemier, University of Notre Dame, US
Co-Chair:
Ian O'Connor, University of Lyon, FR
Chair:

### Time | Label | Presentation Title | Authors
--- | --- | --- | ---
18:00 | 4.3.3 | INCREASING THE EFFICIENCY OF SYNDROME CODING FOR PUFS WITH HELPER DATA COMPRESSION<br>**Speakers:** Matthias Hiller and Georg Sigl, Institute for Security in Information Technology; Technische Universität München, DE<br>**Abstract:**<br>Physical Unclonable Functions (PUFs) provide secure cryptographic keys for resource constrained embedded systems without secure storage. A PUF measures internal manufacturing variations to create a unique, but noisy secret inside a device. Syndrome coding schemes create and store helper data about the structure of a specific PUF to correct errors within subsequent PUF measurements and generate a reliable key. This helper data can contain redundancy. We analyze existing schemes and show that data compression can be applied to decrease the size of the helper data of existing implementations. We introduce compressed Differential Sequence Coding (DSC), which is the most efficient syndrome coding scheme known to date for a popular reference scenario. Adding helper data compression to the DSC algorithm leads to an overall decrease of 68% in helper data size compared to other algorithms in a reference scenario. This is achieved without increasing the number of PUF bits and a minimal increase in logic size.

18:15 | 4.3.4 | KEY-RECOVERY ATTACKS ON VARIOUS RO PUF CONSTRUCTIONS VIA HELPER DATA MANIPULATION<br>**Speakers:** Jeroen Delvaux, Ingrid Verbauwhede<br>**1KU Leuven, BE; 2KU Leuven - COSIC, BE<br>**Abstract:**<br>Physically Unclonable Functions (PUFs) are security primitives that exploit the unique manufacturing variations of an integrated circuit (IC). They are mainly used to generate secret keys. Ring oscillator (RO) PUFs are among the most widely researched PUFs. In this work, we claim various RO PUF constructions to be vulnerable against manipulation of their public helper data. Partial/full key-recovery is a threat for the following constructions, in chronological order: (1) Temperature-aware cooperative RO PUFs, proposed at HOST 2009. (2) The sequential pairing algorithm, proposed at HOST 2010. (3) Group-based RO PUFs, proposed at DATE 2013. (4) Or more general, all entropy distiller constructions proposed at DAC 2013.

18:30 | End of session

### 4.4 "Almost there" emerging technologies

**Date:** Tuesday 25 March 2014<br>**Time:** 17:00 - 18:30
**Location / Room:** Konferenz 2

**Chair:** Ian O'Connor, University of Lyon, FR
**Co-Chair:** Michael Niemier, University of Notre Dame, US

The three papers in this session all address "nearer-term" emerging technologies. Stochastic computing techniques are becoming increasingly relevant as CMOS becomes more error prone, numerous industrial and academic efforts are targeting 3D integration, and integrated microfluidics promise to have a profound impact on healthcare and other domains.

**Time | Label | Presentation Title | Authors
--- | --- | --- | ---
17:00 | 4.4.1 | IIR FILTERS USING STOCHASTIC ARITHMETIC<br>**Speakers:** Naman Sarai, Kibazargan, David J Lija and Marc D Riedel, University of Minnesota, Twin Cities, US<br>**Abstract:**<br>We consider the design of IIR filters operating on oversampled sigma-delta modulated bit streams using stochastic arithmetic. Conventional digital filters process multi-bit data at the Nyquist rate using multi-bit multipliers and adders. High resolution ADCs based on the sigma-delta modulation generate random bits at an oversampled rate as intermediate data. We propose to filter the sigma-delta modulated bit streams directly and present first and second order low pass IIR filters based on the stochastic integrator. Experimental results show a significant reduction in hardware area by using stochastic filters.

17:30 | 4.4.2 | EFFICIENT TRANSIENT THERMAL SIMULATION OF 3D ICS WITH LIQUID-COOlIng AND THROUGH SILICON VIAs<br>**Speakers:** Alain Fournigue, Giovanni Beltrame and Gabriela Nicolescu, Polytechnique Montreal, CA<br>**Abstract:**<br>Three-dimensional integrated circuits (3D ICs) with advanced cooling systems are emerging as a viable solution for many-core platforms. These architectures generate a high and rapidly changing thermal flux. Their design requires accurate transient thermal models. Several models have been proposed, either with limited capabilities, or poor simulation performance. This work introduces an efficient algorithm based on the Finite Difference Method to compute the transient temperature in 3D ICs. Our experiments show a 5x speedup versus state-of-the-art models, while maintaining the same level of accuracy, and demonstrate the effect of large through silicon vias arrays on thermal dissipation.

18:00 | 4.4.3 | A LOGIC INTEGRATED OPTIMAL PIN-COUNT DESIGN FOR DIGITAL MICROFLUIDIC BIOCHIPS<br>**Speakers:** Trung Anh Dinh, Shigeru Yamashita and Tsung-Yi Ho<br>**1Ritsumeikan University, JP; 2National Cheng Kung University, TW<br>**Abstract:**<br>Digital microfluidic biochips have become one of the most promising technologies for biomedical experiments. In modern microfluidic technology, reducing the number of independent control pins that reflects most of the fabrication cost, power consumption and reliability of a microfluidic system, is a key challenge for every digital microfluidic biochip design. However, all the previous chip designs sacrifice the optimality of the problem, and only limited reduction on the number of control pins is observed. Moreover, most existing designs cannot satisfy high-throughput demand for bioassays, and thus inapplicable in practical contexts. In this paper, we propose the first optimal pin-count design scheme for digital microfluidic biochips. By integrating a very simple combinational logic circuit into the original chip, the proposed scheme can provide high-throughput for bioassays with an information-theoretic minimum number of control pins. Furthermore, to cope with the rapid growth of the chip's scale, we also propose a scalable and efficient heuristics. Experiments demonstrate that the proposed scheme can obtain much fewer number of control pins compared with the previous state-of-the-art works.
A single parallel application running on a multi-core system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Some of the reasons for the slow progress of threads are (1) load imbalance, (2) frequent cache misses and (3) effect of synchronization primitives. Identifying critical threads and minimizing their cache miss latencies can improve system performance. One way to hide and tolerate the cache misses is through hardware prefetching. Hardware prefetching is one of the most commonly used memory latency hiding techniques. Previous studies have shown the effectiveness of hardware prefetchers for multiprogrammed workloads (multiple sequential applications running independently on different cores). In contrast to multiprogrammed workloads, the performance of a single parallel application depends on the progress of slow progress/critical) threads. This paper introduces Thread Criticality-aware Prefetcher Aggressiveness Control (TCPAC). TCPAC controls the aggressiveness of prefetchers at the L2 prefetching controllers (known as TCPAC-P), DRAM controller (known as TCPAC-D) and at the Last Level Cache (LLC) controller (known as TCPAC-C) based on the prefetch accuracy and the thread progress. Though each TCPAC subtechnique outperform the respective state-of-the-art techniques such as HPAC [2], PCp [4], and PACMan [3]. Combination of all the TCPAC sub-techniques named as TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan. On an average, on a 8 core system, in terms of improvement in execution time, TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan by 7.61%. For 12 and 16 cores, TCPAC-PDC beats the state-of-the-art combinations by 7.21% and 8.32% respectively.
4.6 Code Generation and Optimization for Embedded Platforms

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 4

Chair: Heiko Falk, Ulm University, DE
Co-Chair: Florence Maraninchi, Grenoble IMP/VERIMAG, FR

This session covers the broad spectrum of topics in compilers, code optimization, and validation under consideration of today’s embedded platforms. The first paper addresses the automated validation of binary translators. The second paper focuses on the on-device optimization of apps and system libraries of mobile platforms. The third paper deals with the code generation of Android Image processing applications for heterogeneous GPU-based architectures. The session is rounded off by short presentations of work-in-progress ideas on model transformation, energy and wear-leveling optimization, and scheduling/register allocation.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td></td>
<td><strong>REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD3)</strong></td>
<td>Igor Loi, Magnus Sjlander, David Whalley and Per Larsson-edefors</td>
</tr>
<tr>
<td>17:10</td>
<td>IP2-3</td>
<td>DRAM-BASED COHERENT CACHES AND HOW TO TAKE ADVANTAGE OF THE COHERENCE PROTOCOL TO REDUCE THE REFRESH ENERGY</td>
<td>Zoran Jaksic and Ramon Calan, Politecnica de Catalunya, ES</td>
</tr>
<tr>
<td>18:00</td>
<td>IP2-4</td>
<td>DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP</td>
<td>Preethi Parayar Mana Damodaran, Stefan Wallentowitz and Andreas Herkersdorf</td>
</tr>
</tbody>
</table>

Presentation Title: A MULTI BANKED - MULTI PORTED - NON BLOCKING SHARED L2 CACHE FOR MIPSOC PLATFORMS
Speakers: Igor Loi and Luca Benini

Abstract
On-chip L2 cache architectures, well established in high-performance parallel computing systems, are now becoming a performance-critical component also for multi/many-core architectures targeted at lower-power, embedded applications. The very stringent requirements on power and cost of these systems result in one of the key challenges in many-core designs, mandating the deployment of highly efficient L2 caches. In this perspective, sharing the L2 cache layer among all system cores has important advantages, such as increased utilization, fast inter-core communication, and reduced aggregate footprint because no undesired replication of lines occurs. This paper presents and explores a novel architecture for a shared L2 cache system with multi-port and multi-bank features. We target this L2 cache to a many-core platform based on hierarchical cluster structure that does not employ private data caches, and therefore does not require complex coherence mechanisms. In fact, our shared L2 cache can be seen logically as a Last Level Cache (LLC) adopting the terminology of high-performance many-core products, although in these latter the LLC is more often an L3 layer. Our experimental results show a maximum aggregate bandwidth of 28GB/s (99% of the maximum channel capacity) for 100% hit traffic with random banking conflicts, as a realistic case. Physical implementation results in 28nm Fully-Depleted-Silicon-on-Insulator (FDSOI) show that our L2 cache can operate at up to 1GHz with a memory density loss of only 20% with respect to an L2 scratchpad for a 2 MB configuration.

Presentation Title: REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD3)
Speakers: Alen Bardizbanyan, Magnus Sjlander, David Whalley and Per Larsson-edefors

Abstract
Fast set-associative level-one data caches (L1~DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD3) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD3 technique enables about 49% of all load operations to access the L1~DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1~DC energy by 13%.

Presentation Title: DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP
Speakers: Preethi Parayar Mana Damodaran, Stefan Wallentowitz and Andreas Herkersdorf

Abstract
In a shared-memory based tiled many-core system-on-chip architecture, memory accesses present a huge performance bottleneck in terms of access latency as well as bandwidth requirements. The best practice approach to address this issue is to provide a multi-level cache hierarchy and a suitable cache-coherency mechanism. This paper presents a method to increase the memory access performance in distributed-directory-coherency-protocol based tiled many-core systems. The proposed method introduces an alternate design for the system-wide shared last-level caches (LLC) placed between the memory and the node private caches (NPC). The proposed system-wide shared LLC layer is distributed over the entire network and it interacts with the home directories of specific cache lines. Results from simulating SPEC2000 benchmark applications executed on a SystemC model of the proposed design show a minimum performance improvement of 20-25% when compared to a model without the shared cache layer at the expense of an additional 2% of the total cache memory space (NPC + LLC memory). In addition, the proposed design shows a minimum 7-15% and an average 14-15% improvement in performance in comparison to centralized system-wide shared LLC of equivalent size and dynamic mapped distributed LLC of equivalent size respectively.

4.5 Code Generation and Optimization for Embedded Platforms

Time | Label | Presentation Title                                                                 | Authors          |
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>18:30</td>
<td></td>
<td><strong>A MULTI BANKED - MULTI PORTED - NON BLOCKING SHARED L2 CACHE FOR MIPSOC PLATFORMS</strong></td>
<td>Igor Loi and Luca Benini</td>
</tr>
<tr>
<td>18:40</td>
<td></td>
<td><strong>REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD3)</strong></td>
<td>Alen Bardizbanyan, Magnus Sjlander, David Whalley and Per Larsson-edefors</td>
</tr>
<tr>
<td>18:50</td>
<td></td>
<td><strong>DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP</strong></td>
<td>Preethi Parayar Mana Damodaran, Stefan Wallentowitz and Andreas Herkersdorf</td>
</tr>
</tbody>
</table>
17:00  4.6.1  EATBit: EFFECTIVE AUTOMATED TEST FOR BINARY TRANSLATION WITH HIGH CODE COVERAGE
Speakers:
Hui Guo\textsuperscript{1}, Zhenjiang Wang\textsuperscript{1}, Chenggang Wu\textsuperscript{1} and Ruining He\textsuperscript{2}
\textsuperscript{1}Institute of Computing Technology, Chinese Academy of Sciences, CN; \textsuperscript{2}University of California, San Diego, US

Abstract
Binary translation makes it convenient to emulate one instruction set by another. Nowadays, it is growing in popularity in various applications, especially the embedded platforms. When it comes to the test of binary translators, traditional methodologies which still mainly rely on manual unit test is costly, labor-intensive and often not adequate to test complicated algorithms in the translators. Some standard benchmark suites, like SPEC CPU2006, are compiled with different compilation options for further tests. However, the translation modules still have over 30% of their code unexecuted after such tests, according to our experimental results. Methodologies based on randomization can generate a vast variety of tests, thus improve the code coverage in the translation system. In this paper, we propose such an approach named EATBit. Test binaries are generated with randomly selected instructions and operands. The binaries and a large amount of input data are then refined to exclude invalid ones. Experimental results on a real binary translator demonstrate that EATBit can not only improve code coverage by over 20%, but also find some new bugs in the translator successfully.

17:30  4.6.2  CODE GENERATION FOR EMBEDDED HETEROGENEOUS ARCHITECTURES ON ANDROID
Speakers:
Garo Bouroumian and Alex Orailoglu, University of California, San Diego, US

Abstract
Smartphones provide applications that are increasingly similar to those of interactive desktop programs, providing rich graphics and animations. To simplify the creation of these interactive applications, mobile operating systems employ high-level object-oriented programming languages and shared libraries to manipulate the device’s peripherals and provide common user-interface frameworks. The presence of dynamic dispatch and polymorphism allows for robust and extensible application coding. Unfortunately, the presence of dynamic dispatch also introduces significant overheads during method calls, which directly impact execution time. Furthermore, since these applications rely heavily on shared libraries and helper routines, the quantity of these method calls is higher than those found in typical desktop-based programs. Optimizing these method calls centrally before consumers download the application onto a given phone is exacerbated due to the large diversity of hardware and operating system versions that the application could run on. This paper proposes a methodology to tailor a given Objective-C application and its associated device-specific shared library codebase using on-device post-compilation code optimization and transformation. In doing so, many polymorphic sites can be resolved statically, improving the overall application performance.

18:00  4.6.3  DESIGN OF SAFETY CRITICAL SYSTEMS BY REFINEMENT
Speakers:
Richard Membarth, Oliver Reiche, Frank Hannig and Jürgen Teich, University of Erlangen-Nuremberg, DE

Abstract
The success of Android is based on its unified Java programming model that allows to write platform-independent programs for a variety of different target platforms. However, this comes at the cost of performance. As a consequence, Google introduced APIs that allow to write native applications and to exploit multiple cores as well as embedded GPUs for compute-intensive parts. This paper proposes code generation techniques in order to target the Renderscript and FLibRenderScript APIs. Renderscript harnesses multi-core CPUs and unified shader GPUs, while the more restricted FLibRenderScript also supports GPUs with earlier shader models. Our techniques focus on image processing applications and allow to target these APIs and OpenCL from a common description. We further supersede memory transfers by sharing the same memory region among different processing elements on HSA platforms. As reference, we use an embedded platform hosting a multi-core ARM CPU and an ARM Mali GPU. We show that our generated source code is faster than native implementations in OpenCV as well as the pre-implemented script intrinsics provided by Google for acceleration on the embedded GPU.

18:30  4.6.4  ENERGY OPTIMIZATION IN ANDROID APPLICATIONS THROUGH WAKELOCK PLACEMENT
Speakers:
Faisal Alam\textsuperscript{1}, Preeti Ranjan Panda\textsuperscript{1}, Nikhil Tripathi\textsuperscript{2}, Namita Sharma\textsuperscript{1} and Sanjiv Narayan\textsuperscript{2}
\textsuperscript{1}Newcastle University, GB; \textsuperscript{2}Newcastle University, ZW; \textsuperscript{3}Newcastle University, BB

Abstract
An increasingly large number of safety-critical embedded systems rely on software to prevent and mitigate hazards occurring due to design errors and unexpected interactions of the system with its users and the environment. Implementing a safety instrumented function in the way advocated by the traditional software methods requires an intimate understanding and thorough validation of a complex ecosystem of programming languages, compilers, operating systems and hardware. We propose to consider an alternative where a system designer, for each individual problem, creates in a correct-by-construction manner both the design of a system and its compilation and execution infrastructure. This permits an uninterrupted chain of a formal correctness argument spanning from formalised requirements all the way to the gate-level characterisation of an execution environment. The past decade of advances in verification technology turned the mechanical verification of large-scale models into a reality while the pressure of certification makes the cost of a formally verified development routine increasingly acceptable. The proposed technique fits the Grand Challenge for Computer Research posed by Hoare in 2003, namely, development of a Verifying Compiler which not only mechanically translates a given program from one language to another but also verifies its correctness according to a formal specification. This allows meeting the most stringent software certification requirements such as SIL 4. We illustrate the idea with a small case-study developed using the Event-B modelling notation and tools.

18:31  4.6.5  A WEAR-LEVELING-AWARE DYNAMIC STACK FOR PCM MEMORY IN EMBEDDED SYSTEMS
Speakers:
Qingan Li\textsuperscript{1}, Yanxiang He\textsuperscript{2}, Yong Chen\textsuperscript{2}, Chun Xue\textsuperscript{1}, Nan Jiang\textsuperscript{2} and Chao Xu\textsuperscript{2}
\textsuperscript{1}Huazhong University & City University of Hong Kong, CN; \textsuperscript{2}City University of Hong Kong, CN

Abstract
Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics such as extremely low leakage power, high storage density and good scalability. However, PCM’s low endurance constrains its practical applications. In this paper, we propose a Wear Leveling aware dynamic stack to extend PCM’s lifetime when it is adopted in embedded systems as main memory. Through a dynamic stack, the memory space is circularly allocated to stack objects, and thus an even usage of PCM memory is achieved. The experimental results show that the proposed method can significantly reduce the write variation on PCM cells and enhance the lifetime of PCM memory.

18:32  4.6.6  ENERGY SAVINGS IN MOBILE OPERATING SYSTEMS THROUGH CODE OPTIMIZATION
Speakers:
Alex Iliasov\textsuperscript{1}, Arseniy Alekseyev\textsuperscript{2}, Danil Sokolov\textsuperscript{3} and Andrey Mokhov\textsuperscript{3}
\textsuperscript{1}Newcastle University, GB; \textsuperscript{2}Newcastle University, ZW; \textsuperscript{3}Newcastle University, BB

Abstract
Methodologies based on randomization can generate a vast variety of tests, thus improve the code coverage in the translation system. In this paper, we propose such an approach named EATBit. Test binaries are generated with randomly selected instructions and operands. The binaries and a large amount of input data are then refined to exclude invalid ones. Experimental results on a real binary translator demonstrate that EATBit can not only improve code coverage by over 20%, but also find some new bugs in the translator successfully.
### 4.7 Dependable System Design

**Date:** Tuesday 25 March 2014  
**Time:** 17:00 - 18:30  
**Location / Room:** Konferenz 5  
**Chair:** Yiorgos Makris, University of Texas at Dallas, US  
**Co-Chair:** Haralamos Stratigopoulos, TIMA, FR

This section presents a variety of techniques to improve dependability of digital systems, showing how to improve security and fault tolerance at system level.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>4.7.1</td>
<td>REAL-TIME TRUST EVALUATION IN INTEGRATED CIRCUITS</td>
<td>Yier Jin and Dean Sullivan, The University of Central Florida, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>The use of side-channel measurements and fingerprinting, in conjunction with statistical analysis, has proven to be the most effective method for accurately detecting hardware Trojans in fabricated integrated circuits. However, these post-fabrication trust evaluation methods overlook the capabilities of advanced design skills that attackers can use in designing sophisticated Trojans. To this end, we have designed a Trojan using power-gating techniques and dynamic scheduling that it can be masked from advanced side-channel fingerprinting detection while dormant. We then propose a real-time trust evaluation framework that continuously monitors the on-board global power consumption to monitor chip trustworthiness. The measurements obtained corroborate our frameworks effectiveness for detecting Trojans. Finally, the results presented are experimentally verified by performing measurements on fabricated Trojan-free and Trojan-infected variants of a reconfigurable linear feedback shift register (LFSR) array.</td>
</tr>
<tr>
<td>17:30</td>
<td>4.7.2</td>
<td>(Best Paper Award Candidate) VERIFICATION-GUIDED VOTER MINIMIZATION IN TRIPLE-MODULAR REDUNDANT CIRCUITS</td>
<td>Dmitry Burtsevaev, Pascal Fradet and Alain Girault, INRIA, FR</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>We present a formal approach to minimize the number of voters in triple-modular redundant sequential circuits. Our technique actually works on a single copy of the circuit and considers a user-defined fault model (under the form 'at most 1 bit-flip every k clock cycles'). Verification-based voter minimization guarantees that the resulting circuit (i) is fault tolerant to the soft-errors defined by the fault model and (ii) is functionally equivalent to the initial one. Our approach operates at the logic level and takes into account the input and output interface specifications of the circuit. Its implementation makes use of graph traversal algorithms, fixed-point iterations, and BDDs. Experimental results on the ITC'99 benchmark suite indicate that our method significantly decreases the number of inserted voters which entails a hardware reduction of up to 55% and a clock frequency increase of up to 35% compared to full TMR. We address scalability issues arising from formal verification with approximations and assess their efficiency and precision.</td>
</tr>
<tr>
<td>18:00</td>
<td>4.7.3</td>
<td>TRADE-OFFS IN EXECUTION SIGNATURE COMPRESSION FOR RELIABLE PROCESSOR SYSTEMS</td>
<td>Jonah Caplan1, Maria Mera1, Peter Milder1 and Brett Meyer1</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>As semiconductor processes scale, making transistors more vulnerable to transient upset, a wide variety of microarchitectural and system-level strategies are emerging to perform efficient error detection and correction computer systems. While these approaches often target various application domains and address error detection and correction at different granularities and with different overheads, an emerging trend is the use of state compression, e.g., cyclic redundancy check (CRC), to reduce the cost of redundancy checking. Prior work in the literature has shown that Fletcher's checksum (FC), while less effective where error detection probability is concerned, is less computationally complex when implemented in software than the more-effective CRC. In this paper, we reexamine the suitability of CRC and FC as compression algorithms when implemented in hardware for embedded safety-critical systems. We have developed and evaluated parameterizable implementations of CRC and FC in FPGA, and we observe that what was true for software implementations does not hold in hardware: CRC is more efficient than FC across a wide variety of target input bandwidths and compression strengths.</td>
</tr>
<tr>
<td>18:15</td>
<td>4.7.4</td>
<td>AN ENERGY-AWARE FAULT TOLERANT SCHEDULING FRAMEWORK FOR SOFT ERROR RESILIENT CLOUD COMPUTING SYSTEMS</td>
<td>Yue Gao, Sandeep Gupta, Yanzhi Wang and Massoud Pedram, University of Southern California, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>For modern high performance systems, aggressive technology and voltage scaling has drastically increased their susceptibility to soft errors. At the grand scale of cloud computing, it is clear that soft error induced failures will occur far more frequently, but it is unclear as to how to effectively apply current error detection and fault tolerance techniques in scale. In this paper, we focus on energy-aware fault tolerant scheduling in public, multi-user cloud systems, and explore the three-way tradeoff between reliability (in terms of soft error resiliency), performance and energy. Through a systematically optimized resource allocation, error detection approach selection, virtual machine placement, spatial/temporal redundancy augmentation and task scheduling process, the cloud service provider can achieve high error coverage and fault tolerance confidence while minimizing global energy costs under user deadline constraints. Our scheduling algorithm includes a static scheduling phase that operates on task graph based workload inputs prior to execution, and a light-weight dynamic scheduler that migrates tasks during execution in case of excessive re-executions. All schedules are evaluated on a runtime simulation engine that (1) mimics the performance fluctuations in cloud systems, and (2) supports the injection of arbitrary fault patterns. Compared to current virtual machine or task replication techniques, we are able to reduce overall application failure rates by over 50% with approximately 76% total energy overhead.</td>
</tr>
</tbody>
</table>
**Presentation Title:** A LOW-POWER, HIGH-PERFORMANCE APPROXIMATE MULTIPLIER WITH CONFIGURABLE PARTIAL ERROR RECOVERY

**Authors:**

1. Cong Liu, 1, Jie Han 2
2. Fabrizio Lombardi 2

**University of Alberta, CA; 1Northeastern University, US**

**Abstract**

Approximate circuits have been considered for error-tolerant applications that can tolerate some loss of accuracy with improved performance and energy efficiency. Multipliers are key arithmetic circuits in many such applications such as digital signal processing (DSP). In this paper, a novel approximate multiplier with a lower power consumption and a shorter critical path than traditional multipliers is proposed for high-performance DSP applications. This multiplier leverages a newly-designed approximate adder that limits its carry propagation to the nearest neighbors for fast partial product accumulation. Different levels of accuracy can be achieved through a configurable error recovery by using different numbers of most significant bits (MSBs) for error reduction. The approximate multiplier has a low mean error distance, i.e., most of the errors are not significant in magnitude. Compared to the Wallace multiplier, a 16-bit approximate multiplier implemented in a 28nm CMOS process shows a reduction in delay and power of 20% and up to 69%, respectively. It is shown that by utilizing an appropriate error recovery, the proposed approximate multiplier achieves similar processing accuracy as traditional exact multipliers but with significant improvements in power and performance.

**End of session**

18:30 Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level)

The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

---

**4.8 State-of-the-art in Verification: European Tertulia IC Design - Enabling AMS Structured Verification / Verification in FPGA & IP design flows**

**Organiser:**

Andreas Brüning, Silicon Saxony, DE

**Date:**

Tuesday 25 March 2014

**Time:**

17:00 - 18:30

**Location / Room:** Exhibition Theatre

---

**Presentation Title:** BRING ASIC-ALIKE VERIFICATION TO YOUR FPGA & IP DESIGN FLOW

**Authors:**

Scott Calkins, Blue Pearl Software Inc, US

**Abstract**

This talk will highlight how successful design teams and IP firms such as PLDA are able to develop high quality code by using a process to control and optimize the HDL which is developed by different designers in different locations, even those with variety skill sets. PLDA designs and sells intellectual property (IP) cores and prototyping tools for ASIC and FPGA that aim to accelerate time-to-market for embedded electronic designers. PLDA specializes in high-speed interface protocols and technologies such as PCIe. Through the use of Blue Pearl Software’s Symbolic Engine that maps code to RTL level then analyzed it for known structures, PLDA is able to generate deterministic results for the handful of synthesizers and target fabrics their customers demand. Analyzing the HDL before it is brought into cycle-based simulators allows designers to run FPGA-centric structural checks for Xilinx and Altera so it helps to detect bugs and specific optimizations earlier in the flow and automatically for the success and satisfaction of our customer’s designers: “Blue Pearl Software’s design analysis tool enables integration of formal verification techniques to our design flow, in order for us to detect structural bugs at the very early stage of code integration, and thus to deliver highest quality IP to our customers. On top, we definitely recommend Blue Pearl Software’s solution to anyone who needs to increase design team productivity.” Hugues Deneux, R&D Director of PLDA

---

**Presentation Title:** TOWARDS CO-DESIGN AND CO-VERIFICATION OF HW, SW, AND ANALOG SYSTEMS

**Authors:**

Christoph Grimm, TU Kaiserslautern, DE

**Abstract**

We can today design and verify digital hardware and software in a way that deserves the word co-design. Co-design achieves a significantly higher productivity in the design, and better performances of the product. Unfortunately, co-design and co-verification is not yet done in a similar productive way for analog and RF systems. The presentation will give an overview of methodology, tools, and languages that include analog and RF design into a comprehensive co-design methodology. Particular focus is on tool integration and power profiling crossing the discrete-analog border.

---

**Presentation Title:** ENABLING AMS STRUCTURED VERIFICATION

**Authors:**

Gunter Strube 1, Jie Han 2
1MunEDA GmbH, DE; 2ZMDI, DE

**Abstract**

The verification of the robustness of design specifications with respect to all combinations of worst-case parameter conditions not only improves the design confidence, but it is increasingly becoming a requirement for quality assurance and documentation for norms. It is a complex task consuming significant man power and compute power and it tends to be sacrificed under time pressure in the final stage of a project. We present an automated structured approach that differentiates through it’s thoroughness, it’s efficiency and most of all it’s ease-of-use. It enables even novice designers to apply advanced state-of-the art statistical tools to create a report including a measure of robustness for each specification and for the circuit.

---

**Presentation Title:** TERTULIA IC-DESIGN - EUROPE TEAMS UP

**Authors:**

Jürgen Haase, edacentrum, DE

**Abstract**

The clusters of Grenoble and Dresden developed to leading clusters of world-wide importance. Now these clusters have initiated substantial initiatives for collaboration in order to strengthen Europe’s position in the world-wide competition of microelectronics sites. This talk gives an overview about actual initiatives - including the tertulia IC-Design.

---

**Exhibition Reception** in Several serving points inside the Exhibition Area (Terrace Level)

The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.
**UB04.01 QUANTUMEDA: A VISUALIZATION AND DESIGN ENVIRONMENT FOR TOPOLOGICAL QUANTUM CIRCUITS**

**Authors:** Illia Polian, Wolfgang Wallner and Alexandru Paler, University of Passau, DE

**Abstract**
Quantum circuits use quantum-mechanical properties of certain physical systems, such as superposition and entanglement, to perform massively parallel calculations. They provide polynomial algorithms for problems for which only inefficient algorithms with asymptotically-exponential running time are known in conventional models of computation. Building a scalable quantum computer that can process a large number of quantum bits (qubits) is one of the grand challenges of modern science. While first small quantum computers have been experimentally demonstrated and a number of implementation technologies have been suggested, all of them encounter difficulties when it comes to scaling. The central difficulty is the high susceptibility of such circuits to noise and decoherence, which necessitates the use of special quantum error correction. Topological quantum computing (TQC) is a paradigm that offers a path to scalability. It strikes a balance between systematic, intuitive methods to design large computations, and relatively loose requirements on the vulnerability of individual qubits to errors. The availability of a platform for implementing large quantum algorithms constitutes the need for methods to manage design complexity, including automatic synthesis, optimization, compaction, verification and visualization of TQC circuits. Topological quantum circuits are based on two-dimensional cluster of qubits which supports highly efficient topological quantum error-correcting codes. In this way, the circuits can operate even though its individual qubits are subject to relatively high error rates. We will present the first environment for design of TQC circuits. The environment allows the user to graphically enter the structure of a circuit, add, delete and re-shape individual qubits, and perform optimization and compaction (both manually and by global replacement). The circuits are represented on an intermediate technology-independent level, where "logical qubits" that consist of a large number of physical qubits perform error-corrected operations. For example, the circuit in Fig. 1 shows an error-corrected CNOT gate implemented by four logical qubits represented by colored structures. The optimized representation can be translated into instruction sequences for a classical computer that operates the actual quantum hardware.

**More information ...**

**UB04.02 AIDA: ANALOG IC DESIGN AUTOMATION**

**Authors:** Nuno Horta, Nuno Lourenço, Ricardo Martins, Ricardo Póvoa, António Canelas and Pedro Ventura

1 Instituto de Telecomunicacoes, PT; 2 Instituto de Telecomunicacoes / Instituto Superior Técnico, PT

**Abstract**
This demonstration presents AIDA, an analog integrated circuit (IC) design automation environment. AIDA includes two main modules, namely, AIDA-C and AIDA-L. AIDA-C is a circuit-level synthesis tool which uses state-of-the-art multi-objective multi-constrained optimization kernels, based on evolutionary computation techniques, where the robustness of the solutions is attained by considering a layout-aware approach and, also, extreme process variations by means of PVT corner analysis. The circuit's performance is measured using Spectre®, Eldo® or Hspice® electrical simulators as evaluation engines. AIDA-L considers the device sizes and the best floorplan, obtained with AIDA-C, and generates the complete layout by placing and routing the devices, while fulfilling the technology design rules by using an in-house design-rule check (DRC) and layout-versus-schematic (LVS) procedures. In order to demonstrate AIDA design environment several analog circuit structures, e.g., OTAs, LNAs, LC-Oscillators, etc., will be synthesized in a 130nm CMOS technology. AIDA-C is demonstrated for circuit-level sizing and optimization by generating a family of Pareto Optimal solutions based on user performance and functional specifications. AIDA-L is demonstrated by generating the layout of a user selected solution from AIDA-C, taking into account electrical currents information to mitigate electromigration and IR-drop effects, and also wiring symmetry for multiplex multi-terminal signal nets of analog ICs.

**More information ...**

**UB04.03 PATN: A PERFORMANCE ANALYSIS TOOL FOR NOC**

**Authors:** Yang Chen and Zhonghai Lu, KTH Royal Institute of Technology, SE

**Abstract**
With processors increasing on a single chip, and more and more time sensitive applications added to on-chip systems, performance bound analysis becomes essential for QoS Network-on-Chip (NoC) designs and evaluations. For the purpose of providing the reliable and automated analysis for QoS NoC, we propose PATN (Performance Analysis Tool for NoC), which automatically computes the end-to-end delay bounds of data flows, and backlog bounds of buffers for NoC with arbitrary topology. PATN is designed based on network calculus, which lies on solid mathematical foundations and provides well-guaranteed accuracy of the results. Network Calculus based analysis has been successfully employed for various communications networks, such as SpaceWire, AFDX, etc. For example, Airbus adopted and approved the network calculus based analysis for certification on its aircraft A380. In this demonstration, we give a whole view of PATN through two segments. First, we explain the architecture and main functions; show the working flow and printing log by analysing end-to-end delay bound of a data flow in a simple network. The log shows that the analysis follows the theoretical methodology exactly, hence to obtain the correct and tight results, which as good as that the theory can achieve. Second, we use PATN to analyse the delay bounds and backlog bounds for 3 NoCs with different topologies -- binary tree, mesh, and hierarchical topology of binary tree and mesh. The analyses demonstrate computation speed and scalability of PATN. Moreover, comparisons of the delay bound, computed with different configuration parameters of the flows and routers, are conducted. It shows how the delay bound is effected by the parameters.

**More information ...**

**UB04.04 GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHEENE-BASED DIGITAL DEVICES**

**Authors:** Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Maci, Politecnico di Torino, IT

**Abstract**
Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The netlist is composed of a parser layer to handle input circuit descriptions, a characterization library of graphene gates based in the synthesis process, a Biconditional Binary Decision Diagram library used to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices.

**More information ...**

**UB04.05 HWDLOBUR: DESIGN OF A HIGH PERFORMANCE CORE FOR REMOVING BLUR EFFECT ON IMAGES**

**Authors:** Giuseppe Airo' Farulla, Giulio Gambardella, Marco Indaco, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT

**Abstract**
This work aims at developing a high performance FPGA-based IP-core able to perform a deblurring algorithm in real-time. Modern approaches to deblurring usually either handle simple types of blur, or need heavy user inter-action. Moreover, they usually require several minutes (or even whole hours) to process a single image. Our purpose is to study the current state-of-the-art and identify the best deblurring algorithms that are suitable for a hardware implementation. The selected algorithm is optimized and implemented in hardware in order to perform the deblurring task with highest possible performances.

**More information ...**

**UB04.06 ENERGY-MODULATED COMPUTING**

**Authors:** Maxim Rykunov, Reza Ramezani, Abdullah Baz, Xuefu Zhang, Delong Shang, Andrey Mokhov, Danil Sokolov, Fei Xia and Alex Yakovlev, Newcastle University, GB

**Abstract**
This demo will illustrate the principle of energy-modulated computing according to which the flow of energy entering a computing system determines its computational flow. This principle will be fundamental for building future autonomous systems, such as those powered by energy harvesting sources and aimed for survival in power-deficient conditions. The demo includes a set of experimental circuits (with three VLSI chips and PCBs) to work in variable power supply conditions and software tools for digital and analogue co-design (Workcraft, Petri' y, MP5A).
**ID.Fix: An EDA Tool for Fixed-Point Refinement of Embedded Systems**

**Authors:**
Olivier Sentieys\(^1\), Daniel Menard\(^2\) and Nicolas Simon\(^3\)

\(^1\)INRIA, FR; \(^2\)INSA Rennes, FR; \(^3\)University of Rennes, FR

**Abstract**

Most of digital image and signal processing algorithms are implemented into architectures based on fixed-point arithmetic to satisfy the cost and power consumption constraints of embedded systems. The fixed-point conversion process (or refinement) is crucial for reducing the time-to-market. Design tools to automate this phase and to explore the design space are thus required. The ID.Fix EDA tool based on the compiler infrastructure GECOS allows for the conversion of a floating-point C source code into a C code using fixed-point data types. The data word-lengths are optimized by minimizing the implementation cost under accuracy constraint. To obtain low optimization time, an analytical approach is used to evaluate the fixed-point computation accuracy. This approach is valid for systems made-up of any (smooth) arithmetic operations.

---

**Faultify: Probabilistic Circuit Fault Emulation**

**Authors:**
David May and Walter Stechele, TUM, DE

**Abstract**

We want to demonstrate an FPGA-based probability-aware fault emulator and its corresponding algorithms in the context of a real-time H.264 decoder. The demo will show that reliability constraints can be relaxed inside the circuit without noticeable degradation of the image quality when carefully investigating where the constraints can be relaxed. We will show how this investigation can be done using our emulator and we will show the effect of a relaxed robustness of the circuit in real-time.

---

**Exhibition-Reception Exhibition Reception**

**Date:** Tuesday 25 March 2014  
**Time:** 18:30 - 19:30  
**Location / Room:** Several serving points inside the Exhibition Area (Terrace Level)

The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

---

**5.1 SPECIAL DAY Hot Topic: Predictable Multi-Core Computing**

**Date:** Wednesday 26 March 2014  
**Time:** 08:30 - 10:00  
**Location / Room:** Saal 1

**Organiser:**  
Jürgen Teich, University of Erlangen-Nuremberg, DE

**Chair:**  
Petru Eles, Linköping University, SE

**Co-Chair:**  
Jürgen Teich, University of Erlangen-Nuremberg, DE

The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. This session treats this important problem of time predictability of applications on multi-core platforms by presenting results of the impact of resource sharing on performance, an architecture that has been designed to meet predictability requirements as well as new results on scheduling mixed critical applications on multi-core platforms.

**Presentation Title**

**Impact of Resource Sharing on Performance and Performance Prediction**

**Speakers:**  
Jan Reineke and Reinhard Wilhelm, Informatik, Universität des Saarlandes, DE

**Abstract**

Multi-core processors are increasingly considered as execution platforms for embedded systems because of their good performance/energy ratio. However, the interference on shared resources poses several problems. It may severely reduce the performance of tasks executed on the cores, and it increases the complexity of timing analysis and/or decreases the precision of its results. Many applications implemented on multi-core platforms are safety- and some also time-critical. A critical issue for these applications is the reduced predictability of such systems resulting from the interference of different applications on shared resources. These interferences can be at least of two kinds: Several applications may request a resource at the same time, but the resource can only admit one access at a time. As a consequence, an arbitration mechanism may delay the request of all but one application, thus slowing down the other applications. This is the case of resources like buses, typically called bandwidth resources. On the other hand, one application may also change the state of a shared resource such that another application using that resource will suffer from a slowdown. This is the case with shared caches, which fall into the class of storage resources. Interference of shared resources makes worst-case execution time (WCET) analysis of applications more difficult since a task or a thread can no longer be analyzed for its timing behavior in isolation. All potential interferences slowing down the task under analysis have to be considered. This leads to a combinatorial explosion of the analysis complexity, as all possible interleavings of different threads have to be analyzed.
5.2 Hot Topic: Hacking and Protecting Hardware: Threats and Challenges

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 6

Organisers:
Said Hamdioui, TU Delft, NL
Giorgio Di Natale, LIReMM, FR

Chair:
Said Hamdioui, TU Delft, NL

Co-Chair:
Giorgio Di Natale, LIReMM, FR

For this Hot-Topic Session, we will have four leading researchers and experienced speakers from different companies to address both hacking and protecting ICs for chip data. Two speakers will focus on the weaknesses of IC and systems and the ways they can be hacked to retrieve secret data, while the other two will cover smart schemes that can effectively support distributed computing primitives such as remote writes, barrier synchronizations, active messages, and communication by sampling. POSIX time functions expose synchronous clocks inside compute clusters and messysynchronous clocks across the MPPA-256 processor.

Time Label Presentation Title Authors
09:00 5.1.2 TIME-CRITICAL COMPUTING ON A SINGLE CHIP MASSIVELY PARALLEL PROCESSOR
Speaker: Benoît Dupont de Dinechin, Kailray, FR
Abstract: The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. We illustrate how this problem has been addressed by suitably designing the architecture, implementation, and programming model, of the Kailray MPPA-256 single-chip many-core processor. The MPPA-256 (Multi-Purpose Processing Array) processor integrates 256 processing engine (PE) cores and 32 resource management (RM) cores on a single 28nm CMOS chip. These VLIW cores are distributed across 16 compute clusters and 4 I/O subsystems, each with a locally shared memory. On-chip communication and synchronization are supported by an explicitly addressed dual network-on-chip (NoC), with one node per compute cluster and 4 nodes per I/O subsystem. Off-chip interfaces include DDR, PCI, and Ethernet, and a direct access to the NoC for low-latency processing of data streams. The key architectural features that support time-critical applications are timing compositional cores, independent memory banks inside the compute clusters, and the data NoC whose guaranteed services are determined by network calculus. The programming model provides communicators that effectively support distributed computing primitives such as remote writes, barrier synchronizations, active messages, and communication by sampling. POSIX time functions expose synchronous clocks inside compute clusters and messysynchronous clocks across the MPPA-256 processor.

09:30 5.1.3 MAPPING MIXED-CRITICALITY APPLICATIONS ON MULTI-CORE ARCHITECTURES
Speakers: Georgia Giannopoulou1, Nikolay Stoimenov1, Pengcheng Huang2 and Lothar Thiele3
1ETH Zurich, CH; 2ETHZ, CH; 3Swiss Federal Institute of Technology Zurich, CH
Abstract: A common trend in real-time embedded systems is to integrate multiple applications on a single platform. Such systems are known as mixed-criticality (MC) systems when the applications are characterized by different criticality levels. Nowadays, multicore platforms are promoted due to cost and performance benefits. However, certification of multicore MC systems is challenging as concurrently executed applications of different criticalities may block each other when accessing shared platform resources. Most of the existing research on multicore MC scheduling ignores the effects of resource sharing on the response times of applications. Recently, a MC scheduling strategy was proposed, which explicitly accounts for these effects. This paper discusses how to combine this policy with an optimization method for the partitioning of tasks to cores as well as the static mapping of memory blocks, i.e., task data and communication buffers, to the banks of a shared memory architecture. Optimization is performed at design time targeting at minimizing the worst-case response times of tasks and achieving efficient resource utilization. The proposed optimization method is evaluated using an industrial application.

10:00 End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).
technology nodes, reliability is expected to become a first-order design constraint. This session tackles this with novel techniques, spanning from memoization to latency-

The evolution of the silicon industry over past decades has been fueled by continued scaling. This has motivated the rapid evolution of integration technologies. In future

Co-Chair:
Antonio Miele, Politecnico di Milano, IT
Chair:
Location / Room: Konferenz 1
Time: 08:30 - 10:00
Date:
5.3 Reliable Systems in the Age of Variability

The evolution of the silicon industry over past decades has been fueled by continued scaling. This has motivated the rapid evolution of integration technologies. In future technology nodes, reliability is expected to become a first-order design constraint. This session tackles this with novel techniques, spanning from memoization to latency-insensitive systems, proposing to tolerate, recover and manage reliability issues in a more variable scenario.

Time | Label | Presentation Title | Authors
--- | --- | --- | ---
09:37 | 5.2.4 | SILICONAP: A SILICON AUTHENTICATION PLATFORM FOR SECURITY AND ANTI-COUNTERFEITING | Mohammad Tehrani, TrueLogic, US
Speaker: Mohammad Tehrani, TrueLogic, US
Abstract: He will talk about design for security and anti-counterfeiting. His talk includes new design techniques for Trojan detection, Trojan prevention, vulnerability analysis, as well as design techniques for preventing counterfeiting of integrated circuits and providing means for easy detection.
10:00 | | End of session | |
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.3 Reliable Systems in the Age of Variability

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 1
Chair: Antonio Miele, Politecnico di Milano, IT
Co-Chair: José L. Ayala, Complutense University of Madrid, ES

The evolution of the silicon industry over past decades has been fueled by continued scaling. This has motivated the rapid evolution of integration technologies. In future technology nodes, reliability is expected to become a first-order design constraint. This session tackles this with novel techniques, spanning from memoization to latency-insensitive systems, proposing to tolerate, recover and manage reliability issues in a more variable scenario.

Time | Label | Presentation Title | Authors
--- | --- | --- | ---
08:30 | 5.3.1 | TEMPORAL MEMOIZATION FOR ENERGY-EFFICIENT TIMING ERROR RECOVERY IN GPGPUS | Abbas Rahimi, Luca Benini and Rajesh Gupta
1UC San Diego, US; 2Università di Bologna, IT
Abstract: Manufacturing and environmental variability lead to timing errors in computing systems that are typically corrected by error detection and correction mechanisms at the circuit level. The cost and speed of recovery can be improved by memoization-based optimization methods that exploit spatial or temporal parallelisms in suitable computing fabrics such as general-purpose graphics processing units (GPGPUs). We propose here a temporal memoization technique for use in floating-point units (FPUs) in GPGPUs that uses value locality inside data-parallel programs. The technique recalls (memorizes) the context of error-free execution of an instruction on a FPU. To enable scalable recovery from independent recoveries, a single-cycle lookup table (LUT) is tightly coupled to each FPU to maintain contexts of recent error-free executions. The LUT reuses these memorized contexts to exactly, or approximately, correct errant FP instructions based on application needs. In real-world applications, the temporal memoization technique achieves an average energy saving of 8%-28% for a wide range of timing error rates (9%-4%) and outperforms recent advances in resilient architectures. This technique also enhances robustness in the voltage overclocking regime and achieves relative average energy saving of 66% with 11% voltage overclocking.

09:00 | 5.3.2 | RELIABILITY-AWARE EXCEPTIONS: TOLERATING INTERMITTENT FAULTS IN MICROPYROCESSOR ARRAY STRUCTURES | Waleed Dweik, Murali Annaram and Michel Dubois, University of Southern California, US
Abstract: In future technology nodes, reliability is expected to become a first-order design constraint. Faults encountered in a chip can be classified into three categories: transient, intermittent, and permanent. Fault classification allows a chip to take the appropriate corrective action. Mechanisms have been proposed to distinguish transient from non-transient faults where all non-transient faults are handled as permanent. Intermittent faults induced by wearout phenomena have become the dominant reliability concern in nanoscale technology, yet there is no mechanism that provides finer classification of non-transient faults into intermittent and permanent faults. In this paper, we present a new class of exceptions called Reliability-Aware Exceptions (RAEs) which provide the ability to distinguish intermittent faults in microprocessor array structures. The RAE handlers have the ability to manipulate microprocessor array structures to recover from all three categories of faults. Using RAEs, we demonstrate that the reliability of two representative microarchitecture structures, load/store queue and reorder buffer in an out-of-order processor, is improved by average factors of 1.3 and 1.95, respectively.

09:30 | 5.3.3 | TEMPERATURE AWARE ENERGY-RELIABILITY TRADE-OFFS FOR MAPPING OF THROUGHPUT-CONSTRAINED APPLICATIONS ON MULTIMEDIA MPSoCs | Anup Das, Akash Kumar and Bharadwaj Veeravalli, National University of Singapore, SG
Abstract: This paper proposes a design-time (offline) analysis technique to determine application task mapping and scheduling on a multiprocessor system and the voltage and frequency levels of each cores (offline DVFS) that minimize application computation and communication energy, simultaneously minimizing processor aging. The proposed technique incorporates (1) the effect of the voltage and frequency on the temperature of a core; (2) the effect of neighboring core voltage and frequency on the temperature (spatial effect); (3) pipelined execution and cyclic dependencies among tasks; and (4) the communication energy component which often constitutes a significant fraction of the total energy for multimedia applications. The temperature model proposed here can be easily integrated in the design space exploration for multiprocessor systems. Experiments conducted with applications modeled as synchronous data-flow graphs in conjunction with HotSpot tool for temperature modeling clearly demonstrate the quality and the speed-up achieved using the proposed approach. Further, they also show 40% savings in energy consumption with 6% increase in system lifetime.

09:45 | 5.3.4 | RECOVERY-BASED RESILIENT LATEENCY-INSENSITIVE SYSTEMS | Yuankai Chen, Xuan Zeng and Hai Zhou
1Northwestern University, US; 2Fudan University, CN
Abstract: As the interconnect delay is becoming a larger fraction of the clock cycle time, the conventional global stalling mechanism, which is used to correct error in general synchronous circuits, would be no longer feasible because of the expensive timing cost for the stalling signal to travel across the circuit. In this paper, we propose recovery-based resilient latency-insensitive systems (RLIS) that efficiently integrate error-recovery techniques with latency-insensitive design to replace the global stalling. We first demonstrate a baseline RLIS as the motivation of our work that uses additional output buffer which guarantees that only correct data can enter the output channel. However this baseline RLIS suffers from performance degradation even when errors do not occur. We propose a novel improved RLIS that allows erroneous data to propagate in the system. Equipped with improved queues that prevent accumulation of erroneous data, the improved RLIS retains the system performance. We provide theoretical studies that analyze the impact of errors on system performance and the queue sizing problem. We also theoretically prove that the improved RLIS performs no worse than the global stalling mechanism. Experimental results show that the improved RLIS has 40.3% and even 3.1% throughput improvements compared to the baseline RLIS and the infeasible global stalling mechanism respectively, with less than 10% hardware overhead.
5.4 Prediction and optimization of timing variations

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 2

Chair:
Antonio Rubio, UPC Barcelona, ES

Co-Chair:
Marisa López Vallejo, UPM Madrid, ES

The session addresses yield analysis due to timing variations as well as various flip flop design techniques improving timing margins under variability.

08:30 5.4.1 EFFICIENT HIGH-SIGMA YIELD ANALYSIS FOR HIGH DIMENSIONAL PROBLEMS
Speakers: Moning Zhang, Zuochang Ye and Yan Wang, Tsinghua National Laboratory for Information Science and Technology, Institute of Microelectronics, Tsinghua University, CN

Abstract
High-sigma analysis is important for estimating the probability of rare events. Traditional high-sigma analysis can only work for small-size (low-dimension) problems limiting to 10 ~ 20 random variables, mostly due to the difficulty of finding optimal boundary points. In this paper we propose an efficient method to deal with high-dimension problems. The proposed method is based on performing optimization in a series of low dimension parameter spaces. The final solution can be regarded as a greedy version of the global optimization. Experiments show that the proposed method can efficiently work with problems with >100 independent variables.

09:00 5.4.2 SUB-THRESHOLD LOGIC CIRCUIT DESIGN USING FEEDBACK EQUALIZATION
Speakers: Mahmoud Zangeneh and Ajay Joshi, Boston University, US

Abstract
Low energy has become one of the primary constraint in the design of digital VLSI circuits in recent years. Minimum-energy consumption can be achieved in digital circuits by operating in the sub-threshold regime. However, in this regime process variation can result in up to an order of magnitude variations in Ion/Ioff ratios leading to timing errors, which can have a detrimental impact on the functionality of the sub-threshold circuits. These timing errors become more frequent in scaled technology nodes where process variations are highly prevalent. Therefore, mechanisms to mitigate these timing errors while minimizing the energy consumption in sub-threshold circuits are required. In this paper, we propose the use of a variable threshold feedback equalizer circuit with combinational logic blocks to mitigate the timing errors, which can then be leveraged to reduce the dominant leakage energy by scaling supply voltage or decreasing the propagation delay. At the fixed supply voltage, we can decrease the propagation delay of the critical path using equalizer circuits and, correspondingly decrease the leakage energy consumption. For a 8-bit carry lookahead adder designed in UMC 130 nm process, the operating frequency can be increased by 22.8% (on average), while reducing the leakage energy by 22.6% in the sub-threshold regime. Overall the feedback equalization technique provides up to 35.4% lower energy-delay product compared to the conventional non-equalized logic. Alternately, for a 8-bit carry lookahead adder, the proposed technique enables us to reduce the critical voltage (beyond which timing errors occur) from 300 mV (nominal design) to 270 mV (design with feedback circuit), and provides a 16.72% decrease in energy per operation while maintaining performance.
While the industrial usage of formal methods has proliferated in the past decade, the capacity limitations of these techniques remains a challenge to their applicability. This paper presents a Markov Chain model to describe the behavior of Bubble Razor. Using this model, we analyze its performance and provide an optimizing strategy to maximize its benefits.

Speakers:
Guowei Zhang and Peter Beere
1Tsinghua University, CN; 2Univ. of Southern California, US

Abstract
Bubble Razor has been proposed to eliminate required timing margins in synchronous design caused by increasing delay variation due to process variation and aging. However, the theoretical analysis of its performance under variability is unknown. This paper presents a Markov Chain model to describe the behavior of Bubble Razor. Using this model, we analyze its performance and provide an optimizing strategy to maximize its benefits.

Speakers:
Yanzhi Wang, Xue Lin, Qing Xie, Naehyuck Chang and Massoud Pedram
1University of Southern California, US; 2Seoul National University, KR

Abstract
Hybrid electrical energy storage (HEE) systems consisting of heterogeneous electrical energy storage (EES) elements are proposed to exploit the strengths of different EES elements and hide their weaknesses. The cycle life of the EES elements is one of the most important metrics. The cycle life is directly related to the state-of-health (SoH), which is defined as the ratio of full charge capacity of an aged EES element to its designed (or nominal) capacity. The SoH degradation models of battery in the previous literature can only be applied to charging/discharging cycles with the same state-of-charge (SoC) swing. To address this shortcoming, this paper derives a novel SoH degradation model of battery for charging/discharging cycles with arbitrary patterns. Based on the proposed model, this paper presents a near-optimal charge management policy focusing on extending the cycle life of battery elements in the HEE systems while simultaneously improving the overall cycle efficiency.

Speakers:
Mehrzad Nejat, Bijan Alizadeh and Ali Afzali Kusha, School of Electrical and Computer Eng., College of Eng., University of Tehran, IR

Abstract
A novel time borrowing method called dynamic Flip-Flop conversion is presented in this paper. A timing violation predictor detects the violations halfway in the critical path and dynamically converts the critical Flip-Flop to a latch. This way, time borrowing benefits of latches are utilized in a Flip-Flop based design which is more adaptable with Computer-Aided-Design tools. The overhead of this method is smaller than that of similar methods due to the elimination of delay elements. According to the post-synthesis simulations and Monte-Carlo analysis of Spice simulations on some ITC’99 benchmark circuits, the power overhead of the proposed method is about 15% and 19% smaller than that of Soft-Edge-Flip-Flip and Dynamic-Clock-Stretching circuits respectively in a simple case of about 40% yield improvement. This overhead would be relatively even smaller for higher performance and yield improvements.

Speakers:
Luo Sun1, Jimson Mathew1, Rishad Shafik2, Dhiraj Pradhan1 and Zhen Li1
1University of Bristol, GB; 2University of Southampton, GB

Abstract
Carbon nanotube field-effect transistor (CNTFET) is envisioned as a promising device to overcome the limitations of traditional CMOS based MOSFETs due to its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using Spice based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write margin (WAM) compared to a CNTFET-based standard 6T bitcell, especially to isolated load-port 6T cell based on CNTFET, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering metallic tolerance to make the proposed SRAM design more reliable.

Speakers:
Sebastiaan Joosten and Julien Schmaltz, Open University, NL

Abstract
Scalable liveness verification for communication fabrics can be seen as a set of queues and flops interconnected by combinatorial logic. Based on this simple but powerful observation, we propose a novel method for liveness verification. Our method directly applies to Register Transfer Level designs. The essential aspects of our approach are (1) to abstract away from the details of queue implementations and (2) an efficient encoding of liveness properties in an SMT instance. Experimental results are promising. Designs with hundreds of queues can be analysed for liveness within minutes.
The papers in this session consider new ways to realize both Boolean and non-Boolean logic. Marco Ottavi, University of Rome "Tor Vergata", IT

**5.6 Emerging logic technologies**

**Date:** Wednesday 26 March 2014  
**Time:** 08:30 - 10:00  
**Location / Room:** Konferenz 4

**Chair:**  
Mehdi Tahoori, KIT, DE

**Co-Chair:**  
Marco Ottavi, University of Rome "Tor Vergata", IT

The papers in this session consider new ways to realize both Boolean and non-Boolean logic. Potential implementations are based on graphene, spin, and resonance energy transfer.

**Time** | **Label** | **Presentation Title** | **Authors**
--- | --- | --- | ---
08:30 | 5.6.1 | **RETLAB: A FAST DESIGN-AUTOMATION FRAMEWORK FOR ARBITRARY RET NETWORKS** | Mohammad Mottaghi, Arjun Rallapalli and Chris Dwyer, Duke University, US

**Abstract**  
Resonance energy transfer (RET) circuits are networks of photo-active molecules that can implement arbitrary logic functions. The nanoscale size of these structures can bring high-density computation to new domains, e.g., in vivo sensing and computation. A key challenge in the design of a RET network is to find, among a huge set of configurations (i.e., design space), the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to find, among a huge set of configurations, the optimum choice and arrangement of molecules on a nanostructure. The prohibitive...
**5.6.3 HIGHLY ACCURATE SPICE-COMPATIBLE MODELING FOR SINGLE- AND DOUBLE-GATE GNRFETS WITH STUDIES ON TECHNOLOGY SCALING**

**Speakers:** Morteza Gholipour, Ying-Yu Chen, Amit Sangai and Deming Chen

**University of Tehran, IR; ‡University of Illinois at Urbana-Champaign, US**

**Abstract**

In this paper, we present a highly accurate closed-form compact model for Schottky-Barrier-type Graphene Nano-Ribbon Field-Effect Transistors (SB-GNRFETs). This is a physics-based analytical model for the current-voltage (I-V) characteristics of SB-GNRFETs. We carry out accurate approximations of Schottky barrier tunneling, channel charge and current, which provide improved accuracy while maintaining compactness. This SPICE-compatible compact model surpasses the existing model [15] in accuracy, and enables efficient circuit-level simulations of futuristic GNRFET-based circuits. The proposed model considers various design parameters and process variation effects, including graphene-specific edge roughness, which allows complete and thorough exploration and evaluation of SB-GNRFET circuits. We are able to model both single- and double-gate SB-GNRFETs, so we can evaluate and compare these two types of SB-GNRFET. We also compare circuit-level performance of SB-GNRFETs with multi-gate (MG) Si-CMOS for a scalability study in future generation technology. Our circuit simulations indicate that SB-GNRFET has an energy-delay product (EDP) advantage over Si-CMOS; the EDP of the ideal SB-GNRFET (assuming no process variation) is ~1.3% of that of Si-CMOS, while the EDP of the non-ideal case with process variation is 136% of that of Si-CMOS. Finally, we study technology scaling with SB-GNRFET and MG Si-CMOS. We show that the EDP of ideal (non-ideal) SB-GNRFET is ~0.88% (54%) EDP of that of Si-CMOS as the technology nodes scales down to 7 nm.

**5.6.4 REWRIRING FOR THRESHOLD LOGIC CIRCUIT MINIMIZATION**

**Speakers:** Chia-Chun Lin, Chun-Yao Wang, Yung-Chih Chen and Ching-Yi Huang

**Dept. of Computer Science, National Tsing Hua University, TW; ‡Dept. of Computer Science and Engineering, Yuan Ze University, TW**

**Abstract**

Recently, there have been many works focusing on synthesis, verification, and testing of threshold circuits due to the rapid development in efficient implementation of threshold logic circuits. To minimize the hardware cost of threshold circuit implementation, this paper proposes a heuristic that consists of rewiring operations and a simplification procedure. Additionally, a subset of input operators of a gate, called critical-effect vectors, are proved to be complete for formally verifying the equivalence of two threshold logic gates, instead of the whole truth table in this paper. This achievement can accelerate the equivalence checking of two threshold logic gates. The experimental results show that the proposed heuristic can efficiently reduce the cost.

**10:00 Width Minimization in the Single-Transistor Array Synthesis**

**Speakers:** Chien-Wen Liu, Chang-En Chiang, Ching-Yi Huang, Chun-Yao Wang, Yung-Chih Chen, Suman Datta and Vijaykrishnan Narayanan

**Dept. of Electrical Engineering, The Pennsylvania State University, US; ‡Department of Computer Science and Engineering, The Pennsylvania State University, US**

**Abstract**

Power consumption has become one of the primary challenges facing Moore's law. For reducing power consumption, Single-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore's law due to its ultra-low power consumption during operation. Prior work has proposed an automated mapping approach for SET architecture which focuses on minimizing the number of hexagons in an SET array. However, the number of an SET array is more related to the width. Consequently, in this work, we propose an approach for width minimization of the SET arrays. The experimental results show that the proposed approach saves 26% of width compared with the state-of-the-art for a set of MCNC and IJWLS 2005 benchmarks while spending similar CPU time.

**10:01 Area Minimization Synthesis for Reconfigurable Single-Electron Transistor Arrays With Fabrication Constraints**

**Speakers:** Yi-Hang Chen, Jian-Yu Chen and Juin-Dar Huang, Department of Electronics Engineering, National Chiao Tung University, TW

**Abstract**

As fabrication processes exploit even deeper submicron technology, power dissipation has become a crucial issue for most electronic circuit and system designs nowadays. In particular, leakage power is becoming a dominant source of power consumption. Recently, the reconfigurable single-electron transistor (SET) array has been proposed as an emerging circuit design style for continuing Moore's Law due to its ultra-low power consumption. Several automated synthesis approaches have been developed for the reconfigurable SET array in the past few years. Nevertheless, all of those existing methods consider fabrication constraints, which are mandatory, merely in late synthesis stages. In this paper, we propose a synthesis algorithm, featuring both variable reordering and product term reordering, for area minimization. In addition, our algorithm takes those mandatory fabrication constraints into account in early stages for better outcomes. Experimental results show that our new method can achieve an area reduction of up to 24% as compared to current state-of-the-art techniques.

**10:02 Software-based Pauli Tracking in Fault-Tolerant Quantum Circuits**

**Speakers:** Alexandru Paler, Simon Devitt, Kae Nemoto and Ilya Polian

**University of Passau, DE; ‡National Institute of Informatics, JP**

**Abstract**

The realisation of large-scale quantum computing is no longer simply a hardware question. The rapid development of quantum technology has resulted in dozens of fundamental problems that should be directed towards the classical computer science and engineering community. One such problem is known as Pauli tracking. Methods for implementing quantum algorithms that are compatible with crucial error correction technology utilize extensive quantum teleportation protocols. These protocols are intrinsically probabilistic and result in correcting operators that occur as byproducts of teleportation. These byproduct operators do not need to be corrected in the quantum hardware itself, but are tracked through the circuit and output results emph(reinterpreted). This tracking is routinely ignored in quantum information as it is assumed that tracking algorithms will eventually be developed. In this work we help fill this gap and present an algorithm for tracking byproduct operators through a quantum computation.
The session covers generation of tests for different fault models including interconnect opens, interconnect for 3D memories, and small delay faults. Additionally test optimization for SoC designs is presented.

**Time** | **Label** | **Presentation Title**
---|---|---
08:30 | 5.7.1 | (Best Paper Award Candidate) **EFFICIENT SMT-BASED ATPG FOR INTERCONNECT OPEN DEFECTS**
**Speakers:**
Dominik Erb\(^1\), Karsten Scheibler\(^1\), Matthias Sauer\(^2\) and Bernd Becker\(^2\)
\(^1\)University of Freiburg, Chair of Computer Architecture, DE; \(^2\)University of Freiburg, DE

**Abstract**
Interconnect opens are known to be one of the predominant defects in nanoscale technologies. However, automatic test pattern generation for open faults is challenging, because of their rather unstable behaviour and the numerous electric parameters which need to be considered. Thus, most approaches try to avoid accurate modeling of all constraints and use simplified fault models in order to detect as many faults as possible or make assumptions which decrease both complexity and accuracy. This paper presents a new SMT-based approach which for the first time supports the Robust Enhanced Aggressor Victim model without restrictions and handles oscillations. It is combined with the first open fault simulator fully supporting the Robust Enhanced Aggressor Victim model and thereby accurately considering unknown values. Experimental results show the high efficiency of the new method outperforming previous approaches by up to two orders of magnitude.

09:00 | 5.7.2 | **INTERCONNECT TEST FOR 3D STACKED MEMORY-ON-LOGIC**
**Speakers:**
Mottaqiallah Taouil\(^1\), Mahmoud Masadeh\(^1\), Said Hamdioui\(^1\) and Erik Jan Marinissen\(^2\)
\(^1\)Delft University of Technology, NL; \(^2\)IMEC, BE

**Abstract**
Three-dimensional stacked IC (3D-SIC) technology based on Through-Silicon Vias (TSVs) provides numerous advantages as compared to traditional 2D-ICS. A potential application is memory stacked on logic, providing enhanced throughput, and reduced latency and power consumption. However, testing the TSV interconnects between the two dies is challenging, because the memory and the logic die might come from different manufacturers. Currently, no standard exists and the proposed solutions fail to address dynamic and time-critical faults (at speed testing). In addition, memory vendors have not been in favor to put additional DFT structures such as ITAG for interconnect testing on their memory devices. This paper proposes a new Memory Based Interconnect Test (MBIT) approach for 3D stacked memories. Our test patterns are applied by read and write instructions to the memory and are validated by a case study where a 3D memory is assumed to be stacked on a MIPS64 processor. The main benefits of the MBIT approach are: (1) zero area overhead, (2) the ability to detect both static and dynamic faults and perform at speed testing, (3) flexibility in applying any test pattern, as this can be executed by the CPU on the logic die and (4) extreme short test execution time.

09:30 | 5.7.3 | **AN EFFECTIVE APPROACH TO AUTOMATIC FUNCTIONAL PROCESSOR TEST GENERATION FOR SMALL-DELAY FAULTS**
**Speakers:**
Andreas Riefert\(^1\), Lyl Ciganda\(^2\), Matthias Sauer\(^1\), Paolo Bernardi\(^2\), Matteo Sonza Reorda\(^3\) and Bernd Becker\(^1\)
\(^1\)University of Freiburg, DE; \(^2\)Politecnico di Torino, IT; \(^3\)Politecnico di Torino - DAUIN, IT

**Abstract**
Functional microprocessor test methods provide several advantages compared to DFT approaches, like reduced chip cost and at speed execution. However, the automatic generation of functional test patterns is an open issue. In this work we present an approach for the automatic generation of functional microprocessor test sequences for small-delay faults based on Bounded Model Checking. We utilize an ATPG framework for small-delay faults in sequential, non-scan circuits and propose a method for constraining the input space for generating functional test sequences (i.e., test programs). We verify our approach by evaluating the miniMIPS microprocessor. In our experiments we were able to reach over 97% fault efficiency. To the best of our knowledge, this is the first fully automated approach to functional microprocessor test for small-delay faults.

09:45 | 5.7.4 | **MULTI-SITE TEST OPTIMIZATION FOR MULTI-VDD SOCS USING SPACE- AND TIME-DIVISION MULTIPLEXING**
**Speakers:**
Fotios Vartziotis\(^1\), Chrysovalantis Kavousianos\(^2\), Krishnendu Chakrabarty\(^3\), Rubin Parekhji\(^4\) and Arvind Jain\(^4\)
\(^1\)University of Ioannina, GR; \(^2\)Department of Computer Science and Engineering, University of Ioannina, GR; \(^3\)Duke University, US; \(^4\)Texas Instruments, IN

**Abstract**
Even though system-on-chip (SoC) testing at multiple voltage settings significantly increases test complexity, the use of a different shift frequency at each voltage setting offers parallelism that can be exploited by time-division multiplexing (TDM) to reduce test length. We show that TDM is especially effective for small-bitwidth and highly loaded test-access mechanisms (TAMs), thereby tangibly increasing the effectiveness of multi-site testing. However, TDM suffers from some inherent limitations that do not allow the fullest possible exploitation of TAM bandwidth. To overcome these limitations, we propose space-division multiplexing (SDM), which complements TDM and offers higher multi-site test efficiency. We implement space- and time-division multiplexing (STDM) using a new, scalable test-time minimization method based on a combination of bin packing and simulated annealing. Results for industrial SoCs highlight the advantages of the proposed optimization method.

10:00 | IP2-20, 50 | **AN EFFICIENT TEMPERATURE-GRADIENT BASED BURN-IN TECHNIQUE FOR 3D STACKED ICS**
**Speakers:**
Nima Aghaee, Zebo Peng and Petru Elles, Linköping University, SE

**Abstract**
Burn-in is usually carried out with high temperature and elevated voltage. Since some of the early-life failures depend not only on high temperature but also on temperature gradients, simply raising the temperature of an IC is not sufficient to detect them. This is especially true for 3D stacked ICs, since they have usually very large temperature gradients. The efficient detection of these early-life failures requires that specific temperature gradients are enforced as part of the burn-in process. This paper presents an efficient method to do so by applying high power stimuli to the cores of the IC under burn-in through the test access mechanism. Therefore, no external heating equipment is required. The scheduling of the heating and cooling intervals to achieve the required temperature gradients is based on thermal simulations and is guided by functions derived from a set of thermal equations. Experimental results demonstrate the efficiency of the proposed method.
MEMS AND 3D-IC PRODUCT ENGINEERING - TECHNOLOGY DESIGN FOR SYSTEM INTEGRATION

Kai Hahn, University Siegen, DE

Co-Chair:
Manfred Dietrich, Fraunhofer IIS/EAS Dresden, DE

Chair:
Manfred Dietrich, Fraunhofer IIS/EAS Dresden, DE

Organisers:

Location / Room:
Exhibition Theatre

Time:
08:30 - 10:00

5.8 Hot Topic: System Integration - The Bridge between More than Moore and More Moore

Date: Wednesday 26 March 2014

Time: 08:30 - 10:00

Location / Room: Exhibition Theatre

Organisers:
Manfred Dietrich, Fraunhofer IIS/EAS Dresden, DE
Kai Hahn, University Siegen, DE

Chair:
Manfred Dietrich, Fraunhofer IIS/EAS Dresden, DE

Co-Chair:
Kai Hahn, University Siegen, DE

System Integration using 3D technology is a very promising way to cope with current and future requirements for electronic systems. Since the pure shrinking of devices (known as "More Moore") will come to an end due to physical and economic restrictions, the integration of systems (e.g. by stacking dies, or by adding sensor functions) shows a way to maintain the growth in complexity as well as in diversity which is necessary for future applications. This so called "More than Moore" approach complements the conventional SoC product engineering. This session gives insights in System Integration design challenges from different perspectives, ranging from design technology over MEMS as well as for 3D integration.

DESIGN TECHNOLOGY FOR 3-D INTEGRATED SYSTEMS

Andy Heining, Fraunhofer IIS/EAS, DE

Abstract
More than Moore technologies (MM) enable the dense integration of different circuits in a package. The short length and small spacing of wires enable high-speed and highly parallel interconnects between system parts as e.g. processor and memory. In the first part of the presentation we will give an overview on the current status of MM from system-in-package up to 3D stacking with trough silicon vias at interposers or stacked directly. The second part of the presentation is dedicated to the design of MM systems. One challenge is the tight integration of analog and digital dies that requires the consideration of several electrical and multi-physical interactions e.g. thermal management, power distribution and electromagnetic compatibility stronger than in 2D-SoC design. The second challenge is the wide design space opened by MM. It request new methods that guides the designer to find the best trade-off between system performance and production costs. By the means of Processor and WideIO memory integration at silicon interposer, that increases the memory bandwidth in future high-end applications we demonstrate new EDA methods for design space exploration, estimation of routing congestion and interposer routing.

SEMI CONDUCTOR PACKAGING IS BACK TO EUROPE - ADVANCES IN SYSTEM INTEGRATION IN WAFER LEVEL PACKAGING

Steffen Kroehnert, NANIUM S.A. - Niederlassung Dresden, DE

Abstract
Differing market segments from mobile communication and consumer to automotive see the increasing need to focus on system integration on less space instead of single components or functional groups. This drives advanced semiconductor packaging to diversify and become fairly more complex, but at the same time an integrated functional part of the system. The demand for more and more diversified functionality on same or even less space drives the development of "More-than-Moore" (MtM) solutions in the packaging world. The keyword is again "System-in-Package" (SiP). Chip-Package-Board Co-Design and Co-Development are essential key for success. Besides some theory, the paper will show some real product examples where system integration in the package saved up to 4X space on the board for the same functionality with even more performance. While today the majority of SiP is still realized using laminated organic substrate interposers, the need to close the gap to System-on-Chip (SoC) performance is driving closer distances of the single functional elements to each other. This can be realized by Fan-Out Wafer Level Packaging (FO-WLP) technologies, like eWLB (embedded Wafer Level Ball Grid Array), which overcomes Fan-In Wafer Level Packaging (FI-WLP) limitations especially in terms of system integration, keeping the advantages of scalability and cost efficient batch processing. In the paper the good progress made to develop eWLB as technology platform will be shown, mainly using FO-WLP as enabler for System-in-Package on Wafer Level (WLSiP).

MEMS AND 3D-IC PRODUCT ENGINEERING - TECHNOLOGY DESIGN FOR SYSTEM INTEGRATION

Kai Hahn, University Siegen, DE

Abstract
Taking into account the diversity of technologies from die manufacturing to packaging it becomes clear that for product engineering of integrated systems such as MEMS or stacked 3D circuits the constraints and inter-dependencies of design and manufacturing are of special interest. The configuration of these technologies is strongly application specific and design methods differ completely from the approach known from the development of conventional two dimensional ICs. The presentation will cover methods and tools for technology design in the area of MEMS as well as for 3D integration.
**3D-TSV-HUB: POTENTIALS AND CHALLENGES FOR VERTICAL INTERCONNECTS IN NETWORKS-ON-CHIPS**

**Speaker:** Andreas Herkersdorf, TU München, DE

**Abstract**
Sophisticated Network-On-Chips (NoCs) will form the backbone for on-chip communication in future System-On-Chip designs. Already in conventional planar systems the synthesis of application specific NoCs is a complex task. When shifting to a stacked die environment, further degrees of freedom are added and a large design space is created. Through Silicon Vias (TSVs) are deployed for building vertical NoC links in a 3D systems. However, TSVs are cost intensive under several aspects. A major concern is the availability of deep submicron technology nodes for the implementation of computing functions (More Moore). But competitive system architectures and functional system partitionings (technology selections) strongly depend on highly efficient interfaces to the system environment. These interfaces are supporting many functions as for example: a) The sensing of physical parameters (temperature, pressure, speed, power, ...) b) Providing power and control signals for actuators (drivers for motors, pumps, ...) c) Providing power for the computing system (- including safety, power up/down) d) Interfacing to human bodies and/or other system elements and e) Communication of system control and operational data (WiFi, Bluetooth, ...). Those interfaces ask for highly efficient (in terms of space, power, performance, .., and cost) 3D integration technologies and design methodologies. Infineon examples for sensors and drivers will be presented.

**SENSORS AND POWER DRIVERS, BRIDGE BETWEEN SYSTEM ENVIRONMENT AND COMPUTING**

**Speaker:** Jochen Reisinger, Infineon Technologies Austria AG, AT

**Abstract**
There are many main drivers enabling the most significant innovations in system solutions which are based on, or only supported by, electronics. No doubt, the best known driver is the availability of deep submicron technology nodes for the implementation of computing functions (More Moore). But competitive system architectures and functional system partitionings (technology selections) strongly depend on highly efficient interfaces to the system environment. Those interfaces are supporting many functions as for example: a) The sensing of physical parameters (temperature, pressure, speed, power, ...) b) Providing power and control signals for actuators (drivers for motors, pumps, ...) c) Providing power for the computing system (- including safety, power up/down) d) Interfacing to human bodies and/or other system elements and e) Communication of system operative and control data (WiFi, Bluetooth, ...). Those interfaces ask for highly efficient (in terms of space, power, performance, .., and cost) 3D integration technologies and design methodologies. Infineon examples for sensors and drivers will be presented.

**FAB AND ACCURATE COMPUTATION USING STOCHASTIC CIRCUITS**

**Speaker:** Armin Alagh and John P. Hayes, University of Michigan - Ann Arbor, US

**Abstract**
Stochastic computing (SC) is a low-cost design technique that has great promise in applications such as image processing. SC enables arithmetic operations to be performed on stochastic bit-streams using ultra-small and low-power circuitry. However, accurate computations tend to require long run-times due to the random fluctuations inherent in stochastic numbers (SNs). We present novel techniques for SN generation that lead to better accuracy/run-time trade-offs. First, we analyze a property called progressive precision (PP) which allows computational accuracy to grow systematically with run-time. Second, borrowing from Monte Carlo methods, we show that SC performance can be greatly improved by replacing the usual pseudo-random number sources by low-discrepancy (LD) sequences that are predictably progressive. Finally, we evaluate the use of LD stochastic numbers in SC, and show they can produce significantly faster and more accurate results than existing stochastic designs.

**DRAM-BASED COHERENT CACHES AND HOW TO TAKE ADVANTAGE OF THE COHERENCE PROTOCOL TO REDUCE THE REFRESH ENERGY**

**Speaker:** Zoran Jacobs and Ramon Canal, Universitat Politècnica de Catalunya, ES

**Abstract**
Recent technology trends has turned DRAMs into an interesting candidate to substitute traditional SRAM-based on-chip memory structures (i.e. register file, cache memories). Nevertheless, a major problem to introduce these cells is that they lose their state (i.e. value) over time, and they have to be refreshed. This paper proposes the implementation of coherent caches with DRAM cells. Furthermore, we propose to use the coherence state to tune the refresh overhead. According to our analysis, an average of up to 57% of refresh energy can be saved. Also, comparing to the caches implemented in SRAMs total energy savings are on average up to 39% depending on the refresh policy with a performance loss below 8%.

**REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD0)**

**Speakers:**
- Alan Bardizbanyan1, Magnus Själander2, David Whalley2 and Per Larsson-edefors1
  1Chalmers University of Technology, SE; 2Florida State University, US

**Abstract**
Fast set-associative level-one data caches (L1-DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD0) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD3 technique enables about 49% of all load operations to access the L1-DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1-DC energy by 13%.

**INTERACTIVE PRESENTATIONS**

**Time:** 09:15
**Label:** 5.8.4
**Presentation Title:** 3D-TSV-HUB: POTENTIALS AND CHALLENGES FOR VERTICAL INTERCONNECTS IN NETWORKS-ON-CHIPS

**Authors:** Andreas Herkersdorf, TU München, DE

**Time:** 09:30
**Label:** 5.8.5
**Presentation Title:** SENSORS AND POWER DRIVERS, BRIDGE BETWEEN SYSTEM ENVIRONMENT AND COMPUTING

**Authors:** Jochen Reisinger, Infineon Technologies Austria AG, AT

**Time:** 09:45
**Label:** 5.8.6
**Presentation Title:** CONCLUSIONS AND DISCUSSION

**Authors:**
- Alan Bardizbanyan
- Magnus Själander
- David Whalley
- Per Larsson-edefors

**Time:** 10:00
**Time:** End of session
**Time:** Coffee Break in Exhibition Area

**INTERACTIVE PRESENTATIONS**

**Date:** Wednesday 26 March 2014
**Time:** 10:00 - 10:30
**Location / Room:** Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the morning. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.
DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP

Speakers:
Preethi Parayil Mana Damodaran 1, Stefan Wallentowitz 2 and Andreas Herkersdorf 3
1LIS, Technical University of Munich, DE; 2Technische Universität München, Institute for Integrated Systems, DE; 3TU München, DE

Abstract
In a shared-memory based tiled many-core system-on-chip architecture, memory accesses present a huge performance bottleneck in terms of access latency as well as bandwidth requirements. The best practice approach to address this issue is to provide a multi-level cache hierarchy and a suitable cache-coherency mechanism. This paper presents a method to increase the memory access performance is distributed-directory-coherency-protocol based tiled many-core systems. The proposed method introduces an alternate design for the system-wide shared last-level caches (LLC) placed between the memory and the node private caches (NPC). The proposed system-wide shared LLC layer is distributed over the entire network and it interacts with the home directories of specific cache lines. Results from simulating SPEC2000 benchmark applications executed on a SystemC model of the proposed design show a minimum performance improvement of 20-25% when compared to a similar system without the shared cache layer at the expense of an additional 3% of the total cache memory space (NPC + LLC memory). In addition, the proposed design shows a minimum 7-15% and an average 14-15% improvement in performance in comparison to centralized system-wide shared LLC of equivalent size and dynamically mapped distributed LLC of equivalent size respectively.

DESIGN OF SAFETY CRITICAL SYSTEMS BY REFINEMENT

Speakers:
Alex Illasov 1, Arseniy Alexseyev 2, Danil Sokolov 3 and Andrey Mokhov 3
1Newcastle University, GB; 2Newcastle University, ZW; 3Newcastle University, BB

Abstract
An increasingly large number of safety-critical embedded systems rely on software to prevent and mitigate hazards occurring due to design errors and unexpected interactions of the system with its users and the environment. Implementing a safety instrumented function in the way advocated by the traditional software methods requires an intimate understanding and thorough validation of a complex ecosystem of programming languages, compilers, operating systems and hardware. We propose to consider an alternative where a system designer, for each individual problem, creates in a correct-by-construction manner both the design of a system and its compilation and execution infrastructure. This permits an uninterrupted chain of a formal correctness argument spanning from formalisation requirements all the way to the gate-level characterisation of an execution environment. The past decade of advances in verification technology turned the mechanical verification of large-scale models into a reality while the pressure of certification makes the cost of a formally verified development routine increasingly acceptable. The proposed technique fits the Grand Challenge for Computer Research posed by Hoare in 2003, namely, development of a Verifying Compiler which not only mechanically translates a given program from one language to another but also verifies its correctness according to a formal specification. This allows meeting the most stringent software certification requirements such as SIL 4. We illustrate the idea with a small case-study developed using the Event-B modelling notation and tools.

ENERGY OPTIMIZATION IN ANDROID APPLICATIONS THROUGH WAKELOCK PLACEMENT

Speakers:
Faisal Alam 1, Preeti Ranjan Panda 1, Nikhil Tripathi 2, Namita Sharma 3 and Sanjiv Narayan 2
1IIT Delhi, IN; 2Calypso Design Systems, IN; 3Indian Institute of Technology Delhi, IN

Abstract
Energy efficiency is a critical factor in mobile systems, and a significant body of recent research efforts has focused on reducing the energy dissipation in mobile hardware and applications. The Android OS Power Manager provides programming interface routines called wakelocks for controlling the activation state of devices on a mobile system. An appropriate placement of wakelock acquire and release functions in the application can make a significant difference to the energy consumption. In this paper, we propose a data flow analysis based strategy for determining the placement of wakelock statements corresponding to the uses of devices in an application. Our experimental evaluation on a set of Android applications show significant (up to 32%) energy savings with the proposed optimization strategy.

A WEAR-LEVELING-AWARE DYNAMIC STACK FOR PCM MEMORY IN EMBEDDED SYSTEMS

Speakers:
Qingan Li 1, Yanxiang He 2, Yong Chen 2, Chun Xue 3, Nan Jiang 2 and Chao Xu 2
1Wuhan University & City University of Hong Kong, CN; 2Wuhan University, CN; 3City University of Hong Kong, CN

Abstract
Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics such as extremely low leakage power, high storage density and good scalability. However, PCM’s low endurance constrains its practical applications. In this paper, we propose a Wear Leveling aware dynamic stack to extend PCM’s lifetime when it is adopted in embedded systems as main memory. Through a dynamic stack, the memory space is circularly allocated to stack objects, and thus an even usage of PCM memory is achieved. The experimental results show that the proposed method can significantly reduce the write variation on PCM cells and enhance the lifetime of PCM memory.

LIFETIME HOLES AWARE REGISTER ALLOCATION FOR CLUSTERED VLIW PROCESSORS

Speakers:
Xuemeng Zhang 1, Hui Wu 2, Haiyan Sun 1 and Jingling Xue 3
1National University of Defense Technology, CN; 2The University of New South Wales, AU; 3UNSW, AU

Abstract
This paper presents an on-the-fly register allocator which dynamically detects and utilises lifetime holes for clustered VLIW processors. A lifetime hole is an interval in which a variable does not contain a valid value. A register holding a lifetime hole can be allocated to another variable whose live range fits in the lifetime hole, leading to more efficient utilisation of registers. We propose efficient techniques for dynamically utilising lifetime holes and incorporate these techniques into our on-the-fly register allocator. We have simulated our register allocator and a linear scan register allocator without considering lifetime holes by using the MediaBench II benchmark suite. Our simulation results show that our register allocator reduces the number of spills by 12.5%, 11.7%, 12.7%, for three different processor models, respectively.

A LOW-POWER, HIGH-PERFORMANCE APPROXIMATE MULTIPLIER WITH CONFIGURABLE PARTIAL ERROR RECOVERY

Speakers:
Cong Liu 1, Jie Han 1 and Fabrizio Lombardi 2
1University of Alberta, CA; 2Northeastern University, US

Abstract
Approximate circuits have been considered for error-tolerant applications that can tolerate some loss of accuracy with improved performance and energy efficiency. Multipliers are key arithmetic circuits in many such applications such as digital signal processing (DSP). In this paper, a novel approximate multiplier with a lower power consumption and a shorter critical path than traditional multipliers is proposed for high-performance DSP applications. This multiplier leverages a newly-designed approximate adder that limits its carry propagation to the nearest neighbors for fast partial product accumulation. Different levels of accuracy can be achieved through a configurable error recovery by using different numbers of most significant bits (MSBs) for error reduction. The approximate multiplier has a low mean error distance, i.e., most of the errors are not significant in magnitude. Compared to the Wallace multipliers, a 16-bit approximate multiplier implemented in a 28nm CMOS process shows a reduction in delay and power of 20% and up to 69%, respectively. It is shown that by utilizing an appropriate error recovery, the proposed approximate multiplier achieves similar processing accuracy as traditional exact multipliers but with significant improvements in power and performance.
Floating-point arithmetic is widely used in scientific computing. While many programmers are subliminally aware that floating-point numbers only approximate the real numbers, few are cognizant of the dangers this entails for programming. Such dangers range from tolerable rounding errors in sequential programs, to unexpected, catastrophic failures in parallel programs. This paper presents a novel Dynamic Reliability Management System (DyReMS) for on-chip systems that performs reliability-driven resource allocation and mapping. It accounts for both the tasks’ resilience properties and heterogeneous error recovery features of different cores. DyReMS also chooses a reliable task version (out of multiple reliability-aware transformed options) depending upon the reliability level of the allocated core. In case of error detection, rollbacks are performed. Our system provides up to 70%–80% improved task reliability compared to a timing reliability-optimizing core assignment, i.e., minimizing the probability of deadline misses (with EDF scheduling).

Abstract

A LOW POWER AND ROBUST CARBON NANTUBE 6T SRAM DESIGN WITH METALLIC TOLERANCE

Speakers:

Luo Lin1, Jimson Mathew1, Richad Shafi2, Dhiraj Pradhan1 and Zhe Li3

1University of Bristol, GB; 2University of Southampton, GB

Abstract

Carbon nanotube fiel-effect transistor (CNTFET) is envisioned as a promising device to overcome the limitations of traditional CMOS based MOSFETs due to its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 3T bitcell based on CNTFETs, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering its favourable physical properties.
PB-17  WIDTH MINIMIZATION IN THE SINGLE-ELECTRON TRANSISTOR ARRAY SYNTHESIS

Speakers:
Chian-Wei Liu1, Chang-En Chiang1, Ching-Yi Huang1, Chun-Yao Wang1, Yung-Chih Chen1, Sunnan Datta1 and Vijaykrishnan Narayanan4
1Dept. of Computer Science, National Tsing Hua University, TW; 2Dept. of Computer Science and Engineering, Yuan Ze University, TW; 3Department of Electrical Engineering, The Pennsylvania State University, US; 4Department of Computer Science and Engineering, The Pennsylvania State University, US

Abstract
Power consumption has become one of the primary challenges to meet the Moore’s law. For reducing power consumption, Single-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore’s law due to its ultra-low power consumption during operation. Prior work has proposed an automated mapping approach for SET architecture which focuses on minimizing the number of hexagons in an SET array. However, the area of an SET array is more related to the width. Consequently, in this work, we propose an approach for width minimization of the SET arrays. The experimental results show that the proposed approach saves 26% of width compared with the state-of-the-art for a set of MCNC and IWLS 2005 benchmarks while spending similar CPU time.

PB-18  AREA MINIMIZATION SYNTHESIS FOR RECONFIGURABLE SINGLE-ELECTRON TRANSISTOR ARRAYS WITH FABRICATION CONSTRAINTS

Speakers:
Yi-Hang Chen, Jian-Yu Chen and Junin-Dar Huang, Department of Electronics Engineering, National Chiao Tung University, TW

Abstract
As fabrication processes exploit even deeper submicron technology, power dissipation has become a crucial issue for most electronic circuit and system designs nowadays. In particular, leakage power is becoming a dominant source of power consumption. Recently, the reconfigurable single-electron transistor (SET) array has been proposed as an emerging circuit design style for continuing Moore’s Law due to its ultra-low power consumption. Several automated synthesis approaches have been developed for the reconfigurable SET array in the past few years. Nevertheless, all of these existing methods consider fabrication constraints, which are mandatory, merely in late synthesis stages. In this paper, we propose a synthesis algorithm, featuring both variable reordering and product term reordering, for area minimization. In addition, our algorithm takes those mandatory fabrication constraints into account in early stages for better outcomes. Experimental results show that our new method can achieve an area reduction of up to 24% as compared to current state-of-the-art techniques.

PB-19  SOFTWARE-BASED PAULI TRACKING IN FAULT-TOLERANT QUANTUM CIRCUITS

Speakers:
Alexandr Paler1, Simon Devitt2, Kae Nemoto2 and Jilla Polian1
1University of Passau, DE; 2National Institute of Informatics, JP

Abstract
The realisation of large-scale quantum computing is no longer simply a hardware question. The rapid development of quantum technology has resulted in dozens of control and programming problems that should be directed towards the classical computer science and engineering community. One such problem is known as Pauli tracking. Methods for implementing quantum algorithms that are compatible with crucial error correction technology utilise extensive quantum teleportation protocols. These protocols concern how and result in circuit operations that are not needed to be corrected in the quantum hardware itself, but are tracked through the circuit and output results emph(reinterpreted). This tracking is routinely ignored in quantum information as it is assumed that tracking algorithms will eventually be developed. In this work we help fill this gap and present an algorithm for tracking byproduct operators through a quantum computation.

PB-20  AN EFFICIENT TEMPERATURE-GRADIENT BASED BURN-IN TECHNIQUE FOR 3D STACKED ICs

Speakers:
Nima Aghaee, Zebo Peng and Petru Eles, Linköping University, SE

Abstract
Burn-in is usually carried out with high temperature and elevated voltage. Since some of the early-life failures depend not only on high temperature but also on temperature gradients, simply raising up the temperature of an IC is not sufficient to detect them. This is especially true for 3D stacked ICs, since they have usually very large temperature gradients. The efficient detection of these early-life failures requires that specific temperature gradients are enforced as a part of the burn-in process. This paper presents an efficient method to do so by applying high power stimuli to the cores of the IC under burn-in through the test access mechanism. Therefore, no external heating equipment is required. The scheduling of the heating and cooling intervals to achieve the required temperature gradients is based on thermal simulations and is guided by functions derived from a set of thermal equations. Experimental results demonstrate the efficiency of the proposed method.

PB-21  TEST AND NON-TEST CUBES FOR DIAGNOSTIC TEST GENERATION BASED ON Merging OF TEST CUBES

Speaker:
Inth Pomeranz, Purdue University, US

Abstract
Test generation by merging of test cubes supports test compaction and test data compression. This paper describes a new approach to the use of test cube merging for the generation of compact diagnostic test sets. For this the paper uses the new concept of non-test cubes. While a test cube for a fault f0 detects the fault, a non-test cube for a fault f1 prevents the fault from being detected. Merging a test cube for a fault f0 and a non-test cube for a fault f1 produces a diagnostic test cube that distinguishes the two faults. The paper describes a procedure for diagnostic test generation based on merging of test and non-test cubes. Experimental results demonstrate that compact diagnostic test sets are obtained.

PB-22  NEW IMPLEMENTATIONS OF PREDICTIVE ALTERNATE ANALOG/RF TEST WITH AUGMENTED MODEL REDUNDANCY

Speakers:
Haithem Ayari, Florence Azais, Serge Bernard, Mariane Comte, Vincent Kerzerho and Michel Renovell, LIRMM, CNRS/Univ. Montpellier 2, FR

Abstract
This paper discusses new implementations of the predictive alternate test strategy that exploit model redundancy in order to improve test confidence. The key idea is to build during the training phase, not only one regression model for each specification as in the classical implementation, but several regression models. This redundancy is then used during the testing phase to identify suspect predictions and remove the corresponding devices from the alternate test flow. In this paper, we propose an approach for width minimization of the SET arrays. The experimental results show that the proposed approach saves 26% of width compared with the state-of-the-art for a set of MCNC and IWLS 2005 benchmarks while spending similar CPU time.
Abstract
This demonstration presents AIDA, an analog integrated circuit (IC) design automation environment. AIDA includes two main modules, namely, AIDA-C and AIDA-L. AIDA-C is a circuit-level synthesis tool which uses state-of-the-art multi-objective multi-constrained optimization kernels, based on evolutionary computation techniques, where the robustness of the solutions is attained by considering a layout-aware approach and, also, extreme process variations by means of PVT corner analysis. The circuit's performance is measured using Spectre®, ELDO® or HSPICE® electrical simulators as evaluation engines. AIDA-L considers the device sizes and the best floorplan, obtained with AIDA-C, and generates the complete layout by placing and routing the devices, while fulfilling the technology design rules by using built-in design-rule check (DRC) and layout-versus-schematic (LVS) procedures. In order to demonstrate AIDA design environment several analog circuit structures, e.g., OTAs, LNAs, LC-Oscillators, etc., will be synthesized in a 130nm CMOS technology. AIDA-C is demonstrated for circuit-level sizing and optimization by generating a family of Pareto Optimal solutions based on user performance and functional specifications. AIDA-L is demonstrated by generating the layout of a user selected solution from AIDA-C, taking into account electrical currents information to mitigate electromigration and IR-drop effects, and also wiring symmetry for multiphase multi-terminal signal nets of analog ICs.

More information ...
LARA: THE LARA COMPILER SUITE

Authors: João Bispo, Pedro Pinto, Ricardo Nobre, Tiago Carvalho and Joao Cardoso, Universidade do Porto, PT

Abstract
LARA is an aspect-oriented programming (AOP) language which allows the description of sophisticated code instrumentation schemes, advanced mapping strategies including conditional decisions, based on hardware/software resources, and of sophisticated sequences of compiler transformations. Furthermore, LARA provides mechanisms for controlling all elements of a toolchain in a consistent and systematic way, using a unified programming interface. We present three compiler tools developed around the LARA technology, MATISSE, MANET and ReflectC. MATISSE is a compiler which 1) allows analyses and transformations on MATLAB code and 2) generates C code from the MATLAB code. MATISSE can be fully controlled through LARA aspects, which can define the type and shape of MATLAB variables, specify code insertion/removal actions, and define specialization directives and other additional information. MATISSE can output transformed MATLAB code and specialized C code. The knowledge provided by the LARA aspects allows MATISSE to generate C tailored to specific targets (e.g., use statically declared arrays to be compliant with the high-level synthesis tools such as Catapult C). MANET is a source-to-source compiler for ANSI C based on Cetus, and is controlled using LARA aspects. MANET manages to leverage the expressiveness and modularity of LARA to query and manipulate the Cetus AST, providing an easy compilation flow with main goal of code instrumentation and code transformations. LARA aspects allow for a simple selection of program elements in the code which can be analyzed or transformed, by either consulting their attributes or applying actions. Thus, MANET can be used to provide information reports based on compiler analyses, to implement sophisticated code instrumentation strategies, or to perform code optimizations and transformations. ReflectC is a C compiler based on Cosy’s compiler framework. Cosy’s configurability and retraceability make ReflectC particularly effective for exploration of compiler transformations and optimizations on possible architecture variations, and it is being used for hardware/software co-design and design space exploration (DSE). We will present demos of the tools and the use of LARA aspects and strategies to guide our suite of compilation tools providing: 1) C code generation from MATLAB code, according to information provided by LARA aspects; 2) Instrumentation of C code to be used for collecting specific compile and runtime information (e.g., execution time, range of values for specific variables, custom profiling). 3) User-controlled compiler optimizations targeting several architectures and DSE of sequences of compiler optimizations bearing in mind performance improvements. In addition to presenting examples for each of the tools of the LARA compilation suite, we show an execution of the complete toolchain, controlled by LARA aspects.

More information ...

MICROTESK: RECONFIGURABLE OPEN-SOURCE FRAMEWORK FOR TEST PROGRAM GENERATION

Authors: Andrei Tatarinikov, Alexander Kamkin and Artem Kotsynyuk, Institute for System Programming of the Russian Academy of Sciences (ISP RAS), RU

Abstract
Test program generation plays a major role in functional verification of microprocessors. Due to tremendous growth in complexity of modern designs and rigid constraints on time to market, it becomes an increasingly difficult task. In spite of powerful test program generation tools available in the market, development of functional tests is still known to be the bottleneck of the microprocessor design cycle. The common problem is that it takes a significant effort to reconfigure a test program generation environment for a new microprocessor design. The model-based approach applied in the state-of-the-art tools, like Genesyss-Pro (IBM Research), still does not provide enough flexibility since creating a microprocessor model is difficult and requires special knowledge and skills. MicroTESK, the open-source test program generation framework being developed at ISPRS, offers an approach to ease customization by using light-weight formal specifications to describe the target microprocessor architecture. The approach helps reduce the effort needed to create a microprocessor model and, consequently, minimize the time required to create functional tests. In addition to gaining flexibility, the use of formal specifications also allows automated extraction of knowledge about test situations that occur in a microprocessor (coverage model), thus facilitating creating directed tests and improving test coverage. By the present moment, a demo prototype of MicroTESK has been implemented. It uses the Sim++ML architecture description language to specify the target microprocessor architecture and provides a convenient Ruby-based language for creating test templates that serve as an abstract description of test programs to be generated. The current version of the framework focuses primarily on RISK microprocessors including ARM, MIPS and SPARK. Supported test generation methods include random, combinatorial, template-based and model-based generation. Flexible architecture of the framework allows adding support for new test generation methods.

More information ...

LEVERAGING DYNAMIC RECONFIGURATION TO INCREASE FAULT-TOLERANCE IN FPGA-BASED SATELLITE SYSTEMS

Authors: Sebastian Korf1, Dario Cozzi1, Dirk Jungewelter1, Jens Hagemeier1, Mario Porrmann1 and Jorgen Ilstad2

Abstract
This demonstrator shows how modern SoCs for satellite payload processing can be extended with high-speed interfaces and computing power utilizing commercial dynamically reconfigurable FPGAs. The use of these FPGAs in space environment will lead to faults due to radiation. Therefore, special methods have been developed to increase the system reliability. We will demonstrate an environment for automatic fault detection and correction in relevant applications like image and video processing.

More information ...

RTL++: DESIGN ENVIRONMENT: WALK BEFORE YOU RUN.

Authors: Soamayeh Sadeghi-Kohan, Behnaz Pourmohseni, Amir Reza Nezamai, Hanieh Hashemi, Hamed Najafi Haghi and Zainalabedin Navabi, University of Tehran, IR

Abstract
To enable development of high level designs with hardware correspondence, synthesiability must be satisfied in a top-down manner. Thus in this work, instead of using TLM-2.0 which is not established for synthesis, we will start with a level above RT level, "RTL++". RTL++ is basically using TLM-1.0 channels and includes abstract communications and handshakings that are mainly hidden from the designer. We develop a package of SystemC channels with hardware correspondence (synthesisable HDL) for the communication between various cores (with simple interfaces) and standard buses. In addition to that, it is being used for hardware/software co-design and design space exploration (DSE).

More information ...

6.1 SPECIAL DAY Hot Topic: The fight against Dark Silicon

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Saal 1
Organiser: Jörg Henkel, Karlsruhe Institute of Technology, DE
Chair: Jörg Henkel, Karlsruhe Institute of Technology, DE
Co-Chair: Jürgen Teich, University of Erlangen-Nuremberg, DE

Dark Silicon is predicted to dominate the chip footage of upcoming many-core systems within a decade since Dennard Scaling fails mainly due to the voltage-scaling problem that results in higher power densities. It would deem upcoming technologies nodes inefficient since a majority of cores would lie fallow. Significant research efforts have started within the last couple of years to investigate and mitigate Dark Silicon effects to ensure an effective use of available chip footage. This special session gives a snapshot of current research activities of this grand challenge. In particular, the three talks present the newest trends and developments starting with the problem of Dennard Scaling and how it mandates new design constraints followed by the problem of power delivery and cooling, and concluding with the newest directions in efficient resource management for many-core systems.
### A LANDSCAPE OF THE NEW DARK SILICON DESIGN REGIME

**Speaker:** Michael Taylor, University of California, San Diego, US  
**Abstract**

Due to the breakdown of Dennard scaling, the percentage of a silicon chip that can switch at full frequency is dropping exponentially with each process generation. This utilization wall forces designers to ensure that, at any point in time, large fractions of their chips are effectively dark silicon, i.e., significantly underclocked or idle for large periods of time. As exponentially larger fractions of a chip’s transistors become dark, silicon area becomes an exponentially cheaper resource relative to power and energy consumption. This shift is driving a new class of architectural techniques that “spend” area to “buy” energy efficiency. All of these techniques seek to introduce new forms of heterogeneity into the computational stack. This work examines four key approaches—the four horsemen—that have emerged as top contenders for thriving in the dark silicon age. Each class carries with its virtues deep-seated restrictions that require a careful understanding of the underlying tradeoffs and benefits. Further, we present a set of dark silicon design principles, and examine how one of the darkest computing architectures of all, the human brain, trades off energy and area in ways that provide potential insights into future directions for computer architecture.

---

### INTEGRATED MICROFLUIDIC POWER GENERATION AND COOLING FOR BRIGHT SILICON MPSoCs

**Speakers:** Mohamed M. Sabry1, Arvind Sridhar1, Patrick Ruch1, David Atenzi1 and Bruno Michel2  
1EPFL, CH; 2IBM Research, CH  
**Abstract**

The so-called demand for powering computing power in our digital information age has produced, as collateral undesirable effects, a surge in power consumption and heat density for Multiprocessors System-on-Chip (MPSoC). Accordingly, significant portion of the energy consumed in state-of-the-art MPSoCs is dissipated in cooling. The remaining energy is used for computation, and causes the temperature ramp-up to operating conditions that already preclude operating all the cores at maximum performance levels, in order to prevent system overheating and failures. This situation is set to worsen as shipments of high-end (i.e., even denser) many-core servers are increasing at a 25% compound annual growth rate. With more power demands, MPSoCs will face a power delivery wall due to the reliability limitations of the underlying power delivery medium. Thus, state-of-the-art worst-case power and cooling delivery solutions are reaching their limits and it will no longer be possible to power up simultaneously all the available on-chip cores (situation known as the existence of dark silicon); hence, drastically limiting the benefits of technology scaling. In this paper we propose a disruptive approach to overcome the prevailing worst-case power and cooling provisioning paradigm for MPSoCs. This proposed approach integrates MPSoC with an on-chip microfluidic fuel cell network for joint cooling delivery and power supply (i.e., local power generation and delivery). By providing an alternative mean to power delivery integrated with cooling, MPSoCs are expected to gain in I/O connectivity. Thanks to this disruptive technology, we can envision the removal of the current limits of power delivery and heat dissipation in server designs, subsequently avoiding dark silicon in future MPSoCs and enabling new perspectives in future energy-proportional computing architecture designs.

---

### EFFECTIVE RESOURCE MANAGEMENT TOWARDS EFFICIENT COMPUTING

**Speaker:** Per Stenström, Chalmers University of Technology, SE  
**Abstract**

Improving performance of computers at historical rates, as dictated by Moore’s Law, is becoming increasingly more challenging especially because we are hitting the chip power-budget wall. But challenges usually direct us to focus on opportunities we have neglected in the past. I will focus on some of these overlooked opportunities in this talk. One such opportunity is to question what are meaningful performance goals for individual applications. I will present a resource management framework in which architectural resources are assigned based on their performance requirements. I will also talk about some innovations that enable us to compute more power-efficiently by using memory resources more effectively by, for example, exploiting value locality.

---

### ENERGY EFFICIENT COMPUTING WITH TUNNEL FETS

**Speakers:** Adrian Ionescu, Arnab Biswas, Nilay Dagtekin and Livio Lattanzio, Nanolab, Ecole Polytechnique Fédérale de Lausanne, CH  
**Abstract**

This paper will review the state-of-the-art in energy efficient computing using tunnel FETs from device to circuit level, including digital IC and memory applications. At device level we will particularly discuss the major challenges remaining for tunnel FETs, with particular emphasis on: (i) selection of the most appropriate material systems and band-gap engineering of heterostructure Tunnel FETs to simultaneously offer best performance trade-offs: low Ioff, high Ion, high Ion/Ioff, subthreshold swing over more than 4 decades of current, and operation below 0.3V, (ii) specifically optimized device design (i.e. field aligned to the tunneling path, avoidance of super-linear onset, minimize Miller effect), (iii) understanding the role of defects for BTBT and providing appropriate control, (iv) understanding and controlling parameter sensitivity and variability, (v) accurate physics-based BTBT modeling of heterojunction tunnel FETs. We will detail the Electron-Hole Bilayer Tunnel FET (EHBTFET), as switch candidate for sub-0.1V operation exploiting tunneling through a bias-induced electron-hole bilayer based on a calibrated quantum-mechanical simulator. We will make performance projections for EHBTFET complementary logic compared to CMOS logic of some dimensions and using recent energy benchmarking. Finally, the design and use of Tunnel FETs as capacitorless DRAM cells, implemented as a double-gate (DG) fully-depleted Silicon-On-Insulator (FD-SOI) architecture will be reported and its principle, embodiment and scalability discussed. We will present recent experimental results on Tunnel FET DRAM memory operation schemes and demonstrate its potential for ultra-low power memories. In conclusion, this paper demonstrates that Tunnel FETs stand as the most promising steep slope switch candidates to reduce the supply voltage below 0.3 V and offer significant power dissipation savings for digital computing.
6.3 Management of Micro/Macro Renewable Energy Storage Systems

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 1
Chair: Geoff Merrett, University of Southampton, UK
Co-Chair: Davide Brunelli, University of Trento, IT

Modern energy storage systems affect all areas of power electronics, from micro-power energy harvesting systems to mega-watt Smart Grid systems. Papers in this session address novel approaches for on-chip power electronics operating under variable Vdd, and optimisation approaches to efficient design of smart grid energy storage.

**Time** | **Label** | **Presentation Title** | **Authors**
--- | --- | --- | ---
11:00 | 6.3.1 | (Best Paper Award Candidate) ASYNCHRONOUS DESIGN FOR NEW ON-CHIP WIDE DYNAMIC RANGE POWER ELECTRONICS | Delong Shang\(^1\), Xuefu Zhang\(^2\), Fei Xia\(^3\) and Alex Yakovlev\(^2\)
\(^{1}\)School of EEE, Newcastle University, GB; \(^{2}\)School of EEE, Newcastle University, GB; \(^{3}\)School of EEE, GB

**Abstract**
Asynchronous circuits will play an important role in microelectronic systems in the future, especially in energy harvesting and autonomous (EHA) systems where such circuits will be able to offer robustness and deliver high efficiency in a wide range of power-energy conditions. The concept of Capacitor Bank Block (CBB) mechanisms was proposed to form the basis of electronics for powering asynchronous loads. These mechanisms will benefit EHA systems by enabling such systems to perform a large range of tasks and energy supply. This paper demonstrates how the CBB mechanisms can be controlled by asynchronous circuits, thereby forming a new type of power delivery units (PDU) that will be able to deliver power to intelligent digital logic in future EHA systems. These PDUs are superior to traditional power converters largely because the former can only regulate sufficiently high power and energy levels (regular and periodic) as well as their controllers require stable power levels themselves. This makes them unsuitable for intermittent and sporadic conditions inherent to EHA systems. In this paper, a novel asynchronous control for the CBB is described. Experiments and analysis of the new PDUs, comprising CBBS and asynchronous control, are presented and discussed in detail.

11:30 | 6.3.2 | REAL-TIME OPTIMIZATION OF THE BATTERY BANKS LIFETIME IN HYBRID RESIDENTIAL ELECTRICAL SYSTEMS | Maurizio Rossi, Alessandro Toppano and Davide Brunelli, University of Trento, IT

**Abstract**
We present a real-time optimization framework to manage Hybrid Residential Electrical Systems (HRES) with multiple Energy sources and heterogeneous storage units. HRES represents urban buildings where photovoltaic (PV) or other renewable sources are installed along with the traditional connection to the main grid. In this paper heterogeneous storage units are used to realize energy buffers for the exceeding energy produced by the renewable when buildings and the grid are not available to accept it. We considered two different battery banks as electric energy storage, in particular lead-acid as the primary one for its low price and low self-discharge rate; while the lithium-ion chemistry is used as secondary bank because of the higher energy density and higher number of cycles. The proposed optimization strategy aims at maximizing the lifetime of the battery banks and to reduce the energy bill by managing the variability of the PV source, in price-varying scenarios. We used a Dynamic-Programming (DP) algorithm to schedule off-line the use of the lead-acid bank minimizing the number of cycles and the Depth-of-Discharge (DOD) under given irradiance forecasts and user load profiles. Forecasts of the user loads and of the renewable energy intake are introduced in the optimization. Moreover a Real-Time scheme is introduced to manage the lithium bank and to minimize the need and the purchase of energy from the Grid when the actual demand does not fit the forecast. Our simulation results outperform the state of the art where the efficiency of both banks is not taken into consideration, even if complex approaches based on DP are used.
This paper presents an approach to optimal dimensioning of active cell balancing architectures, which are of increasing relevance in EES for EV or stationary applications such as smart grids. Active cell balancing equals the state of charge of cells within a battery pack via charge transfers, increasing the effective capacity and lifetime. While optimization approaches have been introduced into the design process of several aspects of EES, active cell balancing architectures have, until now, not been systematically optimized in terms of their components. Therefore, this paper analyzes existing architectures to develop design metrics for energy dissipation, installation volume, and balancing current. Based on these design metrics, a methodology to efficiently obtain Pareto-optimal configurations for a wide range of inductors and transistors at different balancing currents is developed. Our methodology is then applied to a case study, optimizing two state-of-the-art architectures using realistic balancing algorithms. The results give evidence of the applicability of systematic optimization in the domain of cell balancing, leading to higher energy efficiencies with minimized installation space.

**Speakers:**
Swaminathan Narayanasamy1, Sebastian Steinhorst1, Martin Lukasiewycz2, Matthias Kauer3 and Samarjit Chakraborty4
1TUM CREATE, SG; 2TUM CREATE Singapore, SG; 3TUM CREATE Ltd., SG; 4TU Munich, DE

**Abstract**

This paper presents an approach to optimal dimensioning of active cell balancing architectures, which are of increasing relevance in EES for EV or stationary applications such as smart grids. Active cell balancing equals the state of charge of cells within a battery pack via charge transfers, increasing the effective capacity and lifetime. While optimization approaches have been introduced into the design process of several aspects of EES, active cell balancing architectures have, until now, not been systematically optimized in terms of their components. Therefore, this paper analyzes existing architectures to develop design metrics for energy dissipation, installation volume, and balancing current. Based on these design metrics, a methodology to efficiently obtain Pareto-optimal configurations for a wide range of inductors and transistors at different balancing currents is developed. Our methodology is then applied to a case study, optimizing two state-of-the-art architectures using realistic balancing algorithms. The results give evidence of the applicability of systematic optimization in the domain of cell balancing, leading to higher energy efficiencies with minimized installation space.
to certify the correctness of photonic systems.

Formal methods are traditionally used to verify the correctness of hardware, software, or protocols. This session introduces a set of applications which extend the use of formal methods into new domains. The first three papers demonstrate novel ways to bridge formal verification results into the synthesis domain. The fourth leverages formal reasoning to compute the cost of a circuit initialization.

Gianpiero Cabodi, Politecnico di Torino, IT

Co-Chair:
Christoph Scholl, University of Freiburg, DE

Location / Room:
Konferenz 3

Time:
11:00

Presentation Title:
USING MAXBMC FOR PARETO-OPTIMAL CIRCUIT INITIALIZATION

Authors:
Sven Reimer, Matthias Sauer, Tobias Schubert and Bernd Becker, University of Freiburg, DE

Abstract
In this paper we present MaxBMC, a novel formalism for solving optimization problems in sequential systems. Our approach combines techniques from symbolic SAT-based Bounded Model Checking (BMC) and incremental MaxSAT, leading to the first MaxBMC solver. In traditional BMC safety and liveness properties are validated. We extend this formalism: in case the required property is satisfied, an optimization problem is defined to minimize the cost of the reached solutions. We compare its quality in different depths of the system, leading to Pareto-optimal solutions. We state a sound and complete algorithm that not only tackles the optimization problem but moreover verifies whether a global optimum has been identified by using a complete BMC solver as back-end. As a first reference application we present the problem of circuit initialization. Additionally, we give pointers to other tasks which can be covered by our formalism quite naturally and further demonstrate the efficiency and effectiveness of our approach.
**TOWARDS VERIFYING DETERMINISM OF SYSTEMC DESIGNS**

Speakers: Hoang M. Le and Rolf Drechsler, University of Bremen, DE

Abstract

Ensuring the correctness of high-level SystemC designs is an important and challenging problem in today's Electronic System Level (ESL) methodology. Prevalently, a design is checked against a functional specification given by e.g., a testcase with reference output or a user-defined property. Another research direction takes the view of a SystemC design as a piece of concurrent software. The design is then checked for common concurrency problems and thus, a functional specification is not required. Along this line, several methods for deadlock detection and race analysis have been developed. In this work, we propose to consider a new concurrency verification problem, namely input-output determinism, for SystemC designs. That means for each possible input, the design must produce the same output under any valid process schedule. We argue that determinism verification is stronger than both deadlock detection and race analysis. Beside being an attractive correctness criterion itself, proven determinism helps to accelerate both simulative and formal verification. We propose to consider a new concurrency verification problem, namely input-output determinism, for SystemC designs. That means for each possible input, the design must produce the same output under any valid process schedule. We argue that determinism verification is stronger than both deadlock detection and race analysis. Beside being an attractive correctness criterion itself, proven determinism helps to accelerate both simulative and formal verification.
6.6.2 ISOCHRONOUS NETWORKS BY CONSTRUCTION

Speakers: Yu Bai and Klaus Schneider, University of Kaiserslautern, DE

Abstract
While synchronous system models have many advantages over asynchronous models concerning verification and validation, many implementation platforms do not provide efficient means for synchronization. For this reason, we consider a design flow that starts with a synchronous system model that is then transformed into an asynchronous one for synthesis. In essence, it partitions the synchronous system into a set of asynchronous components that communicate with each other via FIFO buffers. Of course, the synthesized system still has to behave as the original synchronous model, i.e., for each variable exactly the same flow of data values must be observed and only the synchronization to synchronous reaction steps is no longer explicitly given. In this paper, we prove that this correctness guarantee is given provided that (1) each component knows which of the input values have to be used for the next reaction (endochrony), (2) the synchronous system is able to perform the reaction (constructiveness), and (3) components agree on the clocks of their shared variables (iso/chrony/clock-consistency).

6.6.4 P-OFTL: AN OBJECT-BASED SEMANTIC-AWARE PARALLEL FLASH TRANSLATION LAYER

Speakers: Wei Wang, Youyou Lu and Jiyou Shu, Tsinghua University, CN

Abstract
With increased density and decreased price, flash memory has been widely used in storage systems for its low latency and low power features. However, traditional storage systems are designed and excessively optimized for magnetic disks, and the potential of flash memory is not brought into full play in the form of Solid State Drives (SSDs). In this paper, we propose p-OFTL, an object-based semantic-aware parallel flash translation layer (FTL). p-OFTL removes the dependencies in the FTL and directly manages the flash memory in file objects, which enables optimization of data layout in the flash using object semantics. While the removing of the mapping table improves system performance, a challenge remains to exploit the internal parallelism when maintaining the continuity of logical addresses in each object, which is essential for efficient garbage collection. To address this challenge, p-OFTL statically remaps the addresses by shifting the bits in the addresses, which spreads writes to different internal parallel units without another mapping table. Also, p-OFTL employs a semantic-aware data grouping algorithm to group data pages by trading off the hot-cold clustering for the continuity of logical addresses, so as to reduce the page movement in garbage collection. Experiments show that p-OFTL improves system performance by 4.0% ~ 10.3% and reduces garbage collection overhead by 15.1% ~ 32.5% in semantic-aware data grouping compared to those in semantic-unaware data grouping algorithms.

6.7 Hardening Approaches at Different Design Levels

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
New solutions for the design of hardened hardware components, from circuit to processor level.

6.8 First Time Right in Analog Design Enabling New Business Cases

New solutions for the design of hardened hardware components, from circuit to processor level.

Cecilia Metra, University of Bologna, IT
Lorena Anghel, TIMA, FR

Chair:

Location / Room: Konferenz 5
Time: 11:00 - 12:30

11:00 6.8.1 FIRST TIME RIGHT IN ANALOG DESIGN ENABLING NEW BUSINESS CASES

**Speakers:**
Ralph Nathan and Daniel Sorin, Duke University, US

**Abstract**
We propose a new, low-cost, hardware-only scheme to detect errors in superlative, out-of-order processor cores. For each instruction decoded, Nostradamus compares what the instruction is expected to do against what the instruction actually does. We implement Nostradamus in RTL on top of a baseline superscalar, out-of-order core, and we experimentally evaluate its ability to detect injected errors. We also evaluate Nostradamus’s area and power overheads.

11:30 6.8.2 WORD-LINE POWER SUPPLY SELECTOR FOR STABILITY IMPROVEMENT OF EMBEDDED SRAMS IN HIGH RELIABILITY APPLICATIONS

**Speakers:**
Bartomeu Alorda, Cristian Carmona and Sebastia Bota, Balearic Islands University, ES

**Abstract**
Embedded SRAM yield dominates the overall ASIC yield, therefore the methodologies centered on improving SRAM cell stability will be introduced in the design as a mandatory. Word-line voltage modulation has showed that it is possible to improve cell stability during access operations. The high variability of physical and performance parameters introduce the need to adopt adaptable solutions to adequately improve SRAM cell stability. In this work, we present a word-line voltage selector circuit designed to modulate power-supply word-line voltage at each individual embedded SRAM block. The final area overhead is minimal and several strategies can be implemented with the embedded SRAM allowing adjust word-line voltage value during the life of ASIC, taking into account different operational, voltage and degradations effects.

12:00 6.8.3 A HIGH PERFORMANCE SEU-TOLERANT LATCH FOR NANOSCALE CMOS TECHNOLOGY

**Speaker:**
Zhengfeng Huang, Heifei University of Technology, CN

**Abstract**
This paper presents a high performance latch to tolerate radiation-induced single event upset in 45 nm CMOS technology. The latch can improve robustness by masking the soft errors utilizing Muller C-element and dual modular redundancy hardening. The power dissipation, propagation delay and reliability of the presented SEU-tolerant latch are analyzed by SPICE simulations. The results show that the presented latch provides a higher robustness and lower power-delay product than classical implementations and alternative hardened solutions.

12:15 6.8.4 A LOW-COST RADIATION HARDENED FLIP-FLOP

**Speakers:**
Yang Lin, Mark Zwolinski and Basel Halak, University of Southampton, GB

**Abstract**
The aggressive scaling of semiconductor devices has caused a significant increase in the soft error rate caused by radiation hits. This has led to an increasing need for fault-tolerant techniques to maintain system reliability. Conventional radiation hardening techniques, typically used in safety-critical applications, are prohibitively expensive for non-safety-critical electronics. This work proposes a novel flip-flop architecture named SETTOFF which significantly improves circuit resilience to radiation hits over previous techniques. In addition, compared to other techniques such as a TMR latch, SETTOFF reduces the area and performance overhead by up to 50% and 80%, respectively; the power consumption is also reduced by up to 85%. In addition, a novel reliability metric called radiation-induced failure rate is developed which can be a valuable tool to predict the impact of radiation hits and quantitatively compare the reliability of various radiation hardened techniques. Our analysis shows that the proposed technique can achieve zero SEU failure rate, and significantly reduce the SET failure rate.

12:30 IP3-8, 98

**Presentation Title:**
PSP-CACHE: A LOW-COST FAULT-TOLERANT CACHE MEMORY ARCHITECTURE

**Speakers:**
Hamed Farbeh and Seyed Ghasssem Miremadi, Sharif University of Technology, IR

**Abstract**
Cache memories constitute a large fraction of processor chip area and are highly vulnerable to soft errors caused by energetic particles. To protect these memories, most of the modern processors employ Error Detection Codes (EDCs) or Error Correction Codes (ECCs). EDCs/ECCs impose significant overheads in terms of area and energy; these overheads increase as a function of interleaving EDCs/ECCs to detect/correct multiple errors. This paper proposes a new cache architecture to minimize the area and energy overheads of EDCs/ECCs in set-associative L1-caches. Simulation results for a 4-way set-associative cache show that the proposed architecture reduces both the area and static power overheads of parity code by about 75% and the dynamic energy overhead by about 73% in comparison to conventional cache architecture. These reduction figures are about 68% and about 66%, respectively, for SEC-DED code. The above reductions are achieved without affecting the error coverage.

12:31 IP3-9, 31

**Presentation Title:**
A HYBRID NON-VOLATILE SRAM CELL WITH CONCURRENT SEU DETECTION AND CORRECTION

**Speakers:**
Pilin Junsangsri1, Fabrizio Lombardi1 and Jie Han2
1Northeastern University, US; 2University of Alberta, CA

**Abstract**
This paper presents a hybrid non-volatile (NV) SRAM cell with a new scheme for SEU tolerance. The proposed NVSRAM cell consists of a 6T SRAM core and a Resistive RAM (KRAM), made of a 1T and a Programmable Metallization Cell (PMC). The proposed cell has concurrent error detection (CED) and correction capabilities; CED is accomplished using a dual-rail checker, while correction is accomplished by utilizing the restore operation; data from the non-volatile memory element is copied back to the SRAM core. The dual-rail checker utilizes two XOR gates each made of 2 inverters and 2 ambipolar transistors, hence, it has a hybrid nature. Extensive simulation results are provided. The simulation results show that the proposed scheme is very efficient in terms of numerous figures of merit such as delay and circuit complexity and thus applicable to integrated circuits such as FPGAs requiring secure on-chip non-volatile storage (i.e. LUTs) for multi-context configurability.

6.8 First Time Right in Analog Design Enabling New Business Cases

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Exhibition Theatre

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:00</td>
<td>6.8.1</td>
<td>FIRST TIME RIGHT IN ANALOG DESIGN ENABLING NEW BUSINESS CASES</td>
<td>Ralph Nathan and Daniel Sorin, Duke University, US</td>
</tr>
<tr>
<td>11:30</td>
<td>6.8.2</td>
<td>WORD-LINE POWER SUPPLY SELECTOR FOR STABILITY IMPROVEMENT OF EMBEDDED SRAMS IN HIGH RELIABILITY APPLICATIONS</td>
<td>Bartomeu Alorda, Cristian Carmona and Sebastia Bota, Balearic Islands University, ES</td>
</tr>
<tr>
<td>12:00</td>
<td>6.8.3</td>
<td>A HIGH PERFORMANCE SEU-TOLERANT LATCH FOR NANOSCALE CMOS TECHNOLOGY</td>
<td>Zhengfeng Huang, Heifei University of Technology, CN</td>
</tr>
<tr>
<td>12:15</td>
<td>6.8.4</td>
<td>A LOW-COST RADIATION HARDENED FLIP-FLOP</td>
<td>Yang Lin, Mark Zwolinski and Basel Halak, University of Southampton, GB</td>
</tr>
<tr>
<td>12:30</td>
<td>IP3-8, 98</td>
<td>PSP-CACHE: A LOW-COST FAULT-TOLERANT CACHE MEMORY ARCHITECTURE</td>
<td>Hamed Farbeh and Seyed Ghasssem Miremadi, Sharif University of Technology, IR</td>
</tr>
<tr>
<td>12:31</td>
<td>IP3-9, 31</td>
<td>A HYBRID NON-VOLATILE SRAM CELL WITH CONCURRENT SEU DETECTION AND CORRECTION</td>
<td>Pilin Junsangsri1, Fabrizio Lombardi1 and Jie Han2</td>
</tr>
</tbody>
</table>

6.8 First Time Right in Analog Design Enabling New Business Cases

End of session
Lunch Break in Exhibition Area
Sandwich lunch
Today’s demanding analog- and mixed-signal applications often do not allow for a “second shot”: Due to both schedule- and budget requirements, costly and time-consuming re-spins of all components need to be avoided to be successful. "First-Time-Right" is the goal for these designs. The presentation will outline the challenges involved in achieving first-time-right analog designs. It will highlight what impact the choice of process architecture makes, and will discuss the pros and cons of different process architectures, such as BCD and SOI. Fabless companies rely on their foundry to provide not only the right processes, but also excellent modeling of process and devices, and highquality, feature-rich process design kits. The influence of the design kit quality will be discussed in a second part of the presentation, as well as choice of the right EDA tools and design flows. Finally, it will be discussed how the relationship between foundry, fabless company and EDA provider needs to be developed in order to better support First-Time-Right in analog- and mixed signal designs.

The demand for lower supply voltages, faster processing speeds, smaller technology nodes, the accompanied higher variation impact under constantly reduced product cycles, significantly increases the necessity for automation during the design of analog modules. This presentation demonstrates the recent progress on the research of the "Fully Automated Analog Topology Synthesis Framework" (FAATS) by introducing its unique approach to elevate automated analog circuit design to the next step. How valuable an extensive design-space exploration can support "First-Time-Right" requirements is presented on different (design) case studies and an exclusive peek into an ongoing ASIC development strongly driven by FAATS.

An increase of the complexity of systems-on-chip (SoC) induces an increase of the complexity of their verification as well. The reason is that we must verify not only the functions of separate logic blocks, but we need to check their interconnections, timing and functional collaboration as well. Therefore, there is still a great demand for verification tools, which are time-effective, fast and as automated as possible. Exactly these issues we target in our solution. You are welcome to see the live demonstration at our booth!

We present the Heterogeneous Image Processing Acceleration (HIPAcc) framework. It allows programmers to develop image preprocessing applications while providing high productivity, flexibility, and portability as well as competitive performance. The same algorithm description serves as basis for targeting different GPU accelerators and low-level languages. Hereby, imaging algorithms can be expressed in a compact and productive way by using a domain-specific language (DSL) that is embedded into C++ code. Using the HIPAcc source-to-source compiler, DSL code is compiled to CUDA, OpenCL, C/C++, or even Renderscript code, which targets heterogeneous architectures on recent MPSoCs running Android. Programming those MPSoCs can be challenging, in particular when targeting different architectures (CPU/GPU/DSP). HIPAcc lifts this burden from programmers by automatically applying source code transformations based on domain knowledge and a built-in architecture model. This demonstration shows the seamless integration of HIPAcc into the Android Developer Tools and provides a live comparison of generated code to functional identical handwritten naive implementations of image filters on recent MPSoCs running Android.

Today's demanding analog- and mixed-signal applications often do not allow for a “second shot”: Due to both schedule- and budget requirements, costly and time-consuming re-spins of all components need to be avoided to be successful. "First-Time-Right" is the goal for these designs. The presentation will outline the challenges involved in achieving first-time-right analog designs. It will highlight what impact the choice of process architecture makes, and will discuss the pros and cons of different process architectures, such as BCD and SOI. Fabless companies rely on their foundry to provide not only the right processes, but also excellent modeling of process and devices, and highquality, feature-rich process design kits. The influence of the design kit quality will be discussed in a second part of the presentation, as well as choice of the right EDA tools and design flows. Finally, it will be discussed how the relationship between foundry, fabless company and EDA provider needs to be developed in order to better support First-Time-Right in analog- and mixed signal designs.

The demand for lower supply voltages, faster processing speeds, smaller technology nodes, the accompanied higher variation impact under constantly reduced product cycles, significantly increases the necessity for automation during the design of analog modules. This presentation demonstrates the recent progress on the research of the "Fully Automated Analog Topology Synthesis Framework" (FAATS) by introducing its unique approach to elevate automated analog circuit design to the next step. How valuable an extensive design-space exploration can support "First-Time-Right" requirements is presented on different (design) case studies and an exclusive peek into an ongoing ASIC development strongly driven by FAATS.

An increase of the complexity of systems-on-chip (SoC) induces an increase of the complexity of their verification as well. The reason is that we must verify not only the functions of separate logic blocks, but we need to check their interconnections, timing and functional collaboration as well. Therefore, there is still a great demand for verification tools, which are time-effective, fast and as automated as possible. Exactly these issues we target in our solution. You are welcome to see the live demonstration at our booth!

We present the Heterogeneous Image Processing Acceleration (HIPAcc) framework. It allows programmers to develop image preprocessing applications while providing high productivity, flexibility, and portability as well as competitive performance. The same algorithm description serves as basis for targeting different GPU accelerators and low-level languages. Hereby, imaging algorithms can be expressed in a compact and productive way by using a domain-specific language (DSL) that is embedded into C++ code. Using the HIPAcc source-to-source compiler, DSL code is compiled to CUDA, OpenCL, C/C++, or even Renderscript code, which targets heterogeneous architectures on recent MPSoCs running Android. Programming those MPSoCs can be challenging, in particular when targeting different architectures (CPU/GPU/DSP). HIPAcc lifts this burden from programmers by automatically applying source code transformations based on domain knowledge and a built-in architecture model. This demonstration shows the seamless integration of HIPAcc into the Android Developer Tools and provides a live comparison of generated code to functional identical handwritten naive implementations of image filters on recent MPSoCs running Android.

Today’s demanding analog- and mixed-signal applications often do not allow for a “second shot”: Due to both schedule- and budget requirements, costly and time-consuming re-spins of all components need to be avoided to be successful. "First-Time-Right" is the goal for these designs. The presentation will outline the challenges involved in achieving first-time-right analog designs. It will highlight what impact the choice of process architecture makes, and will discuss the pros and cons of different process architectures, such as BCD and SOI. Fabless companies rely on their foundry to provide not only the right processes, but also excellent modeling of process and devices, and highquality, feature-rich process design kits. The influence of the design kit quality will be discussed in a second part of the presentation, as well as choice of the right EDA tools and design flows. Finally, it will be discussed how the relationship between foundry, fabless company and EDA provider needs to be developed in order to better support First-Time-Right in analog- and mixed signal designs.

The demand for lower supply voltages, faster processing speeds, smaller technology nodes, the accompanied higher variation impact under constantly reduced product cycles, significantly increases the necessity for automation during the design of analog modules. This presentation demonstrates the recent progress on the research of the "Fully Automated Analog Topology Synthesis Framework" (FAATS) by introducing its unique approach to elevate automated analog circuit design to the next step. How valuable an extensive design-space exploration can support "First-Time-Right" requirements is presented on different (design) case studies and an exclusive peek into an ongoing ASIC development strongly driven by FAATS.
More information ...

Communications and dataflows vs controlflows. Our contributions enrich the modeling and design space exploration. In the scope of the demonstration, we present our enhanced version of TTool/DiplodocusDF, a UML model-driven engineering tool and methodology for the design of heterogeneous data processing systems. Our contributions enrich the modeling and design space exploration. Hardware and software are typically composed as if they were separate components, whereas their interactions yield more than the sum of the two parts. In the scope of the demonstration, we present our enhanced version of TTool/DiplodocusDF, a UML model-driven engineering tool and methodology for the design of heterogeneous data processing systems. Our contributions enrich the modeling and design space exploration.
UB06.09  PIGGY’S WEaver: A DEMONSTRATION FOR FOCUSING ON SEPARATION OF DEBUGGING CONCERNS BASED ON DYNAMIC PROGRAM REWRITING TOOL: PIGGY’S WEaver
Authors: Ikuta Tanigawa1, Nobuhiko Ogura2, Midori Sugaya3 and Harumi Watanabe1
1Tokai University, JP; 2Tokyo City University, JP; 3Shibaura Institute of Technology, JP
Abstract
Dynamic program rewriting is needed to continuous work and reduces costs of maintenance. We propose a dynamic rewriting tool “Piggy’s Weaver” for C# program. The tool attaches and detaches pieces of code to program at any points on each concern. Especially these attachments are focused on debugging concern. In the demonstration, we will apply the tool to a cloud and embedded system “Piggy Net” which is a cooperating charity pot with SNS and was awarded 2nd prize on D2C2012 by Microsoft Japan.
More information ...

UB06.10  UNISON: ASSEMBLY CODE GENERATION USING CONSTRAINT PROGRAMMING
Authors: Roberto CASTAÑEDA LOZANO1, Gabriel HJORT BLINDELL2, Mats CARLSSON1 and Christian SCHULTE2
1Swedish Institute of Computer Science, SE; 2KTH Royal Institute of Technology, SE
Abstract
We demonstrate Unison - a simple, flexible and potentially optimal code generator that solves interdependent code generation tasks together using constraint programming as a modern combinatorial optimization method. We show how Unison takes into account the task interdependencies and their combinatorial nature to improve the speed of the code generated by LLVM (a state-of-the-art compiler) for Hexagon (a digital signal processor ubiquitous in modern mobile platforms).
More information ...

14:00  End of session
16:00  Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.0  Special Day Keynote
Date: Wednesday 26 March 2014
Time: 13:30 – 14:00
Location / Room: Saal 1
The automotive industry is in a radical change process driven by technology. On the one hand the proliferation of communication technologies into the car leads to internet connected vehicles. The vehicle will become an integral part of the internet - opening new processing paradigms for the car itself. On the other hand the vehicle itself significantly expands its sensor and processing capabilities by the use of radar, video, ultrasound sensors and usage of state of the art CPU and GPU processor architectures. In our talk we will address both developments and outline foreseen future applications as future driving assistant and infotainment systems as well as highly automated driving. We will discuss major requirements for the future electrical architectures and implications for future automotive chips.

Time  Label  Presentation Title  Authors
13:30  7.0.1  SPECIAL DAY KEYNOTE: THE CONNECTED CAR AND ITS IMPLICATION TO THE AUTOMOTIVE CHIP ROADMAP
Speaker: Dr.-Ing. Michael Bolle, Robert Bosch GmbH, DE
14:00  End of session
16:00  Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

UB07 Session 7
Date: Wednesday 26 March 2014
Time: 14:00 – 16:00
Location / Room: University Booth, Booth 3, Exhibition Area

Label  Presentation Title  Authors
UB07.01  VIDEO-BASED ABSOLUTE NAVIGATION APPROACH: A NOVEL APPROACH FOR VIDEO-BASED ABSOLUTE NAVIGATION IN SPACE EXPLORATION MISSIONS
Authors: Pascal Trotta, Tadewos Getahun Tadewos, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT
Abstract
Nowadays, space agencies have increased their research efforts in order to enhance the success rate of space exploration missions. Future space missions will increasingly adopt Video Based Navigation (VBN) systems to assist the entry, descent and landing (EDL) phase of space modules. This poster will show a preliminary work on a novel approach for Video-based Absolute Navigation (VBNM). Moreover, the poster depicts how a VBN processing chain can exploit FPGA devices to achieve high throughput. Several visual results will be shown to highlight the peculiarities of the proposed approach.
More information ...

UB07.02  AIDA: ANALOG IC DESIGN AUTOMATION
Authors: Nuno Hortas 1, Nuno Lourenço 2, Ricardo Martins 2, Ricardo Pôvoa 2, António Canelas 2 and Pedro Ventura 1
1Instituto de Telecomunicacoes, PT; 2Instituto de Telecomunicacoes / Instituto Superior Técnico, PT
Abstract
This demonstration presents AIDA, an analog integrated circuit (IC) design automation environment. AIDA includes two main modules, namely, AIDA-C and AIDA-L. AIDA-C is a circuit-level synthesis tool which uses state-of-the-art multi-objective multi-constrained optimization kernels, based on evolutionary computation techniques, where the robustness of the solutions is attained by considering a layout-aware approach and, also, extreme process variations by means of PVT corner analysis. The circuit’s performance is measured using Spectre®, ELDO® or HSPICE® electrical simulators as evaluation engines. AIDA-L considers the device sizes and the best floorplan, obtained with AIDA-C, and generates the complete layout by placing and routing the devices, while fulfilling the technology design rules by using built-in design-rule check (DRC) and layout-versus-schematic (LVS) procedures. In order to demonstrate AIDA design environment several analog circuit structures, e.g., OTAs, LNAs, LC-Oscillators, etc., will be synthesized in a 130nm CMOS technology. AIDA-C is demonstrated for circuit-level sizing and optimization by generating a family of Pareto Optimal solutions based on user performance and functional specifications. AIDA-L is demonstrated by generating the layout of a user selected solution from AIDA-C, taking into account electrical currents information to mitigate electromigration and IR-drop effects, and also wiring symmetry for multport multi-terminal signal nets of analog ICs.
More information ...
**UB07.03** BICONDITIONAL BINARY DECISION DIAGRAM MANIPULATION PACKAGE

**Authors:** Luca Amar1, Alexios Balatsoukas-Stimming2, Pierre-Emmanuel Gaillardon3, Andreas Burg2 and Giovanni De Micheli3

1EPEFL, CH; 2EPEFL-TCL, CH; 3EPEFL-LSI, CH

**Abstract**
In this software demonstration, we present a logic manipulation package based on Biconditional Binary Decision Diagrams (BBDDs). BBDDs are a novel class of canonical binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. We show how Verilog files from real life designs can be rapidly read and processed by the BBDD manipulation package, for verification, testing or synthesis purposes. In particular, we demonstrate the benefit deriving from BBDD re-writing of arithmetic circuits in the synthesis of a product code iterative decoder.

More information ...

---

**UB07.04** GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES

**Authors:** Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT

**Abstract**
Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The software is composed of a parser library to handle input circuit descriptions, a characterization library of graphene gates used in the synthesis process, a Biconditional Binary Decision Diagram library utilized to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices.

More information ...

---

**UB07.05** TOMAHAWK: PERFORMANCE IMPACT OF INSTRUCTION SET ARCHITECTURE EXTENSIONS FOR DYNAMIC TASK SCHEDULING UNITS

**Author:** Oliver Arnold, Technische Universität Dresden, DE

**Abstract**
In this demo a heterogeneous MPSoC is controlled by a dynamic task scheduling unit called CoreManager. The instruction set architecture of this unit has been extended to improve performance for dynamic data dependency checking, task scheduling, processing element allocation and data transfer management. The MPSoC as well as the NoC are integrated in a cycle-accurate virtual system prototype. The performance impact of the CoreManager is analyzed on component as well as on system level.

More information ...

---

**UB07.06** LEGO: TOOLS FOR HYBRID INTEGRATION

**Author:** Fredrik Jonsson, Royal Institute of Technology, SE

**Abstract**
Performance of printed devices depends on the geometry, but is also affected by processing steps of other components integrated onto the same substrate. Since different designs use different devices, process stack, models and design rules must be dynamically determined. In this work we propose and demonstrate an experimental design flow to allow efficient design of hybrid and printed electronic circuits.

More information ...

---

**UB07.07** UVM-SYSTEMC-AMS: UVM STANDARD-COMPLIANT SYSTEMC (AMS)-BASED VERIFICATION FRAMEWORK FOR HETEROGENEOUS SYSTEMS

**Authors:** Zhi Wang1, Yao Li2, Marie-Minerve Louerat2, Francois Pecheux2, Martin Barnasconi1, Thilo Völtter4 and Karsten Einwich4

1Laboratoire d'informatique de Paris 6, FR; 2UPMC-LIP6, FR; 3NXP, NL; 4Fraunhofer IIS, DE

**Abstract**
Today's societal needs for innovative products in terms of communication, mobility, health, entertainment, and safety directly impact microelectronics design methodologies. The embedded systems are simultaneously software-driven, digitally assisted, complex and heterogeneous, but existing verification methodologies are mostly focused on pure digital devices and are completely decoupled from analog verification. This presentation shows how the principles of the new UVM methodology can be soundly enhanced to offer to the test designer a flexible framework for the virtual prototyping of multi-discipline testbenches that supports both digital and Analog Mixed-Signal (AMS) at the architectural level.

More information ...

---

**UB07.08** TTOOL/DIPLODOCUSDIF: A UML ENVIRONMENT FOR HARDWARE/SOFTWARE CO-DESIGN OF DATA-DOMINATED SYSTEMS-ON-CHIP

**Authors:** Andrea Enrici, Ludovic Aprille and Renaud Pacalet, Telecom ParisTech, FR

**Abstract**
The development of new Systems on Chip commonly relies on previous products for whom, due to factors such as system complexities, time and cost constraints, little design space exploration can be performed. Hardware and software are typically composed as if they were separate components, whereas their interactions yield more than the sum of the two parts. In the scope of the demonstration, we present our enhanced version of TTool/DiploDocusDIF, a UML model-driven engineering tool and methodology for the design of heterogeneous data processing systems. Our contributions enrich the modeling and design space exploration usually focused on higher level to target complex transfer schemes and control information exchange at different abstraction levels. Our ameliorated methodology is applied to two signal processing applications, showing the analysis of novel interactions between typically conflicting aspects such as computations vs communications and dataflows vs controlflows.

More information ...

---

**UB07.09** A HOLISTIC APPROACH TO POWER MANAGEMENT FOR ENERGY HARVESTING EMBEDDED SYSTEMS

**Authors:** Kyungsoo Lee, Hideki Takase and Taeheu Ishihara, Kyoto University, JP

**Abstract**
We present a holistic approach to maximizing the energy efficiency of energy harvesting embedded systems which consist of a processor system and an energy harvesting system. A power management program integrated on a real-time OS optimally switches operation mode of the processor and configuration of the energy harvesting system according to the workload of the processor and harvesting situation. The demonstration will show that our prototype system consisting of processor chip and harvesting system board stably runs using harvested energy only. The processor has multiple cores having a different performance in each to improve the energy efficiency of computation. The energy harvesting board has high transferring efficiency to reduce the power loss. The entire system is controlled efficiently by our power management program implemented on Toppers OS.

More information ...

---

**UB07.10** STMC TOOLS: A STATE TRANSITION MODEL DESCRIPTION LANGUAGE STMC AND ITS TOOLS - AN EXTENSION OF THE C PROGRAMMING LANGUAGE FOR DEVELOPING DRIVER SOFTWARE AND SOFTWARE WITH MODELS

**Authors:** Nobuhiko Ogura1, Ikuta Tanigawa2, Takuya Todoroki1, Kenji Arai1 and Harumi Watanabe2

1Tokyo City University, JP; 2Tokai University, JP

**Abstract**
We present a state transition model description programming language. It can be translated to pure standard C programs without any OS or handwritten frameworks, hence it is suit for developing low level driver software and firmware, unlike many other automatic software generation tools from software models that usually focuses on higher level models. We show the language and translator to executable software and visual diagram generator, and analysis tools, with embedded software examples.

More information ...
7.1 SPECIAL DAY Panel: HW/SW Co-Development - The Industrial Workflow

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Saal 1

Organiser:
Johannes Stahl, Synopsys, US

Chair:
Iris Stroh, Markt & Technik, DE

This panel brings together the entire supply chain for the use of virtual prototyping starting with the end users at an automotive Tier1, a semiconductor supplier, IP providers and the virtual prototyping and software development tool providers. The panelists will discuss what are the benefits and challenges of accelerating software development using virtual prototyping are for deployment in industrial projects.

Panelists:
- Andreas Schwerin, Siemens, DE
- Martin Vaupel, Bosch, DE
- Albrecht Mayer, Infineon, DE
- Nick Gatherer, ARM, GB
- Frank Schirmeister, Cadence, US
- Stephan Lauterbach, Lauterbach, US
- Colin Walls, Mentor Graphics, US
- Andreas Hoffmann, Synopsys, US

16:00 End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.2 Embedded Tutorial: Cross Layer Resiliency in Real World

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 6

Organiser:
Vikas Chandra, ARM, US

Chair:
Yanjing Li, Intel, US

Co-Chair:
Ulf Schlichtmann, TUM, DE

Resilience at different design hierarchies will be needed in Complex SoCs to handle failures due to variability, reliability and design errors (logical or electrical). The main reasons for the marginal behavior are sheer design complexity, uncertainties in manufacturing processes, temporal variability and operating conditions. In this session, we will cover the basics of cross layer resiliency and explore the reliability challenges in both embedded processors as well as large scale computing resources.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:30</td>
<td>7.2.1</td>
<td>CROSS-LAYER RESILIENCE EXPLORATION AND OPTIMIZATION</td>
<td>Subhasish Mitra, Stanford University, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>This talk will discuss systematic methodologies for exploring cross-layer resilience, encompassing error detection, correction and recovery techniques, for complex SoCs. The objective is to address several key questions such as: 1. Given a design, is cross-layer resilience always the best option? 2. What are the right models that link resilience techniques across multiple layers for quick, yet accurate, estimation of coverage and costs? 3. What is the proper framework to explore the large space of existing resilience techniques for error detection, correction, and recovery across various abstraction layers?</td>
</tr>
<tr>
<td>15:00</td>
<td>7.2.2</td>
<td>RELIABILITY CHALLENGES IN EMBEDDED PROCESSORS</td>
<td>Vikas Chandra, ARM, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Embedded processors are now at the heart of the mobile revolution and have the aspirations to power even high performance data centers. It is of utmost importance to understand the reliability challenges in embedded processors and find ways to tackle them across different layers of design abstraction. In this talk, I will talk about the reliability requirements in embedded processors, the challenges we are facing and our approach to make the design more robust. We will discuss our approaches of measuring wearout in commercial processors as well as efficient design of in-situ monitors to track timing errors.</td>
</tr>
<tr>
<td>15:30</td>
<td>7.2.3</td>
<td>BILLION CHIPS OF TRILLION TRANSISTORS: HOW TO MAKE THEM RELIABLE?</td>
<td>Chen-Yong Cher(^1) and Silvia Mueller(^2)</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Authors</strong></td>
<td>IBM Research, US; IBM Boeblingen, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Due to increasing demand for personal devices, high performance computing systems and commercial data centers, microprocessor and main memory designers face numerous challenges in delivering large number of chips at effective cost. While frequency scaling effectively ended, technology scaling continues to provide increasing number of transistors. To effectively utilize these transistors for performance, designers turn to sophisticated and highly integrated chip designs such as multi-core (e.g., Intel i7, IBM POWER7, BlueGene/Q), GPGPU (e.g., NVIDIA Tegra) heterogeneous SoC (e.g., IBM ZPower). The increasing demand for chips and transistors presents numerous challenges on reliability, power and manufacturing costs. In large scale HPC systems and data centers, the increasing number of chips also raises per-chip reliability requirement in order to achieve system reliability targets.</td>
</tr>
</tbody>
</table>

16:00 End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).
7.3 Low power methods and multicore architectures for mobile health applications

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 1

Chair: Giovanni Ansaloni, EPFL, CH
Co-Chair: Andrea Bartolini, University of Bologna, IT

Achieving low power operation is essential for battery operated mobile health applications. In this session, the papers address this important issue. The first two papers present multicore architectural methods for bio-signal processing, dealing with synchronisation and innovative memory architecture design. The last two papers focus on low power design of applications for bio-signal processing: tuning of sensor usage based on applications and methods to selectively drop computations to save power, without affecting the accuracy.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:30</td>
<td>7.3.1</td>
<td>HARDWARE/SOFTWARE APPROACH FOR CODE SYNCHRONIZATION IN LOW-POWER MULTI-CORE SENSOR NODES</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speakers: Rubén Braojos 1, Ahmed Dogan 2, Ivan Beretta 2, Giovanni Ansaloni 2 and David Atienza 2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract Latest embedded bio-signal analysis applications, targeting low-power Wireless Body Sensor Nodes (WBSNs), present conflicting requirements. On one hand, bio-signal analysis applications are continuously increasing their demand for high computing capabilities. On the other hand, long-term signal processing in WBSNs must be provided within their highly constrained energy budget. In this context, parallel processing effectively increases the power efficiency of WBSNs, but only if the execution can be properly synchronized among computing elements. To address this challenge, in this work we propose a hardware/software approach to synchronize the execution of bio-signal processing applications in multi-core WBSNs. This new approach requires little hardware resources and very few adaptations in the source code. Moreover, it provides the necessary flexibility to execute applications with an arbitrarily large degree of complexity and parallelism, enabling considerable reductions in power consumption for all multi-core WBSN execution conditions. Experimental results show that a multi-core WBSN architecture using the illustrated approach can obtain energy savings of up to 40%, with respect to an equivalent single-core architecture, when performing advanced bio-signal analysis.</td>
</tr>
<tr>
<td>15:00</td>
<td>7.3.2</td>
<td>HYBRID MEMORY ARCHITECTURE FOR VOLTAGE SCALING IN ULTRA-LOW POWER MULTI-CORE BIOMEDICAL PROCESSORS</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speakers: Daniele Bartolotti 1, Andrea Bartolini 1, Christian Weis 2, Davide Rossi 2 and Luca Benini 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract Technology scaling enables today the design of sensor-based ultra-low cost chips well suited for emerging applications such as wireless body sensor networks, urban life and environment monitoring. Energy consumption is the key limiting factor of this up-coming revolution and memories are often the energy bottleneck mainly due to leakage power. This paper proposes an ultra-low power multi-core architecture targeting eHealth monitoring systems, where applications involve collection of sequences of slow biomedical signals and highly parallel computations at very low voltage. We propose a hybrid memory architecture that combines 6T-SRAM and 8T-SRAM operating in the same voltage domain and capable of dispatching at high voltage a normal operation and at low voltage a fully reliable small memory partition (BT) while the rest of the memory (6T) is state-retentive. Our architecture offers significant energy savings with a low area overhead in typical eHealth Compressed Sensing-based applications.</td>
</tr>
<tr>
<td>15:30</td>
<td>7.3.3</td>
<td>CONTEXT AWARE POWER MANAGEMENT FOR MOTION-SENSING BODY AREA NETWORK NODES</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speakers: Filippo Casamassima 1, Elisabetta Farella 2 and Luca Benini 3</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract Body Area Networks (BANs) are widely used mainly for healthcare and fitness purposes. In both cases, the lifetime of sensor nodes included in the BAN is a key aspect that may affect the functionality of the whole system. Typical approaches to power management are based on a trade-off between the data rate and the monitoring time. Our work introduces a power management layer capable to opportunistically use data sampled by sensors to detect contextual information such as user activity and adapt the node operating point accordingly. The use of this layer has been demonstrated on a commercial sensor node, increasing its battery lifetime up to a factor of 5.</td>
</tr>
<tr>
<td>15:45</td>
<td>7.3.4</td>
<td>A QUALITY-SCALABLE AND ENERGY-EFFICIENT APPROACH FOR SPECTRAL ANALYSIS OF HEART RATE VARIABILITY</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speakers: Georgios Karakostantinou 1, Avinash Sankaranarayanan 2, Mohamed Sabry 1, David Atienza 1 and Andreas Burg 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract Today there is a growing interest in the integration of health monitoring applications in portable devices necessitating the development of methods that improve the energy efficiency of such systems. In this paper, we present a systematic approach that enables energy-quality trade-offs in spectral analysis systems for bio-signals, which are useful in monitoring various health conditions as those associated with the heart-rate. To enable such trade-offs, the processed signals are expressed initially in a basis in which significant components that carry most of the relevant information can be easily distinguished from the parts that influence the output to a lesser extent. Such a classification allows the pruning of operations associated with the less significant signal components leading to power savings with minor quality loss since only less useful parts are pruned under the given requirements. To exploit the attributes of the modified spectral analysis system, thresholding rules are determined and adopted at design- and run-time, allowing the static or dynamic pruning of less-useful operations based on the accuracy and energy requirements. The proposed algorithm is implemented on a typical sensor node simulator and results show up to 82% energy savings when static pruning is combined with voltage and frequency scaling, compared to the conventional algorithm in which such trade-offs were not available. In addition, experiments with numerous cardiac samples of various patients show that such energy savings come with a 4.9% average accuracy loss, which does not affect the system detection capability of sinus-arrhythmia which was used as a test case.</td>
</tr>
<tr>
<td>16:00</td>
<td>7.3.5</td>
<td>BATTERY AWARE STOCHASTIC QoS BOOSTING IN MOBILE COMPUTING DEVICES</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speakers: Hao Shen, Qiwen Chen and Qinru Qiu, Syracuse University, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract Mobile computing has been weaveed into everyday lives to a great extend. Their usage is clearly imprinted with user's personal signature. The ability to learn such signature enables immense potential in workload prediction and resource management. In this work, we investigate the user behavior modeling and apply the model for energy management. Our goal is to maximize the quality of service (QoS) provided by the mobile device (i.e., smartphone), while keep the risk of battery depletion below a given threshold. A Markov Decision Process (MDP) is constructed from history user behavior. The optimal management policy is solved using linear programming. Simulations based on real user traces validate that, compared to existing battery energy management techniques, the stochastic control performs better in boosting the mobile devices' QoS without significantly increasing the chance of battery depletion.</td>
</tr>
</tbody>
</table>

End of session

Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).
# 7.4 Runtime memory optimization and GPU/manycore architectures

**Date:** Wednesday 26 March 2014  
**Time:** 14:30 - 16:00  
**Location / Room:** Konferenz 2

**Chair:** Alberto Nannarelli, DTU Copenhagen, DK  
**Co-Chair:** Alberto Maci, PoliTo Torino, IT

The session starts with memory design techniques under PVT variation and ageing for DRAMs and SRAM caches. Afterwards, bus, memory and partitioning techniques for 2D and 3D GPUs and manycores are presented.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 14:30 | 7.4.1 | **EXPLOITING EXPENDABLE PROCESS-MARGINS IN DRAMS FOR RUN-TIME PERFORMANCE OPTIMIZATION**  
*Abstract*: Manufacturing-time process (P) variations and runtime variations in voltage (V) and temperature (T) can affect a DRAM's performance (internal delays) severely. To counter the effects of these variations, DRAM vendors provide substantial design-time PVT margins to guarantee correct DRAM functionality under worst-case conditions. Unfortunately, with technology scaling these design margins have become large and very pessimistic for a majority of the manufactured DRAMS. While runtime variations are specific to operating conditions and their margins difficult to optimize, process variations are manufacturing-time effects and excessive process-margins can be reduced on a per-device basis, if properly identified. In this paper, we propose a generic post-manufacturing performance characterization methodology for DRAMS that identifies this excess in process-margins for any given DRAM device at runtime, while retaining the requisite margins for voltage (noise) and temperature variations. By doing so, the methodology ascertains the actual impact of process-variations on the particular DRAM device and optimizes its access latencies, thereby improving its overall performance. We evaluate this methodology on 48 DDR3 devices (from 12 DIMMs) and verify the derived timings under worst-case operating conditions, showing up to 33.3% and 25.9% reduction in DRAM read and write latencies, respectively.  
*Speakers*: Karthik Chandrasekar1, Sven Goossens2, Christian Weis3, Martijn Koedam4, Benny Akesson5, Norbert Wehn5 and Kees Goossens5  
1Delft University of Technology, NL; 2Eindhoven University of Technology, NL; 3University of Kaiserslautern, DE; 4Czech Technical University in Prague, CZ; 5Eindhoven university of technology, NL |
| 15:00 | 7.4.2 | **CACHE AGING REDUCTION WITH IMPROVED PERFORMANCE USING DYNAMICALLY RE-SIZABLE CACHE**  
*Abstract*: Aging of transistors is a limiting factor for long term reliability of devices in sub-100nm technologies. It's a worst-case metric where the lifetime of a device is determined by the earliest failing component. Impact is more serious on memory arrays, where failure of a single SRAM cell would cause the failure of the whole system. Previous works have shown that partitioning based strategies based on power management techniques can effectively control aging effects and can extend lifetime of the cache significantly. However, such a benefit comes as a trade-off with performance which reduces proportionally as the time elapses. To address this problem and provide a single solution to concurrently improve aging, energy and performance of the cache, we propose an architectural solution based on the dynamically re-sizable cache and cache partitioning approaches. By this strategy, cache is dynamically re-sized and reconfigured whenever a cache block becomes unreliable. Coupling such aging mitigation technique along with dynamically re-sizable cache approach provides on average 30% lifetime improvement with less than 0.4x degradation in performance whereas, in previous solutions, performance degradation sometimes goes up to 10x.  
*Speakers*: Haroon Mahmood, Massimo Poncino and Enrico Maci, Politecnico di Torino, Torino Italy, IT |
| 15:15 | 7.4.3 | **ON GPU BUS POWER REDUCTION WITH 3D IC TECHNOLOGIES**  
*Abstract*: The complex buses consume significant power in graphics processing units (GPUs). In this paper, we demonstrate how the power consumption of buses in GPUs can be reduced with 3D IC technologies. Based on layout simulations, we found that partitioning and floorplanning of 3D ICs affect the power benefit amount, as well as the technology setup, target clock frequency, and circuit switching activity. With 3D IC technologies, we achieved the total power reduction of up to 21.5% for our GPU.  
*Speakers*: Young-Joon Lee1 and Sung Kyo Lim2  
1Intel Corporation, US; 2Georgia Institute of Technology, US |
| 15:45 | 7.4.4 | **PROCESS VARIATION-AWARE WORKLOAD PARTITIONING ALGORITHMS FOR GPUs SUPPORTING SPATIAL MULTITASKING**  
*Abstract*: High-level programming languages have transformed graphics processing units (GPUs) from domain-restricted devices into powerful compute platforms. Yet many "general-purpose GPU" (GPGPU) applications fail to fully utilize the GPU resources. Executing multiple applications simultaneously on different regions of the GPU (spatial multitasking) thus improves system performance. However, within-die process variations lead to significantly different maximum operating frequencies (Fmax) of the streaming multiprocessors (SMs) within a GPU. As the chip size and number of SMs per chip increase, the frequency variation is also expected to increase, exacerbating the problem. The increased number of SMs also provides a unique opportunity: we can allocate resources to concurrently-executing applications based on how those applications are affected by the differ-ent available Fmax values. In this paper, we study the effects of per-SM clocking on spatial multitasking-capable GPUs. We demonstrate two factors that affect the performance of simulta-neous-nearly-identical applications: (i) the SM partitioning algorithm that decides how many resources to assign to each application, and (ii) the assignment of SMs to applications based on the operating frequencies of those SMs and the applications characteris-tics. Our experimental results show that spatial multitasking that partitions SMs based on application characteristics, when com-bined with per-SM clocking, can greatly improve application performance by up to 46% on average compared to cooperative multitasking with global clocking.  
*Speakers*: Paula Aguilera1, Jungseob Lee2, Amin Farmahini Farahani1, Michael Schultz3, Katherine Morrow1 and Nam Sung Kim1  
1University of Wisconsin-Madison, US; 2AMD, US |
| 16:00 | IP3-11, 240 | **A THERMAL RESILIENT INTEGRATION OF MANY-CORE MICROPROCESSORS AND MAIN MEMORY BY 2.5D TSI I/OS**  
*Abstract*: One memory-logic integration design platform is developed in this paper with thermal reliability analysis provided for 2.5D throughsilicon-interposer (TSI) and 3D through-silicon-via (TSV) based integrations. Temperature-dependent delay and power models have been developed at microarchitecture level for 2.5D and 3D integrations of many-core microprocessors and main memory, respectively. Experiments are performed by general-purpose benchmarks from SPEC CPU2006 and also cloud-oriented benchmarks from Phoenix with the following observations. The memory-logic integration by 3D RC-interconnected TSV I/Os can result in thermal runaway failures due to strong electrical-thermal couplings. On the other hand, the one by 2.5D transmission-line-interconnected TSI I/Os has shown almost the same energy efficiency and better thermal resilience.  
*Speakers*: Sih-Sian Wu1, Kanwen Wang2, Sai Manoj P. D.1, Tsung-Yi Ho2 and Hao Yu1  
1Nanyang Technological University, SG; 2National Cheng Kung University, TW |
The papers in this sessions consider ways to improve the energy, performance, and reliability of emerging memory technologies. STT-RAM and PCRAM are addressed.

### 7.5 Emerging memory technologies

**Date:** Wednesday 26 March 2014  
**Time:** 14:30 - 16:00  
**Location / Room:** Konferenz 3  
**Chair:** Aida Todri, CNRS, FR  
**Co-Chair:** Lars Bauer, KIT, DE

The papers in this sessions consider ways to improve the energy, performance, and reliability of emerging memory technologies. STT-RAM and PCRAM are addressed.

#### 7.5.1 ASYNCHRONOUS ASYMMETRICAL WRITE TERMINATION (AAWT) FOR A LOW POWER STT-MRAM

**Speakers:**  
Rajendra Bishnoi¹, Mojtaba Ebrahimi², Fabian Oboril² and Mehdi Tahoori²  
¹Karlsruhe Institute of Technology, DE; ²Karlsruhe Institute of Technology, DE

**Abstract**  
Spin Transfer Torque (STT) memory is an emerging and promising non-volatile storage technology. However, the high write current is still a major challenge which leads to a huge power consumption of the memory. Due to an inherent torque asymmetry of the Magnetic Tunnel Junction (MTJ) device employed in STT memories, the switching time between parallel and anti-parallel is significantly different. Hence, the write latencies for writing '0' and '1' are also considerably different. In this paper, we propose a technique called Asynchronous Asymmetrical Write Termination (AAWT) which utilizes this asymmetrical behavior to terminate the write operations asynchronously and as a result significantly reduces the write power consumption. Furthermore, we present two different AAWT implementations to determine the actual write termination times. The first one makes use of a clock signal and the second one employs a self-timing approach based on an internal delay element. As shown by our experimental results, AAWT can considerably reduce the write latency.

#### 7.5.2 WRITE-ONCE-MEMORY-CODE PHASE CHANGE MEMORY

**Speakers:**  
Jiayin Li and Kartik Mohanram, University of Pittsburgh, US

This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM -- attributed to PCM SET -- by proposing a novel PCM memory architecture that integrates WOM-codes at the memory organization and memory controller levels. The proposed (2^2)-to-3 WOM-code PCM architecture is able to reduce memory write (read) latency by 20.1% (10.2%) on average across general-purpose (SPEC CPU2006), embedded (Mibench), and high-performance (SPLASH-2) benchmarks. To further improve the write latency of WOM-code PCM, we propose a PCM-refresh approach that uses idle cycles to preemptively set PCM rows to the initial WOM-code state. Results show that WOM-code PCM with PCM-refresh can reduce memory write (read) latency by 54.9% (47.9%) on average across the benchmarks. Finally, to balance write latency improvements against WOM-code PCM overhead, we propose a WOM-code cached PCM (WCPCM) architecture that uses WOM-code PCM as the cache alongside conventional PCM main memory. For just 4.7% memory overhead, WCPCM reduces memory write (read) latency by 47.2% (44.0%) on average across the benchmarks.
15:30  7.5.3  IMPROVING STT-MRAM DENSITY THROUGH MULTI-BIT ERROR CORRECTION  
Speakers: Brandon Del Bel, Jongyeon Kim, Chris H. Kim and Sachin S. Sapatnekar, University of Minnesota, US  
Abstract  
STT-MRAMs are prone to data corruption due to inadvertent bit flips. Traditional methods enhance robustness at the cost of area/energy by using larger cell sizes to improve the thermal stability of the MTJ cells. This paper employs multi-bit error correction with DRAM-style refreshing to mitigate errors and provides a methodology for determining the optimal level of correction. A detailed analysis demonstrates that the reduction in non-volatility requirements afforded by strong error correction translates to significantly lower area for the memory array compared to simpler ECC schemes, even when accounting for the increased overhead of error correction.

16:00  IP3-14,  
Speakers:  
Yuhao Wang, Pingfan Kong, Hao Yu and Dennis Sylvester  
2Nanyang Technological University, SG; 2University of Michigan, US  
Abstract  
The widely applied Advanced Encryption Standard (AES) encryption algorithm is critical in secure big-data storage. Data oriented applications have imposed high throughput and low power, i.e., energy efficiency (J/bit), requirements when applying AES encryption. This paper explores an in-memory AES encryption using the newly introduced domain-wall nanowire. We show that all AES operations can be fully mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire, called DW-AES. The experimental results show that DW-AES can achieve the best energy efficiency of 24 pJ/bit, which is 9X and 6.5X times better than CMOS ASIC and ReRAM-CMOL implementations, respectively. Under the same area budget, the proposed DW-AES exhibits 6.4X higher throughput and 29% power saving compared to a CMOS ASIC implementation; 1.7X higher throughput and 74% power reduction compared to a ReRAM-CMOL implementation.

16:01  IP3-15,  
Speakers:  
Boxun Li, Yu Wang, Yiran Chen, Helen Li and Huazhong Yang  
2Tsinghua University, CN; 2University of Pittsburgh, US  
Abstract  
The emerging neuromorphic computing provides a revolutionary solution to the alternative computing architecture and effectively extends Moore’s law. The discovery of the memristor presents a promising hardware realization of neuromorphic systems with incredible power efficiency, allowing efficiently executing the analog matrix-vector multiplication on the memristor crossbar architecture. However, during computations, the memristor will slowly drift from its initial programmed state, leading to a gradual decline of the computation precision of memristor crossbar-based computing engine (MCE). In this paper, we propose an inline calibration mechanism to guarantee the computation quality of the MCE. The inline calibration mechanism collects the MCE’s computation error through 1) interrupt-and-benchmark (I&B) operations and predicts the best calibration time through polynomial fitting of the computation error data. We also develop an adaptive technique to adjust the time interval between two neighbor I&B operations and minimize the negative impact of the I&B operation on system performance. The experiment results demonstrate that the proposed inline calibration mechanism achieves a calibration efficiency of 91.18% on average and negligible performance overhead (i.e., 0.439%).

16:02  IP3-16,  
Speakers:  
Yuanfan Yang, Jimson Mathew, Dhiraj K Pradhan, Marco Ottavi and Salvatore Pontarelli  
1University of Bristol, GB; 2University of Rome "Tor Vergata", IT  
Abstract  
Memristor based logic and memories are increasingly becoming one of the fundamental building blocks for future system design. Hence, it is important to explore various methodologies for implementing these blocks. In this paper, we present a novel Complementary Resistive Switching (CRS) based stateful logic operations using material implication. The proposed solution benefits from exponential reduction in sneak path current in crossbar implemented logic. We validated the effectiveness of our solution through SPICE simulations on a number of logic circuits. It has been shown that only 4 steps are required for implementing N input NAND gate whereas memristor based stateful logic needs N+1 steps.

16:00  End of session  

7.6 Performance and timing analysis  
Date: Wednesday 26 March 2014  
Time: 14:30 - 16:00  
Location / Room: Konferenz 4  
Chair: Wang Yi, Uppsala University, SE  
Co-Chair: Petru Eles, Linköping University, SE

This session includes three papers. The first uses data mining techniques to detect performance bottlenecks to improve the scalability of multicore platforms for embedded applications. The second proposes to use regular expressions for specifying the patterns of deadline misses and hits to relax schedulability analysis for cyber physical systems. The third presents an approach to the scheduling of streaming applications, considering latency constraints and minimization of the number of processors required.
### COMPUTING A LANGUAGE-BASED GUARANTEE FOR TIMING PROPERTIES OF CYBER-PHYSICAL SYSTEMS

**Speakers:**
Neil Druva, Pratyush Kumar, Georgia Giannopoulou and Lothar Thiele, ETH Zurich, CH

**Abstract**
Real-time systems are often guaranteed in terms of schedulability, which verifies whether or not all jobs meet their deadlines. However, such a guarantee can be insufficient in certain applications. In this paper, we propose a method to compute a language-based guarantee which provides a more detailed description of the deadline miss patterns of an observed task. The only requirement of our method is that the timing behavior of the real-time system be modelled by a network of timed automata. We compute the language-based guarantee by constructing an equivalent finite state automaton in an iterative manner, using a counter-example guided procedure. We illustrate the language-based guarantee for two applications: design of a networked control system and scheduling in a mixed criticality system. In both cases, we show that the language-based guarantee leads to a more efficient design than the schedulability guarantee.

### RESOURCE OPTIMIZATION FOR CSDF-MODELED STREAMING APPLICATIONS WITH LATENCY CONSTRAINTS

**Speakers:**
Di Liu\(^1\), Jelena Spasic\(^1\), Jiali Teddy Zhai\(^1\), Todor Stefanov\(^1\) and Gang Chen\(^2\)
\(^1\)Leiden University, NL; \(^2\)Technical University Munich, DE

**Abstract**
In this paper, we study the problem of minimizing the number of processors required for scheduling latency-constrained streaming applications modeled as CSDF graphs, where the actors of a CSDF are executed as strictly periodic tasks. We formalize the problem and prove that due to the strict periodicity of the actors in the problem, the optimization problem is a convex programming problem, that can be solved efficiently by using an existing convex programming solver. We evaluate our solution approach on a set of 13 real-life streaming applications modeled as CSDF graphs and demonstrate that it can reduce the number of processors in more than 20% of the conducted experiments in comparison to an existing approach.

### A LAYERED APPROACH FOR TESTING TIMING IN THE MODEL-BASED IMPLEMENTATION

**Speakers:**
BaeKiyu Kim\(^1\), Hyeon Il Hwang\(^2\), Taejoon Park\(^1\), Sanghyuk Son\(^2\) and Insup Lee\(^1\)
\(^1\)University of Pennsylvania, US; \(^2\)Daegu Gyeongbuk Institute of Science & Technology, KR

**Abstract**
The model-based implementation is to derive an implementation from a model that has been shown to meet requirements. Even though this approach can be used to guarantee that an implementation satisfies functional requirements that are shown to be correct at the model level, it is still challenging to assure timing requirements at the implementation level. We propose a layered approach in testing timing requirements conformance of implemented systems developed by model-based implementation. In our approach, the abstraction boundary of the implemented system is formally defined using Parmas’ four-variables model. Then, the proposed approach tests timing aspects of the interaction between the auto-generated code and the target platform-dependent code based on the four-variables. This approach aims at not only detecting the timing requirement violation, but also at measuring delay-segments that contribute to the timing deviation of the implemented system w.r.t. the model. We show the case study of testing timing requirements of an infusion pump system to illustrate the applicability of the proposed framework.

### MODEL-BASED PROTOCOL LOG GENERATION FOR TESTING A TELECOMMUNICATION TEST HARNESS USING CLP

**Speakers:**
Kenneth Balck\(^1\), Olgia Grinchtein\(^1\) and Justin Pearson\(^2\)
\(^1\)Ericsson AB, SE; \(^2\)Uppsala University, SE

**Abstract**
Within telecommunications development it is vital to have frameworks and systems to replay complicated scenarios on equipment under test, often there are not enough available scenarios. In this paper we study the problem of testing a test harness, which replays scenarios and analyses protocol logs for the Public Warning System Service, which is a part of the Long Term Evolution (LTE) 4G standard. Protocol logs are sequences of messages with timestamps; and are generated by different mobile network entities. In our case study we focus on user equipment protocol logs. In order to test the test harness we require that logs have both incorrect and correct behaviour. It is easy to collect logs from real system runs, but these logs do not show much variation in the behaviour of system under test. We present an approach where we use constraint logic programming (CLP) for both modelling and test generation, where each test case is a protocol log. In this case study, we uncovered previously unknown faults in the test harness.

### TIME-DECOUPLED PARALLEL SYSTEM SIMULATION

**Speakers:**
Jan Weinstock\(^1\), Christoph Schumacher\(^1\), Rainer Leupers\(^1\), Gerd Ascheid\(^1\) and Laura Tosoratto\(^2\)
\(^1\)RWTH Aachen, DE; \(^2\)Istituto Nazionale di Fisica Nucleare, Sezione di Roma, IT

**Abstract**
With increasing system size and complexity, designers of embedded systems face the challenge of efficiently simulating these systems in order to enable target specific software development and design space exploration as early as possible. Today’s multicore workstations offer enormous computational power, but traditional simulation engines like the OSCI SystemC kernel only operate on a single thread, thereby leaving a lot of computational potential unused. Most modern embedded system designs include multiple processors. This work proposes SScope, a SystemC kernel that aims at exploiting the inherent parallelism of such systems by simulating the processors on different threads. A lookahead mechanism is employed to reduce the required synchronization between the simulation threads, thereby further increasing simulation speed. The virtual prototype of the European FP7 project EURETILE system simulator is used as demonstrator for the proposed work, showing a speedup of 4.01x on a four core host system compared to sequential simulation.

### A UNIFIED METHODOLOGY FOR A FAST BENCHMARKING OF PARALLEL ARCHITECTURE

**Speakers:**
Alexandre Guerre, Jean-Thomas Acquaviva and Yves Lhuillier, CEA LIST, FR

**Abstract**
Benchmarking of architectures is today jeopardized by the explosion of parallel architectures and the dispersion of parallel programming models. Parallel programming requires architecture dependent compilers and languages as well as high programmer expertise. Thus, an objective comparison has become a hard task. This paper presents a novel methodology to evaluate and to compare parallel architectures in order to ease the programmer work. It is based on the usage of micro-benchmarks, code profiling and characterization tools. The main contribution of this methodology is a semi-automatic prediction of the performance for sequential applications on a set of parallel architectures. In addition the performance estimation is correlated with the cost of other criteria such as power or portability. Our methodology prediction was validated on an industrial application. Results are within a range of 20%.

16:00 End of session

**Coffee Break** in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Exhibition Theatre

Moderator: Oliver Bringmann, University of Tübingen, DE

Fully-Depleted Silicon On Insulator (FD-SOI) is emerging as a promising solution to continue the CMOS scaling roadmap at the 22nm technology node and beyond, especially for low power and System-on-Chip applications. After a short introduction into the FD-SOI technology, this panel discusses the role of FD-SOI as the key enabling technology to tackle the challenges of the major European application domains. This includes the creation of a European ecosystem to provide an easy access for industry and SMEs to a leading-edge semiconductor technology with manageable costs. The panelist take a look at different perspectives and discusses the technology, the SME, the application, the EDA, and the research viewpoint to FD-SOI and its impact to European industry.
Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

**IP3 Interactive Presentations**

**Date:** Wednesday 26 March 2014  
**Time:** 16:00 - 16:30  
**Location / Room:** Conference Level, foyer

Interactive Presentations are given in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award ‘Best IP of the Day’ is given.

<table>
<thead>
<tr>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| IP3-1 | **DESIGN AND FABRICATION OF A 315 μH BONDWIRE MICRO-TRANSFORMER FOR ULTRA-LOW VOLTAGE ENERGY HARVESTING** | Enrico Macrelli, Ningning Wang, Saibal Roy, Michael Hayes, Marco Tartagni and Aldo Romani  
  I-DEI, University of Bologna, IT; Tyndall National Institute, UCC, IE; ICN-IEIIT, University of Bologna, IT |

**Abstract**  
This paper presents a design study of a new topology for miniaturized bondwire transformers fabricated and assembled with standard IC bonding wires and toroidal ferrite (Fair-Rite 5975000801) as a magnetic core. The micro-transformer realized on a PCB substrate, enables the build of magnetics on-top-of-chip, thus leading to the design of high power density components. Impedance measurements in a frequency range between 100 kHz to 5 MHz, show that the secondary self-inductance is enhanced from 0.3 μH with an epoxy core to 315 μH with the ferrite core. Moreover, the micro-machined ferrite improves the coupling coefficient from 0.1 to 0.9 and increases the effective turns ratio from 0.5 to 35. Finally, a low-voltage IC DC-DC converter solution, with the transformer mounted on-top, is proposed for energy harvesting applications.

| IP3-2 | **PROVIDING REGULATION SERVICES AND MANAGING DATA CENTER PEAK POWER BUDGETS** | Baris Aksanli and Tajana Rosing, University of California San Diego, US |

**Abstract**  
Data centers are good candidates for providing regulation services in the power markets due to their large power consumption and flexibility. In this paper, we develop a framework that explores the feasibility of data center participation in these markets. We use a battery-based design that can not only help with providing ancillary services, but can also limit peak power costs without any workload performance degradation. The results of our study using data for a 21MW data center show up to $480,000/year savings can be obtained, corresponding to 1280 more servers providing services.

| IP3-3 | **THE ENERGY BENEFIT OF LEVEL-CROSSING SAMPLING INCLUDING THE ACTUATOR’S ENERGY CONSUMPTION** | Burkhard Hensel and Klaus Kabitzsch, Dresden University of Technology, DE |

**Abstract**  
When using level-crossing (also called send-on-delta) sampling in control loops, messages can be saved compared to periodic sampling without degrading control performance. While it is clear that reducing messages improves also the energy efficiency of battery-powered sensor devices, this can be disadvantageous for the energy efficiency the actuator device. This paper addresses the question, under which conditions level-crossing sampling is also for the actuator device more energy-efficient than periodic sampling. It is shown that there is an optimum inter-sample interval. Methods for reaching this optimum by appropriate controller and transmission settings are given. The theory is demonstrated using several known, standardized wireless network protocols.

| IP3-4 | **SKETCHLOG: SKETCHING COMBINATIONAL CIRCUITS** | Andrew Becker, David Novo and Paolo Ienne, École Polytechnique Fédérale de Lausanne, CH |

**Abstract**  
Despite the progress of higher-level languages and tools, Register Transfer Level (RTL) is still by far the dominant input format for high performance digital designs. Experienced designers can directly express their microarchitectural intuitions in RTL. Yet, RTL is terribly verbose, burdened with trivial details, and thus error prone. In this paper, we augment a modern RTL language (Chisel) with new semantic elements to express an imprecise specification: a sketch. We show how, in combination with a naive, unoptimized, but functionally correct reference, a designer can utilize the language and supporting infrastructure to focus on the key design intuition and omit some of the necessary details. The resulting design is exactly or almost exactly as good as the one the designer could have achieved by spending the time to manually complete the sketch. We show that, even limiting ourselves to combinational circuits, realistic instances of meaningful design problems are solved quickly, saving considerable design and debugging effort.

| IP3-5 | **TOWARDS VERIFYING DETERMINISM OF SYSTEMC DESIGNS** | Hoang M. Le and Rolf Drechsler, University of Bremen, DE |

**Abstract**  
Ensuring the correctness of high-level SystemC designs is an important and challenging problem in today’s Electronic System Level (ESL) methodology. Prevalently, a design is checked against a functional specification given by e.g. a testcase with reference output or a user-defined property. Another research direction takes the view of a SystemC design as a piece of concurrent software. The design is then checked for common concurrency problems and thus, a functional specification is not required. Along this line, several methods for deadlock detection and race analysis have been developed. In this work, we propose to consider a new concurrency verification problem, namely input-output determinism, for SystemC designs. That means for each possible input, the design must produce the same output under any valid process schedule. We argue that determinism verification is stronger than both deadlock detection and race analysis. Beside being an attractive correctness criterion itself, proven determinism helps to accelerate both simulative and formal verification. We also present a preliminary study to show the feasibility of determinism verification for SystemC designs.

| IP3-6 | **USING GUIDED LOCAL SEARCH FOR ADAPTIVE RESOURCE RESERVATION IN LARGE-SCALE EMBEDDED SYSTEMS** | Timon ter Braak, University of Twente, NL |

**Abstract**  
To maintain a predictable execution environment, an embedded system must ensure that applications are, in advance, provided with sufficient resources to process tasks, exchange information and to control peripherals. The problem of assigning tasks to processing elements with limited resources, and routing communication channels through a capacitated interconnect is combined into an integer linear programming formulation. We describe a guided local search algorithm to solve this problem at run-time. This algorithm allows for a hybrid strategy where configurations computed at design-time may be used as references to lower the computational overhead at run-time. Computational experiments on a dataset with 100 tasks and 20 processing elements show the effectiveness of this algorithm compared to state-of-the-art solvers CPLEX and Gurobi. The guided local search algorithm finds an initial solution within 100 milliseconds, is competitive for small platforms, scales better with the size of the platform, and has lower memory usage (2-19%).
**IP3-7**

**ACCELERATING GRAPH COMPUTATION WITH RACETRACK MEMORY AND POINTER-ASSISTED GRAPH REPRESENTATION**

**Speakers:**
Eunhyek Park¹, Helen Li², Sungjoon Yoo¹ and Sunggu Lee¹

¹POSTECH, KR; ²Univ. of Pittsburgh, US

**Abstract**

The poor performance of NAND Flash memory, such as long access latency and large granularity access, is the major bottleneck of graph processing. This paper proposes a hybrid non-volatile (NV) SRAM cell with a new scheme for SEU tolerance. The proposed NVRAM cell consists of a 6T SRAM core and a Resistive RAM (RRAM), made of a 1T and a Programmable Metatization Cell (PMC). The proposed cell has concurrent error detection (CED) and correction capabilities; CED is accomplished using a dual-rail checker, while correction is accomplished by utilizing the restore operation; data from the non-volatile memory element is copied back to the SRAM core. The dual-rail checker utilizes two XOR gates each made of 2 inverters and 2 ambipolar transistors, hence, it has a hybrid nature. Extensive simulation results are provided. The simulation results show that the proposed scheme is very efficient in terms of numerous figures of merit such as delay and circuit complexity and thus applicable to integrated circuits such as FPGAs requiring secure on-chip non-volatile storage (i.e. LUTs) for multi-context configurability.

**IP3-8**

**BATTERY AWARE STOCHASTIC QOS BOOSTING IN MOBILE COMPUTING DEVICES**

**Speakers:**
Hao Shen, Qiuwen Chen and Qinru Qiu, Syracuse University, US

**Abstract**

Battery-aware stochastic QoS boosting is proposed to enhance the mobile device's QoS without significantly increasing the risk of battery depletion. We develop an online stochastic control policy to balance the resource management and battery depletions for energy-efficient mobile devices. Users' resource requirements are assumed to be independently and identically distributed (i.i.d.) over time. Our solution is composed of two components: a) a resource allocation algorithm in the mobile device, and b) an online stochastic QoS control policy. Mobile devices' QoS is improved by using a two-level control policy: a coarse-grained control that optimizes the overall QoS based on users' QoS requirements and a fine-grained control that optimizes the QoS for each user. The proposed solution has been implemented in the mobile device and validated using real-world mobile applications. Simulation results show that the proposed solution achieves significant improvements in performance and QoS without significantly increasing the chance of battery depletion.

**IP3-9**

**A THERMAL RESILIENT INTEGRATION OF MANY-CORE MICROPROCESSORS AND MAIN MEMORY BY 2.5D TSI I/OS**

**Speakers:**
Sih-Sian Wu¹, Eunhyek Park¹, Helen Li², Sungjoon Yoo¹ and Hao Yu¹

¹National Cheng Kung University, TW; ²University of Pittsburgh, US

**Abstract**

This paper presents an intelligent storage for graph processing which is based on fast and low cost racetrack memory and a pointer-assisted graph representation. Our goal is to maximize the quality of service (QoS) provided by the mobile device (i.e., smartphone), while keep the risk of battery depletion below a given threshold. A Markov Decision Process (MDP) is constructed from the user behavior. The optimal management policy is solved using linear programming. Simulations based on real user traces validate that, compared to existing battery energy management techniques, the stochastic control performs better in boosting the mobile devices’ QoS without significantly increasing the chance of battery depletion.

**IP3-10**

**LEVERAGING ON-CHIP NETWORKS FOR EFFICIENT PREDICTION ON MULTICORE COHERENCE**

**Speaker:**
Libo Huang, National University of Defense Technology, CN

**Abstract**

Coherent data prediction is introduced as a promising architectural technique for reducing cache-to-cache accesses in directory protocol. However, limited on-chip resources cause the accuracy of current prediction to be generally low. Low accuracy would result in a large number of unnecessary or incorrect predictions, which would consequently generate excessive network traffic. This leads to large power and performance overhead for coherent memory access. This paper proposes an easy abort mechanism (EBT) that leverages NoC design to reduce the negative effect of wrong prediction operations, thus facilitating overall performance improvement and traffic reduction. Using detailed full-system simulations, we conclude that EBT provides a cost-effective solution for designing efficient multicore processors. To the best of our knowledge, this study is the first to leverage on-chip network for the prediction optimization on multicore coherence.
IP3-13

AN ADAPTIVE MEMORY INTERFACE CONTROLLER FOR IMPROVING BANDWIDTH UTILIZATION OF HYBRID AND RECONFIGURABLE SYSTEMS

Speakers:
Vito Giovanni Castellana 1, Antonino Tumeo 2 and Fabrizio Ferrandi 1
1 Politecnico di Milano, DEIB, IT; 2 Pacific Northwest National Laboratory, US

Abstract

Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth driven, but also present large amounts of inherent dynamic parallelism because they can potentially spawn concurrent activities for each one of the element they are exploring. Hybrid architectures, which integrate general purpose processors with reconfigurable devices, appears promising target platforms for accelerating irregular applications. These systems often connect to distributed and multi-ported memories, potentially enabling parallel memory operations. However, these memory architectures introduce several challenges, such as the necessity to manage concurrency and synchronization to avoid structural conflicts on shared memory locations and to guarantee consistency. In this paper we present an adaptive Memory Interface Controller (MIC) that addresses these issues. The MIC is a general and customizable solution that can target several different memory structures, and is suitable for High Level Synthesis frameworks. It implements a dynamic arbitration scheme, which avoids conflicts on memory resources at runtime, and supports atomic memory operations, commonly exploited for synchronization directives in parallel programming paradigms. The MIC simultaneously maps multiple accesses to different memory ports, allowing fine grained parallelism exploitation and ensuring correctness also in the presence of irregular and statically unpredictable memory access patterns. We evaluated the effectiveness of our approach on a typical irregular kernel, graph Breadth First Search (BFS), exploring different design alternatives.

IP3-14

ENERGY EFFICIENT IN-MEMORY AES ENCRYPTION BASED ON NONVOLATILE DOMAIN-WALL NANOWIRE

Speakers:
Yuhao Wang 1, Pingfan Kong 1, Hao Yu 1 and Dennis Sylvester 2
1 Nanyang Technological University, SG; 2 University of Michigan, US

Abstract

The widely applied Advanced Encryption Standard (AES) encryption algorithm is critical in secure big-data storage. Data oriented applications have imposed high throughput and low power, i.e., energy efficiency (J/bit), requirements when applying AES encryption. This paper explores an in-memory AES encryption using the newly introduced domain-wall nanowire. We show that all AES operations can be fully mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire, called DW-AES. The experimental results show that DW-AES can achieve the best energy efficiency of 24 pJ/bit, which is 9X and 6.5X times better than CMOS ASIC and ReRAM-CMOL implementations, respectively. Under the same area budget, the proposed DW-AES exhibits 6.4X higher throughput and 29% power saving compared to a CMOS ASIC implementation; 1.7X higher throughput and 74% power reduction compared to a ReRAM-CMOL implementation.

IP3-15

ICE: INLINE CALIBRATION FOR MEMRISTOR CROSSBAR-BASED COMPUTING ENGINE

Speakers:
Boxun Li 1, Yu Wang 1, Yiran Chen 2, Helen Li 3 and Huazhong Yang 1
1 Tsinghua University, CN; 2 University of Pittsburgh, US

Abstract

The emerging neuromorphic computing provides a revolutionary solution to the alternative computing architecture and effectively extends Moore’s Law. The discovery of the memristor presents a promising hardware realization of neuromorphic systems with incredible power efficiency, allowing efficiently executing the analog matrix-vector multiplication on the memristor crossbar architecture. However, during computations, the memristor will slowly drift from its initial programmed state, leading to a gradual decline of the computation precision of memristor crossbar-based computing engine (MCE). In this paper, we propose an inline calibration mechanism to guarantee the computation quality of the MCE. The inline calibration mechanism collects the MCE’s computation error through interrupt-and-benchmark (I&B) operations and predicts the best calibration time through polynomial fitting of the computation error data. We also develop an adaptive technique to adjust the time interval between two neighbor I&B operations and minimize the negative impact of the I&B operation on system performance. The experimental results demonstrate that the proposed inline calibration mechanism achieves a calibration efficiency of 91.18% on average and negligible performance overhead (i.e., 0.439%).

IP3-16

COMPLEMENTARY RESISTIVE SWITCH BASED STATEFUL LOGIC OPERATIONS USING MATERIAL IMPLICATION

Speakers:
Yuanfan Yang 1, Jingxin Mathew 1, Dhiraj K Pradhan 1, Marco Ottavi 2 and Salvatore Pontarelli 2
1 University of Bristol, GB; 2 University of Rome “Tor Vergata”, IT

Abstract

Memristor based logic and memories are increasingly becoming one of the fundamental building blocks for future system design. Hence, it is important to explore various methodologies for implementing these blocks. In this paper, we present a novel Complementary Resistive Switching (CRS) based stateful logic operations using material implication. The proposed solution benefits from exponential reduction in sneak path current in crossbar implemented logic. We validated the effectiveness of our solution through SPICE simulations on a number of logic circuits. It has been shown that only 4 steps are required for implementing N input NAND gate whereas memristor based stateful logic needs N+1 steps.

IP3-17

A LAYERED APPROACH FOR TESTING TIMING IN THE MODEL-BASED IMPLEMENTATION

Speakers:
BaekGyu Kim 1, Hyeon I Hwang 1, Taejoon Park 2, Sanghyuk Son 2 and Insub Lee 1
1 University of Pennsylvania, US; 2 Daegu Gyeongbuk Institute of Science & Technology, KR

Abstract

The model-based implementation is derived to implement an implementation from a model that has been shown to meet requirements. Even though this approach can be used to guarantee that an implementation satisfies functional requirements that are shown to be correct at the model level, it is still challenging to assure timing requirements at the implementation level. We propose a layered approach in testing timing requirements conformance of implemented systems developed by model-based implementation. In our approach, the abstraction boundary of the implemented system is formally defined using Parnas’ four-variables model. Then, the proposed approach tests timing aspects of the interaction between the auto-generated code and the target platform-dependent code based on the four-variables. This approach aims at not only detecting the timing requirement violation, but also at measuring delay-segments that contribute to the timing deviation of the implemented system w.r.t. the model. We show the case study of testing timing requirements of an infusion pump system to illustrate the applicability of the proposed framework.

IP3-18

MODEL-BASED PROTOCOL LOG GENERATION FOR TESTING A TELECOMMUNICATION TEST HARNES USING CLP

Speakers:
Kenneth Balck 1, Olga Grinchtein 1 and Justin Pearson 2
1 Ericsson AB, SE; 2 Upsala University, SE

Abstract

Within telecommunications development it is vital to have frameworks and systems to replay complicated scenarios on equipment under test, often there are not enough available scenarios. In this paper we study the problem of testing a test harness, which replays scenarios and analyses protocol logs for the Public Warning System service, which is a part of the Long Term Evolution (LTE) 4G standard. Protocol logs are sequences of messages with timestamps; and are generated by different mobile network entities. In our case study we focus on user equipment protocol logs. In order to test the test harness we require that logs have both incorrect and correct behaviour. It is easy to collect logs from real system runs, but these logs do not show much variation in the behaviour of system under test. We present a proposal where we use constraint logic programming (CLP) for both modelling and test generation, where each test case is a protocol log. In this case study, we uncovered previously unknown faults in the test harness.
TIME-DECOUPLED PARALLEL SYSTEMC SIMULATION

Speakers:
Jan Weinstock1, Christoph Schumacher1, Rainer Leupers1, Gerd Ascheid1 and Laura Tosoratto2
1RWTH Aachen, DE; 2Istituto Nazionale di Fisica Nucleare, Sezione di Roma, IT

Abstract
With increasing system size and complexity, designers of embedded systems face the challenge of efficiently simulating these systems in order to enable target specific software development and design space exploration as early as possible. Today’s multicore workstations offer enormous computational power, but traditional simulation engines like the OSCh SystemC kernel only operate on a single thread, thereby leaving a lot of computational potential unused. Most modern embedded system designs include multiple processors. This work proposes SCore, a SystemC kernel that aims at exploiting the inherent parallelism of such systems by simulating the processors on different threads. A lookahead mechanism is employed to reduce the required synchronization between the simulation threads, thereby further increasing simulation speed. The virtual prototype of the European FP7 project EURETILE system simulator is used as demonstrator for the proposed work, showing a speedup of 4.01x on a four core host system compared to sequential simulation.

A UNIFIED METHODOLOGY FOR A FAST BENCHMARKING OF PARALLEL ARCHITECTURE

Speakers:
Alexandre Guerre, Jean-Thomas Acquaviva and Yves Lhuillier, CEA LIST, FR

Abstract
Benchmarking of architectures is today jeopardized by the explosion of parallel architectures and the dispersion of parallel programming models. Parallel programming requires architecture dependent compilers and languages as well as high programmer expertise. Thus, an objective comparison has become a harder task. This paper presents a novel methodology to evaluate and to compare parallel architectures in order to ease the programmer work. It is based on the usage of micro-benchmarks, code profiling and characterization tools. The main contribution of this methodology is a semi-automatic prediction of the performance for task. This paper presents a novel methodology to evaluate and to compare parallel architectures in order to ease the programmer work. It is based on the usage of micro-benchmarks, code profiling and characterization tools. The main contribution of this methodology is a semi-automatic prediction of the performance for sequential applications on a set of parallel architectures. In addition the performance estimation is correlated with the cost of other criteria such as power or portability. Our methodology prediction was validated on an industrial application. Results are within a range of 20%.

VIDEO-BASED ABSOLUTE NAVIGATION APPROACH: A NOVEL APPROACH FOR VIDEO-BASED ABSOLUTE NAVIGATION IN SPACE EXPLORATION MISSIONS

Authors:
Pascal Trotta, Tadewos Getahun Tadewos, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT

Abstract
Nowadays, space agencies have increased their research efforts in order to enhance the success rate of space exploration missions. Future space missions will increasingly adopt Video Based Navigation (VBN) systems to assist the entry, descent and landing (EDL) phase of space modules. This poster will show a preliminary work on a novel approach for Video-based Absolute Navigation (VBNM). Moreover, the poster depicts how a VBA processing chain can exploit FPGA devices to achieve high throughput. Several visual results will be shown to highlight the peculiarities of the proposed approach.

HIPACC: AUTOMATIC GPU CODE GENERATION FOR ANDROID

Authors:
Oliver Reichel1, Richard Membarth2, Frank Hannig1 and Jürgen Teich1
1University of Erlangen-Nuremberg, DE; 2Saarland University, DE

Abstract
We present the Heterogeneous Image Processing Acceleration (HIPAcc) framework. It allows programmers to develop image preprocessing applications while providing high productivity, flexibility and portability as well as competitive performance. The same algorithm description serves as basis for targeting different GPU accelerators and low-level languages. Hereby, imaging algorithms can be expressed in a compact and productive way by using a domain-specific language (DSL) that is embedded into C ++ code. Using the HIPAcc source-to-source compiler, DSL code is compiled to CUDA, OpenCL, C/C ++, or even Renderscript code, which targets heterogeneous architectures on recent MPSoCs running Android. Programming those MPSoCs can be challenging, in particular when targeting different architectures (CPU/GPU/DSP). HIPAcc lifts this burden from programmers by automatically applying source code transformations based on domain knowledge and a built-in architecture model. This demonstration shows the seamless integration of HIPAcc into the Android Developer Tools and provides a live comparison of generated code to functional identical handwritten naive implementations of image filters on recent MPSoCs running Android.

GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES

Authors:
Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT

Abstract
Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The software is composed of a parser library to handle input circuit descriptions, a characterization library of graphene gates used in the synthesis process, a Biconditional Binary Decision Diagram library used to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices.

TOMAHAWK2: PERFORMANCE IMPACT OF INSTRUCTION SET ARCHITECTURE EXTENSIONS FOR DYNAMIC TASK SCHEDULING UNITS

Author:
Oliver Arnold, Technische Universität Dresden, DE

Abstract
In this demo a heterogeneous MPSoC is controlled by a dynamic task scheduling unit called CoreManager. The instruction set architecture of this unit has been extended to improve performance for dynamic data dependency checking, task scheduling, processing element allocation and data transfer management. The MPSoC as well as the NOC are integrated in a cycle-accurate virtual system prototype. The performance impact of the CoreManager is analyzed on component as well as on system level.

LEGO: TOOLS FOR HYBRID INTEGRATION

Author:
Fredrik Jonsson, Royal Institute of Technology, SE

Abstract
Performance of printed devices depends on the geometry, but is also affected by processing steps of other components integrated onto the same substrate. Since different designs use different devices, process stack, models and design rules must be dynamically determined. In this work we propose and demonstrate an experimental design flow to allow efficient design of hybrid and printed electronic circuits.
© 2014 IEEE. All rights reserved.

UB08.07

**UVN-Systemc-AMS: UVM Standard-Compliant SystemC (AMS)-Based Verification Framework for Heterogeneous Systems**

*Authors:*
Zhi Wang1, Yao Li1, Marie-Minevere Louerat1, François Pecheux2, Martin Barnasconi1, Thilo Vörö1 and Karsten Einwich1

1Laboratoire d’Informatique de Paris 6, FR; 2UPMC-LIP6, FR; 3NXP, NL; 4Fraunhofer IIS, DE

**Abstract**

Today’s societal needs for innovative products in terms of communication, mobility, health, entertainment, and safety directly impact microelectronics design methodologies. The presented systems are increasingly software-driven, digitally assisted, complex and heterogeneous, but existing verification methodologies are mostly focused on pure digital devices and are completely decoupled from analog verification. This presentation shows how the principles of the new UVN methodology can be soundly enhanced to offer to the test designer a flexible framework for the virtual prototyping of multi-discipline testbenches that supports both digital and Analog Mixed-Signal (AMS) at the architectural level.

**More information ...**

---

19:30 **DATE Party** in “Gläserne Manufaktur” of the Volkswagen AG

The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden’s most exciting and modern buildings, the “Gläserne Manufaktur” of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

---

**8.1 SPECIAL DAY System Simulation and Virtual Prototyping**

**Date:** Wednesday 26 March 2014

**Time:** 17:00 - 18:30

**Location / Room:** Saal 1

---

**Organiser:**
Johannes Stahl, Synopsys, US

**Chair:**
Johannes Stahl, Synopsys, US

In this session we will review several practical applications of virtual prototyping for architecture design work and software development across different markets such as mobile, industrial and automotive. The authors will share their practical experiences in using the virtual prototyping methodology and current commercial tools.

**Time** | **Label** | **Presentation Title**
--- | --- | ---
17:00 | 8.1.1 | **POWER MODELING AND ANALYSIS IN EARLY DESIGN PHASES**
**Speakers:** Bernhard Fischer, Christian Cech and Hannes Muhr, Siemens, AT
**Abstract** Low power consumption of electronic devices has been an important requirement for many cyber-physical systems in field. Today, power dissipation is often estimated by spreadsheet-based power analysis. A leading-edge high-level power analysis method has the objective of providing high confidence levels in early design stages, where power design decisions have severe impact. This work examines and compares three high-level power analysis approaches (spreadsheet-based, Synopsys Platform Architect MCO, and DOCEA Acceptor) by an industrial use case.

17:30 | 8.1.2 | **SYSTEM-LEVEL DESIGN METHODOLOGY ENABLING FAST DEVELOPMENT OF BASEBAND MP-SOC FOR 4G SMALL CELL BASE STATION**
**Speakers:** Shan Tang, Zhu Ziyuan and Yongtao Su, Institute of Computing Technology, Chinese Academy of Sciences, CN
**Abstract** “Small Cell” is regarded as the solution to optimize 4G wireless networks with improved coverage and capacity and expected to be deploy in a large number. To meet performance requirements and special constraints on the cost and size, we design a heterogeneous multi-processor SoC for small cell base station, which is composed of ASP (Application Specific Processor) cores, hardware accelerators, general-purpose processor core, and infrastructure and interface blocks. The challenges of developing such a complex chip drive us to employ system-level design methodology in both single core and multi-core architecture optimizations. The paper discusses in detail the LISA (Language for Instruction-Set Architectures)/SystemC based ASP-algorithm joint optimization, and task-graph driven multi-core architecture exploration. Finally, the results of silicon implementation on SMIC 55nm technology are presented.

18:00 | 8.1.3 | **VIRTUAL PROTOTYPE LIFE CYCLE IN AUTOMOTIVE APPLICATIONS**
**Speaker:** Manfred Thanner, Freescale, Germany, DE
**Abstract** Virtual prototypes for automotive applications see a unique life cycle in the context of the supply chain from semiconductor to OEMs and within the ecosystem. The presentation gives an overview of current experiences and finding in this field and challenges observed. The virtual platforms targeting the mid to high end application spaces of chassis, to powertrain and driver information systems. The use cases primarily address today semiconductor internal developments and Tier1 level deployment. Additionally different software vendors use the models in their development cycle which drive model requirements like stimulus and abstraction levels. The development of virtual prototypes often start with the reuse of existing cores, accelerators and IP models. These models had certain use cases to address and were created accordingly. Therefore the models sometimes don’t necessarily match fully the requirements of the overall virtual prototype and compromises were made. Further to this, models are often from different design centers, vendors, etc. This can lead to conflicting model features versus the primary use case requirements of the virtual platform for the intended usage. Examples are cycle accuracy vs. functional, correct behavior vs. error behavior and error injection. The virtual platform life cycle is also affected by the availability and integration of 3rd party IP models which adds the commercial terms and license dependency. Further to this, the virtual prototypes need to be integrated or connected to the EDA environments of the “receiving companies”. In the deployment phase of the virtual prototype within the automotive eco system a supply chain needs to be in place. This creates challenges in terms of model interfaces, tool compatibility and integration and support chain.

**8.2 Hot Topic: Near Threshold Computing (NTC)**

**Date:** Wednesday 26 March 2014

**Time:** 17:00 - 18:30

---

**DATE Party** in “Gläserne Manufaktur” of the Volkswagen AG

The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden’s most exciting and modern buildings, the “Gläserne Manufaktur” of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.
To face with the power/utilization wall, Near-Threshold Computing (NTC) has emerged as one of the most promising approaches to achieve an order of magnitude improvement or more in energy efficiency of microprocessors and reconfigurable hardware. NTC takes advantage of the quadratic relation between the supply voltage (Vdd) and the dynamic power, by lowering the supply voltage of chips to a value only slightly higher than the threshold voltage.

The power-wall problem driven by the stagnation of supply voltages in deep-submicron technology nodes, is now the major scaling barrier for moving towards the manycore era. Although the technology scaling enables extreme volumes of computational power, power budget violations will permit only a limited portion to be actually exploited, leading to the so called dark silicon. Near-Threshold voltage Computing (NTC) has emerged as a promising approach to overcome the manycore power-wall, at the expenses of reduced performance values and higher sensitivity to process variations. Given that several application domains operate over specific performance constraints, the performance sustainability is considered a major issue for the wide adoption of NTC.

This year, we investigate how performance guarantees can be ensured when moving towards NTC manycores through variability-aware voltage and frequency allocation schemes. We propose three aggressive NTC voltage tuning and allocation strategies, showing that NTC performance can be efficiently sustained or even optimized at the NTC regime. Finally, we show that NTC highly depends on the underlying workload characteristics, delivering average power gains of 65% for thread-parallel workloads and up to 90% for process-parallel workloads, while offering an extensive analysis on the effects of different voltage tuning/allocation strategies and voltage regulator configurations.
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>8.3.1</td>
<td>EFFICIENCY OF A GLITCH DETECTOR AGAINST ELECTROMAGNETIC FAULT INJECTION</td>
<td>Loic Zussa¹, Amine Dehbaoui¹, Karim Tobich², Jean-Max Dutertre¹, Philippe Maurine², Ludovic Guilloume-Sage², Jessy Clediere³ and Assia Tria³</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>The use of electromagnetic glitches has recently emerged as an effective fault injection technique for the purpose of conducting physical attacks against integrated circuits. First research works have shown that electromagnetic faults are induced by timing constraint violations and that they are also located in the vicinity of the injection probe. This paper reports the study of the efficiency of a glitch detector against EM injection. This detector was originally designed to detect any attempt of inducing timing violations by means of clock or power glitches. Because electromagnetic disturbances are more local than global, the use of a single detector proved to be inefficient. Our subsequent investigation of the use of several detectors to obtain a full fault detection coverage is reported, it also provides further insights into the properties of electromagnetic injection and into the key role played by the injection probe.</td>
</tr>
</tbody>
</table>

| 17:30      | 8.3.2       | ANALYZING AND ELIMINATING THE CAUSES OF FAULT SENSITIVITY ANALYSIS                   | Nahid Farhady Ghalaty, Aydin Aysu and Patrick Schaumant, Virginia Tech, US                   |
|            |             | **Abstract**                                                                        | Fault Sensitivity Analysis (FSA) is a new type of side-channel attack that exploits the relation between the sensitive data and the faulty behavior of a circuit, the so-called fault sensitivity. This paper analyzes the behavior of different implementations of AES S-box architectures against FSA, and proposes a systematic countermeasure against this attack. This paper has two contributions. First, we study the behavior and structure of several S-box implementations, to understand the causes behind the fault sensitivity. We identify two factors: the timing of fault sensitive paths, and the number of logic levels of fault sensitive gates within the netlist. Next, we propose a systematic countermeasure against FSA. The countermeasure masks the effect of these factors by intelligent insertion of delay elements. Compared to earlier work, our method operates at the logic-level, is systematic, and can be easily generalized to bigger circuits. |

| 18:00      | 8.3.3       | A SMALLER AND FASTER VARIANT OF RSM                                                | Noritaka Yamashita, Kauahiko Minematsu, Toshihiko Okamura and Yukiyasu Tsuno, NEC, JP        |
|            |             | **Abstract**                                                                        | Masking is one of the major countermeasures against side-channel attacks to cryptographic modules. Nassar et al. recently proposed a highly efficient masking method, called Rotating S-boxes Masking (RSM), which can be applied to a block cipher based on Substitution-Permutation Network. It arranges multiple masked S-boxes in parallel, which are rotated in each round. This rotation requires remasking process for each round to adjust current masks to those of the S-boxes. In this paper, we propose a method for reducing the complexity of RSM further by omitting the remasking process when the linear diffusion layer of the encryption algorithm has a certain algebraic property. Our method can be applied to AES with a reduced complexity from RSM, while keeping the equivalent security level. |

| 18:30      | IP4-1, 140  | A MULTIPLE FAULT INJECTION METHODOLOGY BASED ON CONE PARTITIONING TOWARDS RTL MODELING OF LASER ATTACKS | Athanasios Papadimitriou¹, David Hely ¹, Vincent Beroulle², Paolo Maitri³ and Regis Leveugle³ |
|            |             | **Abstract**                                                                        | Laser attacks, especially on circuits manufactured with recent deep submicron semiconductor technologies, pose a threat to secure integrated circuits due to the multiplicity of errors induced by a single attack. An efficient way to neutralize such effects is the design of appropriate countermeasures, according to the circuit implementation and characteristics. Therefore tools which allow the early evaluation of security implementations are necessary. Our efforts involve the development of an RTL fault injection approach more representative of laser attacks than random multi-bit fault injections and the utilization and evolution of state of the art emulation techniques to reduce the duration of the fault injection campaigns. This will ultimately lead to the design and validation of new countermeasures against laser attacks, on ASICs implementing cryptographic algorithms. |

8.4 Efficient Designs for Telecom and Financial Applications

**Date:** Wednesday 26 March 2014

**Time:** 17:00 - 18:30

**Location / Room:** Konferenz 2

**Chair:**
Sergio Saponara, University of Pisa, IT

**Co-Chair:**
Amer Baghdadi, Telecom Bretagne, FR

The session presents energy and performance efficient implementations of wireless communication and financial applications.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>19:30</td>
<td>DATE Party</td>
<td>EFFICIENCY OF A GLITCH DETECTOR AGAINST ELECTROMAGNETIC FAULT INJECTION</td>
<td>Loic Zussa¹, Amine Dehbaoui¹, Karim Tobich², Jean-Max Dutertre¹, Philippe Maurine², Ludovic Guilloume-Sage², Jessy Clediere³ and Assia Tria³</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the &quot;Gläserne Manufaktur&quot; of the car manufacturer Volkswagen AG (<a href="http://www.glaesernemanufaktur.de/en/">www.glaesernemanufaktur.de/en/</a>). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.</td>
</tr>
</tbody>
</table>

**DATE Party in "Gläserne Manufaktur" of the Volkswagen AG**

**Date:** Wednesday 26 March 2014

**Time:** 17:00 - 18:30

**Location / Room:** Konferenz 2

**Chair:**
Sergio Saponara, University of Pisa, IT

**Co-Chair:**
Amer Baghdadi, Telecom Bretagne, FR

The session presents energy and performance efficient implementations of wireless communication and financial applications.
MODE-CONTROLLED DATAFLOW BASED MODELING & ANALYSIS OF A 4G-LTE RECEIVER

Abstract

Proposed data flow transformation, a reduction of up to 36% is achieved in the overall energy over conventional QRD sequences. We also explore different possible implementations for QRD of multiple matrices using the SIMD feature of the processor. With the memory interfaces and vector functional units (FUs), the data flow in matrix decomposition algorithms needs to be carefully devised to achieve energy decomposition. The principle can be realized in a large number of valid processing sequences that differ significantly in the number of memory accesses and computations, and hence, the overall implementation energy. With modern low power embedded processors evolving towards register files with wide memory interfaces and vector functional units (FUs), the data flow in matrix decomposition algorithms needs to be carefully devised to achieve energy efficient implementation. In this paper, we present an efficient data flow transformation strategy for the Givens Rotation based QRD that optimizes data memory accesses. We also explore different possible implementations for QRD of multiple matrices using the SIMD feature of the processor. With the proposed data flow transformation, a reduction of up to 36% is achieved in the overall energy over conventional QRD sequences.

Speaker:
Namita Sharma, Preeti Ranjan Panda, Min Li, Prashant Agrawal and Francy Cathoor

Institutes:
1Indian Institute of Technology Delhi, IN; 2IMEC, BE

0.75 dB compared to conventional hard decision decoding and a throughput of up to 1.19 GBit/s for the widely used RS(255,239). This gain in FER is achieved with less complexity and more than 15x larger throughput than other state-of-the-art architectures.

Abstract

We propose a reduced complexity version of the decoding algorithm, that is optimized for efficient hardware implementation and enables high throughput. The decoder was implemented on a Virtex 7 FPGA, achieving a gain of 0.75 dB compared to conventional hard decision decoding and a throughput of up to 1.19 GBit/s for the widely used RS(255,239). This gain in FER is achieved with less complexity and more than 15x larger throughput than other state-of-the-art architectures.

Soft decision decoding of Reed-Solomon codes can largely improve frame errors rates over currently used hard decision decoding. In this paper, we present a new hardware implementation for soft decoding of Reed-Solomon codes based on information set decoding. To our best knowledge this is the first hardware implementation of information set decoding for long Reed-Solomon codes. We propose a reduced complexity version of the decoding algorithm, that is optimized for efficient hardware implementation and enables high throughput. The decoder was implemented on a Virtex 7 FPGA, achieving a gain of 0.75 dB compared to conventional hard decision decoding and a throughput of up to 1.19 GBit/s for the widely used RS(255,239). This gain in FER is achieved with less complexity and more than 15x larger throughput than other state-of-the-art architectures.

HARDWARE IMPLEMENTATION OF A REED-SOLOMON SOFT DECODER BASED ON INFORMATION SET DECODING

Speakers:
Stefan Scholl and Norbert Wehn, TU Kaiserslautern, DE

END OF SESSION

DATE PARTY in "Gläserne Manufaktur" of the Volkswagen AG

The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden’s most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufactur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the Full Evening Social Programme: 75 € per person.
The first presentation proposes an analytical model to estimate the contention and the resulting delays on accessing shared components in a multi-core environment. In order to find the right granularity for design space exploration, the second presentation provides an algorithm for automatic aggregation of design blocks based upon their static computation demands. Finally, the last presentation proposes a novel formal notation for reactive system requirements in order to reduce translational efforts and thus make specifications both easier and quicker to create.

17:00  8.5.1 AN ACTIVITY-SENSITIVE CONTENTION DELAY MODEL FOR HIGHLY EFFICIENT DETERMINISTIC FULL-SYSTEM SIMULATIONS

Speakers: Shu-Yung Chen, Chien-Hao Chen and Ren-Song Tsay, The Department of Computer Science National Tsing Hua University, Taiwan, TW

Abstract

As modern systems are integrating an increasing number of components for better performance and functionality, early full-system simulation tools have become essential for validating complex concurrent system interaction activities. In the past decades, many useful timing-accurate system simulation tools have been developed; however, we find that even for the most efficient techniques, more than 90% of overhead occurs when simulating shared devices, such as buses. Instead of adopting the constant-delay model that compromises accuracy or using the time-consuming precise scheduling approach, we propose in this paper an effective system activity-sensitive contention delay model that can dynamically capture runtime contention situations and system configuration changes. To verify the idea, we construct an analytical bus delay model and integrate that into a system simulation tool. The experimental results show that up to 80 times performance improvement over the scheduling-based bus model on full-system simulations and the estimated timing difference is less than 3%.

17:30  8.5.2 AUTOMATIC SPECIFICATION GRANULARITY TUNING FOR DESIGN SPACE EXPLORATION

Speakers: Jiaxing Zhang and Gunar Schirner, Northeastern University, US

Abstract

Algorithm Design Environments (ADE), such as Simulink, have been shown to be efficient for development, analysis, and evaluation of algorithms. Recent tools proposed to facilitate algorithm / architecture co-design by bridging the gap from ADE to System-Level Design Environments (SLDE) through automatic synthesis from algorithm models to SLDDL specifications. With the wide range of block characteristic (from simple logic functions to complex kernels) in the algorithm model, however, it is challenging to select a suitable compositional granularity for SLDDL blocks in the synthesized specifications. A high volume of SLDDL blocks of little computation will increase the number of mapping possibilities, whereas large blocks with heavy computation on the other hand allow inter-block fusion reducing the computational demands in the overall specification yet sacrificing the mapping flexibility. In this paper, we introduce an automatic specification granularity tuning mechanism to determine the granularity in the synthesized specification model hierarchy guided by the computational demands of algorithm blocks. Our granularity selection significantly simplifies the early design space exploration as only a meaningful block decomposition is exposed in the synthesized specification. It leads to an overall system with less computational demands by leveraging the block fusion capabilities in the ADE. At the same time our granularity decision ensures that sufficient flexibility remains in the system for exploring heterogeneous mapping of the algorithm. Our results on real world examples show that specification models can be synthesized with 80% efficiency through block fusion with 70-90% fewer but coarser grained blocks.

18:00  8.5.3 EDT: A SPECIFICATION NOTATION FOR REACTIVE SYSTEMS

Speakers: Murali Krishna Goldsmith, Venkatesh R, Ulka Shrotri and Supriya Agrawal, Tata Research Development and Design Centre, Tata Consultancy Services Limited, IN

Abstract

Requirements of reactive systems express the relationship between sensors and actuators and are usually described in a natural language and a mix of state-based and stream-based paradigms. Translating these into a formal language is an important pre-requisite to automate the verification of requirements. The analysis effort required for the translation is a prime hurdle to formalization gaining acceptance among software engineers and testers. We present Expressive Decision Tables (EDT), a novel formal notation designed to reduce the translation efforts from both state-based and stream-based informal requirements. We have also built a tool, EDTTool, to generate test data and expected output from EDT specifications. In a case study consisting of more than 200 informal requirements of a real-life automotive application, translation of the informal requirements into EDT needed 43% lesser time than their translation into Statecharts. Further, we tested the Statecharts using test data generated by EDTTool from the corresponding EDT specifications. This testing detected one bug in a mature feature and exposed several missing requirements in another. The paper presents the EDT notation, comparison to other similar notations and the details of the case study.

18:30  636 MODEL-BASED ACTOR MULTIPLEXING WITH APPLICATION TO COMPLEX COMMUNICATION PROTOCOLS

Speakers: Christian Zebelein1, Christian Haubelt1, Joachim Falk2, Tobias Schwarzær2 and Jürgen Teich2
1University of Rostock, DE; 2University of Erlangen-Nuremberg, DE

Abstract

We propose a dynamic scheduling approach for the concurrent execution of logical actor instances on a single synthesized actor instance. Based on a formal datalowf model of computation, the proposed approach can be applied to a wide range of applications in a model-based design flow. As case-study, we evaluate a bus-cycle-accurate SystemC RTL model based on an InfiniBand network adapter in a PCI Express system.

18:31  743 A NOVEL MODEL FOR SYSTEM-LEVEL DECISION MAKING WITH COMBINED ASP AND SMT SOLVING

Speakers: Alexander Biewer1, Jens Gladiagu1 and Christian Haubelt2
1Robert Bosch GmbH, DE; 2University of Rostock, DE

Abstract

In this paper, we present a novel model enabling system-level decision making for time-triggered many-core architectures in automotive systems. The proposed application model includes shared data entities that need to be bound to memories during decision making. As a key enabler to our approach, we explicitly separate computation and shared memory communication over a network-on-chip (NoC). To deal with contention on a NoC, we model the necessary hardware to implement a time-triggered schedule that guarantees freedom of interference. We compute fundamental design decisions, namely (a) spatial binding, (b) multi-hop routing, and (c) time-triggered scheduling, by a novel coupling of answer set programming (ASP) with satisfiability modulo theories (SMT) solvers. First results of an automotive case study demonstrate the applicability of our method for complex real-world applications.
The third talk presents an efficient run-time resource manager heuristic for many-core architectures based on a Lagrangian relaxation technique.

This session discusses novel ideas for embedded software implementation on many-core architectures. The first presentation deals with an optimized implementation of a H265

Sébastien Le Beux, Ecole Centrale de Lyon, FR

Co-Chair:
Chair:
Location / Room:
Time:
Date:
8.6 Mapping and Scheduling for Many-Core Embedded Systems

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 4
Chair:
Marc Geilen, Eindhoven University of Technology, NL

Co-Chair:
Sébastien Le Beux, Ecole Centrale de Lyon, FR

This session discusses novel ideas for embedded software implementation on many-core architectures. The first presentation deals with an optimized implementation of a H265 video coding algorithm on many-core architectures. A run-time scheduling approach for GPGPU architectures for priority-based systems is presented in the second presentation. The third talk presents an efficient run-time resource manager heuristic for many-core architectures based on a Lagrangian relaxation technique.

8.6.1 SOFTWARE ARCHITECTURE OF HIGH EFFICIENCY VIDEO CODING FOR MANY-CORE SYSTEMS WITH POWER-EFFICIENT WORKLOAD BALANCING

Speakers:
Muhammad Usman Karim Khan, Muhammad Shafique and Jörg Henkel, Karlsruhe Institute of Technology (KIT), DE

Abstract
The High Efficiency Video Coding (HEVC) standard aims at providing ~50% better compression compared to its predecessor (H.264) at the cost of high computational complexity. To enable HEVC video encoding in real-time scenarios, special coding support for parallelization is provided in HEVC that can be exploited by many-core systems. In this work, we present a HEVC software architecture where a video frame is adaptively divided into independent video frame regions (i.e. so-called video tiles) which are processed concurrently on multiple cores. By balancing the workload of each video tile mapped to a particular core, the total power consumption of a system is reduced (through dynamically scaling the operating frequency) under a given frame-rate constraint. We also exploit user tolerance to further curtail the HEVC workload with insignificant video quality degradation. Experimental results illustrate that the proposed approach results in ~43% power savings on a many-core system.

8.6.2 GPU-EVR: RUN-TIME EVENT BASED REAL-TIME SCHEDULING FRAMEWORK ON GPGPU PLATFORM

Speakers:
Haeseung Lee 1 and Mohammad Abdullah Al Faruque 2
1University of California, Irvine, US; 2University of California Irvine, US

Abstract
GPU architecture has traditionally been used in graphics application because of its enormous computing capability. Moreover, GPU architecture has also been used for general purpose computing in these days. Most of the current scheduling frameworks that are developed to handle GPGPU workload operate sequentially. This is problematic since this sequential approach may not be scalable for real-time systems, which is a consequence of the approach’s inability to support preemption. We propose a novel scheduling framework that provides real-time support for the GPGPU platform. In contrast to existing frameworks, our proposed framework considers both concurrent execution of applications on the GPU and mapping between streaming multiprocessors and thread blocks. By considering both concurrent execution and mapping, our framework is able to guarantee timing up to 6.4 times as many applications compared to TimeGraph and Global EDF. In addition, our experimental applications use up to 20% less power under our scheduling framework compared to TimeGraph and Global EDF.

8.6.3 MULTI-OBJECTIVE DISTRIBUTED RUN-TIME RESOURCE MANAGEMENT FOR MANY-CORES

Speakers:
Stefan Wildermann, Michael Glaß and Jürgen Teich, University of Erlangen-Nuremberg, DE

Abstract
Dynamic usage scenarios of many-core systems require sophisticated run-time resource management that can deal with multiple often conflicting application and system objectives. This paper proposes an approach based on non-linear programming techniques that is able to trade off between objectives while respecting targets regarding their values. We propose a distributed application embedding for dealing with soft system-wide constraints as well as a centralized one for strict constraints. The experiments show that both approaches may significantly outperform related heuristics.
18:30  IP4-7, 323  COMIK: A PREDICTABLE AND CYCLE-ACCURATELY COMPOSABLE REAL-TIME MICROKERNEL  Speakers:  Andrew Nelson¹, Ashkan Beryanvarand Nejad 1, Anca Molnos2, Martijn Koedam3 and Kees Goossens3 1TU Delft, NL; 2CEA Leti, FR; 3TU Eindhoven, NL

Abstract  The functionality of embedded systems is ever increasing. This has lead to mixed time-criticality systems, where applications with a variety of real-time requirements co-exist on the same platform and share resources. Due to inter-application interference, verifying the real-time requirements of such systems is generally non trivial. In this paper, we present the CoMik microkernel that provides temporally predictable and composable processor virtualization. CoMik’s virtual processors are cycle-accurately composable, i.e. their timing cannot affect the timing of co-existing virtual processors by even a single cycle. Real-time applications executing on dedicated virtual processors can therefore be verified and executed in isolation, simplifying the verification of mixed time-criticality systems. We demonstrate these properties through experimentation on an FPGA prototyped hardware platform.

18:31  IP4-8, 71  UTILIZATION-AWARE LOAD BALANCING FOR THE ENERGY EFFICIENT OPERATION ON THE BIG.LITTLE PROCESSOR

Speakers:  Myungsun Kim1, Kibeom Kim2, James Geraici1 and Seongsoo Hong3 1Samsung Electronics, KR; 2SAMSUNG Electronics, KR; 3Seoul National University, KR

Abstract  ARM's big.LITTLE architecture introduces the opportunity to optimize power consumption by selecting the core type most suitable for a level of processing demand. To take advantage of this new axis of optimization, we introduce the processor utilization factor into the Linux kernel's load balancing algorithm after carefully analyzing the power management mechanism of the big.LITTLE processor's port of Linux and deriving its state diagram representation. Our mechanism improves the Linux kernel's ability to assign tasks to cores in an energy efficient manner without having to make it directly aware of the available core types. Our experiments with a real test bed show that our algorithm improves energy consumption over the standard Linux scheduler up to 11.35% with almost no corresponding reduction in performance.

18:32  IP4-9, 538  HEVC/DM: APPLICATION-DRIVEN DYNAMIC THERMAL MANAGEMENT FOR HIGH EFFICIENCY VIDEO CODING

Speakers:  Daniel Palomino1, Muhammad Shafique2, Hussam Arrouchi3, Altamiro Suin3 and Jörg Henkel2 1Karloer Institute of Technology (KIT), BR; 2Karlsruhe Institute of Technology (KIT), DE; 3Federal University of Rio Grande do Sul, BR

Abstract  This paper presents an application-driven algorithm for Dynamic Thermal Management (DTM) for the High Efficiency Video Coding (HEVC). For efficient design of such a DTM policy, we perform an offline thermal analysis of an HEVC encoder and demonstrate the impact of different video sequences and different coding configurations on the processor temperature. Our thermal analysis is leveraged to develop an efficient application-driven DTM policy that performs temperature-aware coding along with an application-driven control of DTM knobs (e.g., frequency scaling) in order to meet the temperature constraints while still providing high video quality (i.e. PSNR loss < 0.01dB). For accurate thermal analysis and evaluation, we deploy an infrared camera-based thermal measurement setup that, on the contrary to state-of-the-art setups, does not require adding any extra layer on top of the measured chip, thus allowing the camera to accurately capture the infrared emissions from the die.

18:33  IP4-10, 714  IMPROVING EFFICIENCY OF EXTENSIBLE PROCESSORS BY USING APPROXIMATE CUSTOM INSTRUCTIONS

Speakers:  Mehdi Kamali¹, Amin Ghaseb Azar¹, Ali Afzali-Kusha¹ and Massoud Pedram² 1University of Tehran, IR; 2University of Southern California, US

Abstract  In this paper, we propose to move the conventional extensible processor design flow to the approximate computing domain to gain more speedup. In this domain, the instruction set architecture (ISA) design flow selects both exact and approximate custom instructions (CIs). The proposed approach could be used for the applications where imprecise results may be tolerated. In the CI identification phase of the flow, the CIs which do not satisfy the maximum accuracy requirement are used for the applications where imprecise results may be tolerated. In the CI identification phase of the flow, the CIs which do not satisfy the maximum accuracy requirement are used for the applications where imprecise results may be tolerated. The efficacy of the proposed approximate design flow is investigated using the case studies of the discrete cosine transform (DCT) and inverse DCT (IDCT) of the MPEG2 application. Also, the impact of the process variation on the impreciseness of the results is investigated.

18:30  End of session

19:30  DATE Party in "Gläserne Manufaktur" of the Volkswagen AG

The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaeseremanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the Full Evening Social Programme: 75 € per person.

8.7 Performance Modeling and Delay Test

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 5
Chair: Robert Altken, ARM, US
Co-Chair: Mehdi Tahoori, KIT, DE

As technology dimensions shrink and process complexity increases, it becomes vital to accurately model performance limits such as device and metal variability, as well as to determine when these effects become so critical that delay requirements are exceeded.
8.8 Hot Topic: Beyond CMOS Ultra-low-power Computing

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Exhibition Theatre
Organiser: Saibal Mukhopadhyay, Georgia Institute of Technology, US
Toward Ultralow-Power Computing at Extreme with Silicon Carbide (SiC) NanoElectromechanical Logic

Abstract

Growing number of important application areas, including automotive and industrial applications as well as space, avionics, combustion engine, intelligent propulsion systems, and geo-thermal exploration require electronics that can work reliable at extreme conditions - in particular at a temperature > 250°C and at high radiation (1-30 Mrad), where conventional electronics fail to work reliably. Traditionally, existing wide-band-gap semiconductors, e.g., silicon carbide (SiC) transistor-based electronics have been considered most viable for high temperature and high radiation applications. However, the large-size, high threshold voltage, low switching speed and high leakage current make logic design with these devices unattractive. Additionally, the leakage current markedly increases at high temperature (in the range of 10 µA for a 2-input NAND gate), which induces self-heating effect and makes power delivery at high temperature very challenging. To address these issues, in this paper we present a computing platform for low-power reliable operation at extreme environment using SiC electromechnanical switches. We show that a device-circuit-architecture co-design approach can provide reliable long-term operation with virtually zero leakage power.

Abstract

Swarup Bhunia, 1 Vaishnavi Ranganathan, 2 Tina He, 2 Srihari Rajgopal 2, Rui Wang, 2 Mehran Mehregany 2 and Philip Feng 2

1 Case Western Reserve University, US; 2 Case Western Reserve U., US

18:00 8.8.3 TOWARD ULTRALOW-POWER COMPUTING AT EXTREME WITH SILICON CARBIDE (SIC) NANOELECTROMECHANICAL LOGIC

Speakers:
Swarup Bhunia, Vaishnavi Ranganathan, Tina He, Srihari Rajgopal, Rui Wang, Mehran Mehregany and Philip Feng

Abstract

In this paper we discuss the potential of emerging spin-torque devices for computing applications. Recent proposals for spin-based computing schemes may be differentiated as ‘all-spin’ vs. hybrid, programmable vs. fixed, and, Boolean vs. non-Boolean. All-spin logic-styles may offer high area-density due to small form-factor of nano-magnetic devices. However, circuit and system-level design techniques need to be explored that leverage the specific spin-device characteristics to achieve energy-efficiency, performance and reliability comparable to those of CMOS. The non-volatility of nano-magnets can be exploited in the design of energy and area-efficient programmable logic. In such logic-styles, spin-devices and memory elements may play the dual-role of computing as well as memory-elements that provide field-programmability. Spin-based threshold logic design is presented as an example. Emerging spintronic phenomena may lead to ultra-low-voltage, current-mode, spin-torque switches that can offer attractive computing capabilities, beyond digital switches. Such devices may be suitable for non-Boolean data-processing applications which involve analog processing. Integration of such spin-torque devices with charge-based devices like CMOS and resistive memory can lead to highly energy-efficient information processing hardware for applications like pattern-matching, neuromorphic-computing, image-processing and data-conversion. Finally, we discuss the possibility of using coupled spin-torque nano oscillators for low-power non-Boolean computing.

Abstract

Amit Trivedi, Mohammad Faisal Amir and Saibal Mukhopadhyay

Speakers:
Amit Trivedi, Mohammad Faisal Amir and Saibal Mukhopadhyay

Authors

Case Western Reserve University, US; Georgia Institute of Technology, US

17:00 8.8.1 ULTRA-LOW POWER ELECTRONICS WITH Si/Ge TUNNEL FET

Speakers:
Amit Trivedi, Mohammad Faisal Amir and Saibal Mukhopadhyay

Abstract

Si/Ge Tunnel FET (TFET) with its subthreshold subthreshold swing is attractive for low power analog and digital designs. Greater Ion/Off ratio of TFET can reduce the dynamic power in digital designs, while higher gms/IDS can lower the bias power of analog amplifier. However, the above benefits of TFET are eclipsed by MOSFET and higher power/performance point. Ultra low power programmability of the key analog and digital circuits, SRAM and operational transconductance amplifier (OTA), with TFET is demonstrated. Analyzing a TFET based cellular neural network, this work shows the feasibility of ultra-low-power neuromorphic computing with TFET.

Abstract

Kaushik Roy, Mrigank Sharad, Deliang Fan and Karthik Yogendra

Speakers:
Kaushik Roy, Mrigank Sharad, Deliang Fan and Karthik Yogendra

18:30 End of session

19:30 DATE Party in "Gläserne Manufaktur" of the Volkswagen AG

The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden’s most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.gläserne-manufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities.

Authors

Saibat Mukhopadhyay, NamLab gGmbH, DE

Speakers

Saibat Mukhopadhyay, NamLab gGmbH, DE

Organisers

Thomask Mikolajicz, NamLab gGmbH, DE
Transistors as switches have now scaled down to a point where the classical bulk structure is no longer tenable and it is necessary to change the nature of the channel structure. In this session, the three principal contenders for following on from conventional devices will be examined. The first paper looks at the use of III-V nanowires, with expected benefits in terms of speed and energy, as well as integration challenges. The second paper looks at how the use of switches with controllable polarity, such as in silicon nanowire devices, can improve the energy efficiency of systems on chip. The devices themselves are explored in detail in the third paper, with the concept of fine-grain reconfigurability at the fore. The fourth and final paper gives a reality check on carbon electronics and the most promising devices in this class.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>9.1.1</td>
<td>III-V SEMICONDUCTOR NANOWIRES FOR FUTURE DEVICES</td>
<td>H. Schmid, B. Borg, K. Moselund, P. Das Kunungo, G. Signorello, S. Karg, P. Mensch, V. Schmidt and H. Riel, IBM Research, CH</td>
</tr>
<tr>
<td>09:00</td>
<td>1</td>
<td>End of session</td>
<td></td>
</tr>
<tr>
<td>09:05</td>
<td>9.1.2</td>
<td>ADVANCED SYSTEM ON A CHIP DESIGN BASED ON CONTROLLABLE-POLARITY FETS</td>
<td>Pierre-Emmanuel Gaillardon, Luca Amaru, Jian Zhang and Giovanni De Michiel, Integrated Systems Laboratory – Swiss Federal Institute of Technology, CH</td>
</tr>
<tr>
<td>09:35</td>
<td>9.1.3</td>
<td>RECONFIGURABLE SILICON NANOWIRE DEVICES AND CIRCUITS: OPPORTUNITIES AND CHALLENGES</td>
<td>Walter Weber1, André Heinzig2, Jens Trommer1, Markus König2, Matthias Grube1 and Thomas Mikolajick1</td>
</tr>
<tr>
<td>09:35</td>
<td>9.1.4</td>
<td>ADVANCING CMOS WITH CARBON ELECTRONICS</td>
<td>Franz Kreupl, TU Munich, DE</td>
</tr>
<tr>
<td>10:00</td>
<td></td>
<td>End of session</td>
<td></td>
</tr>
</tbody>
</table>

9.2 Low-Cost, High-Performance NoCs

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 6

Chair: Kees Goossens, Eindhoven University, NL
Co-Chair: Luca Ramini, University of Ferrara, IT

This session pushes the boundaries of NoC performance optimization while at the same time accounting for implementation constraints. The first paper takes a perspective where express channels are added to the topology, and then smart application mapping is performed. The second paper instead chooses the TDM NoC route to provide guaranteed performance, and significantly optimizes the TDM scheduling process. Finally, the last paper reduces buffer sizes, while also providing elasticity, in a router's virtual channel buffers.
Hardware features are used as a trust anchor in many secure systems. This includes design obfuscation techniques, encrypted processing, and biometric systems which are discussed in this session.

Abstract

With the emergence of many-core multiprocessor system-on-chips (MPSoCs), the on-chip networks are facing serious challenges in providing fast communication for various tasks and cores. One promising solution shown in recent studies is to add express channels to the network as shortcuts to bypass intermediate routers, thereby reducing packet latency. However, this approach also greatly changes the packet delay estimation and traffic behaviors of the network, both of which have not yet been exploited in existing mapping algorithms. In this paper, we explore the opportunities in optimizing application mapping for express channel-based on-chip networks. Specifically, we derive a new delay model for this type of networks, identify their unique characteristics, and propose an efficient heuristic mapping algorithm that increases the bypassing opportunities by reducing unnecessary turns that would otherwise impose the entire router pipeline delay to packets. Simulation results show that the proposed algorithm can achieve a 2~4X reduction in the number of turns and 10~26% reduction in the average packet delay.
**Abstract**

Hardware is the foundation and the root of trust of any security system. However, in today's global IC industry, an IP provider, an IC design house, a CAD company, or a foundry may subvert a VLSI system with back doors or logic bombs. Such a supply chain adversary's capability is rooted in his knowledge on the hardware design. Successful hardware design obsfuscation would severely limit a supply chain adversary's capability if not preventing all supply chain attacks. However, not all designs are obsfuscable in traditional technologies. We propose to achieve ASIC design obsfuscation based on embedded reconfigureable logic which is determined by the end user and unknown to any party in the supply chain. Combined with other security techniques, embedded reconfigureable logic can provide the root of ASIC design obsfuscation, data confidentiality and tamper-proofness. As a case study, we evaluate hardware-based code injection attacks and reconfigureable-based instruction set obsfuscation based on an open source SPARC processor LEON2. We prevent program monitor Trojan attacks and increase the area of a minimum code injection Trojan with a 1KB ROM by 2.38% for every 1% area increase of the LEON2 processor.

---

**Presentation Title**

**Authors**

**Speakers:** Bao Liu¹ and Brandon Wang²

¹University of Texas at San Antonio, US; ²Cadence Design Systems, Inc., US

---

**Abstract**

Embedded computing devices increasingly permeate many aspects of modern life: from medical to automotive, from building and factory automation to weapons, from critical infrastructures to home entertainment. Despite their specialized nature as well as limited resources and connectivity, these devices are now becoming an increasingly popular and attractive target for attacks, especially, malware infections. A number of approaches have been suggested to detect and/or mitigate such attacks. They vary greatly in terms of application generality and underlying assumptions. However, one common theme is the need for Remote Attestation, a distinct security service that allows a trusted party (verifier) to check the internal state of a remote untrusted embedded device (prover). Many prior methods assume some form of trusted hardware on the prover, which is not a good option for small and low-end embedded devices. To this end, we investigate the feasibility of Remote Attestation without trusted hardware. This paper provides a systematic treatment of Remote Attestation, starting with a precise definition of the desired service and proceeding to its systematic deconstruction into necessary and sufficient properties. Next, these are mapped into a minimal collection of hardware and software components that result in secure Remote Attestation. One distinguishing feature of this line of research is the need to prove (or, at least argue for) architectural normality – an aspect rarely encountered in security research. This work also provides a promising platform for advancing more advanced security services and guarantees.

---

**Abstract**

In today's technology driven world, it is essential to build secure systems with low faulty behavior. Authentication is one of the primary means to gain access to secure systems. Users need to be authenticated in order to gain access to the services and sensitive information contained within the system. Due to the surge in the number of touch based smart devices, there arises a need for a compatible authentication system. Historically, fingerprints have served in its fullest capacity to establish the uniqueness of an individual's identity. It can be detected using capacitive sensing techniques. In this paper we present a novel unified device using transparent electronics for both fingerprint scan and multi-touch interaction. We discuss a high resolution transparent touch sensitive device and a read out circuit that drives the capacitive sensor array for touch interactions at low resolutions and for fingerprint sensing at higher resolutions. Using circuit simulation and custom Verilog-A model for transparent thin-film transistors, we verified that our design can sense fingerprints in 8.25 ms and detect touches in 0.6ms with an efficient power consumption of 1 mW. The results show that such a device can be realized and can serve as a very efficient means of user authentication. Furthermore, from the usability aspect, the proposed device is essential as it provides user transparent and non intrusive authentication.

---

**Abstract**

As cloud computing becomes mainstream, the need to ensure the privacy of the data entrusted to third parties keeps rising. Cloud providers resort to numerous security controls and encryption to thwart potential attackers. Still, since the actual computation inside cloud microprocessors remains unencrypted, the opportunity of leakage is theoretically possible. Therefore, in order to address the challenge of protecting the computation inside the microprocessor, we introduce a novel general purpose architecture for secure data processing, called HEROIC (Homomorphically Encrypted One Instruction Computer). This new design utilizes a single instruction architecture and provides native processing of encrypted data at the architecture level. The security of the solution is assured by a variant of Paillier's homomorphic encryption scheme, used to encrypt both instructions and data. Experimental results using our hardware-cognizant software simulator, indicate an average execution overhead between 5 and 45 times for the encrypted computation (depending on the security parameter), compared to the unencrypted variant, for a 16-bit single instruction architecture.

---

**Abstract**

The security concerns of EDA tools have long been ignored because IC designers and integrators only focus on their functionality and performance. This lack of trusted EDA tools hampers hardware security researchers' efforts to design trusted integrated circuits. To address this concern, a novel EDA tools trust evaluation framework has been proposed to ensure the trustworthiness of EDA tools through its functional operation, rather than scrutinizing the software code. As a result, the newly proposed framework lowers the evaluation cost and is a better fit for hardware security researchers. To support the EDA tools evaluation framework, a new gate-level information assurance scheme is developed for security property checking on any gate-level netlist. Helped by the gate-level scheme, we expand the territory of proof-carrying based IP protection from RT-level designs to gate-level netlist, so that most of the commercially trading third-party IC cores are under the protection of proof-carrying based security properties. Using a sample AES encryption core, we successfully prove the trustworthiness of Synopsys Design Compiler in generating a synthesized netlist.

---

9.4 Timing challenges in validation

**Date:** Thursday 27 March 2014  
**Time:** 08:30 - 10:00  
**Location / Room:** Konferenz 2  
**Chair:** Elena Ioana Vatajelu, Politecnico di Torino, IT  
**Co-Chair:**
Accelerated timing simulation is essential for today’s chip designs, whether it is performed at the gate-level or at the system-level. This session provides solutions to address the challenges of timing analysis and timing validation performance across multiple levels of design’s abstractions.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>9.4.1</td>
<td>FAST STA PREDICTION-BASED GATE-LEVEL TIMING SIMULATION</td>
<td>Tarq Bashir Ahmad and Maciej Ciesielski, UMASS Amherst, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Traditional dynamic simulation with standard delay format (SDF) back-annotation cannot be reliably performed on large designs. The large size of SDF files makes the event-driven timing simulation extremely slow as it has to process an excessive number of events. In order to accelerate gate-level timing simulation we propose an automated fast prediction-based gate-level timing simulation that combines static timing analysis (STA) at the block level with dynamic timing simulation at the I/O interfaces. We demonstrate that the proposed timing simulation can be done earlier in the design cycle in parallel with synthesis.</td>
</tr>
<tr>
<td>09:00</td>
<td>9.4.2</td>
<td>A CROSS-LEVEL VERIFICATION METHODOLOGY FOR DIGITAL IPS AUGMENTED WITH EMBEDDED TIMING MONITORS</td>
<td>Valerio Guarnieri¹, Massimo Petrica¹, Alessandro Sassone¹, Sara Vinco¹, Nicola Bomberi¹, Franco Fummi¹, Enrico Macii² and Massimo Poncino²</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Smart systems implement the leading technology advances in the context of embedded devices. Current design methodologies are not suitable to deal with tightly interacting subsystems of different technological domains, namely analog, digital, discrete and power devices, MEMS and power sources. The effects of interaction between components and with the environment must be modeled and simulated at system level to achieve high performance. Focusing on the digital domain, additional design constraints have to be considered as a result of the integration of multi-domain subsystems in a single device. The main digital design challenges, combined with those emerging from the heterogeneous nature of the whole system, directly impact on performance and on propagation delay of the digital component. This paper proposes a design approach to enhance the RTL model of a given digital component for the integration in smart systems, and a methodology to verify the added features at system-level. The design approach consists of augmenting the RTL model through the automatic insertion of delay sensors, which can detect and correct timing failures. The augmented model is abstracted to SystemC TLM and, then, mutants (i.e., code mutations for emulating timing failures) are automatically injected into the model. Experimental results demonstrate the applicability of the proposed design and verification methodology and the effectiveness of the simulation performance.</td>
</tr>
<tr>
<td>09:30</td>
<td>9.4.3</td>
<td>EMPOWERING STUDY OF DELAY BOUND TIGHTNESS WITH SIMULATED ANNEALING</td>
<td>Xueqian Zhao and Zhonghai Lu, KTH Royal Institute of Technology, SE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Studying the delay bound tightness typically takes a practical approach by comparing simulated results against analytic results. However, this is often a manual process whereas many simulation parameters have to be configured before the simulations run. This is a tedious and time-consuming process. We propose a technique to automate this process by using a simulated annealing approach. We formulate the problem as an online optimization problem, and embed a simulated annealing algorithm in the simulation environment to guide the search of configuration parameters which give good tightness results. This is a fully automated procedure and thus provide a promising path to automatic design space exploration in similar contexts. Experiment results of an all-to-one communication network with large searching space and complicated constraints illustrate the effectiveness of our method.</td>
</tr>
<tr>
<td>10:00</td>
<td></td>
<td>ANALYSIS AND EVALUATION OF PER-FLOW DELAY BOUND FOR MULTIPLEXING MODELS</td>
<td>Yanchen Long¹, Zhonghai Lu² and Xiaolang Yan³</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Multiplexing models are common in resource sharing communication media such as buses, crossbars and networks. While sending packets over a multiplexing node, the packet delay bound can be computed using network calculus models. The tightness of such delay bound remains an open problem. This paper studies the multiplexing models for weighted round robin scheduling with different traffic arrival curves, and analyzes per-flow packet delay bounds with different service properties. We empirically evaluate the tightness of the delay bounds. Our results show the quality of different analysis models, and how influential each parameter is to tightness.</td>
</tr>
<tr>
<td>10:00</td>
<td></td>
<td>END OF SESSION</td>
<td></td>
</tr>
</tbody>
</table>

**IP4-15, 665** (Best Paper Award Candidate)

**ANALYSIS AND EVALUATION OF PER-FLOW DELAY BOUND FOR MULTIPLEXING MODELS**

**Speakers:** Yanchen Long¹, Zhonghai Lu² and Xiaolang Yan³

**Abstract**

Multiplexing models are common in resource sharing communication media such as buses, crossbars and networks. While sending packets over a multiplexing node, the packet delay bound can be computed using network calculus models. The tightness of such delay bound remains an open problem. This paper studies the multiplexing models for weighted round robin scheduling with different traffic arrival curves, and analyzes per-flow packet delay bounds with different service properties. We empirically evaluate the tightness of the delay bounds. Our results show the quality of different analysis models, and how influential each parameter is to tightness.

**Time** | **Label** | **Presentation Title** | **Authors** |
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>9.5.1</td>
<td>INTRODUCTION TO RAP (RESILIENCE ARTICULATION POINT)</td>
<td>Andreas Herkersdorf, TU München, DE</td>
</tr>
</tbody>
</table>
and real-time calculus.

This session deals with scheduling and schedulability analysis of real-time systems. In particular, it presents different models of scheduling, including fixed and dynamic priority.

Benny Akesson, CTU Prague, CZ

Giuseppe Lipari, ENS - Cachan, FR

Chair:

Location / Room:

Time:

9.6 Schedulability analysis

Date: Thursday 27 March 2014

Time: 08:30 - 10:00

Location / Room: Konferenz 4

Chair:

Giuseppe Lipari, ENS - Cachan, FR

Co-Chair:

Benny Akesson, CTU Prague, CZ

This session deals with scheduling and schedulability analysis of real-time systems. In particular, it presents different models of scheduling, including fixed and dynamic priority, and real-time calculus.

Time Label Presentation Title Authors

08:45 9.5.2 SYSTEM LEVEL DESIGN USING RAP (RESILIENCE ARTICULATION POINT)

Speaker:

Ulf Schlichtmann, Technische Universität München, DE

Abstract

We will demonstrate how technology characteristics can be included in system-level reliability analysis using the RAP (Resilience Articulation Point) model. The specific example of a two-wheeled robot will be used.

09:00 9.5.3 CROSS-LAYER RELIABILITY IN THE DESIGN OF AN ERROR RESILIENT COMMUNICATION SYSTEM

Speaker:

Norbert Wehn, University of Kaiserslautern, DE

09:15 9.5.4 RIIF - TOWARD A STANDARD APPROACH FOR CREATING RELIABILITY MODELS FOR COMPLEX SILICON DEVICES

Speaker:

Adrian Evans, IROC Technologies, FR

Abstract

Complex silicon devices are increasingly controlling critical systems where safety and reliability are key concerns. Silicon technology is subject to numerous failure modes which can be broadly classified into soft- error effects (due to natural radiation) and life-time effects (e.g. electro-migration, NBTI, HCI). It is necessary to consider all of these failure modes and how they propagate through the system and produce user-visible effects. There are no consistent tools or methodologies to address this problem. Current ad-hoc approaches are not able to cope with the diversity of technology failure modes, increased design sizes and the complex relationships between consumers and suppliers of electronic components. RIIF (Reliability Information Interchange Format), is an initiative to develop a standard modelling language for specifying the failure mechanisms in silicon devices and systems built using these devices. In this session we give a brief overview of RIIF and present an example that highlights some of the challenges in reliability modelling.

09:30 9.5.5 TEST PERSPECTIVES

Speaker:

Jacob Abraham, UT Austin, US

09:45 9.5.6 INDUSTRIAL PERSPECTIVE

Speaker:

Sani Nassif, IBM, US

10:00 End of session

Coffee Break in Exhibition Area

On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.6 Schedulability analysis

Date: Thursday 27 March 2014

Time: 08:30 - 10:00

Location / Room: Konferenz 4

Chair:

Giuseppe Lipari, ENS - Cachan, FR

Co-Chair:

Benny Akesson, CTU Prague, CZ

This session deals with scheduling and schedulability analysis of real-time systems. In particular, it presents different models of scheduling, including fixed and dynamic priority, and real-time calculus.

Time Label Presentation Title Authors

08:30 9.6.1 RATE-ADAPTIVE TASKS: MODEL, ANALYSIS, AND DESIGN ISSUES

Speakers:

Giorgio Buttazzo 1, Enrico Bini 2 and Darren Buttle 3

1Scuola Superiore Sant’Anna, IT; 2Lund University, SE; 3ETAS-PGA/PRM-E, DE

Abstract

In automotive systems, some of the engine control tasks are triggered by specific crankshaft rotation angles and are designed to adapt their functionality based on the angular velocity of the engine. This paper proposes a new task model for specifying such a type of real-time activities and presents an approach for analyzing the system feasibility under deadline scheduling for different scenarios. In particular, a feasibility test is derived for tasks under steady-state conditions (constant speed), as well as in dynamic conditions (constant acceleration). A design method is also discussed to determine the most suitable switching speeds for adapting the functionality of tasks without exceeding a desired utilization. Finally, a number of research directions are highlighted to extend the current results to more complex and realistic scenarios.

09:00 9.6.2 ACCEPTANCE AND RANDOM GENERATION OF EVENT SEQUENCES UNDER REAL TIME CALCULUS CONSTRAINTS

Speakers:

Kajori Banerjee and Pallab Dasgupta, Indian Institute of Technology Kharagpur, IN

Abstract

Simulation platforms for complex networked real-time systems require random input pattern generators for simulating input distributions. They also require monitors for checking whether the output of the system satisfies the desired throughput. In this paper we study the acceptance and generation problems in a setting where the constraints defining the input distributions as well as the constraints defining the expected output distributions are specified in real time calculus (RTC). We prove that event patterns satisfying a given set of RTC constraints can be described by a omega-regular language. We propose a method for constructing an automaton that can be used for online generation of random admissible event patterns. This is significant, considering the known problems of deadlock in less informed generators for streams satisfying RTC constraints.

09:30 9.6.3 GENERAL AND EFFICIENT RESPONSE TIME ANALYSIS FOR EDF SCHEDULING

Speakers:

Nan Guan and Wang Yi, Uppsala University, SE

Abstract

Response Time Analysis (RTA) is one of the key problems in real-time system design. This paper proposes new RTA methods for EDF scheduling, with general system models where workload and resource availability are represented by request/demand bound functions and supply bound functions. The main idea is to derive response time upper bounds by lower-bounding the slack times. We first present a simple over-approximate RTA method, which lower bounds the slack time by measuring the "horizontal distance" between the demand bound function and the supply bound function. Then we present an exact RTA method based on the above idea but eliminating the pessimism in the first analysis. This new exact RTA method, not only allows to precisely analyze more general system models than existing EDF RTA techniques, but also significantly improves analysis efficiency. Experiments are conducted to show efficiency improvement of our new RTA technique, and tradeoffs between the analysis precision and efficiency of the two methods in this paper are discussed.
### 9.7 Timing Analysis and Cell Design

**Date:** Thursday 27 March 2014  
**Time:** 08:30 - 10:00  
**Location / Room:** Konferenz 5  
**Chair:** Jose Monteiro, INESC-ID / Tecnico, ULisboa, PT  
**Co-Chair:** Elena Dubrova, Royal Institute of Technology, SE  

The papers in this session present static timing techniques and tools for the analysis and synthesis of logic circuits. The papers take into account new aspects of timing analysis like variability, leakage and sign-off.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>9.7.1</td>
<td><strong>THE SCHEDULABILITY REGION OF TWO-LEVEL MIXED-CRITICALITY SYSTEMS BASED ON EDF-VD</strong></td>
<td>Dirk Mueller and Alejandro Masrur, TU Chemnitz, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>The algorithm Earliest Deadline First with Virtual Deadlines (EDF-VD) was recently proposed to schedule mixed-criticality task sets consisting of high-criticality (HI) and low-criticality (LO) tasks. EDF-VD distinguishes between HI and LO mode. In HI mode, the HI tasks may require executing for longer than in LO mode. As a result, in LO mode, EDF-VD assigns virtual deadlines to HI tasks (i.e., it uniformly downscaleredes deadlines of HI tasks) to account for an increase of workload in HI mode. Different schedulability conditions have been proposed in the literature; however, the schedulability region to fully characterize EDF-VD has not been investigated so far. In this paper, we review EDF-VD's schedulability criteria and determine its schedulability region to better understand and design mixed-criticality systems. Based on this result, we show that EDV-VD has a schedulability region being around 85% larger than that of the Worst-Case Reservations (WCR) approach.</td>
</tr>
<tr>
<td>09:00</td>
<td>9.7.2</td>
<td><strong>STATISTICAL STATIC TIMING ANALYSIS USING A SKEW-NORMAL CANONICAL DELAY MODEL</strong></td>
<td>Vijaykumar M and V Vasudevan, Department of Electrical Engineering Indian Institute of Technology Madras, IN</td>
</tr>
</tbody>
</table>
A DEEP LEARNING METHODOLOGY TO PROLIFERATE GOLDEN SIGNOFF TIMING

Speakers:
Seung-Soo Han1, Andrew B. Kahng2, Siddhartha Nath2 and Ashok S. Vidyakumar2
1Myongji University, Yongin, KR; 2University of California, San Diego, US

Abstract
Signoff timing analysis remains a critical element in the IC design flow. Multiple signoff corners, libraries, design methodologies, and implementation flows make timing closure very complex at advanced technology nodes. Reported timing slacks directly affect chip area and power by forcing additional buffering or sizing (negative slacks), or limiting area and power recovery (positive slacks). Design teams often wish to ensure that one tool’s timing reports are neither optimistic nor pessimistic with respect to another tool’s reports. The resulting “correlation” problem is highly complex because tools contain millions of lines of black-box and legacy code, licenses prevent any reverse-engineering of algorithms, and the nature of the problem is seemingly “unbounded” across possible designs, timing paths, and electrical parameters. In this work, we apply a “big-data” mindset to approach the timer correlation problem. We develop a machine learning-based tool, Golden Timer extension (GTX), to correct divergence in flip-flop setup time, cell arc delay, wire delay, stage delay, and path slack at timing endpoints between timers. Our models are developed with datasets of >300K data points for cell, wire, and stage delays and >30K data points for path slack and flip-flop setup time. We propose a methodology to apply GTX to two arbitrary timers, and we evaluate scalability of GTX across multiple designs and foundry technologies / libraries, both with and without signal integrity analysis. Our experimental results show reduction in divergence between timing tools from 139.3ps to 21.1ps (i.e., 6.6×) in endpoint slack, from 25.6ps to 2.4ps (i.e., 10× reduction) in flip-flop setup time, from 454.4ps to 51.9ps (i.e., 8.7× reduction) in cell delay, from 194.8ps to 17.4ps (i.e., 11.2× reduction) in wire delay, and from 117ps to 23.8ps (4.9× reduction) in stage delay. The average (mean) divergence in timing reports after applying GTX is almost zero. We further demonstrate the incremental application of our methods so that models can be adapted to any outlier discrepancies when new designs are taped out in the same technology / library. Last, we demonstrate that GTX can also correlate timing reports between signoff and design implementation tools.

AGING-AWARE STANDARD CELL LIBRARY DESIGN

Speakers:
Samuel Kalmar1, Farshad Firouzi1, Mojtaba Ebrahimi2 and Mehdi Tahoori1
1Karlsruhe Institute of Technology (KIT), DE; 2Karlsruhe Institute of Technology, DE

Abstract
Transistor aging, mostly due to Bias Temperature Instability (BTI), is one of the major unreliability sources at nano-scale technology nodes. BTI causes the circuit delay to increase and eventually leads to a decrease in the circuit lifetime. Typically, standard cells in the library are optimized according to the design time delay, however, due to the asymmetric effect of BTI, the rise and fall delays might become significantly imbalanced over the lifetime. In this paper, the BTI effect is mitigated by balancing the rise and fall delays of the standard cells at the expected lifetime. We find an optimal trade-off between the increase in the size of the library and the lifetime improvement (timing margin reduction) by non-uniform extension of the library cells for various ranges of the input signal probabilities. The simulation results reveal that our technique can prolong the circuit lifetime by around 150% with a negligible area overhead.

PASS-XNOR LOGIC: A NEW LOGIC STYLE FOR P-N JUNCTION BASED GRAPHENE CIRCUITS

Speakers:
Valerio Tenace, Andrea Calimera, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT

Abstract
In this work we introduce a new logic style for p-n junctions based digital graphene circuits: the pass-XNOR logic style. The latter enables the realization of compact, energy efficient circuits that better exploit the characteristics of graphene. We first show how a single p-n junction can be conceived as a pass-XNOR gate, i.e., a transmission gate with embedded logic functionality, the XNOR Boolean operator. Secondly, we propose a smart integration strategy in which series/parallel connections of pass-XNOR gates allow to implement AND/OR logical conjunctions, and, therefore, all possible truth tables. Experimental results conducted on a set of representative logic functions show the superior of pass-XNOR logic circuits w.r.t. standard CMOS circuits and graphene circuits that use p-n junctions in a complementary-like structure.

MIXED ALLOCATION OF ADJUSTABLE DELAY BUFFERS COMBINED WITH BUFFER SIZING IN CLOCK TREE SYNTHESIS OF MULTIPLE POWER MODE DESIGNS

Speakers:
Kitaek Park, Geunho Kim and Taewhan Kim, Seoul National University, KR

Abstract
Recently, many works have shown that adjustable delay buffer (ADB) whose delay is adjustable dynamically can effectively solve the clock skew variation problem in the designs with multiple power modes. However, all the previous works of ADB allocation inherently entail two critical limitations, which are the adjusted delays by ADB are always increments and the low cost buffer sizing has never been or not been primarily taken into account. To demonstrate how much overcoming the two limitations is effective in resolving the clock skew constraint, we characterize the two types of ADBs called CADB (capacitor based ADB) and IADB (inverter based ADB) and show that the adjusted delays by IADB can be decremented, and show that the clock skew violation in some clock trees of multiple power modes can be resolved by applying buffer sizing together with using only a small number of IADB and CADB.

9.8 Embedded Tutorial: Memcomputing: the Cope of Good Hope

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Exhibition Theatre
Organisers:
Yiu Shih, Missouri University of Science & Technology, US
Hung-Ming Chen, National Chiao Tung University, Taiwan, Ta
Chair:
Tsung-Yi Ho, CSIE, NCKU, TW
Co-Chair:
Hung-Ming Chen, National Chiao Tung University, TW

Energy efficiency has emerged as a major barrier to performance scalability for modern processors. On the other hand, significant breakthroughs have been achieved in memory technologies recently. As such, the fascinating idea of memcomputing (i.e., use memory for computation purposes) has drawn wide attention from both academia and industry as an effective remedy. Compared with conventional logic computing, memory array provides large set of parallel resources with high bandwidth, which can be configured to perform computing in spatial/temporal manner leading to dramatic reduction in processor-memory traffic. Moreover, memory computing brings the computing engine close to the data, thus drastically minimizing the von Neumann bottleneck. Finally, it exploits the advances in memory technologies and integration approaches (e.g. 3D integration) to achieve better technology scalability. This special session offers a broad-spectrum retreat (devices, processes and systems) on this hot topic to the general CAD community, hoping to inspire more contributions from the design automation perspective.
Today's smartphones and tablets contain multiple cellular modems to support 2G/3G/4G standards, including Long Term Evolution (LTE). They run on complex multi-processor hardware platforms and have to meet hard real-time constraints. Dataflow modeling can be used to design an LTE receiver. Static dataflow allows a rich set of analysis techniques, and is too restrictive to model the dynamic behavior in many realistic applications, including LTE receivers. Dynamic dataflow allows modeling of many realistic applications, but does not support rigorous temporal analysis. Mode-Controlled Dataflow (MCDF) is a restricted form of dynamic dataflow, and allows the same analysis techniques as static dataflow, in principle. We prove that MCDF is sufficiently expressive to handle the dynamic behavior of a realistic LTE receiver, by systematically and stepwise developing a complete MCDF model for an LTE receiver.
IP4-4 MODEL-BASED ACTOR MULTIPLEXING WITH APPLICATION TO COMPLEX COMMUNICATION PROTOCOLS

Speakers:
Christian Zebelien1, Christian Haubelt1, Joachim Falk1, Tobias Schwarzer1 and Jürgen Teich2
1University of Rostock, DE; 2University of Erlangen-Nuremberg, DE

Abstract
We propose a dynamic scheduling approach for the concurrent execution of logical actor instances on a single synthesized actor instance. Based on a formal dataflow model of computation, the proposed approach can be applied to a wide range of applications in a model-based design flow. As case-study, we evaluate a bus-cycle-accurate SystemC model based on an Infiniband network adapter in a PCI Express system.

IP4-5 A NOVEL MODEL FOR SYSTEM-LEVEL DECISION MAKING WITH COMBINED ASP AND SMT SOLVING

Speakers:
Alexander Biewer1, Jens Gladigau1 and Christian Haubelt2
1Robert Bosch GmbH, DE; 2University of Rostock, DE

Abstract
In this paper, we present a novel model enabling system-level decision making for time-triggered many-core architectures in automotive systems. The proposed application model includes shared data entities that need to be bound to memories during decision making. As a key enabler to our approach, we explicitly separate computation and shared memory communication over a network-on-chip (NoC). To deal with contention on a NoC, we model the necessary basis to implement a time-triggered schedule that guarantees freedom of interference. We compute fundamental design decisions, namely (a) spatial binding, (b) multi-hop routing, and (c) time-triggered scheduling, by a novel coupling of answer set programming (ASP) with satisfiability modulo theories (SMT) solvers. First results of an automotive case study demonstrate the applicability of our method for complex real-world applications.

IP4-6 DESPERATE: SPEEDING-UP DESIGN SPACE EXPLORATION BY USING PREDICTIVE SIMULATION SCHEDULING

Speakers:
Giovanni Mariani, Gianluca Palermo, Vittorio Zaccaria and Cristina Silvano, Politecnico di Milano, IT

Abstract
Design Space Exploration (DSE) is the problem to find the best architecture configuration in a platform based design problem. To accurately evaluate a configuration, computational expensive simulations are required. A common approach to reduce DSE execution time is to use analytic performance prediction models to approximate some of the required simulations, thus to prune the design space by removing bad configuration candidates. In this paper we will demonstrate that state of the art analytic techniques to speedup the DSE process are not capable to fully exploit the potentialities of a parallel simulation environment. We will demonstrate that, when different simulations can be run in parallel, predicting simulation time to better schedule the simulations on the parallel simulation environment is a more profitable approach with a speedup of more than 2x when compared to state of the art approaches.

IP4-7 COMiK: A PREDICTABLE AND CYCLE-ACCURATELY COMPOSABLE REAL-TIME MICROKERNEL

Speakers:
Andrew Nelson1, Ashkan Beyranvand Nejad1, Anca Molnos2, Martijn Koedam3 and Kees Goossens3
1TU Delft, NL; 2CEA Leti, FR; 3TU Eindhoven, NL

Abstract
The functionality of embedded systems is ever increasing. This has led to mixed time-critiquality systems, where applications with a variety of real-time requirements co-exist on the same platform and share resources. Due to inter-application interference, verifying the real-time requirements of such systems is generally non-trivial. In this paper, we present the CoMk microkernel that provides temporally predictable and composable processor virtualisation. CoMk’s virtual processors are cycle-accurately composable, i.e. their timing cannot affect the timing of co-existing virtual processors by even a single cycle. Real-time applications executing on dedicated virtual processors can therefore be verified and executed in isolation, simplifying the verification of mixed time-critiquality systems. We demonstrate these properties through experimentation on an FPGA prototyped hardware platform.

IP4-8 UTILIZATION-AWARE LOAD BALANCING FOR THE ENERGY EFFICIENT OPERATION ON THE BIG.LITTLE PROCESSOR

Speakers:
Myungsun Kim1, Kibom Kim1, James Geraci1 and Seongsoo Hong3
1Samsung Electronics, KR; 2SAMSUNG Electronics, KR; 3Seoul National University, KR

Abstract
ARM’s big.LITTLE architecture introduces the opportunity to optimize power consumption by selecting the core type most suitable for a level of processing demand. To take advantage of this new axis of optimization, we introduce the processor utilization factor into the Linux kernel’s load balancing algorithm after carefully analyzing the power management mechanism of the big.LITTLE processor’s part of Linux and deriving its state diagram representation. Our mechanism improves the Linux kernel’s ability to assign tasks to cores in an energy efficient manner without having to make it directly aware of the available core types. Our experiments with a real test bed show that our algorithm improves energy consumption over the standard Linux scheduler up to 11.35% with almost no corresponding reduction in performance.

IP4-9 HEVCMT: APPLICATION-DRIVEN DYNAMIC THERMAL MANAGEMENT FOR HIGH EFFICIENCY VIDEO CODING

Speakers:
Daniel Palomin01, Muhammad Shaﬁque2, Hussam Amrouch2, Altamiro Susin3 and Jörg Henkel2
1Karlsruhe Institute of Technology (KIT), BR; 2Karlsruhe Institute of Technology (KIT), DE; 3Federal University of Rio Grande do Sul, BR

Abstract
This paper presents an application-driven algorithm for Dynamic Thermal Management (DTM) for the High Efficiency Video Coding (HEVC). For efficient design of such a DTM policy, we perform an offline thermal analysis of an HEVC encoder and demonstrate the impact of different video sequences and different coding conﬁgurations on the processor temperature. Our thermal analysis is leveraged to develop an efﬁcient application-driven DTM policy that performs temperature-aware coding along with an application-driven design of DTM knobs (e.g., frequency scaling) in order to meet the temperature constraints while still providing high video quality (i.e. PSNR loss < 0.01dB). For accurate thermal analysis and evaluation, we deploy an infrared camera-based thermal measurement setup that, on the contrary to state-of-the-art setups, does not require adding any extra layer on top of the measured chip, thus allowing the camera to accurately capture the infrared emissions from the die.

IP4-10 IMPROVING EFFICIENCY OF EXTENSIBLE PROCESSORS BY USING APPROXIMATE CUSTOM INSTRUCTIONS

Speakers:
Mehdi Kamal1, Amin Ghasem Azar1, Ali Afzali-Kusha1 and Massoud Pedram2
1University of Tehran, IR; 2University of Southern California, US

Abstract
In this paper, we propose to move the conventional extensible processor design flow to the approximate computing domain to gain more speedup. In this domain, the instruction set architecture (ISA) design flow selects both exact and approximate custom instructions (CIs). The proposed approach could be used for the applications where imprecise results may be tolerated. In the CI identification phase of the flow, the CIs which do not satisfy the maximum propagation delay but can provide approximate results also may be included in the CI candidate set. Next, in the selection phase, we propose a merit function which selects CIs with higher cycle savings and small error rates. The efficacy of the proposed approximate design flow is investigated using the case studies of the discrete cosine transform (DCT) and inverse DCT (IDCT) of the MPEG2 application. Also, the impact of the process variation on the imprecision of the results is investigated.
IP4-11 PROBABLISTIC STANDARD CELL MODELING CONSIDERING NON-GAUSSIAN PARAMETERS AND CORRELATIONS

Speakers:
André Lange1, Christoph Sohrmann1, Roland Jancke1, Joachim Haase1, Ingolf Lorenz2 and Ulf Schlichtmann3
1Fraunhofer Institute for Integrated Circuits (IIS), Design Automation Division (EAS), DE; 2GLOBALFOUNDRIES Inc., DE; 3Technische Universität München, DE

Abstract
Variability continues to pose challenges to integrated circuit design. With statistical static timing analysis and high-yield estimation methods, solutions to particular problems exist, but they do not allow a common view on performance variability including potentially correlated and non-Gaussian parameter distributions. In this paper, we present a probabilistic approach for variability modeling as an alternative: model parameters are treated as multi-dimensional random variables. Such a fully multivariate statistical description formally accounts for correlations and non-Gaussian random components. Statistical characterization and model application are introduced for standard cells and gate-level digital circuits. Example analyses of circuitry in a 28 nm industrial technology illustrate the capabilities of our modeling approach.

IP4-12 DYNAMIC CONSTRUCTION OF CIRCUITS FOR REACTIVE TRAFFIC IN HOMOGENEOUS CMPS

Speakers:
Marta Ortlín-Obón1, Darío Suárez-Gracia Suárez-Gracia1, María Villaroya-Gaudí1, Cruz Izú2 and Víctor Vihals-Yüfera1
1University of Zaragoza, ES; 2University of Adelaide, AU

Abstract
Networks on Chip (NoCs) have a large impact on system performance, area and energy. Considering the characteristics of the memory subsystem while designing the NoC helps identify improvement opportunities and build more efficient designs. Leveraging the frequent request-reply pattern, our proposal dynamically builds the reply path in advance, is able to share circuits between messages, and even removes some implicit replies, significantly reducing NoC latency. A careful implementation of this circuit reservation mechanism achieves an average 17% reduction in router energy consumption, 8% smaller router area and a 2% system performance increase, compared with its baseline counterpart.

IP4-13 IMPROVING HAMILTONIAN-BASED ROUTING METHODS FOR ON-CHIP NETWORKS: A TURN MODEL APPROACH

Speakers:
Poona Bahrebar and Dirk Stroobandt, Ghent University, BE

Abstract
The overall performance of Multi-Processor System-on-Chip (MPSoC) platforms depends highly on the efficient communication among their cores in the Network-on-Chip (NoC). Routing algorithms are responsible for the on-chip communication and traffic distribution through the network. Hence, designing efficient and high-performance routing algorithms is of significant importance. In this paper, a deadlock-free and highly adaptive path-based routing method is proposed without using virtual channels. This method strives to exploit the maximum number of minimal paths between any source and destination pair. The simulation results in terms of performance and power consumption demonstrate that the proposed method significantly outperforms the other adaptive and non-adaptive schemes. This efficiency is achieved by reducing the number of hotspots and smoothly distributing the traffic across the network.

IP4-14 EDA TOOLS TRUST EVALUATION THROUGH SECURITY PROPERTY PROOFS

Speakers:
Yier Jin, The University of Central Florida, US

Abstract
The security concerns of EDA tools have long been ignored because IC designers and integrators only focus on their functionality and performance. This lack of trusted EDA tools hampers hardware security researchers efforts to design trusted integrated circuits. To address this concern, a novel EDA tools trust evaluation framework has been proposed to ensure the trustworthiness of EDA tools through its functional operation, rather than scrutinizing the software code. As a result, the newly proposed framework lowers the evaluation cost and is a better fit for hardware security researchers. To support the EDA tools evaluation framework, a new gate-level information assurance scheme is developed for security property checking on any gate-level netlist. Helped by the gate-level scheme, we expand the territory of proof-carrying based IP protection from RT-level designs to gate-level netlist, so that most of the commercially trading third-party IP cores are under the protection of proof-carrying based security properties. Using a sample AES encryption core, we successfully prove the trustworthiness of Synopsys Design Compiler in generating a synthesized netlist.

IP4-15 BEST PAPER AWARD CANDIDATE

ANALYSIS AND EVALUATION OF PER-FLOW DELAY BOUND FOR MULTIPLEXING MODELS

Speakers:
Yanchen Long1, Zhonghai Lu2 and Xiaolang Yan3
1Zhejiang University and KTH Royal Institute of Technology, SE; 2KTH Royal Institute of Technology, SE; 3Zhejiang University, CN

Abstract
Multiplexing models are common in resource sharing communication media such as buses, crossbars and networks. While sending packets over a multiplexing node, the packet delay bound can be computed using network calculus models. The tightness of such delay bound remains an open problem. This paper studies the multiplexing models for weighted round robin scheduling with different traffic arrival curves, and analyzes per-flow packet delay bounds with different service properties. We empirically evaluate the tightness of the delay bounds. Our results show the quality of different analysis models, and how influential each parameter is to tightness.

IP4-17 AGING-AWARE STANDARD CELL LIBRARY DESIGN

Speakers:
Saman Kiamehr1, Farshad Firouzi1, Mojtaba Ebrahimi2 and Mehdi Tahoori2
1Karlsruhe Institute of Technology (KIT), DE; 2Karlsruhe Institute of Technology, DE

Abstract
Transistor aging, mostly due to Bias Temperature Instability (BTI), is one of the major unreliability sources at nano-scale technology nodes. BTI causes the circuit delay to increase and eventually leads to a decrease in the circuit lifetime. Typically, standard cells in the library are optimized according to the design time delay, however, due to the asymmetric effect of BTI, the rise and fall delays might become significantly imbalanced over the lifetime. In this paper, the BTI effect is mitigated by balancing the rise and fall delays of the standard cells at the expired lifetime. We find an optimal trade-off between the increase in the size of the library and the lifetime improvement (timing margin reduction) by non-uniform extension of the library cells for various ranges of the input signal probabilities. The simulation results reveal that our technique can prolong the circuit lifetime by around 150% with a negligible area overhead.

IP4-18 PASS-XNOR LOGIC: A NEW LOGIC STYLE FOR P-N JUNCTION BASED GRAPHENE CIRCUITS

Speakers:
Valerio Tenace, Andrea Calimera, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT

Abstract
In this work we introduce a new logic style for p-n junctions based digital graphene circuits: the pass-XNOR logic style. The latter enables the realization of compact, energy efficient circuits that better exploit the characteristics of graphene. We first show how a single p-n junction can be conceived as a pass-XNOR gate, i.e., a transmission gate with embedded logic functionality, the XNOR Boolean operator. Secondly, we propose a smart integration strategy in which series/parallel connections of pass-XNOR gates allow to implement AND/OR logical conjunctions, and, therefore, all possible truth tables. Experimental results conducted on a set of representative logic functions show the superior of pass-XNOR logic circuits w.r.t. standard CMOS circuits and graphene circuits that use p-n junctions in a complementary-like structure.
Abstract

Recently, many works have shown that adjustable delay buffer (ADB) whose delay is adjustable dynamically can effectively solve the clock skew variation problem in the designs with multiple power modes. However, all the previous works of ADB allocation inherently entail two critical limitations, which are the adjusted delays by ADB are always increments and the low cost buffer sizing has never been or not been primarily taken into account. To demonstrate how much overcoming the two limitations is effective in resolving the clock skew constraint, we characterize the two types of ADBs called CADB (capacitor based ADB) and IADB (inverter based ADB) and show that the adjusted delays by IADB can be decremented, and show that the clock skew violation in some clock trees of multiple power modes can be resolved by applying buffer sizing together with using only a small number of IADBs and CADBs.
The first paper in this session gives an overview of alternative memory technologies and how each can contribute or disrupt accepted memory hierarchies. Memory devices and technologies have undergone huge transformations in recent years and many industrially viable replacements to conventional technologies are on the brink of entering the market. In the temporal domain an applications actual execution never varies by even a single clock cycle. Similarly, the energy and power behaviors of applications are also composable. As a result, applications can be designed, developed, verified, and executed in isolation. The VEs are also predictable, meaning that all interference is bounded. This makes them virtualized also in terms of performance bounds, which enables firm real-time applications to be verified using formal performance analysis frameworks. The CompSOC platform uses the CoMik microkernel to implement virtual processors on each processor time through temporal partitioning. Each application can use its own operating system (e.g. Compose, μCOS-III) and model of computation (e.g. CSDL, KPN, TT) in its VEP, to suit its level of time criticality. As more applications are integrated on a single SOC, the need arises for more dynamic behaviour. The system should be able to start, modify and stop applications at run time without affecting running appli- cations. For this purpose the CompSOC platform has been extended with a predictable and composable resource management framework. It manages application bundles that contain 1) an application in the form of executables (ELFs on multiple processors), and also 2) the specifications of the (one or more) particular VEPs that the application executes in, consisting of virtual processors, NOC connections, virtualised mem- ories, etc. At run time, the resource management framework can dynamically load and start application bundles by creating a VEP and then loading, booting, and executing an application within it. VEPs can also be modified, stopped, and deleted at run time. Our University Booth will present virtual-execution-platform and application-bundle concepts using an interactive demonstrator. It will show that the CompSOC has been extended with dynamic functionality, without sacrificing its key strengths: composability and predictability. We will demonstrate this through the use of the resource management framework and application bundles, showing that we can create, modify and delete virtual execution platforms running a mixed time-criticality application dynamically at run-time.
This session comprises three papers devoted to studying different aspects of wireless NoC design and optimization. The first paper focuses on energy efficiency, by effectively tuning the transmission power of on-chip antennas. The second paper compares the performance and power of different routing algorithms for wireless NoCs, while the third paper explores the adoption of wireless NoCs in 3D chip designs.
This session discusses techniques to improve energy efficiency in large-scale computing systems, many-core systems, servers, and the cloud. The papers in this session focus on various aspects of energy efficiency and system design.

10.3 Green Computing Systems

11:30 10.2.2 PERFORMANCE EVALUATION OF WIRELESS NOCS IN PRESENCE OF IRREGULAR NETWORK ROUTING STRATEGIES

Speakers: Paul Wettin, Jacob Murray, Ryan Kim, Xinmin Yu, Partha Pande and Deukhyoun Heo, Washington State University, US

Abstract

The millimeter (mm)-wave small-world wireless NoC (mSWNoC) is an enabling interconnect architecture to design high performance and low power multicore chips. As the mSWNoC has an overall irregular topology, it is extremely important to design suitable deadlock-free routing mechanisms for it. In this paper we quantify the latency, energy dissipation, and thermal profiles of mSWNoC architectures by incorporating irregular network routing strategies. We demonstrate that the latency, energy dissipation, and thermal profile are affected by the adopted routing methodologies. In presence of the benchmarks considered, the variation in latency and energy dissipation is small. However, the network hotspot temperature can vary considerably depending on the exact routing strategy and the characteristics of the benchmark.

12:00 10.2.3 LOW-LATENCY WIRELESS 3D NOCS VIA RANDOMIZED SHORTCUT CHIPS

Speakers: Hiroki Matsutani1, Michihiro Kobuchi2, Ikki Fujiwara2, Takahiro Kagami1, Yasuhiro Take1, Tadahiro Kuroda1, Paul Bogdan3, Radu Marculescu4 and Hidehara Amano1

1Keio University, JP; 2National Institute of Informatics, JP; 3University of Southern California, US; 4Carnegie Mellon University, US

Abstract

In this paper, we demonstrate that by inducing a certain fraction of randomness into wireless 3D NoCs (where CMOS wireless links are used for vertical inter-chip communication) we can reduce the communication latency when considering the physical constraints of 3D design space. Towards this end, we consider two cases, namely 1) replacing existing horizontal 2D links in a wireless 3D NoC with randomized shortcut NoC links and 2) enabling full connectivity via adding a randomized NoC layer to a wireless 3D system with no or partial horizontal connectivity. Consequently, the packet routing is optimized by exploiting both the existing and the newly added random NoC. Thus, by adding randomly wired shortcut NoCs to a wireless 3D system, one can strike a good balance between the modular design and the minimum randomness needed for achieving low-latency. Experimental results show that by adding a random NoC chip to wireless 3D CMPs without built-in horizontal NoCs we can reduce the communication latency by as much as 26.2% when compared to that of adding a 2D mesh NoC. Also, the application execution time and average flit transfer energy can also be improved accordingly.

12:31 10.2.4 FAILURE ANALYSIS OF A NETWORK-ON-CHIP FOR REAL-TIME MIXED-CRITICAL SYSTEMS

Speakers: Bertie A Rambo1, Alexander Tschien1, Jonas Diemer1, Leonie Ahrends1 and Rolf Ernst2

1Technische Universität Braunschweig, DE; 2TU Braunschweig, DE

Abstract

Multi- and many-core architectures using Networks-on-Chip (NoC) are being explored for use in real-time safety-critical applications for their performance and efficiency. Such systems must provide isolation between tasks that may present distinct criticality levels. The NoC is critical to maintain the isolation property as it is a heavily used shared resource. To meet safety-standard requirements, such architectures require a systematic evaluation of the effects of all possible failures such as in the form of a Failure Mode and Effects Analysis (FMEA). We present the results of a detailed system-level analysis of a typical real-time mixed-critical network-on-chip architecture. This comprises an FMEA and error effects classification regarding duration and isolation violation.

This session discusses techniques to improve energy efficiency in large-scale computing systems, many-core systems, servers, and the cloud. The papers in this session particularly emphasize the practical experiences in academia and in industry.

10.3 Green Computing Systems

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 1
Chair: Ayse Coskun, Boston University, US
Co-Chair: Martino Ruggiero, University of Bologna, IT

This session discusses techniques to improve energy efficiency in large-scale computing systems, many-core systems, servers, and the cloud. The papers in this session particularly emphasize the practical experiences in academia and in industry.
GLOBAL FAN SPEED CONTROL CONSIDERING NON-IDEAL TEMPERATURE MEASUREMENTS IN ENTERPRISE SERVERS

Speakers:
- Jungsoo Kim
- Mohamed M. Sabry
- David Atienza
- Kalyan Vaidyanathan
- Kenny Gross

Abstract

Time lag and quantization in temperature sensors in enterprise servers lead to stability concerns on existing variable fan speed control schemes. Stability challenges become further aggravated when multiple local controllers are running together with the fan control scheme. In this paper, we present a global control scheme which tackles the concerns on the stability of enterprise servers while reducing the performance degradation caused by the variable fan speed control scheme. We first present a stable fan speed control scheme based on the Proportional-Integral-Derivative (PID) controller by adaptively adjusting the PID parameters according to the operating fan speed and eliminating the fan speed oscillation caused by temperature quantization. Then, we present a global control scheme which coordinates control actions among multiple local controllers. In addition, it guarantees the server stability while minimizing the overall performance degradation. We validated the proposed control scheme using a cloud data center and an advanced virtual platform. Our experimental results show that the proposed fan control scheme is stable under the non-ideal temperature measurement system (10 sec in time lag and 1C in quantization figures). Furthermore, the global control scheme enables to run multiple local controllers in a stable manner while reducing the performance degradation up to 19.2% compared to conventional coordination schemes with 19.1% savings in server power consumption.

UNVEILING EURORA – THERMAL AND POWER CHARACTERIZATION OF THE MOST ENERGY-EFFICIENT SUPERCOMPUTER IN THE WORLD

Speakers:
- Andrea Bartolini
- Matteo Cacciari
- Carlo Cavazzoni
- Giampietro Tecchiolli
- Luca Benini

Abstract

EURORA (EUropean many integrated cORe Architecture) is today the most energy efficient supercomputer in the world. Ranked 1st in the Green500 in July 2013, it is a prototype built from Eurotech and Cineca toward next-generation Tier-1 systems in the PRACE E2P EU project. EURORA’s outstanding energy-efficiency is achieved by adopting a direct liquid cooling solution and a heterogeneous architecture with best-in-class general purpose HW components (Intel Xeon Phi and Nvidia Kepler K20). In this paper we present a novel, low-overhead monitoring infrastructure capable to track in detail and in real-time the thermal and power characteristics of EURORA’s components with fine-grained resolution. Our experiments give insights on EURORA’s thermal/power trade-offs and highlight opportunities for run-time power/thermal management and optimization.

CONTENTION AWARE FREQUENCY SCALING ON CMPs WITH GUARANTEED QUALITY OF SERVICE

Speakers:
- Hao Shen
- Qinru Qiu

Abstract

Workload consolidation is usually performed in datacenters to improve server utilization for higher energy efficiency. One of the key issues related to workload consolidation is contention for shared resources such as last level cache, main memory, memory controller, etc. Dynamic voltage and frequency scaling (DVFS) of CPU is another effective technique that has widely been used to trade the performance for power reduction. We have found that the degree of resource contention of a system affects its performance sensitivity to CPU frequency. In this paper, we apply machine learning techniques to construct a model that quantifies runtime performance degradation caused by resource contention and frequency scaling. The inputs of our model are readings from Performance Monitoring Units (PMU) screened using standard feature selection technique. The model is tested on an SMU-enabled chip multi-processor and it reaches up to 90% accuracy. Experimental results show that, guided by the performance model, runtime power management techniques such as DVFS can achieve more accurate power and performance tradeoff without violating the quality of service (QoS) agreement. The QoS violation of the proposed system is significantly lower than systems that have no performance degradation information.

CONCURRENT PLACEMENT, CAPACITY PROVISIONING, AND REQUEST FLOW CONTROL FOR A DISTRIBUTED CLOUD INFRASTRUCTURE

Speakers:
- Shuang Chen
- Yanzh Wang
- Massoud Pedram

Abstract

Cloud computing and storage have attracted a lot of attention due to the ever increasing demand for reliable and cost-effective access to vast resources and services available on the Internet. Cloud services are typically hosted in a Cloud computing and storage model that has attracted a lot of attention due to the ever increasing demand for reliable and cost-effective access to vast resources and services available on the Internet. Cloud services are typically hosted in a set of geographically distributed data centers, which we call the cloud infrastructure. To minimize the total cost of ownership of this cloud infrastructure (which accounts for both the upfront capital cost and the operational cost of the infrastructure resources), the infrastructure owners/operators must do a careful planning of data center locations in the targeted service area (for example, the US territories), data center capacity provisioning (i.e., the total CPU cycles per second that can be provided in each data center). In addition, they must have flow control policies that will distribute the incoming user requests to the available resources in the cloud infrastructure. This paper presents an approach for solving the unified problem of data center placement and provisioning, and request flow control in one shot. The solution technique is based on mathematical programming. Experimental results, using Google cluster data and placement/provisioning of up to eight data center sites demonstrate the cost savings of the proposed problem formulation and solution approach.

COOLIP: SIMPLE YET EFFECTIVE JOB ALLOCATION FOR DISTRIBUTED THERMALLY-THROTTLED PROCESSORS

Speakers:
- Pratyush Kumar
- Hoeseok Yang
- Juliana Bacivarov
- Lothar Thiele

Abstract

Cloud computing and storage have attracted a lot of attention due to the ever increasing demand for reliable and cost-effective access to vast resources and services available on the Internet. Cloud services are typically hosted in a Cloud computing and storage model that has attracted a lot of attention due to the ever increasing demand for reliable and cost-effective access to vast resources and services available on the Internet. Cloud services are typically hosted in a set of geographically distributed data centers, which we call the cloud infrastructure. To minimize the total cost of ownership of this cloud infrastructure (which accounts for both the upfront capital cost and the operational cost of the infrastructure resources), the infrastructure owners/operators must do a careful planning of data center locations in the targeted service area (for example, the US territories), data center capacity provisioning (i.e., the total CPU cycles per second that can be provided in each data center). In addition, they must have flow control policies that will distribute the incoming user requests to the available resources in the cloud infrastructure. This paper presents an approach for solving the unified problem of data center placement and provisioning, and request flow control in one shot. The solution technique is based on mathematical programming. Experimental results, using Google cluster data and placement/provisioning of up to eight data center sites demonstrate the cost savings of the proposed problem formulation and solution approach.

ENERGY OPTIMIZATION IN 3D MPSOCs WITH WIDE-1/O DRAUM USING TEMPERATURE VARIATION AWARE BANK-WISE REFRESH

Speakers:
- Mohammadsadegh Sadri
- Matthias Jung
- Christian Weis
- Norbert Wehn
- Luca Benini

Abstract

Heterogeneous 3D integrated systems with Wide-1/O DRAUMs are a promising solution to squeeze more functionalities and storage bits into an ever decreasing volume. Unfortunately, with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. We improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. In order to prove the validity of our concepts we develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoc with Wide-1/O DRAMs in detail. On this platform we run the Android OS with world-wide benchmarks to quantify the advantages of our ideas. We show improvements of 16% in DRAM refresh power due to temperature variation aware bank-wise refresh. Furthermore, two solutions are investigated to speedup system simulations: (1) Adaptive tuning of sampling intervals based on the estimated chip thermal profile, which results in speedups of 2X. (2) Hardware acceleration of thermal simulations using the Maxeler engine, which shows possible speedups of 12X.
10.5 Analysis of Components and Systems

10.4 System-level evaluation

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 2

Chair:
Pablo Sanchez, University of Cantabria, ES
Co-Chair:
Florian Letombe, Synopsys, FR

The session presents system-level verification and simulation techniques, as well as specific solutions for particular system components. The first paper analyzes how to detect concurrency errors from multi-threaded software on a virtual platform. The second one proposes a hybrid simulation platform for cache configuration analysis. The last paper explores SSD verification challenges. The session is completed by three IPs that introduce novel approaches for parallel simulation and efficient NoC/Smart systems validation.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:00</td>
<td>10.4.1.1</td>
<td>AUTOMATIC DETECTION OF CONCURRENCY BUGS THROUGH EVENT ORDERING CONSTRAINTS</td>
<td>Luis Gabriel Murillo, Simon Wawroschek, Jeronimo Castrillon, Rainer Leupers and Gerd Ascheid, RWTH Aachen University, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Writing correct parallel software for modern multi-processor systems-on-chip (MPSOCs) is a complicated task. Programmers can rarely anticipate all possible external and internal interactions in complex concurrent systems. Concurrency bugs originating from races and improper synchronization are difficult to understand and reproduce. Furthermore, traditional debug and verification practices for embedded systems lack support to address this issue efficiently. For instance, programmers still need to step through several executions until finding a buggy state or analyze complex traces, which results in productivity losses. This paper proposes a new debug approach for MPSOCs that combines dynamic analysis and the benefits of virtual platforms. All in all, (i) enables automatic exploration of SW behavior, (ii) identifies problematic concurrent interactions, (iii) provokes possibly erroneous executions and, ultimately, (iv) detects concurrency bugs. The approach is demonstrated on an industrial-strength virtual platform with a full Linux operating system and real-world parallel benchmarks.</td>
<td></td>
</tr>
<tr>
<td>11:30</td>
<td>10.4.2.1</td>
<td>HARDWARE-BASED FAST EXPLORATION OF CACHE HIERARCHIES IN APPLICATION SPECIFIC MPSOCs</td>
<td>Isuru Nawinne, Josef Schneider, Haris Javaid and Sri Parameswaran, The University of New South Wales, AU</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Multi-level caches are widely used to improve the memory access speed of multiprocessor systems. Deciding on a suitable set of cache memories for an application specific embedded system’s memory hierarchy is a tedious problem, particularly in the case of MPSOCs. To accurately determine the number of hits and misses for all the configurations in the design space of an MPSOC, researchers extract the trace first using Instruction set simulators and then simulate using a software simulator. Such simulations take several hours to months. We propose a novel method based on specialized hardware which can quickly simulate the design space of cache configurations for a shared memory multiprocessor system on an FPGA, by analyzing the memory traces and calculating the cache hits and misses simultaneously. We demonstrate that our simulator can explore the cache design space of a quad-core system with private L1 caches and a shared L2 cache, over a range of standard benchmarks, taking as less as 0.106 seconds per million memory accesses, which is up to 456 times faster than the fastest known software based simulator. Since we emulate the program and analyze memory traces simultaneously, we eliminate the need to extract multiple memory access traces prior to simulation, which saves a significant amount of time during the design stage.</td>
<td></td>
</tr>
<tr>
<td>12:00</td>
<td>10.4.3.1</td>
<td>SSDEXPLORER: A VIRTUAL PLATFORM FOR FINE-GRAINED DESIGN SPACE EXPLORATION OF SOLID STATE DRIVES</td>
<td>Lorenzo Zuo1, Cristian Zambeili1, Rino Micheloni2, Salvatore Galliano3, Marco Indaco3, Stefano Di Carlo1, Paolo Prinetto3, Piero Olivo1 and Davide Bertozzi3</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Solid State Drives (SSDs) are gaining particular momentum in various frameworks such as multimedia, large data centers and cloud environments. Unfortunately, efficient CAD tools for SSD design space exploration able to assess the optimization of the device microarchitecture w.r.t. the target performance are still missing. This paper tries to close this gap by proposing SSDExplorer, a tool for fine-grained and fast design space exploration of SSD devices. SSDExplorer provides unprecedented insights into the architecture behavior and subcomponent interaction efficiency, while avoiding the need for the actual implementation of an FTL or of key hardware components. This is achieved by the introduction of suitable abstractions of the different components. This is confirmed by the thorough validation of SSDExplorer against a commercial SSD device.</td>
<td></td>
</tr>
<tr>
<td>12:30</td>
<td>10.5.1.1</td>
<td>EFFICIENT SIMULATION AND MODELLING OF NON-RECTANGULAR NOC TOPOLOGIES</td>
<td>Ji Qi and Mark Zwolinski, University of Southampton, GB</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>With increasing chip complexity, Networks-on-Chips (NoCs) are becoming a central platform for future on-chip communications. Many regular NoC architectures have been proposed to eliminate the communication bottlenecks of traditional bus-based networks. Non-rectangular and irregular architectures have also been proposed to increase performance. However, the complexity of designing custom non-rectangular networks leads to a rapid increase in design and verification times. To alleviate the conflict between performance and efficiency, this paper proposes a novel method that efficiently constructs virtual non-rectangular topologies on a mesh network by using time-regulated models to emulate irregular patterns. Data routings on virtual hexagonal and two irregular geometries validate the proposed method. An MPEG-4 decoder is used to exemplify the proposed method for media applications. Results analysis shows the virtual topologies emulated by the proposed method can provide precise timing and energy performance.</td>
<td></td>
</tr>
<tr>
<td>12:30</td>
<td></td>
<td>IPS-6, 221 EFFICIENT SIMULATION AND MODELLING OF NON-RECTANGULAR NOC TOPOLOGIES</td>
<td>Franco Fummi1, Michele Lora2, Francesco Stefanni3, Dimitrios Trachanis4, Jan Vanhese4 and Sara Vinco2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Design of smart systems needs to cover a wide variety of domains, ranging from analogue to digital, with power devices, micro-sensors and actuators, up to MEMS. This high level of heterogeneity makes design a very challenging task, as each domain is supported by specific languages, modeling formalisms and simulation frameworks. A major issue is furtherly posed by simulation, that heavily impacts the design and verification loop and that is very hard to be built in such an heterogeneous context. On the other hand, achieving efficient simulation would indeed make smart system design feasible with respect to budget constraints. This work provides a formalization of the typical abstraction levels and design domains of a smart system. This taxonomy allows to identify a precise role in the design flow for co-simulation and simulation scenarios. Moreover, a methodology is proposed to move from the co-simulated heterogeneous to a simulatable homogeneous representation in C++ of the entire smart system. The impact of heterogeneous or homogeneous models of simulation is also examined. Experimental results prove the effectiveness of the proposed C++ generation for reaching high-speed simulation.</td>
<td></td>
</tr>
</tbody>
</table>

12:30 End of session

Lunch Break: In Exhibition Area
Sandwich lunch
The first paper proposes a new static analysis approach based on segment graphs that identifies a tight set of potential access conflicts in segments that may-happen-in-parallel in system-level models. In the second paper, a technique for latency analysis for shared resource systems is introduced. The third paper proposes a method that improves the trade-off between simulation speed and accuracy of performance models of architectures. Finally, the fourth paper deals with cross-correlating specification and RTL to discover versioning issues, poor documentation, and mismatches between specification and RTL.

11:00 10.5.1 (Best Paper Award Candidate)
MAY-HAPPEN-IN-PARALLEL ANALYSIS BASED ON SEGMENT GRAPHS FOR SAFE ESL MODELS
Speakers:
Wewei Chen1, Xu Han2 and Rainer Doemer 3
1University of California, Irvine, US; 2Qualcomm Inc., US; 3EECS, UC Irvine, US
Abstract
A well-defined system-level model contains explicit parallelism and should be free from parallel access conflicts to shared variables. However, safe parallelism is difficult to achieve since risky shared variables are often hidden deep in the design and are not exposed through simulation. In this paper, we propose a new static analysis approach based on segment graphs that identifies a tight set of potential access conflicts in segments that may-happen-in-parallel (MHP).

11:30 10.5.2 TIMING ANALYSIS OF FIRST-COME FIRST-SERVED SCHEDULED INTERVAL-TIMED DIRECTED ACYCLIC GRAPHS
Speakers:
Raymond Frijns1, Shreya Adyanthaya1, Sander Stuijk1, Jeroen Voeten1, Marc Gellen1, Ramon Schifflers2 and Henk Corporaal 1
1Eindhoven University of Technology, NL; 2ASML, NL
Abstract
Analyzing worst-case application timing for systems with shared resources is difficult, especially when non-monotonic arbitration policies like First-Come-First-Served (FCFS) scheduling are used in combination with varying task execution times. Analysis methods that conservatively analyze these systems are often based on state-space exploration, which is not scalable due to its inherent susceptibility to combinatorial explosion. We propose a scalable timing analysis method that periodically restarted Directed Acyclic Task Graphs, that can provide conservative bounds on task timing properties when shared resources with FCFS scheduling are used. By expressing task enabling and completion times in intervals, denoting best-case and worst-case timing properties, contention on the shared resources can be estimated using conservative approximations. With an industrial case study we show that our approach can easily analyze models with thousands of tasks in less than 10 seconds, and the worst-case bounds obtained show an average improvement of 46% compared to bounds obtained by static worst-case analysis.

12:00 10.5.3 A DYNAMIC COMPUTATION METHOD FOR FAST AND ACCURATE PERFORMANCE EVALUATION OF MULTI-CORE ARCHITECTURES
Speakers:
Sebastien Le Nours1, Adam Postula2 and Neil Bergmann 2
1University of Nantes, FR; 2University of Queensland, AU
Abstract
Early estimation of performance has become necessary to facilitate design of complex multi-core architectures. Performance evaluation based on extensive simulations is time consuming and needs to be improved to allow exploration of different architectures in acceptable time. In this paper, we propose a method that improves the trade-off between simulation speed and accuracy in performance models of architectures. This method computes during model execution some of the synchronization instants involved in architecture evolution. It allows grouping and abstracting architecture processes and this way significantly reduces the number of simulation events. Experiments show significant benefits from the computation method on the simulation time. Especially, a simulation speed-up by a factor of 4 is achieved in the considered case study, with no loss of accuracy about estimation of processing resource usage. The proposed method has potential to support automatic generation of efficient architecture models.

12:15 10.5.4 CROSS-CORRELATION OF SPECIFICATION AND RTL FOR SOFT IP ANALYSIS
Speakers:
Bhanu Singh1, Arunprashat Shankar1, Francis Wolff1, Christos Papachristou1, Daniel Weyer2 and Steve Clay 2
1Case Western Reserve University, US; 2Rockwell Automation, US
Abstract
Semiconductor companies often use third-party IPs in order to improve their design productivity. In practice, there are risks involved in using a third-party IP as bugs may creep in due to versioning issues, poor documentation, and mismatches between specification and RTL. As a result of this, third-party IP specification and RTL must be carefully evaluated. Our methodology addresses this issue, which cross-correlates specification and RTL to discover these discrepancies. The key innovative ideas in our approach are to use prior and trusted experience about designs, which include their specs and RTL code. Also, we have captured this trusted experience into two knowledge bases (KB), Spec-KB and RTL-KB. Finally, knowledge base rules are used to cross-correlate the RTL blocks to the specs. We have tested our approach by analyzing several third-party IPs. We have defined metrics for specification coverage and RTL identification coverage to quantify our results.

12:30 End of session
Lunch Break in Exhibition Area
Sandwich lunch

10.6 Multi-processor and distributed systems
Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 4
Chair:
Orlando Moreira, Ericsson, NL
Co-Chair:
Giuseppe Lipari, ENS - Cachan, FR

This session features new results in scheduling, allocation and management of real-time application in multi-core and distributed systems. The first paper presents a control algorithm for managing real-time tasks so to meet thermal constraints in a multi-core chip. The second paper proposes an algorithm for mixed-criticality task allocation in a multiprocessor platform. The third paper proposes a method for generating a schedule for a multi-mode application in a distributed system.
Papers in this session address synthesis algorithms and tools at different levels, targeting power, area and delay minimization.

10.7 Advances in Synthesis

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 5
Chair: John Hayes, University of Michigan, US
Co-Chair: Kim Taemin, Intel Labs, US

Papers in this session address synthesis algorithms and tools at different levels, targeting power, area and delay minimization.
10.7.3  
**Title:** AN EFFICIENT MANIPULATION PACKAGE FOR BICONDITIONAL BINARY DECISION DIAGRAMS  
**Speakers:** Luca Amari, Pierre-Emmanuel Gaillardon and Giovanni De Micheli, EPFL, CH  
**Abstract**  
Biconditional Binary Decision Diagrams (BBDDs) are a novel class of binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. Reduced and ordered BBDDs are remarkably compact and unique for a given Boolean function. In order to exploit BBDDs in Electronic Design Automation (EDA) applications, efficient manipulation algorithms must be developed and integrated in a software package. In this paper, we present the theory for efficient BBDD manipulation and its practical software implementation. The key features of the proposed approach are (i) strong canonical form pre-conditioning of stored BBDD nodes, (ii) recursive formulation of Boolean operations in terms of biconditional expansions, (iii) performance-oriented memory management and (iv) dedicated BBDD re-ordering techniques. Experimental results show that the developed BBDD package achieves an average node count reduction of 19.48% and a speed-up factor of 1.63x with respect to a state-of-art decision diagram manipulation package. Employed in the synthesis of datapath circuits, the BBDD manipulation package is capable to advantageously restructure arithmetic operations producing 11.02% smaller and 32.29% faster circuits as compared to a commercial synthesis flow.

10.7.4  
**Title:** SYNTHESES ALGORITHM OF PARALLEL INDEX GENERATION UNITS  
**Speaker:** Yusuke Matsunaga, Kyushu University, JP  
**Abstract**  
The index generation function is a multi-valued logic function which checks if the given input vector is a registered or not, and returns its index value if the vector is registered. If the latency of the operation is critical, dedicated hardware is used for implementing the index generation functions. This paper proposes a method implementing the index generation functions using parallel index generation units. A novel and efficient algorithm called 'conflict free partitioning' is proposed to synthesis parallel index generation units. Experimental results show the proposed method outperforms other existing methods.

10.8  
**Title:** AUTOMATING DATA REUSE IN HIGH-LEVEL SYNTHESIS  
**Speakers:** Wim Mees1 and Dirk Stroobandt  
**Abstract**  
Current High-Level Synthesis (HLS) tools perform excellently for the synthesis of computation kernels, but they often don’t optimize memory bandwidth. As memory access is a bottleneck in many algorithms, the performance of the generated circuit will benefit substantially from memory access optimization. In this paper we present an automated method and a toolchain to detect reuse of array data in loop nests and to build hardware that exploits this data reuse. This saves memory bandwidth and improves circuit performance. We make use of the polyhedral representation of the source program, which makes our method computationally easy. Our software complements the existing HLS flows. Starting from a loop nest written in C, our tool generates a reuse buffer and a loop controller, and preprocesses the loop body for synthesis with an existing HLS tool. Our automated tool produces designs from unoptimized source code that are as efficient as those generated by a commercial HLS tool from manually-optimized source code.

10.9  
**Title:** A UNIVERSAL SYMMETRY DETECTION ALGORITHM  
**Speaker:** Peter Maurer, Dept. of Computer Sci., Baylor University, US  
**Abstract**  
Research on symmetry detection focuses on identifying and detecting new types of symmetry. We present an algorithm that is capable of detecting any type of permutation-based symmetry, including many types for which there are no existing algorithms. General symmetry detection is library-based, but symmetries that can be parameterized, (i.e. total, partial, rotational, and dihedral symmetry), can be detected without using libraries. In many cases it is faster than existing techniques. Furthermore, it is simpler than most existing techniques, and can easily be incorporated into existing software.

10.10  
**Title:** OPTIMIZATION OF DESIGN COMPLEXITY IN TIME-MULTIPLIED CONSTANT MULTIPICATIONS  
**Speakers:** Levent Aksoy1, Paulo Flores2 and Jose Monteiro 3  
**Abstract**  
The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplied constant multiplication (TMC). This paper addresses the problem of optimizing the complexity of a TMC design and introduces an algorithm that finds the least complex TMC design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMC algorithms.

10.11  
**Title:** HARDWARE PRIMITIVES FOR THE SYNTHESIS OF MULTITHREADED ELASTIC SYSTEMS  
**Speakers:** Giorgos Dimitrakopoulos1, Sebastian Annessis2, Anastasios Psarras1, Konstantinos Tsouri1, Pavlos Mathaiakis1 and Jordi Cortadella  
**Abstract**  
Elastic systems operate in a dataflow-like mode using a distributed scalable control and tolerating variable latency computations. At the same time, multithreading increases the utilization of processing units and hides the latency of each operation by time-multiplexing operations of different threads in the datapath. This paper proposes a model to unify multithreading and elasticity. A new multithreaded elastic control protocol is introduced supported by low-cost elastic buffers that minimize the storage requirements without sacrificing performance. To enable the synthesis of multithreaded elastic architectures, new hardware primitives are proposed and utilized in two circuit examples to prove the applicability of the proposed approach.
Today the most powerful innovations in the major industries and the most promising approaches to tackle burning societal challenges are substantially influenced by and depending from the innovations provided by the microelectronics industry. Breakthroughs in manufacturing technologies enable the realization of novel types of devices and of systems, which enable applications with fascinating functionality and enormous performance. However, this innovation chain is not operational without appropriate innovations in design technology: We need an innovation Agenda 2020 for design methodology and EDA tools fueling the innovation chain of electronics. 2014 the technologies for MEMS and for 3D chips have reached a maturity level that enables them to reshape our lives until 2020. This panel will discuss how to utilize these technologies: Which applications will become possible with the upcoming innovations in 3D and MEMS technologies, what kind of EDA innovations will be required in order to be able to implement these applications effectively and efficiently, yielding powerful yet reliable components and systems. The set-up of the panel includes the manufacturers GLOBALFOUNDRIES and X-FAB, Bosch as leading supplier of technology and one of the MEMS pioneers as well as leading EDA vendors Cadence and Synopsys.

### Time and Label

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:00</td>
<td>10.8.1</td>
<td>INTRODUCTION</td>
<td>Ahmed Jerraya, CEA-LETI, FR</td>
</tr>
<tr>
<td>12:30</td>
<td></td>
<td>End of session</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Lunch Break in Exhibition Area</td>
<td>Sandwich lunch</td>
</tr>
</tbody>
</table>

### UB10 Session 10

Date: Thursday 27 March 2014
Time: 12:00 - 14:30
Location / Room: University Booth, Booth 3, Exhibition Area

<table>
<thead>
<tr>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>UB10.01</td>
<td>SOC VERIFICATION: AUTOMATED FUNCTIONAL VERIFICATION OF SYSTEMS-ON-CHIP</td>
<td>Zdenek Prikryl, Marcela Simkova and Karel Masarik, Faculty of Information Technology, Brno University of Technology, CZ</td>
</tr>
<tr>
<td>UB10.02</td>
<td>BRIDGING MATLAB/SIMULINK AND ESL DESIGN VIA AUTOMATIC CODE GENERATION</td>
<td>Liyuan Zhang, Michael Glädl and Jürgen Teich, University of Erlangen-Nuremberg, DE</td>
</tr>
<tr>
<td>UB10.04</td>
<td>GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES</td>
<td>Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT</td>
</tr>
<tr>
<td>UB10.05</td>
<td>RESCV: RESOURCE-AWARE COMPUTER VISION APPLICATION ON HETEROGENEOUS MULTI-TILE ARCHITECTURE</td>
<td>Erices Sousa1, Johny Paul1, Vahid Lari1, Frank Hannig1, Jürgen Teich1 and Walter Stechele2</td>
</tr>
</tbody>
</table>

1University of Erlangen-Nuremberg, DE; 2Technische Universität München, DE

Abstract

**UB10.01 SOC VERIFICATION: AUTOMATED FUNCTIONAL VERIFICATION OF SYSTEMS-ON-CHIP**

Authors: Zdenek Prikryl, Marcela Simkova and Karel Masarik, Faculty of Information Technology, Brno University of Technology, CZ

Abstract

An increase of the complexity of systems-on-chip (SoC) induces an increase of the complexity of their verification as well. The reason is that we must verify not only the functions of separate logic blocks, but we need to check their interconnections, timing and functional collaboration as well. Therefore, there is still a great demand for verification tools, which are time-effective, fast and as automated as possible. Exactly these issues we target in our solution. You are welcome to see the live demonstration at our booth!

**More information ...**

**UB10.02 BRIDGING MATLAB/SIMULINK AND ESL DESIGN VIA AUTOMATIC CODE GENERATION**

Authors: Liyuan Zhang, Michael Glädl and Jürgen Teich, University of Erlangen-Nuremberg, DE

Abstract

Matlab/Simulink is today's de-facto standard for model-based design in domains such as control engineering and signal processing. Commercial tools are available to generate embedded C or HDL code directly from a Simulink model. However, Simulink models are purely functional models and, hence, designers cannot seamlessly consider the architecture that a Simulink model is later implemented on. In particular, it is not possible to explore the different architectural alternatives and investigate the arising interactions and side-effects directly within Simulink. To benefit from Matlab/Simulink's algorithm exploration capabilities and overcome the outlined drawbacks, we introduce a model transformation framework that converts a Simulink model to an executable specification, written in an actor-oriented modeling language. This specification then serves as the input of an established Electronic System Level (ESL) design flow, enabling Design Space Exploration (DSE) and automatic code generation for both hardware and software. In this demonstration, we will show how to automatically transform Simulink models to an established ESL design flow by means of a code generator. Based on the generated code, we will present a co-simulation approach that combines complex environmental models from Matlab/Simulink with the auto-generated model of a controller. We will use an Anti-lock Braking System (ABS) as an example where we investigate the impact of different controller implementations in the automotive E/E architecture. In detail, the following scientific achievements are included in the proposed demonstration: To bridge Simulink and ESL design flows, we developed an ESL Code-Generator to perform model transformation. The idea is that for any given Simulink models such as a controller in a control system, the designer can simply invoke our Code-Generator to create the ESL model automatically. In our design flow, we use SystemC as a programming language with an extension of actors with a specific Model of Computation (MoC). We guarantee the preservation of the semantics of the generated model by (a) applying a specific 1-to-1 mapping from Simulink basic blocks to an actor library and (b) considering different transformations to capture single-rate and multi-rate Simulink models. After the model transformation is finished, this auto-generated SystemC model serves as the input of a well-established ESL design flow that enables DSE. Besides the Code-Generator we demonstrate also a validation technique that considers the functional correctness by comparing the original Simulink model with the generated SystemC model. The main idea behind this technique is (1) to co-simulate the auto-generated model along with the original model and (2) to reuse the environment model and the test bench that are originally created in Simulink also for the auto-generated model. Furthermore, the performance of the model can also be measured during co-simulation. In this demonstration, an ABS model will be transformed from Simulink to SystemC by invoking ESL Code-Generator. Then, by applying our validation technique, the correctness and the accuracy of the auto-generated model can be examined. Lastly, to evaluate the performance of the model, application-dependent quality of control will be measured, such as the braking distance on an icy road.

**More information ...**

**UB10.04 GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES**

Authors: Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT

Abstract

Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The software is composed of a parser library to handle input circuit descriptions, a characterization library of graphene gates used in the synthesis process, a Biconditional Binary Decision Diagram library used to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices.

**More information ...**

**UB10.05 RESCV: RESOURCE-AWARE COMPUTER VISION APPLICATION ON HETEROGENEOUS MULTI-TILE ARCHITECTURE**

Authors: Erices Sousa1, Johny Paul1, Vahid Lari1, Frank Hannig1, Jürgen Teich1 and Walter Stechele2

1University of Erlangen-Nuremberg, DE; 2Technische Universität München, DE

Abstract

We demonstrate the benefits of invasive computing by showing the efficiency and utilization improvements in a resource-aware manner by algorithmic selection of different invasive resources, such as TCPA (tightly-coupled processor array), and RISC processors. More specific we present a dynamic load balancing of a computer vision application between multiple RISC cores and a TCPA, based on invasive mechanisms supported by our operating system and the agent system.

**More information ...**
Virtual prototyping and Electronic System Level (ESL) modeling have become valuable approaches to cope with the ever-increasing complexity of embedded systems. Their effectiveness, however, is highly dependent on their quick development time and accuracy both conflicting goals. In this demonstration, we present (a) an ESL methodology [1] (2) for the simulation-based evaluation of power and performance of embedded systems by the use of virtual prototypes. Our methodology permits us to develop ESL models for design space exploration of dynamic power and performance management strategies and hardware/software co-design choices. (b) We present a novel sketch-based tool termed Mahler [3] for the very early design phase of ESL modeling. Mahler provides a playground to quickly model functionality and evaluate performance on different architecture implementations. In Mahler, ESL models are created by literally sketching with a pen or touch interface, e.g. a tablet stylus, or a touchless interface, such as a Leap Motion controller. The application and architecture models are transformed to an executable virtual prototype through sketch recognition. This approach provides a very intuitive and fast way to explore actor-oriented functional modeling and hardware/software partitioning. The output of Mahler is a simulation-ready SystemC-based source-code stub that can be refined for subsequent design iterations. We will show a model of a Voice over LTE (VolTE) use case, consisting of a heterogeneous cellular SoC platform, together with a wireless channel fading model and a base station network model. State-based [1] and polynomial-equation-based [4] power models are built and co-simulated for the SoC digital module and the RF transceiver module, respectively to abstract their different power consumption characterization accurately. The entire end-to-end modeling enables efficient SoC performance and power simulation with proper network configuration in seconds, which is highly desired in cellular system early design exploration phase and co-optimization with network vendors.

More information ...

Organic semiconductors with conjugated electron system are currently intensively investigated for optoelectronic applications. This interest is spurred by novel devices such as organic light-emitting diodes (OLED), organic solar cells, and flexible electronics. I this talk, I will discuss some of the recent progress in realizing devices, in particular highly efficient white OLED for lighting and flexible organic solar cells.
Microelectronics has been following Moore's law for almost 40 years. However, this trend tends to run out of steam in recent technology nodes. The continuous improvements in the size of the transistors and in the operating frequencies result in serious power consumption, heat dissipation and reliability issues. Spintronics (Nobel Prize of Physics, 2007 awarded to Prof. Fort from Univ. Paris-Sud and Peter Grundberg from Forschungszentrum Jülich) nanodevices can reduce significantly the power, improve the reliability or allow new functionalities. The 2010 ITRS report on emerging research devices identified Magnetic Tunnel Junction (MTJ) nanopillar (the preeminent spintronics nanodevice) as one of the most promising technologies to be part of the future microelectronics circuits. It provides data non-volatility, hardness to radiations, fast data access and low-power operations. Magnetic memories became the most promising candidate for both low power logic computing and the data storage. This tutorial paper presents multi-discipline questions (Device, Circuit, Architecture, System and CAD) related to this topic to share the most recent results and discuss the future challenges.

Spintronics has been following Moore's law for almost 40 years. However, this trend tends to run out of steam in recent technology nodes. The continuous improvements in the size of the transistors and in the operating frequencies result in serious power consumption, heat dissipation and reliability issues. Spintronics (Nobel Prize of Physics, 2007 awarded to Prof. Fort from Univ. Paris-Sud and Peter Grundberg from Forschungszentrum Jülich) nanodevices can reduce significantly the power, improve the reliability or allow new functionalities. The 2010 ITRS report on emerging research devices identified Magnetic Tunnel Junction (MTJ) nanopillar (the preeminent spintronics nanodevice) as one of the most promising technologies to be part of the future microelectronics circuits. It provides data non-volatility, hardness to radiations, fast data access and low-power operations. Magnetic memories became the most promising candidate for both low power logic computing and the data storage. This tutorial paper presents multi-discipline questions (Device, Circuit, Architecture, System and CAD) related to this topic to share the most recent results and discuss the future challenges.

A process for the fabrication of bottom-gate, top-contact (inverted staggered) organic thin-film transistors (TFTs) with channel lengths as short as 1 μm on flexible plastic substrates has been developed. The TFTs employ vacuum-deposited small-molecule semiconductors and a low-temperature-processed gate dielectric that is sufficiently thin to allow the TFTs to operate with voltages of about 3 V. The p-channel TFTs have an effective field-effect mobility of about 1 cm²/Vs, an on/off ratio of 107, and a signal propagation delay (measured in 11-stage ring oscillators) of 300 ns per stage. For the n-channel TFTs, an effective field-effect mobility of about 0.06 cm²/Vs, an on/off ratio of 106, and a signal propagation delay of 17 μs per stage have been obtained.
THE METAMODELING APPROACH TO SYSTEM LEVEL SYNTHESIS

Abstract
This paper presents an industry proven Metamodelling based approach to System-Level-Synthesis which is seen as generic design automation strategy above today’s implementation levels RTL (for digital) and Schematic Entry (for analog). The approach follows a new synthesis paradigm: The designer develops a simple domain and/or design specific language and a smart tool synthesizing implementation level models according to its needs. The overhead of making both a tool and a model pays off since the tool building is automated by code generation and reuse, both based on Metamodelling techniques. Also the focus on owns demand keeps development costs low. Finally, utilization of specification data keeps also modeling effort low and increases design consistency and improves system performance by up to 26.1% when running a single application and 18.3% for multi-program scenarios.

Authors
Wolfgang Ecker
Speakers
Infineon Technologies, DE; 

Luigi Porreca
Speakers
Infineon Technologies, DE;
14:15 11.3.2 LOGIC SYNTHESIS OF LOW-POWER ICS WITH ULTRA-WIDE VOLTAGE AND FREQUENCY SCALING

**Speakers:**
Yu Pu, Juan Echeverri, Maurice Meijer and Jose Pineda de Gyvez, NXP Research, NL

**Abstract**
For low-power digital ICs with ultra-wide voltage and frequency scaling (e.g., from the nominal supply voltage to the sub/near-threshold regime), achieving design closure can be a big challenge, especially when speed limits are pushed at very different voltages. This paper shares a practical logic synthesis recipe that helps to fulfill tight timing constraints. Our method includes: i) synthesizing circuits at a high voltage; ii) over-constraining maximal transition time; iii) pruning standard cell library based on cell delay degradation factor across voltages. This approach shows effectiveness on an industrial 90nm low-power micro-controller.

14:30 11.3.3 FORMAL VERIFICATION OF FAINT-PROPAGATION SECURITY PROPERTIES IN A COMMERCIAL SOC DESIGN

**Speakers:**
Pramod Subramanyan\(^1\) and Divya Arora\(^2\)

\(^1\)Princeton University, US; \(^2\)Intel Corporation, US

**Abstract**
SoCs embedded in mobile phones, tablets and other smart devices come equipped with numerous features that impose specific security requirements on their hardware and firmware. Many security requirements can be formulated as taint-propagation properties that verify information flow between a set of signals in the design. In this work, we take a tablet SoC design, formulate its critical security requirements as taint-propagation properties, and prove them using a formal verification flow. We describe the properties targeted, techniques to help the verifier scale, and security bugs uncovered in the process.

14:45 11.3.4 EARLY DESIGN STAGE THERMAL EVALUATION AND MITIGATION: THE Locomotiv ARCHITECTURAL CASE

**Speakers:**
Tanguy Sassolas\(^1\), Chiara Sandionigi\(^1\), Alexandre Guerre\(^2\), Alexandre Aminot\(^1\), Pascal Vivet\(^1\), Hela Boussetta\(^4\), Luca Ferro\(^4\) and Nicolas Pellier\(^4\)

\(^1\)CEA LIST, FR; \(^2\)CEA LIST, FR; \(^3\)TCEA-LETI, FR; \(^4\)DOCEA Power, FR

**Abstract**
To offer more computing power to modern SoCs, transistors keep scaling in new technology nodes. Consequently, the power density is increasing, leading to higher thermal risks. Thermal issues need to be addressed as early as possible in the design flow, when the optimization opportunities are the highest. For early design stages, architects rely on virtual prototypes to model their designs' behavior with an adapted trade-off between accuracy and simulation speed. Unfortunately, accurate virtual prototypes fail to encompass thermal effects timescale. In this paper, we demonstrate that less accurate high-level architectural models, in conjunction with efficient thermal and simulation tools, provide an adapted environment to analyze thermal issues and design software thermal mitigation solutions in the case of the Locomotiv MPSoC architecture.

15:00 11.3.5 MULTI-DISCIPLINARY INTEGRATED DESIGN AUTOMATION TOOL FOR AUTOMOTIVE CYBER-PHYSICAL SYSTEMS

**Speakers:**
Arquimedes Canedo\(^1\), Mohammad Abdullah Al Faruque\(^2\) and Jan Richter\(^3\)

\(^1\)Siemens Corporation, US; \(^2\)University of California Irvine, US

**Abstract**
This paper presents our multi-year experience in the development of a Functional Modeling Compiler (FMC), a new model-based design tool for the development of multi-disciplinary automotive cyber-physical systems. We show how system-level simulation models suitable for design-space exploration of complex architectures can be synthesized from functional specifications to test and validate the interactions between ECUs, control algorithms, and the multi-physics.

15:15 11.3.6 PREDICTIVE PARALLEL EVENT-DRIVEN HDL SIMULATION WITH A NEW POWERFUL PREDICTION STRATEGY

**Speakers:**
Seiyang Yang\(^1\), Jaehoon Han\(^1\), Doowhan Kwon\(^1\), Namdo Kim\(^2\), Daeseo Cha\(^2\), Junhyuck Park\(^2\) and Jay Kim\(^2\)

\(^1\)Pusan National University, KR; \(^2\)Samsung Electronics Co., KR

**Abstract**
Traditional parallel event-driven HDL simulation methods suffer heavy synchronization & communication overhead for timely transferring the signal data among local simulators, which could easily nullify most of the expected simulation speed-up from parallelization. A new predictive parallel event-driven HDL simulation method for a series of not only timing, but also function oriented design changes with a new powerful prediction strategy. Experimentation with real SOC designs from industry has been performed for actual design changes, and shown the effectiveness of the enhanced approach.

15:30 End of session

Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

### 11.4 Enabling validation on fast platforms

**Date:** Thursday 27 March 2014

**Time:** 14:00 - 15:30

**Location / Room:** Konferenz 2

**Chair:** Ronny Morad, IBM, IL

**Co-Chair:** Franco Fummi, Universita' di Verona, IT

Fast platforms, whether acceleration, post-silicon or virtual prototypes, are key technologies to enabling validation of complex systems. However, they present enormous challenges to become effective. This session presents four papers and two IPs that propose solutions to overcome some of them, thus enabling much higher performance and coverage.
**11.4.1 COVERAGE EVALUATION OF POST-SILICON VALIDATION TESTS WITH VIRTUAL PROTOTYPES**

**Abstract**

High-quality tests for post-silicon validation should be ready before a silicon device becomes available in order to save time spent on preparing, debugging and fixing tests after the device is available. Test coverage is an important metric for evaluating the quality and readiness of post-silicon tests. We propose an online-capture offline-replay approach to coverage evaluation of post-silicon validation tests with virtual prototypes for estimating silicon device test coverage. We first capture necessary data from a concrete execution of the virtual prototype within a virtual platform under a given test, and then compute the test coverage by efficiently replaying this execution offline on the virtual prototype itself. Our approach provides early feedback on quality of post-silicon validation tests before silicon is ready. To ensure fidelity of early coverage evaluation, our approach have been further extended to support coverage evaluation and conformance checking in the post-silicon stage. We have applied our approach to evaluate a suite of common tests on virtual prototypes of five network adapters. Our approach was able to reliably estimate that this suite achieves high functional coverage on all five silicon devices.

**Authors**

Kai Cong, Li Lei, Zhenkun Yang and Fei Xie, Portland State University, US

**Speakers**

Kai Cong, Li Lei, Zhenkun Yang and Fei Xie, Portland State University, US

**Time**

14:00

**Label**

11.4.1 (Best Paper Award Candidate)

**Presentation Title**

COVERAGE EVALUATION OF POST-SILICON VALIDATION TESTS WITH VIRTUAL PROTOTYPES

**11.4.2 ARCHIVED: ARCHITECTURAL CHECKING VIA EVENT DIGESTS FOR HIGH PERFORMANCE VALIDATION**

**Abstract**

Simulation-based techniques play a key role in validating the functional correctness of microprocessor designs. A common approach for validating microprocessors (called instruction-by-instruction, or IBI checking) consists of running a RTL and an architectural simulation in lock-step, while comparing processor architectural state at each instruction retirement. This solution, however, cannot be deployed on long regression tests, because of the limited performance of RTL simulators. Acceleration platforms have the performance power to overcome this issue, but are not amenable to the deployment of an IBI checking methodology. Indeed, validation on these platforms requires logging activity on-platform and then checking it against a golden model off-platform. Unfortunately, an IBI checking approach following this paradigm entails a large slowdown for the acceleration platform, because of the sizable amount of data that must be transferred off-platform for comparison against the golden model. In this work we propose a sequence-by-sequence (SBS) checking approach that is efficient and practical for acceleration platforms. Our solution validates the test execution over sequences of instructions (instead of individual ones), thus greatly reducing the amount of data transferred for off-platform checking. We found that SBS checking delivers the same bug-detection accuracy as traditional IBI checking, while reducing the amount of traced data by more than 90%.

**Authors**

Chang-Hong Hsu1, Debapriya Chatterjee2, Ronny Morad1, Raviv Gal3 and Valeria Bertacco4

1University of Michigan, Ann Arbor, US; 2IBM Research – Austin, US; 3IBM Research - Haifa, IL

**Speakers**

Chang-Hong Hsu, Debapriya Chatterjee, Ronny Morad, Raviv Gal and Valeria Bertacco

**Time**

14:30

**Label**

11.4.2

**Presentation Title**

ARCHIVED: ARCHITECTURAL CHECKING VIA EVENT DIGESTS FOR HIGH PERFORMANCE VALIDATION

**11.4.3 EFFECTIVE POST-SILICON FAILURE LOCALIZATION USING DYNAMIC PROGRAM Slicing**

**Abstract**

In post-silicon functional validation, one of the complex and time-consuming processes is the localization of an instruction that exposes a bug detected at silicon level. The task is especially hard due to the silicon’s limited observability and the long time between the failure’s occurrence and its detection. We propose a novel method that automates the architectural localization of post-silicon test-case failures. The proposed tool analyzes a failing test-case, while leveraging the information derived from executing the test on an Instruction Set software Simulator (ISS), to identify a set of instructions that could lead to the faulty final state. The proposed failure localization process comprises the creation of a resource dependency graph based on the execution of the test-case on the ISS, determining a program slice of instructions that influence the faulty resources, and the reduction of the set of suspicious instructions by leveraging the knowledge of the correct resources. We evaluate our proposed solution through extensive experiments. Experimental results show that, in over 97% of all cases, our method was able to narrow down the list of suspicious instructions to under 2 instructions, on average, out of over 200. In over 59% of all cases, our method correctly reduced a test-case to a single faulty instruction.

**Authors**

Ophir Friedler, Wisam Kadyr, Arkady Morgenshtein, Amir Nahir and Vitali Sokhin, IBM Research - Haifa, IL

**Speakers**

Ophir Friedler, Wisam Kadyr, Arkady Morgenshtein, Amir Nahir and Vitali Sokhin, IBM Research - Haifa, IL

**Time**

15:00

**Label**

11.4.3

**Presentation Title**

EFFECTIVE POST-SILICON FAILURE LOCALIZATION USING DYNAMIC PROGRAM Slicing

**11.4.4 DESIGN-FOR-DEBUG ROUTING FOR FIB PROBING**

**Abstract**

To observe internal signals, physical probing is an important step in post-silicon debug. Focused ion beam (FIB) is one of most popular probing technologies. However, an unsuitable layout significantly decreases the percentage of nets which can be observed through FIB probing for advanced process technologies. This paper presents the first design-for-debug routing to increase the FIB observable rate. The proposed algorithm, which adopts three FIB states and costs to enhance the maze routing, keeps at least one FIB candidate for each net while routing. Experimental results demonstrate that the proposed method can significantly increase the FIB observable rate under 100% routability.

**Authors**

Chia-Yi Lee, Tai-Hung Li and Tai-Chen Chen, National Central University, TW

**Speakers**

Chia-Yi Lee, Tai-Hung Li and Tai-Chen Chen, National Central University, TW

**Time**

15:15

**Label**

11.4.4

**Presentation Title**

DESIGN-FOR-DEBUG ROUTING FOR FIB PROBING

**11.5 Memory Resource Allocation and Scheduling in MPSOC**

**Date:** Thursday 27 March 2014

**Time:** 14:00 - 15:30

**Location / Room:** Konferenz 3

**Chair:**
Low-latency data access and efficient interprocess communication are critical to MPSoC performance and power efficiency. This session introduces innovative approaches for data placement, memory bandwidth allocation and scheduling techniques in MPSoC architectures with heterogeneous 2D/3D memory hierarchies.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:00</td>
<td>11.5.1</td>
<td>(Best Paper Award Candidate) SCENARIO-AWARE DATA PLACEMENT AND MEMORY AREA ALLOCATION FOR MULTIPLE-PROCESSOR SYSTEM-ON-CHIPS WITH RECONFIGURABLE 3D-STACKED SRAMS</td>
<td>Mengling Tsai, Yirung Chen, Yiting Chen and Ru-Hua Chang, Department of Computer Science and Information Engineering, National Chi Nan University, TW</td>
</tr>
<tr>
<td>14:15</td>
<td>11.5.2</td>
<td>OPTIMIZED BUFFER ALLOCATION IN MULTICORE PLATFORMS</td>
<td>Maximilian Odendahl, Andres Goens, Rainer Leupers, Gerd Ascheid, Benjamin Ries, Berthold Voening and Tomas Henriksson</td>
</tr>
<tr>
<td>15:00</td>
<td>11.5.3</td>
<td>MEMORY-CONSTRAINED STATIC RATE-OPTIMAL SCHEDULING OF SYNCHRONOUS DATAFLOW GRAPHS VIA RETIMING</td>
<td>Xue-Yang Zhu, Marc Geilen, Twan Basten and Sander Stuijk</td>
</tr>
<tr>
<td>15:15</td>
<td>11.5.4</td>
<td>A CONSTRAINT-BASED DESIGN SPACE EXPLORATION FRAMEWORK FOR REAL-TIME APPLICATIONS ON MPSoC</td>
<td>Kathrin Rosvall and Ingo Sander, KTH Royal Institute of Technology, SE</td>
</tr>
<tr>
<td>15:31</td>
<td>11.5.5</td>
<td>RELIABILITY-AWARE MAPPING OPTIMIZATION OF MULTI-CORE SYSTEMS WITH MIXED-CRITICALITY</td>
<td>Shin-Haeng Kang, Hoesook Yang, Sungchan Kim, Juliana Baciavaro, Soonhoi Ha and Lothar Thiele</td>
</tr>
</tbody>
</table>

Abstracts and details of the presentations can be found in the referenced sources.
Thermal analysis and management of batteries have been an important research issue for battery-operated systems such as electric vehicles and mobile devices. Nowadays, battery packs are designed considering heat dissipation, and external cooling devices such as a cooling fan are also widely used to increase the energy efficiency of the battery system. However, the cooling efficiency of the battery system is not always sufficient, and overheating may reduce the lifetime of the cell, increasing the risk of fire. In this paper, we present a novel design flow to synthesize Simulink models onto a NoC-based MPSoC running on low-cost FPGAs. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.

**Abstract**

Thermal hot spots and unbalanced temperatures between cores on chip can cause either degradation in performance or may have a severe impact on reliability, or both. In this paper, we propose mDTM, a proactive dynamic thermal management technique for on-chip systems. It employs multi-objective management for migrating tasks in order to both prevent the system from hitting an undesirable thermal threshold and to balance the temperatures between the cores. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.

**Abstract**

MDTM: MULTI-OBJECTIVE DYNAMIC THERMAL MANAGEMENT FOR ON-CHIP SYSTEMS

**Speakers:**

Heba Khdr, Thomas Ebi, Muhammad Shafique, Hussam Amrouch and Jörg Henkel, KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT), DE

**Abstract**

Thermal hot spots and unbalanced temperatures between cores on chip can cause either degradation in performance or may have a severe impact on reliability, or both. In this paper, we propose mDTM, a proactive dynamic thermal management technique for on-chip systems. It employs multi-objective management for migrating tasks in order to both prevent the system from hitting an undesirable thermal threshold and to balance the temperatures between the cores. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.

**Abstract**

THERMAL MANAGEMENT OF BATTERIES USING A HYBRID SUPERCAPACITOR ARCHITECTURE

**Speakers:**

Donghwa Shin¹, Massimo Poncino² and Enrico Macii³

¹Department of Computer Engineering, Yeungnam University, KR; ²Politecnico di Torino, IT; ³Dipartimento di Automatica e Informatica, Politecnico di Torino, IT

**Abstract**

Thermal analysis and management of batteries have been an important research issue for battery-operated systems such as electric vehicles and mobile devices. Nowadays, battery packs are designed considering heat dissipation, and external cooling devices such as a cooling fan are also widely used to increase the energy efficiency of the battery system. However, the cooling efficiency of the battery system is not always sufficient, and overheating may reduce the lifetime of the cell, increasing the risk of fire. In this paper, we present a novel design flow to synthesize Simulink models onto a NoC-based MPSoC running on low-cost FPGAs. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.

**Abstract**

MINIMAL SPARSE OBSERVABILITY OF COMPLEX NETWORKS: APPLICATION TO MPSoC SENSOR PLACEMENT AND RUN-TIME THERMAL ESTIMATION & TRACKING

**Speakers:**

Sanatan Sarma and Nikil Dutt, University of California Irvine, US

**Abstract**

This paper addresses the fundamental and practically useful question of identifying a minimum set of sensors and their locations through which a large complex dynamical network system and its time-dependent states can be observed. The paper defines the minimal sparse observability problem (MSOP) and provides analytical tools with necessary and sufficient conditions to make an arbitrary complex dynamic network system completely observable. The mathematical tools are then used to develop effective algorithms to find the sparsest measurement vector that provides the ability to estimate the internal states of a complex dynamic network system from experimentally accessible outputs. The developed algorithms are further used in the design of a sparse Kalman filter (SKF) to estimate the time-dependent internal states of a linear time-invariant (LTI) dynamical network system. The approach is applied to illustrate the minimum sensor in-situ run-time thermal estimation and robust hotspot tracking for dynamic thermal management (DTM) of high performance processors and MPSoCs.

**Abstract**

Thermal hot spots and unbalanced temperatures between cores on chip can cause either degradation in performance or may have a severe impact on reliability, or both. In this paper, we propose mDTM, a proactive dynamic thermal management technique for on-chip systems. It employs multi-objective management for migrating tasks in order to both prevent the system from hitting an undesirable thermal threshold and to balance the temperatures between the cores. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>15:31</td>
<td>IPS-17</td>
<td>THERMAL ANALYSIS AND MODEL IDENTIFICATION TECHNIQUES FOR A LOGIC + WIDEIO STACKED DRAM TEST CHIP</td>
<td>Francesco Beneventi¹, Andrea Bartolini¹, Pascal Vivet², Denis Dutoit² and Luca Benini¹</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Speakers:</strong></td>
<td>¹DEI - University of Bologna, IT; ²CEA-Leti, Grenoble, FR</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>High temperature is one of the limiting factors and major concerns in 3D-chip integration. In this paper we use a 3D test chip (WIDEIO DRAM on top of a logic die) equipped with temperature sensors and heaters to explore thermal effects. We correlated real temperature measurements with the power dissipated by the heaters using model learning techniques. The resulting compact thermal model is able to predict temperatures at chip locations far from the temperature sensors and to infer the power dissipation at any location of the chip. Results are verified by mean of an off-sample validation technique and show a high accuracy of the compact thermal model when compared with silicon measurements.</td>
</tr>
<tr>
<td>15:32</td>
<td>IPS-18</td>
<td>ADAPTIVE POWER ALLOCATION FOR MANY-CORE SYSTEMS INSPIRED FROM MULTIAGENT AUCTION MODEL</td>
<td>Xiaohang Wang¹, Baoyin Zhao¹, Terrence Mak², Mei Yang³, Yingtao Jiang³, Masoud Daneshfar² and Maurizio Palesi¹</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Speakers:</strong></td>
<td>¹Guangzhou Institute of Advanced Technology, CN; ²The Chinese University of Hong Kong, CN; ³University of Nevada, Las Vegas, US; ⁴University of Turku, FI; ⁵University of Enna, Kore, IT</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Scaling of future many-core chips is hindered by the challenge imposed by ever-escalating power consumption. At its worst, an increasing fraction of the chips will have to be shut down, as power supply is inadequate to simultaneously switch all the transistors. This so-called dark silicon problem brings up a critical issue regarding how to achieve the maximum performance with a given limited power budget. This issue is further complicated by two facts. First, high variation in power budget calls for wide range power control capability, whereas most current frequency/voltage scaling techniques cannot effectively adjust power over such a wide range. Second, as the applications' behavior becomes more complicated, there is a pressing need for scalability and global coordination, rendering heuristic-based centralized or fully distributed control schemes inefficient. To address the aforementioned problems, in this paper, a power allocation method employing multigant auction models is proposed, referred to as Hierarchical MultiAgent based Power allocation (HiMAP). Tiles act the role of consumers to bid for power budget and the whole process is modeled by a combinatorial auction, whereas HiMAP finds the Walrasian equilibria. Experimental results have confirmed that HiMAP can reduce the execution time by as much as 45% compared to three competing methods. The runtime overhead and cost of HiMAP are also small, which makes it suitable for adaptive power allocation in many-core systems.</td>
</tr>
<tr>
<td>15:33</td>
<td>IPS-19</td>
<td>UNIFIED, ULTRA COMPACT, QUADRATIC POWER PROXIES FOR MULTI-CORE PROCESSORS</td>
<td>Muhammad Yasini¹, Ibrahim (Abe) Elfadel² and Anas Shahrouk²</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Speakers:</strong></td>
<td>¹New York University - Abu Dhabi, AE; ²Masdar Institute of Science and Technology, AE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Per-core power proxies for multi-core processors are known to use several dozens of hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.</td>
</tr>
<tr>
<td>15:30</td>
<td></td>
<td>End of session</td>
<td></td>
</tr>
</tbody>
</table>

### 11.7 Power and Emerging Technologies in Reconfigurable Computing

**Date:** Thursday 27 March 2014  
**Time:** 14:00 - 15:30  
**Location/Room:** Konferenz 5  
**Chair:** Diana Goehringer, Ruhr-University Bochum (RUB), DE  
**Co-Chair:** Fabrizio Ferrandi, Politecnico di Milano, IT

The first two papers in this session propose new architectures that take advantage of emerging nonvolatile memory technologies. The third paper proposes a battery cell aware task partitioning and mapping to maximize battery runtime.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 14:00 | 11.7.1| EXPLOITING STT-NV TECHNOLOGY FOR RECONFIGURABLE, HIGH PERFORMANCE, LOW POWER, AND LOW TEMPERATURE FUNCTIONAL UNIT DESIGN | Adarsh Reddy¹, Hamid Mahmoodi² and Houman Homayoun¹  
¹George Mason University, US; ²San Francisco State University, US |
|       |       | **Abstract**                                                                      | Unavailability of functional units and their unequal activity makes performance bottlenecks and thermal hot spot units in general-purpose processors. We propose to use reconfigurable functional units to overcome these challenges. A selected set of complex functional units that might be under-utilized, such as a multiplier and divider, are realized in a time multiplexed fashion using a shared programmable Look Up Table (LUT) based fabric. This allows for run-time reconfiguration and migration of their activity. LUT based implementation also allows under-utilized functional units to be dynamically reconfigured to the functional units that have a performance bottleneck and hence improving performance. The programmable LUTs are realized using Spin Transfer Torque (STT) Magnetic technology (also called STT-NV) due to its zero leakage and CMOS compatibility. The results show significant performance improvement of 16% on average across standard benchmarks, when replacing CMOS multiplier and divider with reconfigurable STT-NV LUT counterpart. In addition, reconfiguration reduces the maximum temperature of functional units by up to 70°C and almost eliminates the thermal variation across them. This comes with small power overhead and no area impact. |
The latter can be evaluated and possibly improved. The embedded tutorial aims at providing an updated view on what GPGPUs can provide not only in terms of performance and power, but also in terms of reliability, and how.

Rob Aitken, ARM, US
Co-Chair: Dimitris Gizopoulos, University of Athens, GR
Organiser:
Location / Room: 11.8 Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability

15:00 11.7.3 EXTENDING LIFETIME OF BATTERY-POWERED COARSE-GRAINED RECONFIGURABLE COMPUTING PLATFORMS

Shouyi Yin, Peng Guang, Leibo Liu 1 and Shaojun Wei 2
1Tsinghua University, CN; 2Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, CN

Abstract

The coarse-grained reconfigurable architecture (CGRA) is a promising platform for mobile computing. In this paper, how to prolong the lifetime of battery-powered reconfigurable computing platform is addressed. Considering the nonlinear characteristics of battery, a multi-objective optimization model is built for extending the lifetime of battery. Based on this model, a joint task-battery scheduling algorithm is proposed. The experimental results show that the proposed method achieves 26.22% improvement on battery runtime averagely comparing to the state-of-the-art methods.

15:10 IPS-20, 659 3D FPGA USING HIGH-DENSITY INTERCONNECT MONOLITHIC INTEGRATION

Ogun Turkyilmaz 1, Gerald Cibrario 2, Olivier Rozeau 2, Perrine Batude 2 and Fabien Clermidy 3
1CEA-LETI, Minetc Campus, FR; 2CEA, FR; 3CEA-LETI, FR

Abstract

New 3D technology, called ‘Monolithic Integration’, offers very dense 3D interconnect capabilities. In this paper, we propose a 3D FPGA architecture with logic-on-memory approach based on this technology. The routing and computation blocks are split into two layers where the logic is placed on the top and memory on the bottom. Using extracted values from layout in 14nm FDSoI technology, typical benchmark circuits are evaluated in the VPR toolflow. The results show an area reduction of 55% compared to the 2D FPGA. More importantly, due to the lowered routing congestion, the EDP of the 3D FPGA is improved by 4%.

15:32 IPS-21, 526 JOINT COMMUNICATION SCHEDULING AND INTERCONNECT SYNTHESIS FOR FPGA-BASED MANY-CORE SYSTEMS

Alessandro Cilardo, Edoardo Fusella, Luca Gallo and Antonino Mazzeo, University of Naples Federico II, IT

Abstract

This work proposes an automated methodology for optimizing FPGA-based many-core interconnect architectures. Based on the application communication requirements, the methodology concurrently defines the structure of the interconnect and the communication task scheduling, taking into account possible dependencies between tasks under given area constraints. The resulting architecture improves the level of communication parallelism that can be exploited while keeping area costs low. The paper thoroughly describes the proposed approach and discusses a few case-studies showing the impact of the proposed technique.

15:33 IPS-22, 688 A NOVEL EMBEDDED SYSTEM FOR VISION TRACKING

Antonis Nikitakis 1, Theofilos Pagonas 1 and Ioannis Papaefstathiou 2
1Technical University of Crete, Department of Electronic and Computer Engineering Kounoupidiana, Chania, Crete, GR; 2Synelixis Solutions Ltd, Farmakidou 10, Chalkida, GR34100, Greece, GR

Abstract

One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as its memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory subsystem which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.

15:30 End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.8 Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Exhibition Theatre
Organiser: Matteo Sonza Reorda, Politecnico di Torino, It
Chair: Dimitris Gizopoulos, University of Athens, GR
Co-Chair: Rob Atkien, ARM, US

The embedded tutorial aims at providing an updated view on what GPGPUs can provide not only in terms of performance and power, but also in terms of reliability, and how the latter can be evaluated and possibly improve.
14:30 11.8.2  GPU RELIABILITY ASSESSMENT AND ENHANCEMENT  Authors: Paolo Rechi¹, Luigi Carro¹ and Steve Keckler²  
1URGSG, BR; 2NVidia, US

15:00 11.8.3  EVALUATING THE ROBUSTNESS OF GPU APPLICATIONS THROUGH FAULT INJECTION  Speakers: Karthik Pattabiraman¹, Bo Fang¹ and Sudhanva Gurumurthi²  
1UBC, CA; 2AMD, US

15:30  End of session  
Coffee Break in Exhibition Area

On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

UB11 Session 11

Date: Thursday 27 March 2014
Time: 14:30 - 16:30
Location / Room: University Booth, Booth 3, Exhibition Area

UB11.01 CYCLOOE: DESIGNING CLOUD-BASED SELF-HEALING CYBER-PHYSICAL SYSTEMS  Authors:  
Giulio Gambardella¹, Silviu Folea², Miha Lulea², Liviu Miclea², George Mois³, Teodora Sanda³, Marco Incaco¹, Paolo Prinetto¹, Daniele Rolfo¹ and Pascal Trotta¹  
1Politecnico di Torino, IT; 2Universitatea Tehnica din Cluj-Napoca Departamentul de Automatica, RO

Abstract  
Cyber-Physical Systems (CPSs) are a new generation of systems capable to represent more than networking and information technology, information and knowledge being integrated into physical objects. These type of systems are physical and engineered systems whose actions are monitored, controlled, and integrated by a computing and communication kernel. The Cycloee project aims at developing: (1) an infrastructure for designing self-healing Cyber-Physical Systems (CPSs) using cloud computing technology; (2) an experimental model for CPSs using wireless sensor networks (WSN) for data acquisition, reliable hardware components based on reconﬁgurable devices - Field Programmable Gate Arrays (FPGAs) and cloud computing technology to store, manage and analyse data in a large context.

More information ...

UB11.02 BRIDGING MATLAB/SIMULINK AND ESL DESIGN VIA AUTOMATIC CODE GENERATION  Authors:  
Liuyuan Zhang, Michael Glaß and Jürgen Teich, University of Erlangen-Nuremberg, DE

Abstract  
Matlab/Simulink is today’s de-facto standard for model-based design in domains such as control engineering and signal processing. Commercial tools are available to generate embedded C or HDL code directly from a Simulink model. However, Simulink models are purely functional models and, hence, designers cannot seemly consider the architecture that a Simulink model is later implemented on. In particular, it is not possible to explore the different architectural alternatives and investigate the arising interactions and side-effects directly within Simulink. To benefit from Matlab/Simulink’s algorithm exploration capabilities and overcome the outlined drawbacks, we introduce a model transformation framework that converts a Simulink model to an executable specification, written in an actor-oriented modeling language. This specification then serves as the input of an established Electronic System Level (ESL) design ﬂow, enabling Design Space Exploration (DSE) and automatic code generation for both hardware and software. In this demonstration, we will show how to automatically transform Simulink models to an established ESL design ﬂow by means of a code generator. Based on the generated code, we will present a co-simulation approach that combines complex environmental models from Matlab/Simulink with the auto-generated model of a controller. We will use an Anti-lock Braking System (ABS) as an example where we investigate the impact of different controller implementations in the automotive E/E architecture. In detail, the following scientiﬁc achievements are included in the proposed demonstration: To bridge Simulink and ESL design ﬂows, we developed an ESL Code-Generator to perform model transformation. The idea is that for any given Simulink models such as a controller in a control system, the designer can simply invoke our Code-Generator to create the ESL model automatically. In our design ﬂow, we use SystemC as a programming language with an extension of actors with a speciﬁc Model of Computation (MoC). We guarantee the preservation of the semantics of the generated model by (a) applying a speciﬁc 1-to-1 mapping from Simulink basic blocks to an actor library and (b) considering different transformations to capture single-rate and multi-rate Simulink models. After the model transformation is ﬁnished, this auto-generated SystemC model serves as the input of a well-established ESL design ﬂow that enables DSE. Besides the Code-Generator we demonstrate also a validation technique that considers the functional correctness by comparing the original Simulink model with the generated SystemC model. The main idea behind this technique is (1) to co-simulate the auto-generated model along with the original model and (2) to reuse the environment model and the test bench that are originally created in Simulink also for the auto-generated model. Furthermore, the performance of the model can also be measured during co-simulation. In this demonstration, an ABS model will be transformed from Simulink to SystemC by invoking ESL Code-Generator. Then, by applying our validation technique, the correctness and the accuracy of the auto-generated model can be examined. Lastly, to evaluate the performance of the model, application-dependent quality of control will be measured, such as the braking distance on an icy road.

More information ...

UB11.03 BICONDITIONAL BINARY DECISION DIAGRAM MANIPULATION PACKAGE  Authors:  
Luca Amaru¹, Alexios Balatsoukas-Stimming², Pierre-Emmanuel Gaillardon¹, Andreas Burg² and Giovanni De Micheli³  
¹EPFL, CH; ²EPFL-TCL, CH; ³EPFL-LSI, CH

Abstract  
In this software demonstration, we present a logic manipulation package based on Biconditional Binary Decision Diagrams (BBDDs). BBDDs are a novel class of canonical binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. We show how Verilog files generated model can be examined. Lastly, to evaluate the performance of the model, application-depended quality of control will be measured, such as the braking distance of the generated model along with the the original model and (2) to reuse the environment model and the test bench that are originally created in Simulink also for the co-simulation approach. To combine complex models, we use SystemC as a programming language with an extension of actors with a specific Model of Computation (MoC). We guarantee the preservation of the semantics of the generated model by (a) applying a specific 1-to-1 mapping from Simulink basic blocks to an actor library and (b) considering different transformations to capture single-rate and multi-rate Simulink models. After the model transformation is finished, this auto-generated SystemC model serves as the input of a well-established ESL design flow that enables DSE. Besides the Code-Generator we demonstrate also a validation technique that considers the functional correctness by comparing the original Simulink model with the generated SystemC model. The main idea behind this technique is (1) to co-simulate the auto-generated model along with the original model and (2) to reuse the environment model and the test bench that are originally created in Simulink also for the auto-generated model. Furthermore, the performance of the model can also be measured during co-simulation. In this demonstration, an ABS model will be transformed from Simulink to SystemC by invoking ESL Code-Generator. Then, by applying our validation technique, the correctness and the accuracy of the auto-generated model can be examined. Lastly, to evaluate the performance of the model, application-dependent quality of control will be measured, such as the braking distance on an icy road.

More information ...

UB11.04 HEROES^2: A SYSTEMC FRAMEWORK FOR MODELING, SIMULATION AND TESTING OF HETEROGENEOUS SOFTWARE-INTENSIVE SYSTEMS  Authors:  
Markus Becker¹, Wolfgang Mueller¹, Ulrich Kiffmeier² and Joachim Stropp³  
¹University of Paderborn/C-LAB, DE; ²dSPACE GmbH, DE

Abstract  
Heroes^2 is a SystemC framework for modeling/simulation of heterogeneous SW-intensive systems. It has 8 abstraction levels for corefnement of application/environment models from continuous/discrete models to networked embedded SW stacks. Support of various SW/comm. abstractions is achieved by combining AMS MoCs, TLM, HhS models (MW, RTOS, HAL) and QEMU user mode/system emulator. Interfacing w/ a commercial AUTOSAR toolchain is provided, i.e., code generators, integration and experimentation tools.

More information ...
IP5 Interactive Presentations

Date: Thursday 27 March 2014
Time: 15:30 - 16:00
Location / Room: Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award ‘Best IP of the Day’ is given.

### IP5.01 HYBRID WIRE-SURFACE WAVE ARCHITECTURE FOR ONE-TO-MANY COMMUNICATION IN NETWORK-ON-CHIP

**Speakers:**
Ammar Karkar 1, Nizar Dahr 1, Ra’ed Al-Dujaily 2, Kenneth Tong 3, Terrence Mak 4 and Alex Yakovlev 1

1 School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne, GB; 2 General Systems Company, Baghdad - Iraq, IQ; 3 Department of Electrical and Electronic Engineering, University College London, GB; 4 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong, CN

**Abstract**
Network-on-chip (NoC) is a communication paradigm that has emerged to tackle different on-chip challenges and has satisfied different demands in terms of high performance and economical interconnect implementation. However, merely metal based NoC pursuit offers limited scalability with the relentless technology scaling, especially in one-to-many (1-to-M) communication. To meet the scalability demand, this paper proposes a new hybrid architecture empowered by both metal interconnects and Zenneck surface wave interconnects (SWI). This architecture, in conjunction with newly proposed routing and global arbitration schemes, avoids overloading the NoC and alleviates traffic hotspots compared to the trend of handling 1-to-M traffic as unicast. This work addresses the system level challenges for intra-chip multicasting. Evaluation results, based on a cycle-accurate simulation and hardware description, demonstrate the effectiveness of the proposed architecture in terms of power reduction ratio of 2 to 12X and average delay reduction of 25X or more, compared to a regular NoC. These results are achieved with negligible hardware overheads.

### IP5.02 FAILURE ANALYSIS OF A NETWORK-ON-CHIP FOR REAL-TIME MIXED-CRITICAL SYSTEMS

**Speakers:**
Eberle A Rambo 1, Alexander Tschien 1, Jonas Diemer 1, Leonie Ahrendts 1 and Rolf Ernst 2

1 Technische Universität Braunschweig, DE; 2 TU Braunschweig, DE

**Abstract**
Multi- and many-core architectures using Networks-on-Chip (NoC) are being explored for use in real-time safety-critical applications for their performance and efficiency. Such systems must provide isolation between tasks that may present distinct criticality levels. The NoC is critical to maintain the isolation property as it is a heavily used shared resource. To meet safety-standard requirements, such architectures require a systematic evaluation of the effects of all possible failures such as in the form of a Failure Mode and Effects Analysis (FMEA). We present the results of a detailed system-level analysis of a typical real-time mixed-critical network-on-chip architecture. This comprises an FMEA and error effects classification regarding duration and isolation violation.
IP5-3  COOLIP: SIMPLE YET EFFECTIVE JOB ALLOCATION FOR DISTRIBUTED THERMALLY-THROTTLED PROCESSORS

Speakers: Pratyush Kumar, Hoesook Yang, Iuliana Bacivarov and Lothar Thiele, ETH Zurich, CH

Abstract
Thermal constraints limit the time for which a processor can run at high frequency. Such thermal-throttling complicates the computation of response times of jobs. For multiple processors, a key decision is where to allocate the next job. For distributed thermally-throttled processors, we present COOLIP with a simple allocation policy: a job is allocated to the earliest available processor, and if there are several available simultaneously, to the coolest one. For Poisson distribution of inter-arrival times and Gaussian distribution of execution demand of jobs, COOLIP matches the 95-percentile response time of Earliest Finish-Time (EFT) policy which minimizes response time with full knowledge of execution demand of unfinished jobs and thermal models of processors. We argue that COOLIP performs well because it directs the processors into states such that a defined sufficient condition of optimality holds.

IP5-4  ENERGY OPTIMIZATION IN 3D MPSoCs WITH WIDE-I/O DRAM USING TEMPERATURE VARIATION AWARE BANG-WISE REFRESH

Speakers: MohammadSadegh Sadri, Matthias Jung, Christian Weis, Norbert Wein and Luca Benini

Abstract
Heterogeneous 3D integrated systems with Wide-I/O DRAMs are a promising solution to squeeze more functionality and storage bits into an ever decreasing volume. Unfortunately, with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. We improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. In order to provide proof of our concepts we develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. On this platform we run the Android OS with real-world benchmarks to quantify the advantages of our ideas. We show improvements of 16% in DRAM refresh power due to temperature variation aware bang-wise refresh. Furthermore, two solutions are investigated to speedup system simulations: (1) Adaptive tuning of sampling intervals based on the estimated chip thermal profile, which results in speedups of 2X. (2) Hardware acceleration of thermal simulations using the Maxeler engine, which shows possible speedups of 12X.

IP5-5  EFFICIENT SIMULATION AND MODELLING OF NON-RECTANGULAR NOC TOPOLOGIES

Speakers: Ji Q and Mark Zwolinski, University of Southampton, GB

Abstract
With increasing chip complexity, Networks-on-Chips (NoCs) are becoming a central platform for future on-chip communications. Many regular NoC architectures have been proposed to eliminate the communication bottlenecks of traditional bus-based networks. Non-rectangular and irregular architectures have also been proposed to increase performance. However, the complexity of designing custom non-rectangular networks leads to a rapid increase in design and verification times. To alleviate the conflict between performance and efficiency, this paper proposes a novel method that efficiently constructs virtual non-rectangular topologies on a mesh network by using time-regulated models to emulate irregular patterns. Data routings on virtual hexagonal and two irregular geometries validate the proposed method. An MPEG-4 decoder is used to exemplify the proposed method for media applications. Results analysis shows the virtual topologies emulated by the proposed method can provide precise timing and energy performance.

IP5-6  MOVING FROM CO-SIMULATION TO SIMULATION FOR EFFECTIVE SMART SYSTEMS DESIGN

Speakers: Franco Fummi, Michele Lora, Francesco Stefanelli, Dimitrios Trachanis, Jan Vanhese and Sara Vinco

Abstract
Design of smart systems needs to cover a wide variety of domains, ranging from analogue to digital, with power devices, micro-sensors and actuators, up to MEMS. This high level of heterogeneity makes design a very challenging task, as each domain is supported by specific languages, modeling formalisms and simulation frameworks. A major issue is furthermore posed by simulation, that heavily impacts the design and verification loop and that is very hard to be built in such an heterogeneous context. On the other hand, achieving efficient simulation would indeed make smart system design feasible with respect to budget constraints. This work provides a formalization of the typical abstraction levels and design domains of a smart system. This taxonomy allows to identify a precise role in the design flow that is based on the end simulation scenario. Moreover, a methodology is proposed to move from the co-simulated heterogeneity to a simulatable homogeneous representation in C++ of the entire smart system. The impact of heterogeneous or homogeneous models of computation is also examined. Experimental results prove the effectiveness of the proposed C++ generation for reaching high-speed simulation.

IP5-7  AUTOMATING DATA REUSE IN HIGH-LEVEL SYNTHESIS

Speakers: Wim Meeus and Dirk Stroobandt

Abstract
Current High-Level Synthesis (HLS) tools perform excellently for the synthesis of computation kernels, but they often don’t optimize memory bandwidth. As memory access is a bottleneck in many algorithms, the performance of the generated circuit will benefit substantially from memory access optimization. In this paper we present an automated method and a toolchain to detect reuse of array data in loop nests and to build hardware that exploits this data reuse. This saves memory bandwidth and improves circuit performance. We make use of the polyhedral representation of the source program, which makes our method computationally easy. Our software complements the existing HLS flows. Starting from a loop nest written in C, our tool generates a reuse buffer and a loop controller, and preprocesses the loop body for synthesis with an existing HLS tool. Our automated tool produces designs from unoptimized source code that are as efficient as those generated by a commercial HLS tool from manually-optimized source code.

IP5-8  A UNIVERSAL SYMMETRY DETECTION ALGORITHM

Speaker: Peter Maurer, Dept. of Computer Sci., Baylor University, US

Abstract
Research on symmetry detection focuses on identifying and detecting new types of symmetry. We present an algorithm that is capable of detecting any type of permutation-based symmetry, including many types for which there are no existing algorithms. General symmetry detection is library-based, but symmetries that can be parameterized, (i.e. total, partial, rotational, and dihedral symmetry), can be detected without using libraries. In many cases it is faster than existing techniques. Furthermore, it is simpler than most existing techniques, and can easily be incorporated into existing software.

IP5-9  OPTIMIZATION OF DESIGN COMPLEXITY IN TIME-MULTIPLEXED CONSTANT MULTIPLICATIONS

Speakers: Levent Aksroy, Paulo Flores and Jose Monteiro

Abstract
The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplexed constant multiplication (TMCM). This paper addresses the problem of optimizing the complexity of a TMCM design and introduces an algorithm that finds the least complex TMCM design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMCM algorithms.
PS-10 HARDWARE PRIMITIVES FOR THE SYNTHESIS OF MULTITHREADED ELASTIC SYSTEMS
Speakers: Giorgos Dimitrakopoulou1, Seitanidis Ioannis2, Anastasios Psarras1, Konstantinos Tsouris1, Pavlos Matthaikos3 and Jordi Cortadella4
1Democritus University of Thrace, GR; 2Democritus University of Thrace, GR; 3Mentor Graphics, FR; 4Universitat Politecnica de Catalunya, ES
Abstract
Abstract—Elastic systems operate in a dataflow-like mode using a distributed scalable control and tolerating variable latency computations. At the same time, multithreading increases the utilization of processing units and hides the latency of each operation by time-multiplexing operations of different threads in the datapath. This paper proposes a model to unify multithreading and elasticity. A new multithreaded elastic control protocol is introduced supported by low-cost elastic buffers that minimize the storage requirements without sacrificing performance. To enable the synthesis of multithreaded elastic architectures, new hardware primitives are proposed and utilized in two circuit examples to prove the applicability of the proposed approach.
PS-11 DCM: AN IP FOR THE AUTONOMOUS CONTROL OF OPTICAL AND ELECTRICAL RECONFIGURABLE NOCs.
Speakers: Wolfgang Buter1, Christof Osewold2, Daniel Gregorek3 and Alberto Garcia-Ortiz4
1University of Bremen, DE; 2ITEM (U.Bremen), DE
Abstract
The increasing requirements for bandwidth and quality-of-service motivate the use of parallel interconnect architectures with several degrees of reconfiguration. This paper presents an IP, called Distributed Channel Management (DCM), to extend existing packet-switched NoCs with a reconfigurable point-to-point network seamlessly, i.e., without the need for any modification on the routers. The configuration of the reconfigurable network takes place dynamically and autonomously, so that the topology can be changed at run time. Furthermore, the architecture is scalable due to the autonomous decentralized administration of the links. The Paper reports a thorough experimental analysis of the overhead of the approach at the gate level that considers different network parameters such as fill size and timing constraints.
PS-12 MINIMALLY BUFFERED SINGLE-CYCLE DEFLECTION ROUTER
Speakers: Gnaneswara Rao Jonna1, John Jose2, Rachana Radhakrishnan1 and Madhu Muttyam1
1Indian Institute of Technology, Madras., IN; 2Rajagiri School of Engineering & Technology, Kochhi., IN
Abstract
With the drift from computation centric designs to communication centric designs in the Chip Multi Processor (CMP) era, the interconnect fabric is gaining more importance. An efficient NoC in terms of power, area and average flit latency has a huge impact on the overall performance of a CMP. In the current work, we propose MinBSD - a minimally buffered, single cycle, deflection router. It incorporates different operations (Injection, Ejection, Preemption, Re-injection) in a single module to handle the traffic effectively and ensures smooth flow of flits through router pipeline. It performs overlapped execution of independent operations. These factors not only make MinBSD to operate in a single cycle but also reduce the critical path latency resulting in a faster interconnect network. Experimental results show that MinBSD reduces the average flit latency on real work loads, reduces die area and power consumption when compared to the existing state-of-the-art minimally buffered deflection routers.
PS-13 FUNCTIONAL TEST GENERATION GUIDED BY STEADY-STATE PROBABILITIES OF ABSTRACT DESIGN
Speakers: Jian Wang1, Huawei Li2, Tao Lv2, Tiancheng Wang2 and Xiaowei Li2
1Institute of Computing Technology, Chinese Academy of Sciences, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
This paper presents a novel method for functional test generation aiming at exploring control state space of the design. The steady-state probabilities (SP's) of the abstract design's control FSM are used to guide test generation. The SP's of the states can reflect how hard the states can be reached, and the hard-to-reach states are assigned with high priority to be exercised. Experimental results show that our method has better performance in test generation in comparison with constrained random simulation, and demonstrate that SPs provide good guidance on traversing hard-to-reach states of the design under validation.
PS-14 AUTOMATED SYSTEM TESTING USING DYNAMIC AND RESOURCE RESTRICTED CLIENTS
Speakers: Mirko Caspar, Mirko Lippmann and Wolfram Hardt, Technische Universität Chemnitz, DE
Abstract
Testing on system level using a static and homogeneous architecture of clients is common practice. This paper introduces a new approach to use a heterogeneous and dynamic set of resource restricted test clients for automated testing. Due to changing resources and availability of the clients, the test case distribution needs to be recalculated dynamically during the test execution. All necessary conditions and parameters are represented by a formal model. It is shown that the algorithmic problem of DYNAMIC TESTPARTITIONING can be solved in polynomial time by a heuristic recursive algorithm. A testbench architecture is introduced and by simulation it is shown that the testbench can execute the test requirements within a small variation using a number of several hundred clients. The system can react dynamically on changing resources and availability of the test clients within several seconds. The approach is generic and can be adapted to a huge number of systems.
PS-15 RELIABILITY-AWARE MAPPING OPTIMIZATION OF MULTI-CORE SYSTEMS WITH MIXED-CRITICALITY
Speakers: Shin-Haeng Kang1, Hoeseok Yang2, Sungchan Kim3, Iuliana Bacivarov2, Soontha Ha4 and Lothar Thiele4
1stSeoul National University, KR; 2ETH Zurich, CH; 3Chonbuk National University, KR; 4Swiss Federal Institute of Technology Zurich, CH
Abstract
This paper presents a novel mapping optimization technique for mixed critical multi-core systems with different reliability requirements. For this scope, we derived a quantitative reliability metric and presented a scheduling analysis that certifies given mixed-criticality constraints. Our framework is capable of investigating reconfiguration space of the system and presents an IP, called Distributed Channel Management (DCM), to extend existing packet-switched NoCs with a reconfigurable point-to-point network seamlessly, i.e., without the need for any modification on the routers. The configuration of the reconfigurable network takes place dynamically and autonomously, so that the topology can be changed at run time. Furthermore, the architecture is scalable due to the autonomous decentralized administration of the links. The Paper reports a thorough experimental analysis of the overhead of the approach at the gate level that considers different network parameters such as fill size and timing constraints.
PS-16 (Best Paper Award Candidate)
FROM SIMULINK TO NOC-BASED MPSoC ON FPGA
Speakers: Francesco Robino and Johnny Öberg, KTH Royal Institute of Technology, SE
Abstract
Network-on-chip (NoC) based multi-processor systems are promising candidates for future embedded system platforms. However, because of their complexity, new high level modeling techniques are needed to design, simulate and synthesize embedded systems targeting NoC-based MPSoC. Simulink is a popular modeling environment suitable to model at system level. However, there is no clear standard to synthesize Simulink models into SW and HW towards a NoC-based MPSoC implementation. In addition, many of the proposed solutions require large overhead in terms of SW components and memory requirements, resulting in complex and customized multi-processor platforms. In this paper we present a novel design flow to synthesize Simulink models onto a NoC-based MPSoC running on low-cost FPGAs. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.
IPS-17  (Best Paper Award Candidate)  THERMAL ANALYSIS AND MODEL IDENTIFICATION TECHNIQUES FOR A LOGIC + wideIO STACKED DRAM TEST CHIP
Speakers: Francesco Beneventi1, Andrea Bartolini1, Pascal Vivet2, Denis Dubot3 and Luca Benini1
1DEI - University of Bologna, IT; 2CEA-Leti, Grenoble, FR
Abstract
High temperature is one of the limiting factors and major concerns in 3D-chip integration. In this paper we use a 3D test chip (WIDEIO DRAM on top of a logic die) equipped with temperature sensors and heaters to explore thermal effects. We correlated real temperature measurements with the power dissipated by the heaters using model learning techniques. The resulting compact thermal model is able to predict temperatures at chip locations far from the temperature sensors and to infer the power dissipation at any location of the chip. Results are verified by mean of an off-sample validation technique and show a high accuracy of the compact thermal model when compared with silicon measurements.

IPS-18  ADAPTIVE POWER ALLOCATION FOR MANY-CORE SYSTEMS INSPIRED FROM MULTIAIGENT AUCTION MODEL
Speakers: Xiaohang Wang1, Baixin Zhao1, Terrence Mak2, Mei Yang3, Yingtao Jiang3, Masoud Daneshfarl4 and Maurizio Palesi5
1Guangzhou Institute of Advanced Technology, CN; 2The Chinese University of Hong Kong, CN; 3University of Nevada, Las Vegas, US; 4University of Turku, FI; 5University of Enna, Kore, IT
Abstract
Scaling of future many-core chips is hindered by the challenge imposed by ever-escalating power consumption. At its worst, an increasing fraction of the chips will have to be shut down, as power supply is inadequate to simultaneously switch all the transistors. This so-called dark silicon problem brings up a critical issue regarding how to achieve the maximum performance with a given limited power budget. This issue is further complicated by two facts. First, high variation in power budget calls for wide range power control capability, whereas most current frequency/voltage scaling techniques cannot effectively adjust power over such a wide range. Second, as the applications’ behavior becomes more complicated, there is a pressing need for scalability and global coordination, rendering heuristic-based centralized or fully distributed control schemes inefficient. To address the aforementioned problems, in this paper, a power allocation method employing multiaigent auction models is proposed, referred as Hierarchal MultiAgent based Power allocation (HiMAP). Tiles act the role of consumers to bid for power budget and the whole process is modeled by a combinatorial auction, whereas HiMAP finds the Walrasian equilibria. Experimental results have confirmed that HiMAP can reduce the execution time by as much as 45% compared to three competing methods. The runtime overhead and cost of HiMAP are also small, which makes it suitable for adaptive power allocation in many-core systems.

IPS-19  UNIFIED, ULTRA COMPACT, QUADRATIC POWER PROXIES FOR MULTI-CORE PROCESSORS
Speakers: Muhammad Yasin1, Ibrahim (Abie) Elfadel2 and Anas Shahrou2
1New York University - Abu Dhabi, AE; 2Masdar Institute of Science and Technology, AE
Abstract
Per-core power proxies for multi-core processors are known to use several dozen hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.

IPS-20  3D FPGA USING HIGH-DENSITY INTERCONNECT MONOLITHIC INTEGRATION
Speakers: Ogun Turkylmaz1, Gerald Cibricaro2, Olivier Rozeau2, Perrine Batude2 and Fabien Clermidy3
1CEA-LETI, Minatex Campus, FR; 2CEA, FR; 3CEA-LETI, FR
Abstract
New 3D technology, called “Monolithic Integration”, offers very dense 3D interconnect capabilities. In this paper, we propose a 3D FPGA architecture with logic-on-memory approach based on this technology. The routing and computation blocks are splitted into two layers where the logic is placed on the top and memory on the bottom. Using extracted values from layout in 14nm FDSOI technology, typical benchmark circuits are evaluated in the VPR5 toolflow. The results show an area reduction of 55% compared to the 2D FPGA. More importantly, due to the lowered routing congestion, the EDP of the 3D FPGA is improved by 47%.

IPS-21  JOINT COMMUNICATION SCHEDULING AND INTERCONNECT SYNTHESIS FOR FPGA-BASED MANY-CORE SYSTEMS
Speakers: Alessandro Cilardo, Edoardo Fusella, Luca Gallo and Antonino Mazzeo, University of Naples Federico II, IT
Abstract
This work proposes an automated methodology for optimizing FPGA-based many-core interconnect architectures. Based on the application communication requirements, the methodology concurrently defines the structure of the interconnect and the communication task scheduling, taking into account possible dependencies between tasks under given area constraints. The resulting architecture improves the level of communication parallelism that can be exploited while keeping area costs low. The paper thoroughly describes the proposed approach and discusses a few case-studies showing the impact of the proposed technique.

IPS-22  A NOVEL EMBEDDED SYSTEM FOR VISION TRACKING
Speakers: Antonis Nikitakis1, Theofilos Pagonas1 and Ioannis Papaefstathiou2
1Technical University of Crete, Department of Electronic and Computer Engineering, Chania, Crete, GR73100, Greece, GR; 2Synelixis Solutions Ltd, Farmakidou 10, Chalkida, GR34100, Greece, GR
Abstract
One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as it memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory sub-system which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.

12.1 SPECIAL DAY Hot Topic: The future of interfacing to the natural world

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Saal 1

Organisers:
Ian O’Connor, Lyon Institute of Nanotechnology, FR
Thomas Mikolajick, NamLab gGmbH, DE

Chair:
Michael Huebner, Ruhr Universitaet Bochum, DE

Co-Chair:
Ian O’Connor, Lyon Institute of Nanotechnology, FR
Challenges for acquiring and processing data from the real world includes the development of interfaces capable of extracting relevant information from massive sensor networks or from living organisms, sifting through the wealth of data to arrive systematically at a meaningful conclusion, and building hardware platforms suited to carry out these operations in an energy-efficient way. The first paper in this session looks at the necessarily complex processing of chemical information with hardware components that are capable of responding to various chemical conditions. Interfaces to living organisms are examined in the second paper, which discusses challenges and approaches for efficient detection of disease. In the third paper, novel hardware devices and architectures are explored for use in energy-efficient video analysis applications such as movement detection and face recognition. The fourth paper discusses handling of complex data with large-scale GPU-based recurrent networks, exploiting specific features of the data to improve energy efficiency.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>12.1.1</td>
<td>INTEGRATED CIRCUITS PROCESSING CHEMICAL INFORMATION: PROSPECTS AND CHALLENGES</td>
<td>Andreas Richter, Axel Voigt, René Schäßbrey, Stephan Henker and Marcus Völpe, Technische Universität Dresden, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>The unbelievable properties of our information processing capabilities regarding the processing of big data, resilience, and energy efficiency are inspiration sources for the optimization and the rethinking of the principles of electronic information processing. Here, we present an approach of integrated circuits intended to solve chemical problems by active processing of chemical information.</td>
</tr>
<tr>
<td>16:25</td>
<td>12.1.2</td>
<td>INTERFACING TO LIVING CELLS</td>
<td>Rudy Lauwereins, IMEC, BE</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>Recent advances in More than Moore technology enable close observation of and even direct interfacing to living cells. This paper illustrates this through three use cases. In the first use case, the type or quality of billions of cells is quickly inspected in a fluidic medium. Secondly, the effect of potential drugs is monitored in neural cell cultures. In the third use case, neural brain activity is recorded in vivo using implantable electrodes to understand how the brain functions.</td>
</tr>
<tr>
<td>16:45</td>
<td>12.1.3</td>
<td>VIDEO ANALYTICS USING BEYOND CMOS DEVICES</td>
<td>Vijaykrishnan Narayanan¹, Gert Cauwenberghs², Donald Chiarulli³, Suman Datta⁴, Steve Levitan⁵ and Philip Wong⁵</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Authors</td>
<td>¹Penn State University, US; ²University of California at San Diego, US; ³University of Pittsburgh, US; ⁴The Pennsylvania State University, US; ⁵Stanford University, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>The human vision system understands and interprets complex scenes for a variety of visual tasks in real-time while consuming less than 20 Watts of power. The holistic design of artificial vision systems that will approach and eventually exceed the capabilities of human vision systems is a grand challenge. The design of such a system needs advances in multiple disciplines. This paper focuses on advances needed in the computational fabric and provides an overview of a new-genre of architectures inspired by advances in both the understanding of the visual cortex and the emergence of devices with new mechanisms for state computations.</td>
</tr>
<tr>
<td>17:10</td>
<td>12.1.4</td>
<td>ENERGY EFFICIENT NEURAL NETWORKS FOR BIG DATA ANALYTICS</td>
<td>Wang Yu, Boxun Li, Rong Luo, Yiran Chen, Ningyi Xu and Huazhong Yang, Tsinghua University, CN</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>The world is experiencing a data revolution to discover knowledge in big data. Sequential data, such as the text, speech and video, are the primary sources of big data. The recurrent network is a powerful model to process sequential data because of the ability of capturing the long-term latent dependencies and features of the data. However, the difficulty of training a recurrent network, especially the huge requirement of computing power, makes the recurrent network fail to become a mainstream tool in mining big data. In this paper, we propose an efficient GPU implementation of large-scale recurrent network training. The proposed GPU implementation is based on a fast approximation technique of activation functions and a fine-grained two-stage pipeline architecture. We also propose a parallel realization of billion cells through a developed library. The experiment results demonstrate that the proposed GPU implementation is able to realize at least 6x speedup on a signal GTX580 GPU compared with the CPU implementation on an Intel Xeon ES-2690 (16 cores) with MKL library. Meanwhile, the trained large-scale recurrent network can achieve the state-of-the-art performance on the Microsoft Research Sentence Completion Challenge, a challenge set for advancing language modeling.</td>
</tr>
</tbody>
</table>

17:30 End of session

12.2 Hot topic: How Secure are PUFs Really? On the Reach and Limits of Recent PUF Attacks

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Konferenz 6
Organiser: Ulrich Rührmair, TU München, DE
Chair: Ulf Schlichtmann, TU München, DE

PUFs are an emerging and promising security primitive. However, some strong attacks on their core security features have been reported recently, for example on their unclonability. We discuss the reach, but also the limits of these attacks, aiming at a well-balanced treatment, and also evaluate the future perspectives of the field.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>12.2.1</td>
<td>PUFs at a Glance</td>
<td>Ulrich Rührmair¹ and Daniel E. Holcomb²</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>¹TU München, DE; ²University of Michigan, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>Physical Unclonable Functions (PUFs) are a new, hardware-based security primitive, which has been introduced just about a decade ago. In this paper, we provide a brief and easily accessible overview of the area. We describe the typical security features, implementations, attacks, protocols uses, and applications of PUFs. Special focus is placed on the two most prominent PUF types, so-called &quot;Weak PUFs&quot; and &quot;Strong PUFs&quot;, and their mutual differences.</td>
</tr>
<tr>
<td>Time</td>
<td>Label</td>
<td>Presentation Title</td>
<td>Authors</td>
</tr>
<tr>
<td>-------</td>
<td>---------</td>
<td>------------------------------------------------------------------------------------</td>
<td>-------------------------------------------------------------------------------------------------------------------</td>
</tr>
<tr>
<td>16:15</td>
<td>12.2.2</td>
<td>PUF MODELING ATTACKS: AN INTRODUCTION AND OVERVIEW</td>
<td>Ulrich Rührmair¹ and Jan Sölter²</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>¹TU München, DE; ²Freie Universität Berlin, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Machine learning (ML) based modeling attacks are the currently most relevant and effective attack form for so-called Strong Physical Unclonable Functions (Strong PUFs). We provide an overview of this method in this paper: We discuss the basic conditions under which it is applicable; the ML algorithms that have been used in this context; the latest and most advanced results on simulated and silicon data; the right interpretation of existing results; and possible future research directions.</td>
</tr>
<tr>
<td>16:30</td>
<td>12.2.3</td>
<td>HYBRID SIDE-CHANNEL / MACHINE-LEARNING ATTACKS ON PUFs: A NEW THREAT?</td>
<td>Theocharides Theocharis, University of Cyprus, CY</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Theocharides Theocharis, University of Cyprus, CY</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Machine Learning (ML) is a well-studied strategy in modeling Physical Unclonable Functions (PUFs) but reaches its limits while applied on instances of high complexity. To address this issue, side-channel attack is combined to help reduce the computational workload of ML modeling attacks and make it more applicable. In this work, we present the currently known hybrid side-channel attacks on PUFs. A taxonomy is proposed based on the characteristics of different side-channel attacks. The practical reach of some published side-channel attacks is discussed. Both challenges and opportunities for PUF attackers are introduced. Countermeasures against some certain side-channel attacks are also analyzed. To better understand the side-channel attacks on PUFs, three different methodologies of implementing side-channel attacks are compared. At the end of this paper, we bring forward some open problems for this research area.</td>
</tr>
<tr>
<td>16:45</td>
<td>12.2.4</td>
<td>PHYSICAL VULNERABILITIES OF PHYSICALLY UNCLONABLE FUNCTIONS</td>
<td>Marten van Dijk³ and Ulrich Rührmair²</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>³University of Connecticut, US; ²TU München, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>In recent years one of the most popular areas of research in hardware security has been Physically Unclonable Functions (PUF). PUFs provide primitives for implementing tamper detection, encryption and device fingerprinting. One particularly common application is replacing Non-volatile Memory (NVM) as key storage in embedded devices like smart cards and secure microcontrollers. Though a wide array of PUF have been demonstrated in the academic literature, vendors have only begun to roll out PUFs in their end-user products. Moreover, the improvement to overall system security provided by PUFs is still the subject of much debate. This work reviews the state of the art of PUFs in general, and as a replacement for key storage in particular. We review also techniques and methodologies which make the physical response characterization and physical/digital cloning of PUFs possible.</td>
</tr>
<tr>
<td>17:00</td>
<td>12.2.5</td>
<td>PROTOCOL ATTACKS ON ADVANCED PUF PROTOCOLS AND COUNTERMEASURES</td>
<td>Xiaolin Xu and Wayne Burleson, UMass, Amherst, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>In recent years, PUF-based schemes have not only been suggested for the basic security tasks of tamper sensitive key storage or system identification, but also for more complex cryptographic protocols like oblivious transfer (OT), bit commitment (BC), or key exchange (KE). These more complex protocols are secure against adversaries in the stand-alone, good PUF model. In this survey, a shortened version of [17], we explain the stronger bad PUF model and PUF re-use model. We argue why these stronger attack models are realistic, and that existing protocols, if used in practice, will need to face these. One consequence is that the design of advanced cryptographic PUF protocols needs to be strongly reconsidered. It suggests that Strong PUFs require additional hardware properties in order to be broadly usable in such protocols: Firstly, they should ideally be erasable, meaning that single PUF-responses can be erased without affecting other responses. If the area efficient implementation of this feature turns out to be difficult, new forms of Controlled PUFs [3] (such as Logically Erasable and Logically Reconfigurable PUFs [6]) may suffice in certain applications. Secondly, PUFs should be certifiable, meaning that one can verify that the PUF has been produced faithfully and has not been manipulated in any way afterwards. The combined implementation of these features represents a pressing and challenging problem for the PUF hardware community.</td>
</tr>
<tr>
<td>17:15</td>
<td>12.2.6</td>
<td>QUO VADIS, PUF? TRENDS AND CHALLENGES OF EMERGING PHYSICAL-DISORDER BASED SECURITY</td>
<td>Ulrich Rührmair¹ and Jan Sölter²</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>¹TU München, DE; ²Freie Universität Berlin, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Physical unclonable Function (PUF) has emerged as a popular and widely studied security primitive based on the randomness of the underlying physical medium. To date, most of the research emphasis have been placed on finding new ways to measure randomness, hardware realization and analysis of a few initially proposed structures, and conventional secret-key based protocols. In this work, we suggest our subjective analysis of the emerging and future trends in this area that aim to change the scope, widen the application domain, and make lasting impact. We emphasize on development of new PUF-based primitives and paradigms, robust protocols, public-key protocols, digital PUF, new technologies, implementation, metrics and tests for evaluation/validation, as well as relevant attacks and countermeasures.</td>
</tr>
</tbody>
</table>

**Time**

12.3 Multimedia Systems

**Date:** Thursday 27 March 2014

**Time:** 16:00 - 17:30

**Location / Room:** Konferenz 1

**Chair:** Theocharides Theocharis, University of Cyprus, CY

**Co-Chair:** Cristiana Bolchini, Politecnico di Milano, IT

The session presents designs for energy efficient and flexible implementations of advanced video coders or image acquisition/processing systems.
### 12.4 Physical Aspects

**Date:** Thursday 27 March 2014  
**Time:** 16:00 - 17:30  
**Location / Room:** Konferenz 2

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>12.4.1</td>
<td>OPTIMIZATION OF STANDARD CELL BASED DETAILED PLACEMENT FOR 16 NM FINFET PROCESS</td>
<td>Yuelin Du and Martin D. F. Wong, University of Illinois at Urbana-Champaign, US</td>
</tr>
</tbody>
</table>

**Abstract**  
FinFET transistors have great advantages over traditional planner MOSFET transistors in high performance and low power applications. Major foundries are adopting the FinFET technology for CMOS semiconductor device fabrication in the 16 nm technology node and beyond. Edge device degradation is among the major challenges for the FinFET process. To avoid such degradation, dummy gates are needed on device edges, and the dummy gates have to be tied to power rails in order not to introduce unconnected parasitic transistors. This requires that each dummy gate must abut at least one source node after standard cell placement. If the drain nodes at two adjacent cell boundaries abut each other, additional source nodes must be inserted in between for dummy gate power tying, which costs more placement area. Usually there is some flexibility during detailed placement to horizontally flip the cells or switch the positions of adjacent cells, which has little impact on the global placement objectives, such as timing conditions and net congestion. This paper proposes a detailed placement optimization strategy for the standard cell based designs. By flipping a subset of cells in a standard cell row and switching pairs of adjacent cells, the number of drain to abutments between adjacent cell boundaries can be optimally minimized, which saves additional source node insertion and reduces the length of the standard cell row. In addition, the proposed graph model can be easily modified to consider more complicated design rules. The experimental results show that the optimization of 100k cells is completed within 0.1 second, verifying the efficiency of the proposed algorithm.

**Speaker:**  
Yuelin Du and Martin D. F. Wong, University of Illinois at Urbana-Champaign, US

---

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>12.3.4</td>
<td>IMAGE PROGRESSIVE ACQUISITION FOR HARDWARE SYSTEMS</td>
<td>Jianxiong Liu, Christos Bouganis and Peter Y.K. Cheung, Imperial College London, GB</td>
</tr>
</tbody>
</table>

**Abstract**  
As the resolution of digital images increases, accessing raw image data from memory has become a major consideration during the design of image/video processing systems. This is due to the fact that the bandwidth requirement and energy consumption of such image accessing process has increased. Inspired by the successful application of progressive image sampling techniques in many image processing tasks, this work proposes to apply similar concept within hardware systems to efficiently trade image quality for reduced bandwidth requirement and lower energy consumption. Based on this idea, a hardware system is proposed that is placed between the memory subsystem and the processing core of the design. The proposed system alters the conventional memory access pattern to progressively and adaptively access pixels from a target memory external to the system. The sampled pixels are used to reconstruct an approximation to the ground truth, which is stored in an internal image buffer for further processing. The system is prototyped on FPGA and its performance evaluation shows that a saving of up to 85% of memory accessing time and 33%/45% of image acquisition time/energy is achieved on the benchmark image "lena" while maintaining a PSNR of about 30 dB.

**Speaker:**  
Jianxiong Liu, Christos Bouganis and Peter Y.K. Cheung, Imperial College London, GB

---

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>12.3.1</td>
<td>FLEXIBLE AND SCALABLE IMPLEMENTATION OF H.264/AVC ENCODER FOR MULTIPLE RESOLUTIONS USING ASIPS</td>
<td>Hong Chinh Doan, Haris Javadi and Sri Parameswaran, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, AU</td>
</tr>
</tbody>
</table>

**Abstract**  
Real-time encoding of video streams is computationally intensive and rarely carried out at high resolutions. In this paper, for the first time, we propose a platform for H.264 encoder which is both flexible (allows software upgrades) and scalable (supports multiple resolutions), and supports high video quality (by using both intra prediction and inter prediction) and high throughput (by exploiting slice-level and pixel-level parallelisms). Our platform uses multiple Application Specific Instruction Set Processors (ASIPS) with local and shared memories, and hardware accelerators (in the form of custom instructions). Our platform can be configured to use a particular number of ASIPS (slices per video frame) for a specific video resolution at design-time. The MPSoC architecture is automatically generated by our platform and the H.264 software does not need any modification, which enables quick design space exploration. We implemented the proposed platform in a commercial design environment, and illustrated its utility by creating systems with up to 170 ASIPS supporting resolutions up to HD1080. We further show how power gating can be used in our platform to save energy consumption.

**Speaker:**  
Hong Chinh Doan, Haris Javadi and Sri Parameswaran, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, AU

---

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>12.3.2</td>
<td>A FLEXIBLE ASIC ARCHITECTURE FOR CONNECTED COMPONENTS LABELING IN EMBEDDED VISION APPLICATIONS</td>
<td>Juan Fernando Eusse1, Rainer Leupers1, Gerd Ascheid1, Patrick Sudowe1, Bastian Leibe1 and Tamon Sadasue 2</td>
</tr>
</tbody>
</table>

**Abstract**  
Experimental results show that the optimization of 100k cells is completed within 0.1 second, verifying the efficiency of the proposed algorithm.
12.4.2 SIGNATURE INDEXING OF DESIGN LAYOUTS FOR HOTSPOT DETECTION

**Speakers:**
Cristian Andrades, Andrea Rodríguez and Charles Chiang

1Universidad de Concepcion, CL; 2Synopsys Inc., US

**Abstract**
This work presents a new signature for 2D spatial configurations that is useful for the optimization of a hotspot detection process. The signature is a string of numbers representing changes along the horizontal and vertical slices of a configuration, which serves as the key of an inverted index that groups layout windows with the same signature. The method extracts signatures from a compact specification of similar exact patterns with a fixed size. These signatures are used as search keys of the inverted index to retrieve candidate windows that can match the patterns. Experimental results show that this simple type of signature has 100% recall and, in average, over 85% of precision in terms of the area effectively covered by the pattern and the retrieved area of the layout. In addition, the signature shows a good discriminate quality, since around 99% of the extracted signatures match each of them with a single pattern.

**Time:** 17:00

12.4.3 METAL LAYER PLANNING FOR SILICON INTERPOSERS WITH CONSIDERATION OF ROUTABILITY AND MANUFACTURING COST

**Speakers:**
Wen-Hao Liu, Tzu-Kai Chien and Ting-Chi Wang, National Tsing Hua University, TW

**Abstract**
A 2.5D IC provides a silicon interposer to integrate multiple dies into a package, which not only offers better performance than 2D ICs but also has lower manufacturing complexity than true 3D ICs. In an interposer, routing wires connect signals between dies or route signals from dies to the package substrate. The number of metal layers in an interposer is one of the critical factors to affect the routability and manufacturing cost of the 2.5D IC. Thus, how to achieve 100% routing completion rate in an interposer using a minimum number of metal layers plays a key role for the success of a 2.5D IC. This paper presents a global-routing-based metal layer planner called VGR to identify a minimal number of metal layers for an interposer with consideration of routability and manufacturing cost. Also, VGR can identify a good stacking order of the horizontal and vertical layers in an interposer such that the routing solution in the interposer costs fewer vias. To our best knowledge, this paper is the first study to solve the metal layer planning problem for silicon interposers.

17:30 End of session
12.6 Error Resilience and Power Management

**Date:** Thursday 27 March 2014

**Time:** 12:00 - 17:30

**Location / Room:** Conference Room

**Chair:** William Fornaciari, Politecnico di Milano - DEIB, IT

**Co-Chair:** Kim Gruettner, OFFIS, DE

This session addresses the trade-off between accuracy and power consumption and the management of multi-core/multi-systems. The power management is addressed at several abstraction levels from circuit to system level (operating system).

### 16:00 12.6.1 ASLAN: SYNTHESIS OF APPROXIMATE SEQUENTIAL CIRCUITS

**Speakers:** Ashish Ranjan, Arnab Raha, Swaghat Venkitaramani, Kaushik Roy and Anand Raghunathan, PURDUE UNIVERSITY, US

**Abstract**

Applications from several important domains exhibit intrinsic resilience to approximations or inexactness in their underlying computations. Approximate circuits are commonly used to realize highly efficient hardware implementations of such applications. A wide range of manual and automatic techniques for the design of approximate circuits have been proposed. However, all of them target combinational circuits, leaving a gap between these techniques and the natural granularity at which quality is specified. In practice, the designer is concerned with quality or accuracy at the output of a sequential circuit after several cycles of computation, and not at the output of an embedded combinational block. We propose ASLAN (Automatic methodology for Sequential Logic Approximation), the first effort towards the synthesis of approximate sequential circuits. Given an RTL or gate-level description of a sequential circuit and a quality constraint at its output, ASLAN automatically synthesizes an approximate version that guarantees the specified quality bound. The key challenges in approximating sequential circuits are (i) to model how errors due to approximations are generated, propagate through multiple cycles of operation, and eventually impact quality of the final output, and (ii) to select the most beneficial approximations, i.e., those that result in higher energy savings for smaller impact on output quality. We address the first challenge by constructing a virtual Sequential Quality Constraint Circuit (SQCC) and utilizing formal verification to ensure that a given approximation satisfies the quality constraint during synthesis. To address the second challenge, we identify combinational blocks in the sequential circuit that are amenable to approximation, generate local quality-energy trade-off curves for them, and use a gradient-descent approach to iteratively approximate the sequential circuit. We used ASLAN to automatically synthesize approximate versions for several sequential benchmarks (DCT, FIR, IIR, etc.). Our experiments demonstrate energy reductions of 1.20X-2.44X for tight error constraints, and 1.32X-4.42X for relaxed error constraints. We also present case studies of using the approximate circuits generated by ASLAN in two well known applications — MPEG Encoding and K-Means Clustering. We obtain energy savings of 1.32X with 0.5% average degradation in PSNR for the MPEG encoder and 1.26X with 0.8% quality loss in case of KMeans Clustering.

### 16:30 12.6.2 VRCON: DYNAMIC RECONFIGURATION OF VOLTAGE REGULATORS IN A MULTICORE PLATFORM

**Speakers:** Woojoo Lee, Yanzhi Wang and Massoud Pedram, University of southern california, US

**Abstract**

Dynamic voltage and frequency scaling (DVFS) is driven by user requirements for high performance and low power. To overcome limitations of the conventional chip-wide DVFS and achieve the maximum possible energy saving, per-core DVFS is being enabled in the recent CMP offerings. While power consumed by the CMP is reduced by per-core DVFS, power dissipated by many voltage regulators (VRs) needed to support per-core DVFS becomes critical. This paper focuses on the dynamic control of the VRs in a CMP platform. Starting with a proposed platform with a configurable VR-to-core power distribution network, two optimization methods are presented to maximize the system-wide energy savings: (i) reactive VR consolidation to reconfigure the network for maximizing the power conversion efficiency of the VRs performed under the pre-determined DVFS levels for the cores, and (ii) proactive VR consolidation to determine new DVFS levels for maximizing the total energy savings without any performance degradation. Results from detailed experiments demonstrate up to 35% VR energy loss reduction and 14% total energy saving.

### 17:00 12.6.3 COARSE-GRAINED BUBBLE RAZOR TO EXPLOIT THE POTENTIAL OF TWO-PHASE TRANSPARENT LATCH DESIGNS

**Speakers:** Hayoung Kim, Jae-joon Kim, Sungjo Yoo, Sunggu Lee and Dongyoung Kim, POSTECH, KR

**Abstract**

Timing margin to cover process variation is one of the most critical factors that limit the amount of supply voltage reduction thereby power consumption. To remove too conservative timing margin, Bubble Razor was introduced to dynamically detect and correct errors in two-phase transparent latch designs. However, it does not fully exploit the potential of two-phase transparent latch design, e.g. time borrowing. Thus, especially at low supply voltage where the effect of process variation becomes significant, the existing Bubble Razor can suffer from significant overhead in performance and power consumption due to too frequent occurrence of bubble generations. We present a design methodology for coarse-grained Bubble which exploits the time-borrowing characteristic of two-phase transparent latch design. By selectively inserting error checkpoints, i.e., shadow latches and error management logic, in the circuit, time borrowing can be applied between error checkpoints thereby avoiding bubbles which could occur in the existing Bubble Razor design with a checkpoint at every latch on the critical path. We present a methodology to choose the grain size (the number of stages between error checkpoints) based on 3-sigma delay distribution. We also verify the benefits of coarse-grained Bubble Razor with a real microprocessor, Core-A design [15] using 20nm Predictive Technology Model (PTM) [16]. The proposed methodology offers 62% improvement in performance (MIPS) and 49% less energy consumption (per instruction) at 0.6V operation (zero frequency margin) over the original Bubble Razor scheme. In addition, it gives 29% area reduction in core design.
12.7 Built-in Self-Test Solutions for Mixed-Signal and RF ICs

Presentations in this session offer solutions to equip mixed-signal and RF circuits with built-in self-test capabilities. These solutions include the use of an on-chip neural network that maps test signatures directly to a pass/fail decision, loopback test where the transmitter is used to test the receiver, and a reconfiguration principle for pipelined data converters.

12.7.1 AN ANALOG NON-VOLATILE NEURAL NETWORK PLATFORM FOR PROTOTYPING RF BIST SOLUTIONS

**Speakers:** Dzmitry Maliuk¹ and Yiorgos Makris²

¹Yale University, US; ²University of Texas at Dallas, US

**Abstract**

We introduce an analog non-volatile neural network chip which serves as an experimentation platform for prototyping custom classifiers for on-chip integration towards fully stand-alone built-in self-test (BIST) solutions for RF circuits. Our chip consists of a reconfigurable array of synapses and neurons operating below threshold and featuring sub-WT power consumption. The synapse circuits employ dynamic weight storage for fast bidirectional weight updates during training. The learned weights are then copied onto analog floating gate (FG) memory for permanent storage. The chip architecture supports two learning models: a multilayer perceptron and an ontogenic neural network. A benchmark XOR task is first employed to evaluate the overall learning capability of our chip. The BIST-related effectiveness is then evaluated on two case studies: the detection of parametric and catastrophic faults in an LNA and an RF front-end circuits, respectively.

12.7.2 BUILT-IN SELF-TEST AND CHARACTERIZATION OF POLAR TRANSMITTER PARAMETERS IN THE LOOP-BACK MODE

**Speakers:** Jae Woong Jeong¹, Sule Ozev¹, Shreyas Sen², Vishwanath Natarajan² and Mustapha Slamani³

¹Arizona State University, US; ²Intel Corporation, US; ³IBM Corp., US

**Abstract**

This paper presents a Built-in-self-test (BIST) solution for polar transmitters with low cost. Polar transmitters are desirable for portable devices due to higher power efficiency they provide compared to traditional Cartesian transmitters. However, they generally require iterative test/measurement/calibration cycles. The delay skew between the envelope and phase signals and the finite envelope bandwidth can create inter modulation distortion (IMD) that leads to the violation of the spectral mask and error vector magnitude (EVM) requirements. These parameters are typically not directly measured but calibrated through spectral performance analysis using expensive RF equipment, leading to lengthy and costly measurement/calibration cycles. Characterization and calibration of these parameters inside the device would reduce the test time and cost considerably. In this paper, we propose a technique to measure the delay skew and the finite envelope bandwidth, two parameters that can be digitally calibrated, based on the measurement of the output of the receiver in the loop-back mode. Simulation and hardware measurement results show that the proposed technique can characterize the targeted impairments in the polar transmitter accurately.

12.7.3 A FLEXIBLE BIST STRATEGY FOR SDR TRANSMITTERS

**Speakers:** Emanuel Dogaru¹, Filipe Vincio dos Santos² and William Rebernak¹

¹Thales Communications & Security, FR; ²Thales Chair on Advanced Analog Design, SUPELEC, FR

**Abstract**

Software-defined radio (SDR) development aims for increased speed and flexibility. The impact of these system-level requirements on the physical layer (PHY) access hardware is leading to more complex architectures, which together with higher levels of integration pose a challenging problem for product testing. For radio units that must be field-upgradeable without specialized equipment, Built-in Self-Test (BIST) schemes are arguably the only way to ensure continued compliance to specifications. In this paper we introduce a loopback RF BIST technique that uses Periodically Nonuniform Sampling (PNS2) of the transmitter (TX) output to evaluate compliance to spectral mask specifications. No significant hardware costs are incurred due to the re-use of available RX resources (I/Q ADCs, DSP, GPP, etc.). Simulation results of an homodyne TX demonstrate that Adjacent Channel Power Ratio (ACPR) can be accurately estimated. Future work will consist in validating our loopback RF BIST architecture on an in-house SDR testbed.
SIGMA-DELTA TESTABILITY FOR PIPELINE A/D CONVERTERS

Speakers:
Antonio Jose Gines Arteaga and Gildas Leger, Instituto de Microelectronica de Sevilla, IMSE-CNM, (CSIC - Universidad de Sevilla), ES

Abstract
Pipeline Analog to Digital Converters (ADCs) are widely used in applications that require medium to high resolution at high acquisition speed. Despite of their quite simple working principles, they usually form rather complex mixed-signal blocks, particularly if digital correction and calibration are considered. As a result, pipeline converters are difficult to test and diagnose. In this paper, we propose to reconfigure the internal Multiplying DACs (MDACs) that perform residue amplifications as integrators, each one with an analog and a digital input. In this way, we can reuse consecutive pipeline stages to form Sigma Delta modulators, with very reduced area overhead. We thus get an on-chip DC (low-frequency) probe with a digital 1-bit output that does not require any extra pin. In addition, digital test techniques developed for Sigma Delta modulators may be used to enhance the diagnosing capabilities. An industrial 1.8V 15-bit 100Msps pipeline ADC that had previously been fully validated in a 0.18um CMOS process is used as a case of study for the introduction of the DfT modifications.

12.8 Panel: Future SoC verification methodology: UVM evolution or revolution?

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Exhibition Theatre

Organiser:
Alex Goryachev, IBM Research - Haifa, IL

Chair:
Rolf Drechsler, University of Bremen/DFKI, DE

It is a recent trend that SoCs are becoming more similar to servers. Many SoCs today are no longer tied to a single application and look more like general purpose PCs and high-end servers. Smartphones are the most notable example of this, but we are also seeing this with TV chips, in-car controllers, network routers, and more. This trend is occurring in parallel to the constantly growing complexity of SoCs, which support diverse IO interfaces and devices, and have complex architectures including multiple heterogeneous cores, multi-level caches, and multiple IO bridges. Today, common practice for verification is based on Universal Verification Methodology (UVM), which, at the system level, is built on reusing and combining unit-level environments, followed by running real software on an SoC. This methodology leaves a large gap. In high-end systems, this gap is covered by system-level verification that focuses on HW-only system integration. This level has its own methodology, dedicated environment, set of tools, and teams. It looks at the system as a whole and is not based on reusing lower level environments. Formal methods are a field of intensive research, but they have not been adopted by the industry for SoC-level verification. In this panel leading experts from industry (both users and vendors) and academy will discuss the future of SoC verification methodology. Is the gap in today’s SoC verification methodology significant? Is it growing? Or perhaps it does not exist? What is the right way to close the gap, if one exists? Is it sufficient to extend UVM capabilities (e.g., SystemC, TLM) or are dedicated tools and methodology needed? Are formal methods ready to play a significant role in SoC-level verification? In general, we would like to determine the importance of system-level verification and its unique needs—whether generators, checking, coverage, or teams.

Panelists:
- Lyes Benalycherif, STMicroelectronics, FR
- Franco Fummi, University of Verona, IT
- Alan J. Hu, University of British Columbia, Vancouver, CA
- Ronny Morad, IBM Research - Haifa, IL
- Frank Schirrmeister, Cadence Design Systems, US

17:30 End of session

Source URL: https://past.date-conference.com/date14/booklet-proof_reading