Booklet Proof Reading

Goto Session:

1.1 Opening Session
UB01 Session 1
2.1 EXECUTIVE SESSION: How to Handle Today's Design Complexity
2.2 Panel: Emerging vs. Established Technologies: a Two Sphinxes' Riddle at the Crossroads?
2.3 Making automotive systems safer and more energy efficient
2.4 Modern Challenges in Analog and Mixed-Signal Design
2.5 Low-Power and Efficient Architectures
2.6 Real-Time memory hierarchies
2.7 Yield and Reliability for Robust Systems
2.8 Hot Topic: Technology Transfer towards Horizon 2020
UB02 Session 2
3.1 EXECUTIVE SESSION: Advanced Technology Challenges & Opportunities
3.2 Panel: The World Is Going... Analog & Mixed-Signal! What about EDA?
3.3 Secure Hardware Primitives and Implementations
3.4 Modeling and Optimization of Power Distribution Networks
3.5 Robust Architectures
3.6 Cyber Physical Systems: Security and Co-design
3.7 On line Strategies for Reliability
3.8 Hot Topic: Mission Profile Aware Design - The Solution for Successful Design of Tomorrows Automotive Electronics
UB03 Session 3
IP1 Interactive Presentations
4.1 EXECUTIVE SESSION: Addressing Challenges of Reliable Chips
4.2 Hot Topic: Multicore Systems in Safety Critical Electronic Control Units for Automotive and Avionics
4.3 Secure Device Identification
4.4 "Almost there" emerging technologies
4.5 Memory System Architectures
4.6 Code Generation and Optimization for Embedded Platforms
4.7 Dependable System Design
4.8 State-of-the-art in Verification: European Tertulia IC Design - Enabling AMS Structured Verification / Verification in FPGA & IP design flows
UB04 Session 4
Exhibition-Reception Exhibition Reception
5.1 SPECIAL DAY Hot Topic: Predictable Multi-Core Computing
5.2 Hot Topic: Hacking and Protecting Hardware: Threats and Challenges
5.3 Reliable Systems in the Age of Variability
5.4 Prediction and optimization of timing variations
5.5 Boosting the Scalability of Formal Verification Technologies
5.6 Emerging logic technologies
5.7 Test Generation and Optimization
5.8 Hot Topic: System Integration - The Bridge between More than Moore and More Moore
IP2 Interactive Presentations
UB05 Session 5
6.1 SPECIAL DAY Hot Topic: The fight against Dark Silicon
6.2 Embedded Tutorial: Emerging Transistor Technologies: From Devices to Architectures
6.3 Management of Micro/Macro Renewable Energy Storage Systems
6.4 Power delivery and distribution
6.5 Beyond EDA: Extending the Application Domain of Formal Methods
6.6 Model-Based Design and Hardware/Software Interfaces
6.7 Hardening Approaches at Different Design Levels
6.8 First Time Right in Analog Design Enabling New Business Cases
UB06 Session 6
7.0 Special Day Keynote
UB07 Session 7
7.1 SPECIAL DAY Panel: HW/SW Co-Development - The Industrial Workflow
7.2 Embedded Tutorial: Cross Layer Resiliency in Real World
7.3 Low power methods and multicore architectures for mobile health applications
7.4 Runtime memory optimization and GPU/manycore architectures
7.5 Emerging memory technologies
7.6 Performance and timing analysis
7.7 Design-for-Test and Test Access
7.8 Panel: FD-SOI - the Enabling European Technology for Energy Efficient Solutions - Creating a Solution Hive & Design Hub as Eco-System for Future Success
IP3 Interactive Presentations
UB08 Session 8
8.1 SPECIAL DAY System Simulation and Virtual Prototyping
8.2 Hot Topic: Near Threshold Computing (NTC)
8.3 Physical Attacks and countermeasures
8.4 Efficient Designs for Telecom and Financial Applications
8.5 Modeling & Specification
8.6 Mapping and Scheduling for Many-Core Embedded Systems
8.7 Performance Modeling and Delay Test
8.8 Hot Topic: Beyond CMOS Ultra-low-power Computing
Party DATE Party
9.1 SPECIAL DAY Hot Topic: CMOS scaling - from evolutionary to revolutionary computing
9.2 Low-Cost, High-Performance NoCs
9.3 Hardware Implementations for Data Security
9.4 Timing challenges in validation
9.5 Hot Topic: Connecting Different Worlds - Technology Abstraction for Reliability-Aware Design and Test
9.6 Schedulability analysis
9.7 Timing Analysis and Cell Design
9.8 Embedded Tutorial: Memcomputing: the Cape of Good Hope
IP4 Interactive Presentations
UB09 Session 9
10.1 SPECIAL DAY Hot Topic: Memories today and tomorrow
10.2 Wireless NoCs
10.3 Green Computing Systems
10.4 System-level evaluation
10.5 Analysis of Components and Systems
10.6 Multi-processor and distributed systems
10.7 Advances in Synthesis
10.8 EDA+3D+MEMS Innovation Agenda 2020 Fueling the Innovation Chain of Electronics
UB10 Session 10
11.0 Special Day Keynote
11.1 SPECIAL DAY Embedded Tutorial: Alternatives to CMOS
11.2 Transitioning NoC Design Techniques to Future Challenges
11.3 Industry relevant research and practice for system design
11.4 Enabling validation on fast platforms
11.5 Memory Resource Allocation and Scheduling in MPSoC
11.6 System-Level Thermal Estimation and Management
11.7 Power and Emerging Technologies in Reconfigurable Computing
11.8 Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability
UB11 Session 11
IP5 Interactive Presentations
12.1 SPECIAL DAY Hot Topic: The future of interfacing to the natural world
12.2 Hot topic: How Secure are PUFs Really? On the Reach and Limits of Recent PUF Attacks
12.3 Multimedia Systems
12.4 Physical Aspects
12.5 System-level Design Space Exploration
12.6 Error Resilience and Power Management
12.7 Built-in Self-Test Solutions for Mixed-Signal and RF ICs
12.8 Panel: Future SoC verification methodology: UVM evolution or revolution?

1.1 Opening Session

Date: Tuesday 25 March 2014
Time: 08:30 - 10:30
Location / Room: Grosser Saal

Organiser:
Gerhard Fettweis, Technische Universität Dresden, DE

Time	Label	Presentation Title Authors
08:30	1.1.1	WELCOME ADDRESSES Speakers: Gerhard Fettweis¹ and Luca Fanucci² ¹Technische Universität Dresden, DE; ²University of Pisa, IT
08:50	1.1.2	PRESENTATION OF DISTINGUISHED AWARDS Speaker: DATE Executive Committee , , Abstract DATE 2014 Best Paper Awards EDAA Lifetime Achievement Award 2014 (Rolf Ernst, TU Braunschweig, DE) EDAA Outstanding Dissertation Awards 2013 ACM SIGDA Distinguished Service Award (Peter Marwedel, TU Dortmund, DE) DATE Fellow Award (Enrico Macii, Politecnico di Torino, IT) IEEE/CEDA Outstanding Service Contribution Award 2013 (Enrico Macii, Politecnico di Torino, IT) IEEE CS TTTC Outstanding Contribution Award (Enrico Macii, Politecnico di Torino, IT) IEEE Fellow Award (Cecilia Metra, University of Bologna, IT) Read More ...
09:10	1.1.3	KEYNOTE ADDRESS: SYSTEM DESIGN CHALLENGES FOR NEXT GENERATION WIRELESS AND EMBEDDED SYSTEMS Speaker: David Fuller, National Instruments, US Abstract Application demands in our embedded world are growing dramatically. Consumer expectations and the industry's forward-looking technology roadmaps paint a picture of a connected world full of intelligent devices once thought to have fixed functionalities. Researchers exploring next generation wireless systems, Internet of Things (IOT), and even machine-to-machine (M2M) communications face many challenges in making this vision a reality. Where once a single, isolated design flow addressed the discrete application, heterogeneous multi-processing architectures must be considered and embraced along with the connections to other devices and systems, and real-world sensor data. As the systems grow in complexity, new design approaches must also be developed and employed to expedite the research, design, and development cycle. David Fuller will outline challenges system designers face in developing cyber-physical systems and explore a graphical system design approach that includes hardware abstraction and comprehends a heterogeneous multiprocessing environment while embracing different models of computation. Through this new approach, system designers can shorten design cycles and the time to prototype ultimately accelerating deployment.
	1.1.4	KEYNOTE ADDRESS: THE GROWING IMPORTANCE OF MICROELECTRONICS FROM A FOUNDRY PERSPECTIVE Speaker: Gerd Teepe, GLOBALFOUNDRIES, DE Abstract Microelectronics is the dominant industrial technology of today. Its rate of innovation, spelled out by Moore's Law, is exceptional by any commercial metric, especially, as it has been on this trajectory for almost 40 years. It is not surprising, that other industrial sectors are taking advantage of the innovation engine of the semiconductors for its own product innovation: Cars are safer and more economic, medical diagnostics are performing to a significantly higher level, and energy efficiency from the generation to the consumer is a lot more efficient. "The Internet" has become the basis for our communication, organization and planning in our economies with significant impact to our society. However, the Semiconductor industry is under a powerful transformation marked by the following trends: - Design Complexity is facing new challenges, as technological complexity is transferred to the design space at an accelerated pace - The SOC is dominating the design space - Intelligent Things are emerging with unprecedented cognitive and motion capabilities - The supply chain transformation is in full motion, with the foundry model at the forefront With these powerful trends in motion, we will have to rethink our approach towards semiconductors as part of the industrial system. It will not be sufficient any more to "enhance" traditional products like Cars, TVs, machines or phones with semiconductor content to make them perform at a higher level to increase its value to consumers. We need to rethink the connected world around us to truly assess the next generation of intelligent applications, which we are about to enter.
10:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

UB01 Session 1

Date: Tuesday 25 March 2014
Time: 10:30 - 12:30
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB01.01	QUANTUMEDA: A VISUALIZATION AND DESIGN ENVIRONMENT FOR TOPOLOGICAL QUANTUM CIRCUITS Authors: Ilia Polian, Wolfgang Wallner and Alexandru Paler, University of Passau, DE Abstract Quantum circuits use quantum-mechanical properties of certain physical systems, such as superposition and entanglement, to perform massively parallel calculations. They provide polynomial algorithms for problems for which only inefficient algorithms with asymptotically-exponential running time are known in conventional mod-els of computation. Building a scalable quantum computer that can process a large number of quantum bits (qubits) is one of the grand challenges of modern science. While first small quantum computers have been experimentally demonstrated and a number of implementation technologies have been suggested, all of them encounter difficulties when it comes to scaling. The central difficulty is the high susceptibility of such circuits to noise and decoherence, which necessitates the use of special quantum error correction. Topological quantum computing (TQC) is a paradigm that offers a path to scalability. It strikes a balance between systematic, intuitive methods to design large computations, and relatively loose requirements on the vulnerability of individual qubits to errors. The availability of a platform for implementing large quantum algo-rithm constitutes the need for methods to manage design complexity, including automatic synthesis, optimiza-tion, compaction, verification and visualization of TQC circuits. Topological quantum circuits are based on a three-dimensional cluster of qubits which supports highly efficient topological quantum error-correcting codes. In this way, the circuits can operate even though its individual qubits are subject to relatively high error rates. We will present the first environment for design of TQC circuits. The environment allows the user to graphically enter the structure of a circuit, add, delete and re-shape individual qubits, and perform optimization and compaction (both manually and by global replacement). The circuits are represented on an intermediate technology-independent level, where "logical qubits" that consist of a large number of physical qubits perform error-corrected operations. For example, the circuit in Fig. 1 shows an error-corrected CNOT gate implemented by four logical qubits represented by colored structures. The optimized representation can be translated into instruction sequences for a classical computer that operates the actual quantum hardware. More information ...
UB01.02	AN AUTOMATED DESIGN FLOW FOR FAST PROTOTYPING OF SIMULINK MODELS ONTO MPSOC Authors: Francesco Robino and Johnny Öberg, Royal Institute of Technology, SE Abstract Simulink is a modelling environment suitable to model embedded systems at system-level. However there is no standard to rapidly prototype Simulink models onto modern multiprocessor system-on-chip (MPSoC). In this demonstration we show how our NoC System Generator tool can be used as part of an automated platform-based design flow to synthesize a Simulink model to a network-on-chip based MPSoC implementation on FPGA. The performance of the generated prototype scales with the number of processors. More information ...
UB01.03	HEROES^2: A SYSTEMC FRAMEWORK FOR MODELING, SIMULATION AND TESTING OF HETEROGENEOUS SOFTWARE-INTENSIVE SYSTEMS Authors: Markus Becker¹, Wolfgang Mueller¹, Ulrich Kiffmeier² and Joachim Stroop² ¹University of Paderborn/C-LAB, DE; ²dSPACE GmbH, DE Abstract HeroeS^2 is a SystemC framework for modeling/simulation of heterogeneous SW-intensive systems. It has 8 abstraction levels for corefinement of application/environment models from continous/discrete models to networked embedded SW stacks. Support of various SW/comm. abstractions is achieved by combining AMS MoCs, TLM, HdS models (MW, RTOS, HAL) and QEMU user mode/system emulator. Interfacing w/ a commerical AUTOSAR toolchain is provided, i.e., code generators, integration and experimentation tools. More information ...
UB01.04	BUILDING A PROTOTYPING PLATFORM FOR INVESTIGATING THE IMPACT OF ATTACKS AGAINST AUTOMOTIVE NETWORKS Authors: Alexander Stühring¹, Günter Ehmen¹ and Sibylle Fröschle² ¹University of Oldenburg, DE; ²OFFIS, DE Abstract The University of Oldenburg is working on solutions to ensure a secure communication in the automotive domain. This is a key requirement for safe applications in the context of future Car2X applications. In order to achieve this goal we are using a self-developed prototyping platform to analyze and demonstrate the impact of attacks on in-vehicle buses and wireless networks. Moreover, the visitors are able to start attacks and observe the consequences in a simulated driving scenario. More information ...
UB01.05	MOTORBRAIN: MODEL-BASED DESIGN AND VIRTUAL INTEGRATION OF AN INTELLIGENT AND SAFE ELECTRICAL POWERTRAIN Authors: Sven Rosinger, Maher Fakih and Jörg Walter, OFFIS - Institut für Informatik, DE Abstract Hardware prototypes and hardware in the loop simulations are commonly used during embedded vehicle- and motor-control unit design. This demonstrator presents a platform that is an order of magnitude cheaper than existing systems but still easy to integrate into present workflows: Within an existing model-driven design methodology, a real-time hardware simulation is performed using the Raspberry Pi single-board computer to simulate an e-motor with little development effort and in conjunction with an industrial motor control unit. More information ...
UB01.06	ENERGY-MODULATED COMPUTING Authors: Maxim Rykunov, Reza Ramezani, Abdullah Baz, Xuefu Zhang, Delong Shang, Andrey Mokhov, Danil Sokolov, Fei Xia and Alex Yakovlev, Newcastle University, GB Abstract This demo will illustrate the principle of energy-modulated computing according to which the flow of energy entering a computing system determines its computational flow. This principle will be fundamental for building future autonomous systems, such as those powered by energy harvesting sources and aimed for survival in power-deficient conditions. The demo includes a set of experimental circuits (with three VLSI chips and PCBs) to work in variable power supply conditions and software tools for digital and analogue co-design (Workcraft, Petrify, MPSAT). More information ...
UB01.07	VERIFIC-MM Authors: Christoph Kuznik and Wolfgang Müller, University of Paderborn, DE Abstract Verific-MM is an approach to systematize and accelerate the coverage plan engineering as well as the verification environment's (functional) metric code generation -- usually a time-consuming and error-prone task -- in particular by (i) improving automation via assisted model-based approaches, utilizing recent industry standards such as UCIS and (ii) a supporting methodology suitable for various target (functional coverage) languages (IEEE-1800 SystemVerilog, IEEE-1647 e, IEEE-1666 SystemC). More information ...
UB01.08	MICROTESK: RECONFIGURABLE OPEN-SOURCE FRAMEWORK FOR TEST PROGRAM GENERATION Authors: Andrei Tatarnikov, Alexander Kamkin and Artem Kotsynyak, Institute for System Programming of the Russian Academy of Sciences (ISP RAS), RU Abstract Test program generation plays a major role in functional verification of microprocessors. Due to tremendous growth in complexity of modern designs and rigid constraints on time to market, it becomes an increasingly difficult task. In spite of powerful test program generation tools available in the market, development of functional tests is still known to be the bottleneck of the microprocessor design cycle. The common problem is that it takes a significant effort to reconfigure a test program generation environment for a new microprocessor design. The model-based approach applied in the state-of-the-art tools, like Genesys-Pro (IBM Research), still does not provide enough flexibility since creating a microprocessor model is difficult and requires special knowledge and skills. MicroTESK, the open-source test program generation framework being developed at ISPRAS, offers an approach to ease customization by using light-weight formal specifications to describe the target microprocessor architecture. The approach helps reduce the effort needed to create a microprocessor model and, consequently, minimize the time required to create functional tests. In addition to gaining flexibility, the use of formal specifications also allows automated extraction of knowledge about test situations that occur in a microprocessor (coverage model), thus, facilitating creating directed tests and improving test coverage. By the present moment, a demo prototype of MicroTESK has been implemented. It uses the Sim-nML architecture description language to specify the target microprocessor architecture and provides a convenient Ruby-based language for creating test templates that serve as an abstract description of test programs to be generated. The current version of the framework focuses primarily on RISK microprocessors including ARM, MIPS and SPARK. Supported test generation methods include random, combinatorial, template-based and model-based generation. Flexible architecture of the framework allows adding support for new test generation methods. More information ...
UB01.09	FAULTIFY: PROBABILISTIC CIRCUIT FAULT EMULATION Authors: David May and Walter Stechele, TUM, DE Abstract We want to demonstrate an FPGA-based probability-aware fault emulator and its corresponding algorithms in the context of a real-time H.264 decoder. The demo will show that reliability constraints can be relaxed inside the circuit without noticeable degradation of the image quality when carefully investigating where the constraints can be relaxed. We will show how this investigation can to be done using our emulator and we will show the effect of a relaxed robustness of the circuit in real-time. More information ...
UB01.10	UNISON: ASSEMBLY CODE GENERATION USING CONSTRAINT PROGRAMMING Authors: Roberto Castañeda Lozano¹, Gabriel Hjort Blindell², Mats Carlsson¹ and Christian Schulte² ¹Swedish Institute of Computer Science, SE; ²KTH Royal Institute of Technology, SE Abstract We demonstrate Unison - a simple, flexible and potentially optimal code generator that solves interdependent code generation tasks together using constraint programming as a modern combinatorial optimization method. We show how Unison takes into account the task interdependencies and their combinatorial nature to improve the speed of the code generated by LLVM (a state-of-the-art compiler) for Hexagon (a digital signal processor ubiquitous in modern mobile platforms). More information ...
12:30	End of session
13:00	Lunch Break in Exhibition Area Sandwich lunch

2.1 EXECUTIVE SESSION: How to Handle Today's Design Complexity

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Saal 1

Organiser:
Yervant Zorian, Fellow & Chief Architect, Synopsys, US

Executives:
Sanjive Agarwala, Fellow & Silicon Director, Texas Instruments, US
Paul Lo, Senior Vice President, Synopsys, US
Rainer Kress, Head Design Methodology, Infineon, DE
Wolfgang Maier, Director, IBM, DE

The widening gap between growing SOC complexity and designer productivity limits traditional chip design methods and flows. This results in several new approaches and innovative methods that work to elevate the limitations of different aspects of complex SOC design. Executives in this session will discuss the impact of complexity and the new opportunities it may bring in designing today's SOC.

Time	Label	Presentation Title Authors
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

2.2 Panel: Emerging vs. Established Technologies: a Two Sphinxes' Riddle at the Crossroads?

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 6

Organiser:
Marco Casale-Rossi, Synopsys, Inc., US

Chair:
Giovanni De Micheli, EPFL, CH

Crossroads have always been challenging: they require a decision; in Egyptian and Greek mythology they were often guarded by two sphinxes trying to cheat the traveler with their riddles. The two sphinxes, the knight and the knave, the lady and the tiger, are just few instances of difficult puzzles that have kept logicians and mathematicians busy for the last 5,000 years. Today, you are walking down Moore's Law road when you come to a crossroads: one road brings you into the land of emerging technologies: 14, 10 and 7 nanometer, FDSOI, FinFET, 3D-IC,... beyond and below; the other road holds you into the land of established technologies: 28, 40, 65, and 90 nanometers, possibly even above, A&M/S, MEMS,... Choosing the right road is critical to lead your project and your company to success, but making the right decision is increasingly difficult, as it encompasses complex technical and economic considerations. However, unlike the mythological traveler, you won't run into the sphinxes but, rather, into some of our industry best experts; unlike the sphinxes, they will strive to provide you with honest advice about the "road conditions", and you are allowed to ask multiple questions to them to figure out which road is the best for you.

Panelists:

Rob Aitken, ARM Ltd., US
Antun Domic, Synopsys, Inc., US
Manfred Horstmann, GLOBALFOUNDRIES, DE
Robert Hum, Mentor Graphics Corp., US
Philippe Magarshack, STMicroelectronics, FR

13:00

End of session
Lunch Break in Exhibition Area
Sandwich lunch

2.3 Making automotive systems safer and more energy efficient

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 1

Chair:
Bart Vermeulen, NXP, NL

Co-Chair:
Sebastian Steinhorst, TUM-CREATE, SG

With the transition from hydraulic and mechanical automotive systems to electronic systems, the requirements on safety and energy efficiency are becoming increasingly important. The papers in this session address these issues by presenting robustness improvements at component and system level, advanced energy management at network level, and multi-variant design space exploration.

Time	Label	Presentation Title Authors
11:30	2.3.1	EMULATION-BASED ROBUSTNESS ASSESSMENT FOR AUTOMOTIVE SMART-POWER ICS Speakers: Manuel Harrant¹, Thomas Nirmaier¹, Jerome Kirscher¹, Christoph Grimm² and Georg Pelz¹ ¹Infineon Technologies AG, DE; ²TU Kaiserslautern, DE Abstract In this paper we present a concept for assessing the robustness of automotive smart power ICs through lab measurements with respect to application variance and parameter spread. Classical compliance to the product specification, where only minimum and maximum values are defined, is not enough to assess device robustness since complex transients of application components cannot be defined within single specification parameters. That is why application fitness becomes a necessary task to reduce device failures, which may occur in the application. One solution would be to enhance traditional lab verification methods with a concept that considers application and parameter spread. This innovative concept is demonstrated on an electronic throttle control application. It has been emulated in real-time, including power amplification and application-relevant parameters. Within this application space, Monte Carlo experiments were carried out to evaluate the influence of parameter spread on selected system characteristics. Finally, an appropriate metric was used to quantify the robustness of the micro-electronic device within its application.
12:00	2.3.2	STARTUP ERROR DETECTION AND CONTAINMENT TO IMPROVE THE ROBUSTNESS OF HYBRID FLEXRAY NETWORKS Speakers: Alexander Kordes¹, Bart Vermeulen², Abhijit Deb² and Michael Wahl¹ ¹University of Siegen, DE; ²NXP Semiconductors, NL Abstract The research and development on in-vehicle networks (IVNs) is driven by two main requirements: bandwidth and robustness. In this paper we address the robustness requirement. We focus on FlexRay IVNs that are used for safety-critical applications. We analyze and discuss faults that may affect the startup and operation of a FlexRay network. These failures may not only occur during the startup phase of the vehicle, but they may also happen due to a bus problem that requires the bus to be reinitialized during normal operation. Here any startup failure leads to a critical situation like a brake system failure. The fault scenarios we discuss in this paper are the resetting leading coldstart node (RLCN), the deaf coldstart node (DCN), and the babbling idiot (BI). These faults are described in literature, but neither the precise behavior of all involved nodes, nor a clear solution is provided to contain their impact. The idea of a bus guardian (BG) is given in a draft specification of the FlexRay consortium, but no details are given. In this paper, we extend on these ideas by investigating and implementing a detailed (BG) concept, based on our fault analysis. We subsequently evaluate the successful containment of the three fault types in simulation. We also quantify the chip area cost of our solution.
12:30	2.3.3	A SELF-PROPAGATING WAKEUP MECHANISM FOR POINT-TO-POINT NETWORKS WITH PARTIAL NETWORK SUPPORT Speakers: Jan Reinke Seyler¹, Thilo Streichert¹, Juri Warkentin¹, Matthias Spägele¹, Michael Glaß² and Jürgen Teich² ¹Daimler AG, DE; ²University of Erlangen-Nuremberg, DE Abstract As a result of the increased demand for bandwidth, current automotive networks are getting more heterogeneous. New technologies like Ethernet as a packet-switched point-to-point network are introduced. Nevertheless, the requirements on stand-by power consumption and short activation times are still the same as for existing field buses. Ethernet does not provide wakeup mechanisms that are sufficient for automotive systems. As a remedy, this paper introduces a novel physical-layer mechanism called Low Frequency Wakeup that is largely independent of the communication technology and topology used. It provides parallel and remote wakeup for all nodes even in a point-to-point network as well as full support of partial networking. The overall wakeup detection time is smaller than 10 ms and every node can actively feed a wakeup signal asynchronously to all other nodes. In terms of latency, it is shown that Low Frequency Wakeup reaches a reduction of more than 30 % for a three-hop network and more than 50 % for a five-hop network in comparison to the current state-of-the-art technology for automotive point-to-point networks.
12:45	2.3.4	MULTI-VARIANT-BASED DESIGN SPACE EXPLORATION FOR AUTOMOTIVE EMBEDDED SYSTEMS Speakers: Sebastian Graf¹, Michael Glaß¹, Jürgen Teich¹ and Christoph Lauer² ¹University of Erlangen-Nuremberg, DE; ²AUDI AG Ingolstadt, DE Abstract This paper proposes a novel design method for modern automotive electrical and electronic (E/E) architecture component platforms. The addressed challenge is to derive an optimized component platform termed Baukasten where components, i.e., different manifestations of Electronic Control Units (ECUs), are reused across different car configurations, models, or even OEM companies. The proposed approach derives an efficient graph-based exploration model from defined functional variants. From this, a novel symbolic formulation of multi-variant resource allocation, task binding, and message routing serves as input for a state-of-the-art hybrid optimization technique to derive the individual architecture for each functional variant and the resulting Baukasten at once. For the first time, this enables a concurrent analysis and optimization of individual variants and the Baukasten. Given each manifestation of a component in the Baukasten induces production, storage, and maintenance overhead, we particularly investigate the trade-off between the number of different hardware variants and other established design objectives like monetary cost. We apply the proposed technique to a real-world automotive use case, i.e., a subsystem within the safety domain, to illustrate the advantages of the multi-variant-based design space exploration approach.
13:00	IP1-1, 417	SAFE: SECURITY-AWARE FLEXRAY SCHEDULING ENGINE Speakers: Gang Han¹, Haibo Zeng², Yaping Li³ and Wenhua Dou¹ ¹National University of Defense Technology, CN; ²McGill University, CA; ³The Chinese University of Hong Kong, CN Abstract In this paper, we propose SAFE (Security Aware FlexRay scheduling Engine), to provide a problem definition and a design framework for FlexRay static segment schedule to address the new challenge on security. From a high level specification of the application, the architecture and communication middleware are synthesized to satisfy security requirements, in addition to extensibility, costs, and end-to-end latencies. The proposed design process is applied to two industrial case studies consisting of a set of active safety functions and an X-by-wire system respectively.
13:01	IP1-2, 459	TRANSIENT ERRORS RESILIENCY ANALYSIS TECHNIQUE FOR AUTOMOTIVE SAFETY CRITICAL APPLICATIONS Speakers: Sujan Pandey and Bart Vermeulen, NXP Semiconductors, NL Abstract When a single bit is flipped as a result of a transient error in an electronic circuit, its effect can have a severe impact if the circuit is deployed in safety critical domains such as automotive, aeronautics, and industrial automation. In the design phase it is therefore essential to evaluate, and where necessary improve, the resilience of a circuit to all possible transient errors. In this paper, we present a method to analyze the transient error resiliency of a digital circuit. This method is based on an analytical model. It models a transient error as a random function and finds the vulnerable number of bits for each node. We perform a case study on a circuit implementation of a well-known adaptive filter algorithm. The results from the analytical and simulation models show that the analytical model is accurate enough to estimate the effects of transient errors on the performance of a digital circuit. Our analytical method also reduces the run time significantly in a design phase.
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

2.4 Modern Challenges in Analog and Mixed-Signal Design

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 2

Chair:
Georges Gielen, KU Leuven, BE

Co-Chair:
Günhan Dündar, Bogazici University, TR

The session addresses complex challenges in analogue and mixed-signal modeling and design. The regular papers present a novel, zonotope based approach to non-linear macromodeling and a new layout technique in analogue IC design that avoids failures due to IR-drop and electromigration. The two short papers discuss new mechanisms to select solutions from multi-dimensional Pareto-optimal fronts and efficient recording of activities in CMOS neural networks.

Time	Label	Presentation Title Authors
11:30	2.4.1	ELECTROMIGRATION-AWARE AND IR-DROP AVOIDANCE ROUTING IN ANALOG MULTIPORT TERMINAL STRUCTURES Speakers: Ricardo Martins, Nuno Lourenco, António Canelas and Nuno Horta, Instituto de Telecomunicações, Instituto Superior Técnico – TU Lisbon, PT Abstract This paper describes an electromigration-aware and IR-Drop avoidance routing approach considering multiport multiterminal (MP/MT) signal nets of analog integrated circuits (IC). The effects of current densities and temperature in the interconnects may cause the malfunction/failure of a circuit due to IR-Drop or electromigration (EM). These become increasingly more relevant with the ongoing reduction on circuit sizes caused by the evolution of the nanoscale integration processes. Therefore, EM and IR-Drop effects must be taken into account in the design of both power networks and signal wires of analog and mixed-signal ICs, to make their impact on the circuits' reliability negligible. In previous EM and IR-Drop-aware analog IC routing approaches, 'dot-models' are assumed for the terminals, i.e., each terminal has only one port that need to be routed, however, in practice, analog standard cells usually contain multiple electrically-equivalent locations, often distributed over different fabrications layers, where legal connections can be made, i.e., MP terminals, which need to be properly explored. The design flow is detailed, and the applicability of the approach is demonstrated with experimental results, and also, by generating the routing of an analog circuit structure for the UMC 130nm design process.
12:00	2.4.2	(Best Paper Award Candidate) ZONOTOPE-BASED NONLINEAR MODEL ORDER REDUCTION FOR FAST PERFORMANCE BOUND ANALYSIS OF ANALOG CIRCUITS WITH MULTIPLE-INTERVAL-VALUED PARAMETER VARIATIONS Speakers: Yang Song, Sai Manoj Pd and Hao Yu, Nanyang Technological University, SG Abstract It is challenging to efficiently evaluate performance bound of high-precision analog circuits with multiple parameter variations at nano-scale. In this paper, a nonlinear model order reduction is proposed to deploy zonotope-based model for multiple-interval-valued parameter variations. As such, one can have a zonotope-based reachability analysis to generate a set of trajectories with performance bound defined. By further constructing local parameterized subspaces to approximate a number of zonotopes along the set of trajectories, one can perform nonlinear model order reduction to generate the performance bound under parameter variations. As shown by numerical experiments, the zonotope-based nonlinear macromodeling by order of 19 achieves up to 500x speedup when compared to Monte Carlo simulations of the original model; and up to 50% smaller error when compared to previous parameterized nonlinear macromodeling under the same order.
12:30	2.4.3	IMPLEMENTATION ISSUES IN THE HIERARCHICAL COMPOSITION OF PERFORMANCE MODELS OF ANALOG CIRCUITS Speakers: Manuel Velasco-Jiménez, Rafael Castro-López, Elisenda Roca and Francisco Fernández, IMSE-CNM, CSIC and Universidad de Sevilla, ES Abstract Emerging hierarchical design methodologies based on the use of Pareto-optimal fronts (PoFs) are promising candidates to reduce the bottleneck caused by the design of complex analog circuits. However, little work has been reported about how to transmit the information provided by the PoFs of low hierarchical level blocks through the hierarchy to compose the performance models of higher level blocks. This composition actually poses several problems such as the dependence of the PoF performances on the surrounding circuitry and the complexity of dealing with multi-dimensional PoFs in order to explore more efficiently the design space. To deal with these problems, this paper proposes new mechanisms to represent and select candidate solutions from multi-dimensional PoFs that are transformed to the changing operating conditions enforced by the surrounding circuitry. These mechanisms are demonstrated with the generation of the performance model of an active filter by composing previously generated PoFs of operational amplifiers.
12:45	2.4.4	MODELING OF AN ANALOG RECORDING SYSTEM DESIGN FOR ECOG AND AP SIGNALS Speakers: Nils Heidmann¹, Nico Hellwege¹, Tim Hoehlein¹, Thomas Westphal¹, Dagmar Peters-Drolshagen¹ and Steffen Paul² ¹University of Bremen, DE; ²University Bremen, DE Abstract The recording of neural activities has turned out to be a promising approach to understand the basic function of specific brain parts like the visual or motor cortex. However, the development and design of advanced neural recording systems is very challenging since the number of parallel measurement channels increases continuously. Beside the analog recording channels digital preprocessing becomes mandatory to handle the corresponding amount of data and to adapt this data to the available transmission bandwidth. In this paper we present the design as well as the behavioral modeling of an analog recording front-end. Simulation and measurement results demonstrate the performances of the system for recording neural signals. Since simulation of this analog front-end is very time consuming but essential for large fully-integrated designs, a mixed-signal model approach is introduced that enables a significant simulation acceleration of integrated and external analog front-ends. The simulation can be accelerated by a factor of up to 22.2 for a single front-end. The proposed system has been fabricated in a 0.35 µm CMOS technology and performances have been measured. This demonstrates that the behavioral model is compatible to the transistor level design. A neural spike detector shows the transient performance of the modeled design on real input stimuli.
13:00	IP1-3, 411	MODEL BASED HIERARCHICAL OPTIMIZATION STRATEGIES FOR ANALOG DESIGN AUTOMATION Speakers: Engin Afacan¹, Gunhan Dundar¹, Faik Baskaya¹, Simge Ay¹ and Francisco Fernandez² ¹Bogazici University, TR; ²Universidad de Sevilla, TR Abstract The design of complex analog circuits by using flat optimization-based approaches is inefficient, even impossible, due to the high number of design variables and the growth of the cost of performance evaluation with the circuit size. Over the past two decades, top-down hierarchical design approaches have been developed and applied. They are based on hierarchical circuit decomposition and specification transmission from top-level to lower level blocks. However, such specification transmission is usually performed with little knowledge on the feasibility of the specifications, leading, therefore, to costly redesign iterations. Even if the specification transmission is successful, there is no guarantee that it is optimal in terms of e.g., power consumption or area occupation. To palliate this problem, two novel model-based hierarchical synthesis methods are proposed in this paper: Model-Based Hierarchical Optimization (MBHO) and Improved Model-Based Hierarchical Optimization (IMBHO). They are based on the concurrent design at higher and lower hierarchical levels and appropriate communication between the different processes. Experimental results on a filter example comparing the new approaches and the conventional top-down design approach are provided.
13:01	IP1-4, 925	A NOVEL LOW POWER 11-BIT HYBRID ADC USING FLASH AND DELAY LINE ARCHITECTURES Speakers: Hsun-Cheng Lee and Jacob Abraham, the University of Texas at Austin, US Abstract This paper presents a novel low power 11-bit hybrid ADC using flash and delay line architectures, where a 4-bit flash ADC is followed by a 7-bit delay-line ADC. This hybrid ADC inherits accuracy and power efficiency from flash ADCs and delay-line ADCs, respectively. Also, in order to reduce the power of the first stage flash ADC, a power-saving technique is adopted by biasing the DC tail current of the pre-amplifiers at 5μA instead of the operational current, 47μA in stand-by mode. The hybrid ADC was designed and simulated in a commercial 65nm process. With a 1.1 V supply and 100 MS/s, the ADC achieves an SNDR of 60 dB and consumes 1.6 mW, which results in a figure of merit (FOM) of 19.4 fJ/conversion-step without any calibration technique. Also, Monte Carlo simulations are performed with a 3σ device mismatch for the SNDR estimation, and the SNDR is observed to be better than 58.5 dB.
13:02	IP1-5, 752	SEMI-SYMBOLIC ANALYSIS OF MIXED-SIGNAL SYSTEMS INCLUDING DISCONTINUITIES Speakers: Carna Radojicic, Christoph Grimm, Javier Moreno and Xiao Pan, TU Kaiserslautern, DE Abstract The paper describes an approach for semi-symbolic analysis of mixed-signal systems that contain discontinuous functions, e.g. due to modeling comparators. For modeling and semi- symbolic simulation, we use extended Affine Arithmetic. Affine Arithmetic is currently limited to accurate analysis of linear func- tions and mild non-linear functions, but not yet discontinuities. In this paper we extend the approach to also handle discontinuities. For demonstration, we symbolically analyze a Σ∆-modulator.
13:03	IP1-6, 927	(Best Paper Award Candidate) NOVEL CIRCUIT TOPOLOGY SYNTHESIS METHOD USING CIRCUIT FEATURE MINING AND SYMBOLIC COMPARISON Speakers: Cristian Ferent and Alex Doboli, Stony Brook University, US Abstract This paper presents a reasoning-based approach to analog circuit synthesis using ordered node clustering representations (ONCR) to describe alternative circuit features and symbolic circuit comparison to characterize performance trade-offs of synthesized solutions. Case studies illustrate application of the proposed methods to topology selection and refinement.
13:04	IP1-7, 468	AN EMBEDDED OFFSET AND GAIN INSTRUMENT FOR OPAMP IPS Speakers: Jinbo Wan and Hans KerkHoff, CAES-TDT, CTIT, University of Twente, NL Abstract Analog and mixed-signal IPs are increasingly required to use digital fabrication technologies and are deeply embedded into system-on-chips (SoC). These developments append more requirements and challenges on analog testing methodologies. Traditional analog testing methods suffer from less accessibility and control with regard to these embedded analog circuits in SoCs. As an alternative, an embedded instrument for analog OpAmp IP tests is proposed in this paper. It can provide the exact gain and offset values of OpAmps instead of only pass/fail result. What's more, it is an non-invasive monitor and can work online without isolating the DUT Opamp from its surrounding feedback networks. Nor does it require accurate test stimulations. In addition, the monitor can remove its own offsets without additional complex self-calibration circuits. All self-calibrations are completed in the digital domain after each measurement in real time. Therefore it is also suitable for aging-sensitive applications, in which the monitor may suffer from aging mechanisms and has additional offset drifts as well. The monitor measurement range for offset is from 0.2mV to 70mV, and for gain it is from 0dB to 40dB. The error for offset measurements can be 10% of the measurement value with plus/minus 0.1mV, and -2.5dB for gain measurements.
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

2.5 Low-Power and Efficient Architectures

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 3

Chair:
Cristina Silvano, Politecnico di Milano, IT

Co-Chair:
Todd Austin, University of Michigan, US

This session presents three papers on energy efficiency in memory-intensive systems. The first paper aims at energy-efficient scheduling of cooperative-thread arrays on GPGPUs for memory intensive workloads through throttling of warps on different cores. The second paper leverages the application-specific knowledge of the next-generation parallelized high-efficiency video encoder to design a distributed scratchpad memory system with adaptive SPM data allocation and power management. The third paper explores the feasibility of non-volatile memories for instruction caches to improve energy efficiency. To handle the write delay and energy issues of NVMs, an analysis and extensions to the miss status handling registers are proposed.

Time	Label	Presentation Title Authors
11:30	2.5.1	ENERGY-EFFICIENT SCHEDULING FOR MEMORY-INTENSIVE GPGPU WORKLOADS Speakers: Seokwoo Song¹, Minseok Lee¹, John Kim¹, Woong Seo², Yeongon Cho² and Soojung Ryu² ¹KAIST, KR; ²Samsung, KR Abstract High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.
12:00	2.5.2	DSVM: ENERGY-EFFICIENT DISTRIBUTED SCRATCHPAD VIDEO MEMORY ARCHITECTURE FOR THE NEXT-GENERATION HIGH EFFICIENCY VIDEO CODING Speakers: Felipe Sampaio¹, Muhammad Shafique², Bruno Zatt³, Sergio Bampi⁴ and Jörg Henkel² ¹Federal University of Rio Grande do Sul, BR; ²Karlsruhe Institute of Technology (KIT), DE; ³Federal University of Pelotas, BR; ⁴Federal University of Rio Grande do Sul, BR Abstract An energy-efficient distributed Scratchpad Video Memory Architecture (dSVM) for the next-generation parallel High Efficiency Video Coding is presented. Our dSVM combines private and overlapping (shared) Scratchpad Memories (SPMs) to support data reuse within and across different cores concurrently executing multiple parallel HEVC threads. We developed a statistical method to size and design the organization of the SPMs along with a supporting memory reading policy for energy efficiency. The key is to leverage the HEVC and video content knowledge. Furthermore, we integrate an adaptive power management policy for SPMs to manage the power states of different memory parts at run time depending upon the varying video content properties. Our experimental results illustrate that our dSVM architecture reduces the overall memory energy consumption by up to 51%-61% compared to parallelized state-of-the-art solutions [11]. The dSVM external memory energy savings increase with an increasing number of parallel HEVC threads and size of search window. Moreover, our SPM power management reacts to the current video properties and achieves up to 54% on-chip leakage energy savings.
12:30	2.5.3	FEASIBILITY EXPLORATION OF NVM BASED I-CACHE THROUGH MSHR ENHANCEMENTS Speakers: Manu Komalan¹, José Ignacio Gómez Pérez², Christian Tenllado², Praveen Raghavan³, Matthias Hartmann³ and Francky Catthoor³ ¹imec, UCM(Universidad Complutense de Madrid), ES; ²Universidad Complutense de Madrid, ES; ³imec, BE Abstract SRAM based memory systems are plagued by a number of problems like sub-threshold leakage and susceptibility to read/write failure with dynamic voltage scaling schemes or low supply voltage. Non-Volatile Memory (NVM) technologies are being explored extensively nowadays to replace the conventional SRAM memories even for level 1 (L1) caches. These NVMs like Spin Torque Transfer RAM (STT-MRAM), Resistive-RAM (ReRAM) and Phase Change RAM (PRAM) are less hindered by leakage problems with technology scaling and consume lesser area. However, simple replacement of SRAM by NVMs is not a viable option due to their write related issues. The main focus of this paper is the exploration of write delay and write energy issues in a NVM based L1 Instruction cache (I-cache) for an ARM like single core system. We propose a NVM I-cache and extend its MSHR (Miss Status Handling Register)functionality to address the NVMs write related issues. According to our simulations, appropriate tuning of selective architecture parameters can reduce the performance penalty introduced by the NVM (∼45%) to extremely tolerable levels (∼1%) and show energy gains up to 35%. Furthermore, on configuring our modified NVM based system to occupy area comparable to the original SRAM-based configuration, it outperforms the SRAM baseline and leads to even more energy savings.
13:00	IP1-8, 266	EVX: VECTOR EXECUTION ON LOW POWER EDGE CORES Speakers: Milovan Duric¹, Oscar Palomar¹, Aaron Smith², Osman Unsal¹, Adrian Cristal¹, Mateo Valero¹ and Doug Burger² ¹Barcelona Supercomputing Center, ES; ²Microsoft Research, US Abstract In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions.
13:01	IP1-9, 730	PROGRAM AFFINITY PERFORMANCE MODELS FOR PERFORMANCE AND UTILIZATION Speakers: Ryan Moore and Bruce Childers, University of Pittsburgh, US Abstract Multithreaded applications have a wide variety of behavior, causing complex interactions with today's chip multiprocessor machines. Application threads may have large private working sets, and may compete for cache space and memory bandwidth. These threads benefit from large private caches. Other threads may share data or communicate, and thus, execute more quickly if using shared caches. Many applications fall somewhere in between, requiring careful thread-to-core assignments to maximize performance. Yet because of the large number of thread-to-core assignments on today's chip multiprocessors, it is time and energy prohibitive to exhaustively try and determine the best assignment. In this paper, we present and demonstrate application performance models that predict application performance given a proposed thread-to-core assignment. We show how these models can be quickly built and used to select thread-to-core assignments for multiple programs and to improve system utilization.
13:02	IP1-10, 791	ADVANCED SIMD: EXTENDING THE REACH OF CONTEMPORARY SIMD ARCHITECTURES Speakers: Matthias Boettcher¹, Giacomo Gabrielli², Mbou Eyole², Alastair Reid² and Bashir M. Al-Hashimi¹ ¹University of Southampton, GB; ²ARM Ltd., GB Abstract SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g. Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity. This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed. We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.
13:03	IP1-11, 898	A TIGHTLY-COUPLED HARDWARE CONTROLLER TO IMPROVE SCALABILITY AND PROGRAMMABILITY OF SHARED-MEMORY HETEROGENEOUS CLUSTERS Speakers: Paolo Burgio¹, Robin Danilo², Andrea Marongiu³, Philippe Coussy⁴ and Luca Benini⁵ ¹University of Bologna, Université de Bretagne-Sud, IT; ²Université de Bretagne-Sud, FR; ³University of Bologna, IT; ⁴Universite de Bretagne-Sud / Lab-STICC, FR; ⁵Università di Bologna, IT Abstract Modern designs for embedded many-core systems increasingly include application-specific units to accelerate key computational kernels with orders-of-magnitude higher execution speed and energy efficiency compared to software counterparts. A promising architectural template is based on heterogeneous clusters, where simple RISC cores and specialized HW units (HWPU) communicate in a tightly-coupled manner via L1 shared memory. Efficiently integrating processors and a high number of HW Processing Units (HWPUs) in such an system poses two main challenges, namely, architectural scalability and programmability. In this paper we describe an optimized Data Pump (DP) which connects several accelerators to a restricted set of communication ports, and acts as a virtualization layer for programming, exposing FIFO queues to offload "HW tasks" to them through a set of lightweight APIs. In this work, we aim at optimizing both these mechanisms, for respectively reducing modules area and making programming sequence easier and lighter.
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

2.6 Real-Time memory hierarchies

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 4

Chair:
Benny Akesson, CTU Prague, CZ

Co-Chair:
Marco Di Natale, Scuola Superiore Sant'Anna, IT

The papers in this session deal with analysis and management of memory hierarchies for complex real-time systems, both from the deterministic and the probabilistic point of view.

Time	Label	Presentation Title Authors
11:30	2.6.1	(Best Paper Award Candidate) ON THE CORRECTNESS, OPTIMALITY AND PRECISION OF STATIC PROBABILISTIC TIMING ANALYSIS Speakers: Sebastian Altmeyer¹ and Robert Davis² ¹University of Amsterdam, NL; ²University of York, GB Abstract In this paper, we investigate Static Probabilistic Timing Analysis (SPTA) for single processor systems that use a cache with an evict-on-miss random replacement policy. We show that previously published formulae for the probability of a cache hit can produce results that are optimistic and unsound when used to compute probabilistic Worst-Case Execution Time (pWCET) distributions. We investigate the correctness, optimality, and precision of different approaches to SPTA. We prove that one of the previously published formulae for the probability of a cache hit is optimal with respect to the limited information that it uses. We improve upon this formulation by using extra information about cache contention. To investigate the precision of various approaches to SPTA, we introduce a simple exhaustive method that computes a precise pWCET distribution, albeit at the cost of exponential complexity. Further, we integrate this precise approach, applied to small numbers of frequently accessed memory blocks, with imprecise analysis of other memory blocks, to form a combined approach that improves precision, without significantly increasing its complexity. The performance of the various approaches are compared on benchmark programs.
12:00	2.6.2	WCET-CENTRIC DYNAMIC INSTRUCTION CACHE LOCKING Speakers: Huping Ding¹, Yun Liang² and Tulika Mitra¹ ¹School of Computing, National University of Singapore, SG; ²Center for Energy-efficient Computing and Applications, School of EECS, Peking University, CN Abstract Cache locking is an effective technique to improve timing predictability in real-time systems. In static cache locking, the locked memory blocks remain unchanged throughout the program execution. Thus static locking may not be effective for large programs where multiple memory blocks are competing for few cache lines available for locking. In comparison, dynamic cache locking overcomes cache space limitation through time-multiplexing of locked memory blocks. Prior dynamic locking technique partitions the program into regions and takes independent locking decisions for each region. We propose a flexible loop-based dynamic cache locking approach. We not only select the memory blocks to be locked but also the locking points (e.g., loop level). We judiciously allow memory blocks from the same loop to be locked at different program points for WCET improvement. We design a constraint-based approach that incorporates a global view to decide on the number of locking slots at each loop entry point and then select the memory blocks to be locked for each loop. Experimental evaluation shows that our dynamic cache locking approach achieves substantial improvement of WCET compared to prior techniques.
12:30	2.6.3	MINIMIZING STACK MEMORY FOR HARD REAL-TIME APPLICATIONS ON MULTICORE PLATFORMS Speakers: Chuansheng Dong and Haibo Zeng, McGill University, CA Abstract Multicore platforms are increasingly used in real-time embedded applications. In the development of such applications, an efficient use of RAM memory is as important as the effective scheduling of software tasks. Preemption Threshold Scheduling is a well-known technique for controlling the degree of preemption, possibly improving system schedulability, and allowing savings in stack space. In this paper, we target at the optimal mapping of tasks to cores and the assignment of the scheduling parameters for systems scheduled with preemption thresholds. We formulate the optimization problems using Mixed Integer Linear Programming framework, and propose an efficient heuristic as an alternative. We demonstrate the efficiency and quality of both approaches with extensive experiments using random systems as well as two industrial case studies.
12:45	2.6.4	TIME-PREDICTABLE EXECUTION OF MULTITHREADED APPLICATIONS ON MULTICORE SYSTEMS Speakers: Ahmed Alhammad and Rodolfo Pellizzoni, University of Waterloo, CA Abstract In multicore systems, contention for access to main memory between application threads complicates timing analysis and may lead to pessimistic bounds on execution time. This is particularly problematic for real-time applications, which require provable bounds on worst-case performance. In this work, we employ a predictable execution model to schedule memory accesses performed by application threads without relying on unpredictable hardware arbiters. In addition, we statically schedule application's threads with the objective to minimize the application's makespan. Our experimental evaluation on 4-core system with NAS Parallel Benchmarks indicates that the proposed execution scheme yields an aggregated improvement of 21% over contention execution in which application's threads uncontrollably accessing the main memory.
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

2.7 Yield and Reliability for Robust Systems

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 5

Chair:
Joan Figueras, UPC, ES

Co-Chair:
Jose Pineda de Gyvez, NXP, NL

Robustness is increasingly a requirement for SOCs and memories, and effects such as wearout, BTI, and soft errors are important to consider as part of design. Another important component of robust design is tolerance of rare events. Understanding design robustness helps predict and enhance yield.

Time	Label	Presentation Title Authors
11:30	2.7.1	(Best Paper Award Candidate) COMPREHENSIVE ANALYSIS OF ALPHA AND NEUTRON PARTICLE-INDUCED SOFT ERRORS IN AN EMBEDDED PROCESSOR AT NANOSCALES Speakers: Mojtaba Ebrahimi¹, Adrian Evans², Mehdi B. Tahoori¹, Razi Seyyedi¹, Enrico Costenaro³ and Dan Alexandrescu³ ¹Karlsruhe Institute of Technology, DE; ²iRoC Technologies, DE; ³iRoC Technologies, France, FR Abstract Radiation-induced soft errors have become a key challenge in advanced commercial electronic components and systems. We present results of Soft Error Rate (SER) analysis of an embedded processor. Our SER analysis platform accurately models all generation, propagation and masking effects starting from a technology response model derived using TCAD simulations at the device level all the way to application masking. The platform employs a combination of empirical models at the device level, analytical error propagation at logic level and fault emulation at the architecture/application level to provide the detailed contribution of each component (flip-flops, combinational gates, and SRAMs) to the overall SER. At each stage in the modeling hierarchy, an appropriate level of abstraction is used to propagate the effect of errors to the next higher level. Unlike previous studies which are based on very simple test chips, analyzing the entire processor gives more insight into the contributions of different components to the overall SER. The results of this analysis can assist circuit designers to adopt effective hardening techniques to reduce the overall SER while meeting required power and performance constraints.
12:00	2.7.2	BIAS TEMPERATURE INSTABILITY ANALYSIS OF FINFET BASED SRAM CELLS Speakers: Seyab Khan¹, Innocent Agbo², Said Hamdioui³, Halil Kukner⁴, Ben Kaczer⁴, Praveen Raghavan⁵ and Francky Catthoor⁴ ¹Technical University Delft, NL; ²TU Delft, NL; ³Delft University of Technology, NL; ⁴IMEC, BE; ⁵imec, BE Abstract Bias Temperature Instability (BTI) is posing a major reliability challenge for today's and future nano-devices as it degrades their performance. This paper provides a comprehensive analysis of BTI impact, in terms of timedependent degradation, on FinFET based SRAM cell. The evaluation metrics are read Static Noise Margin (SNM), hold SNM and Write Trip Point (WTP); while the aspects investigated consist dependence on the supply voltage, cell strength, and design styles (6 versus 8 Transistors cell). A comparison between FinFET and planar CMOS based SRAM cells degradation is also covered. The simulation results for FinFET based cells show that: (a) Read SNM of the cell degrades more (by 16.72%) than the other metrics (6.82% in WTP and 14.19% in hold SNM) (b) 12% increment in the cell's supply voltage enhances its read SNM by 9% (c) Strengthening only the pull-down transistors in the cell by 1.5 reduces BTI induced read SNM degradation by 26.61% (d) 8T SRAM cells has 1.43 higher WTP than 6T cell; however, the cells suffer from 31.13% higher read SNM and 8.05% higher hold SNM degradations than 6T SRAM cells and (e) FinFET based SRAM cells are more vulnerable to BTI degradation than planar CMOS based cells
12:30	2.7.3	SSFB: A HIGHLY-EFFICIENT AND SCALABLE SIMULATION REDUCTION TECHNIQUE FOR SRAM YIELD ANALYSIS Speakers: Manish Rana and Ramon Canal, Universitat Politecnica de Catalunya, ES Abstract Abstract--- Estimating extremely low SRAM failure-probabilities by conventional Monte Carlo (MC) approach requires hundreds-of-thousands simulations making it an impractical approach. To alleviate this problem, failure-probability estimation methods with a smaller number of simulations have recently been proposed, most notably variants of consecutive mean-shift based Importance Sampling (IS). In this method, a large amount of time is spent simulating data points that will eventually be discarded in favor of other data-points with minimum norm. This can potentially increase the simulation time by orders of magnitude. To solve this very important limitation, in this paper, we introduce SSFB: a novel SRAM failure-probability estimation method that has much better cognizance of the data points compared to conventional approaches. The proposed method starts with radial simulation of a single point and reduces discarded simulations by: a) random sampling -only- when it reaches a failure boundary and after that continues again with radial simulation of a chosen point, and b) random sampling is performed -only- within a specific failure-range which decreases in each iteration. The proposed method is also scalable to higher dimensions (more input variables) as sampling is done on the surface of the hyper-sphere, rather than within-the-hypersphere as other techniques do. Our results show that using our method we can achieve an overall 40x reduction in simulations compared to consecutive mean-shift IS methods while remaining within the 0.01-Sigma accuracy.
13:00	IP1-12, 861	INFORMER: AN INTEGRATED FRAMEWORK FOR EARLY-STAGE MEMORY ROBUSTNESS ANALYSIS Speakers: Shrikanth Ganapathy¹, Ramon Canal¹, Dan Alexandrescu², Enrico Costenaro², Antonio Gonzalez³ and Antonio Rubio¹ ¹Universitat Politecnica de Catalunya, ES; ²iRoC Technologies, FR; ³Intel and Universitat Politecnica de Catalunya, ES Abstract With the growing importance of parametric (process and environmental) variations in advanced technologies, it has become a serious challenge to design reliable, fast and low-power embedded memories. Adopting a variation-aware design paradigm requires a holistic perspective of memory-wide metrics such as yield, power and performance. However, accurate estimation of such metrics is largely dependent on circuit implementation styles, technology parameters and architecture-level specifics. In this paper, we propose a fully automated tool - INFORMER that helps high-level designers estimate memory reliability metrics rapidly and accurately. The tool relies on accurate circuit-level simulations of failure mechanisms such as ageing, soft-errors and parametric failures. The obtained statistics can then help couple low-level metrics with higher-level design choices. A new technique for rapid estimation of low-probability failure events is also proposed. We present three use-cases of our prototype tool to demonstrate its diverse capabilities in autonomously guiding large SRAM based robust memory designs.
13:01	IP1-13, 121	WEAR-OUT ANALYSIS OF ERROR CORRECTION TECHNIQUES IN PHASE-CHANGE MEMORY Speakers: Caio Hoffman, Luiz Ramos, Rodolfo Azevedo and Guido Araújo, University of Campinas, BR Abstract Phase-Change Memory (PCM) is new memory technology and a possible replacement for DRAM, whose scaling limitations require new lithography technologies. Despite being promising, PCM has limited endurance (its cells withstand roughly 10^8 bit-flips before failing), which prompted the adoption of Error Correction Techniques (ECTs). However, previous lifetime analyses of ECTs did not consider the difference between the bit-flip frequencies of data and code bits, which may lead to inaccurate wear-out analyses for the ECTs. In this work, we improve the wear-out analysis of PCM by modeling and analyzing the bit-flip probabilities of five ECTs. Our models also enable an accurate estimation of energy consumption and analysis of the endurance-energy trade-off for each ECT.
13:02	IP1-14, 344	APPROXIMATING THE AGE OF RF/ANALOG CIRCUITS THROUGH RE-CHARACTERIZATION AND STATISTICAL ESTIMATION Speakers: Doohwang Chang¹, Sule Ozev¹, Ozgur Sinanoglu² and Ramesh Karri³ ¹Arizona State University, US; ²New York University Abu Dhabi, AE; ³Polytechnic Institute of New York University, US Abstract Counterfeit ICs have become an issue for semiconductor manufacturers due to impacts on their reputation and lost revenue. Counterfeit ICs are either products that are intentionally mislabeled or legitimate products that are extracted from electronic waste. The former is easier to detect whereas the latter is harder since they are identical to new devices but display degraded performance due to environmental and use stress conditions. Detecting counterfeit ICs that are extracted from electronic waste requires an approach that can approximate the age of manufactured devices based on their parameters. In this paper, we present a methodology that uses information on both fresh and aged ICs and tries to distinguish between the fresh and aged population based on an estimate of the age. Since analog devices age mainly due to their bias stress, input signals play less of a role. Hence, it is possible to use simulation models to approximate the aging process, which would give us access to a large population of aged devices. Using this information, we can construct a statistical model that approximates the age of a given circuit. We use a Low noise amplifier (LNA) and an NMOS LC oscillator to demonstrate that individual aged devices can be accurately classified using the proposed method.
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

2.8 Hot Topic: Technology Transfer towards Horizon 2020

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Exhibition Theatre

Organiser:
Rainer Leupers, RWTH Aachen,

Chair:
Norbert Wehn, TU Kaiserslautern, DE

European research projects produce many excellent results, and the quality of research papers at DATE and other major European conferences is often outstanding. But how many academic research results in computing technologies and EDA actually make it into industrial practice? In the context of the transition into the Horizon 2020 framework program, the European research community is currently investigating novel ways of stimulating additional academia-industry technology transfer. This special session contributes by discussing concrete transfer experiences and new concepts. Furthermore it will exemplify several success stories from both academic and industrial perspectives.

Time	Label	Presentation Title Authors
11:30	2.8.1	THE TETRACOM APPROACH TO TECHNOLOGY TRANSFER Speaker: Rainer Leupers, RWTH Aachen University, DE Abstract The mission of TETRACOM is to boost European academia-to-industry technology transfer (TT) in all domains of Computing Systems. The key differentiator of TETRACOM is a novel instrument called Technology Transfer Project (TTP). TTPs help to lower the barrier for researchers to make the first steps towards commercialisation of their research results. TTPs are designed to provide incentives for TT at small to medium scale via partial funding of dedicated, well-defined, and short term academia-industry collaborations that bring concrete R&D results into industrial use. This will be implemented via competitive calls for TTP proposals. It is expected to fund up to 50 TTPs. The TTP activities will be complemented by Technology Transfer Infrastructures (TTIs) that provide training, service, and dissemination actions. These are designed to encourage a larger fraction of the R&D community to engage in TTPs, possibly even for the first time. Altogether, TETRACOM is conceived as the major pilot project of its kind in the area of Computing Systems, acting as a TT catalyst for the mutual benefit of academia and industry. It is expected to acquire around more than 20 new contractors over the project duration. TETRACOM complements and actually precedes the use of existing financial instruments such as venture capital or business angels based funding.
11:45	2.8.2	LEVERAGING EUROPEAN RESEARCH TO CREATE VALUE Speaker: Marco Roodzant, ACE Associated Compiler Experts bv, NL Abstract Experiences from bringing advanced system-software R&D to global industrial use by a European high-tech SME. Using some of the European R&D projects and its results in our 38 years of history demonstrating both business and failure, we will explain some critical success factors in the different phases of technology transfer in our specific domain.
12:00	2.8.3	SUCCESSFUL TECHNOLOGY TRANSFER - SHARING OF EXPERIENCE Speaker: Johannes Stahl, Synopsys, Inc., US Abstract We will highlight where we see the value of cooperation with universities. We will refer to what the researchers need to do and what industry has to do to make for a successful technology transfer. Our contribution will be based on many years of experience working with RWTH Aachen as our lead university partner.
12:15	2.8.4	FROM RESEARCH TO MARKET: CASE STUDIES IN THE FIELD OF INNOVATIVE INTEGRATED ARCHITECTURES Speaker: Luca Fanucci, University of Pisa, IT Abstract We will present some technology transfer experiences from research to industry of innovative integrated circuit architectures for different application fields (automotive, multimedia and industrial). Starting from the analysis of the relevant scenarios we will discuss the adopted Research/Industry collaboration model based on know-how and human resources sharing. Our intent is to highlight main key points in order to have a successful research/industry technology transfer up to the market.
12:30	2.8.5	OPEN SOURCE TECHNOLOGY TRANSFER: SCENARIOS AND VALUE CREATION Speaker: Albert Cohen, INRIA, FR Abstract When value creation and business cases are at stake, free and open-source software is perceived as an opportunity and also as a threat. We will go through selected examples of the successful transfer of research results into industrial use, based exclusively or in part on open source platforms. The talk will build on personal experience conducting research in production compilers, and collecting the experiences of fellow researchers at IRILL, a joint initiative of INRIA and two French Universities promoting research and innovation on free software.
12:45	2.8.6	SUPPORTING INTERNATIONAL TECHNOLOGY TRANSFER: OBJECTIVES AND OBSTACLES Speaker: Bernd Janson, consultant, DE Abstract Founded in 1984 ZENIT in Mülheim/Ruhr, North Rhine Westphalia, Germany, offers qualified support especially for SMEs who are engaged in innovation business like research and innovation funding programmes or international technology transfer. As part of the Enterprise Europe Network and as National Contact Point for SMEs ZENIT is focused on consultancy services for innovative companies and other players like universities and research centres. To get new insights on the impact of FP7 (Framework Programme 7 of the European Union) projects into the NRW-market ZENIT started a series of interviews with companies and universities funded in FP7. First results show that positive effects on science, basic research and further application-oriented research and development are dominating. Nevertheless there are some incidences for positive effects on the NRW-market through innovative products and processes but they are quite rare yet. One important barrier for FP7 project innovation is the fact that the contract negotiation with many players often offers only a suboptimal basis for technology transfer business.
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

UB02 Session 2

Date: Tuesday 25 March 2014
Time: 12:30 - 15:00
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB02.01	QUANTUMEDA: A VISUALIZATION AND DESIGN ENVIRONMENT FOR TOPOLOGICAL QUANTUM CIRCUITS Authors: Ilia Polian, Wolfgang Wallner and Alexandru Paler, University of Passau, DE Abstract Quantum circuits use quantum-mechanical properties of certain physical systems, such as superposition and entanglement, to perform massively parallel calculations. They provide polynomial algorithms for problems for which only inefficient algorithms with asymptotically-exponential running time are known in conventional mod-els of computation. Building a scalable quantum computer that can process a large number of quantum bits (qubits) is one of the grand challenges of modern science. While first small quantum computers have been experimentally demonstrated and a number of implementation technologies have been suggested, all of them encounter difficulties when it comes to scaling. The central difficulty is the high susceptibility of such circuits to noise and decoherence, which necessitates the use of special quantum error correction. Topological quantum computing (TQC) is a paradigm that offers a path to scalability. It strikes a balance between systematic, intuitive methods to design large computations, and relatively loose requirements on the vulnerability of individual qubits to errors. The availability of a platform for implementing large quantum algo-rithm constitutes the need for methods to manage design complexity, including automatic synthesis, optimiza-tion, compaction, verification and visualization of TQC circuits. Topological quantum circuits are based on a three-dimensional cluster of qubits which supports highly efficient topological quantum error-correcting codes. In this way, the circuits can operate even though its individual qubits are subject to relatively high error rates. We will present the first environment for design of TQC circuits. The environment allows the user to graphically enter the structure of a circuit, add, delete and re-shape individual qubits, and perform optimization and compaction (both manually and by global replacement). The circuits are represented on an intermediate technology-independent level, where "logical qubits" that consist of a large number of physical qubits perform error-corrected operations. For example, the circuit in Fig. 1 shows an error-corrected CNOT gate implemented by four logical qubits represented by colored structures. The optimized representation can be translated into instruction sequences for a classical computer that operates the actual quantum hardware. More information ...
UB02.02	AN AUTOMATED DESIGN FLOW FOR FAST PROTOTYPING OF SIMULINK MODELS ONTO MPSOC Authors: Francesco Robino and Johnny Öberg, Royal Institute of Technology, SE Abstract Simulink is a modelling environment suitable to model embedded systems at system-level. However there is no standard to rapidly prototype Simulink models onto modern multiprocessor system-on-chip (MPSoC). In this demonstration we show how our NoC System Generator tool can be used as part of an automated platform-based design flow to synthesize a Simulink model to a network-on-chip based MPSoC implementation on FPGA. The performance of the generated prototype scales with the number of processors. More information ...
UB02.03	CUCUMBER-VERILOG: BEHAVIOR DRIVEN DEVELOPMENT FOR CIRCUIT DESIGN AND VERIFICATION Authors: Melanie Diepenbeck, Mathias Soeken, Ulrich Kühne and Rolf Drechsler, University of Bremen, DE Abstract When designing hardware one usually applies a top-down approach in which starting from a natural language specification a design is implemented and afterwards tested and verified for correctness. In contrast, software development is pushed towards agile techniques such as Test Driven Development (TDD), where tests play a central role in driving the implementation. Behavior Driven Development (BDD) extends TDD by using natural language style scenarios to describe the tests. Essentially, in both techniques testing and implementation is interleaved: first, test cases are written, and secondly, the implementation is extended to satisfy them. Since nowadays 70% of the the effort to design hardware systems is spent on verification, test and verification should receive more attention and be applied as soon as possible. We present a BDD tool tailored for the Verilog hardware description language which enables a new design flow for hardware design, test, and verification. BDD acceptence tests are readily given by means of the natural language specification. Assigning test code to their sentences yields a testbench which serves as a starting point for the implementation. In the same time, the natural language scenarios form a test documentation that is easily accessable also to non-experts. Furthermore, our tool allows for the generalization of test cases to properties suitable for formal verification. As properties are typically more difficult to formalize than test cases, our approach facilitates the access to formal verification. In our demonstration, we will show how to implement hardware designs using our BDD tool and how properties are generalized from test cases which can then can be verified by a model checker automatically. More information ...
UB02.04	BUILDING A PROTOTYPING PLATFORM FOR INVESTIGATING THE IMPACT OF ATTACKS AGAINST AUTOMOTIVE NETWORKS Authors: Alexander Stühring¹, Günter Ehmen¹ and Sibylle Fröschle² ¹University of Oldenburg, DE; ²OFFIS, DE Abstract The University of Oldenburg is working on solutions to ensure a secure communication in the automotive domain. This is a key requirement for safe applications in the context of future Car2X applications. In order to achieve this goal we are using a self-developed prototyping platform to analyze and demonstrate the impact of attacks on in-vehicle buses and wireless networks. Moreover, the visitors are able to start attacks and observe the consequences in a simulated driving scenario. More information ...
UB02.05	HWDEBLUR: DESIGN OF A HIGH PERFORMANCE CORE FOR REMOVING BLUR EFFECT ON IMAGES Authors: Giuseppe Airo' Farulla, Giulio Gambardella, Marco Indaco, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT Abstract This work aims at developing a high performance FPGA-based IP-core able to perform a deblurring algorithm in real-time. Modern approaches to deblurring usually either only handle simple types of blur, or need heavy user inter-action. Moreover, they usually require several minutes (or even whole hours) to process a single image. Our purpose is to study the current state-of-the-art and identify the best deblurring algorithms that are suitable for a hardware implementation. The selected algorithm is optimized and implemented in hardware in order to perform the deblurring task with highest possible performances. More information ...
UB02.06	ENERGY-MODULATED COMPUTING Authors: Maxim Rykunov, Reza Ramezani, Abdullah Baz, Xuefu Zhang, Delong Shang, Andrey Mokhov, Danil Sokolov, Fei Xia and Alex Yakovlev, Newcastle University, GB Abstract This demo will illustrate the principle of energy-modulated computing according to which the flow of energy entering a computing system determines its computational flow. This principle will be fundamental for building future autonomous systems, such as those powered by energy harvesting sources and aimed for survival in power-deficient conditions. The demo includes a set of experimental circuits (with three VLSI chips and PCBs) to work in variable power supply conditions and software tools for digital and analogue co-design (Workcraft, Petrify, MPSAT). More information ...
UB02.07	ID.FIX: AN EDA TOOL FOR FIXED-POINT REFINEMENT OF EMBEDDED SYSTEMS Authors: Olivier Sentieys¹, Daniel Menard² and Nicolas Simon³ ¹INRIA, FR; ²INSA Rennes, FR; ³University of Rennes, FR Abstract Most of digital image and signal processing algorithms are implemented into architectures based on fixed-point arithmetic to satisfy the cost and power consumption constraints of embedded systems. The fixed-point conversion process (or refinement) is crucial for reducing the time-to-market. Design tools to automate this phase and to explore the design space are thus required. The ID.Fix EDA tool based on the compiler infrastructure GECOS allows for the convertion of a floating-point C source code into a C code using fixed-point data types. The data word-lengths are optimized by minimizing the implementation cost under accuracy constraint. To obtain low optimization time, an analytical approach is used to evaluate the fixed-point computation accuracy. This approach is valid for systems made-up of any (smooth) arithmetic operations. More information ...
UB02.08	MICROTESK: RECONFIGURABLE OPEN-SOURCE FRAMEWORK FOR TEST PROGRAM GENERATION Authors: Andrei Tatarnikov, Alexander Kamkin and Artem Kotsynyak, Institute for System Programming of the Russian Academy of Sciences (ISP RAS), RU Abstract Test program generation plays a major role in functional verification of microprocessors. Due to tremendous growth in complexity of modern designs and rigid constraints on time to market, it becomes an increasingly difficult task. In spite of powerful test program generation tools available in the market, development of functional tests is still known to be the bottleneck of the microprocessor design cycle. The common problem is that it takes a significant effort to reconfigure a test program generation environment for a new microprocessor design. The model-based approach applied in the state-of-the-art tools, like Genesys-Pro (IBM Research), still does not provide enough flexibility since creating a microprocessor model is difficult and requires special knowledge and skills. MicroTESK, the open-source test program generation framework being developed at ISPRAS, offers an approach to ease customization by using light-weight formal specifications to describe the target microprocessor architecture. The approach helps reduce the effort needed to create a microprocessor model and, consequently, minimize the time required to create functional tests. In addition to gaining flexibility, the use of formal specifications also allows automated extraction of knowledge about test situations that occur in a microprocessor (coverage model), thus, facilitating creating directed tests and improving test coverage. By the present moment, a demo prototype of MicroTESK has been implemented. It uses the Sim-nML architecture description language to specify the target microprocessor architecture and provides a convenient Ruby-based language for creating test templates that serve as an abstract description of test programs to be generated. The current version of the framework focuses primarily on RISK microprocessors including ARM, MIPS and SPARK. Supported test generation methods include random, combinatorial, template-based and model-based generation. Flexible architecture of the framework allows adding support for new test generation methods. More information ...
UB02.09	FAULTIFY: PROBABILISTIC CIRCUIT FAULT EMULATION Authors: David May and Walter Stechele, TUM, DE Abstract We want to demonstrate an FPGA-based probability-aware fault emulator and its corresponding algorithms in the context of a real-time H.264 decoder. The demo will show that reliability constraints can be relaxed inside the circuit without noticeable degradation of the image quality when carefully investigating where the constraints can be relaxed. We will show how this investigation can to be done using our emulator and we will show the effect of a relaxed robustness of the circuit in real-time. More information ...
15:00	End of session
16:00	Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.1 EXECUTIVE SESSION: Advanced Technology Challenges & Opportunities

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Saal 1

Organiser:
Yervant Zorian, Fellow & Chief Architect, Synopsys, US

Executives:
Lorent Remont, Vice President, STMicroelectronics, FR
Joachim Kunkel, Senior Vice President & GM, Synopsys, US
Rudy Lauwereins, VP, IMEC, BE
Wenchi Chang, Senior Manager, TSMC, NL
Gerd Teepe, VP, Global Foundries, DE

The continuous technology scaling and their new applications are dramatically impacting the semiconductor industry. This may also significantly affect the dependency between eco-system players necessitating smooth interdependency between them. The executives in this session will discuss upcoming innovations in the semiconductor industry and their impact on the solutions offered by the eco system players.

Time	Label	Presentation Title Authors
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.2 Panel: The World Is Going... Analog & Mixed-Signal! What about EDA?

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 6

Organiser:
Marco Casale-Rossi, Synopsys, Inc., US

Chair:
Pietro Palella, STMicroelectronics, IT

Contrarily to a common belief, the world is not going digital! Analog and mixed-signal electronics is more and more important and yet pervasive. This is due both to the increasing systems integration, by nature leading to heterogeneity, and to the complex, digital computing functions being complemented by scores of on-chip analog functions, interfacing/interacting with people, environment, and other systems. Specialty silicon foundries are now stable members of top ten revenue rankings. This technology trend demands for more design automation in both implementation and verification domains. Lossless interfaces between digital and analog design environments, multi-technology support, mixed-signal simulation engines - but also debugging aids - are no longer a nice to have. According to IBS, the cost of implementing and verifying the mixed-signal functions is generally over 50% of the design costs even though the mixed-signal transistors can be as low as 3% of the total! What are the critical requirement, moving forward, and what is EDA industry doing to serve the needs of this increasingly important semiconductor industry segment?

Panelists:

Mario Anton, Micronas, DE
Ori Galzur, TowerJazz, IL
Robert Hum, Mentor Graphics Corp., US
Rainer Kress, Infineon Technologies, DE
Paul Lo, Synopsys, Inc., US

16:00

End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.3 Secure Hardware Primitives and Implementations

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 1

Chair:
Paolo Maistri, TIMA, FR

Co-Chair:
Patrick Schaumont, Virginia tech, US

System designers need secure building blocks for robust system protection against physical attacks. This session presents novel hardware designs and analysis on code-based cryptography, random number generators and IP protection mechanisms using watermarking.

Time	Label	Presentation Title Authors
14:30	3.3.1	LIGHTWEIGHT CODE-BASED CRYPTOGRAPHY: QC-MDPC MCELIECE ENCRYPTION ON RECONFIGURABLE DEVICES Speakers: Ingo von Maurich and Tim Güneysu, Ruhr-Universität Bochum, DE Abstract With the break of RSA and ECC cryptosystems in an era of quantum computing, asymmetric code-based cryptography is an established alternative that can be a potential replacement. A major drawback are large keys in the range between 50kByte to several MByte that prevented real-world applications of code-based cryptosystems so far. A recent proposal by Misoczki et al. showed that quasi-cyclic moderate density parity-check (QC-MDPC) codes can be used in McEliece encryption -- reducing the public key to just 0.6kByte to achieve a 80-bit security level. Despite of reasonably small key sizes that could also enable small designs, previous work only report high-performance implementations with high resource consumptions of more than 13,000 slices on a large Xilinx Virtex-6 FPGA for a combined en-/decryption unit. In this work we focus on lightweight implementations of code-based cryptography and demonstrate that McEliece encryption using QC-MDPC codes can be implemented with a significantly smaller resource footprint -- still achieving reasonable performance sufficient for many applications, e.g., challenge-response protocols or hybrid firmware encryption. More precisely, our design requires just 68 slices for the encryption and around 150 slices for the decryption unit and is able to en-/decrypt an input block in 2.2ms and 13.4ms, respectively.
15:00	3.3.2	ON THE ASSUMPTION OF MUTUAL INDEPENDENCE OF JITTER REALIZATIONS IN P-TRNG STOCHASTIC MODELS. Speakers: Patrick Haddad¹, Yannick Teglia¹, Florent Bernard² and Viktor Fischer³ ¹STMicroelectronics, FR; ²Laboratory Hubert Curien, University of Lyon, UJM Saint-Etienne, FR; ³Hubert Curien Laboratory, Jean Monnet University, FR Abstract Security in true random number generation in cryptography is based on entropy per bit at the generator output. The entropy is evaluated using stochastic models. Several recent works propose stochastic models based on assumptions related to selected physical analog phenomena such as noisy signals and on the knowledge of the principle of randomness extraction from the obtained noisy analog signal. However, these assumptions simplify often considerably the underlying analog processes, which include several noise sources. In this paper, we present a new comprehensive multilevel approach, which enables to build the stochastic model based on detailed analysis of noise sources starting at transistor level and on conversion of the noise to the clock jitter exploited at the generator level. Using this approach, we can estimate proportion of the jitter coming only from the thermal noise, which is included in the total clock jitter.
15:30	3.3.3	CLOCK-MODULATION BASED WATERMARK FOR PROTECTION OF EMBEDDED PROCESSORS Speakers: Jedrzej Kufel¹, Peter Wilson¹, Stephen Hill², Bashir Al-Hashimi¹, Paul N. Whatmough³ and James Myers³ ¹University of Southampton, GB; ²ARM, US; ³ARM, GB Abstract This paper presents a novel watermark generation technique for the protection of embedded processors. In previous work, a load circuit is used to generate detectable watermark patterns in the ASIC power supply. This approach leads to hardware area overheads. We propose removing the dedicated load circuit entirely, instead to compensate the reduced power consumption the watermark power pattern is emulated by reusing existing clock gated sequential logic as a zero-overhead load circuit and modulating the clock-gating enable signal with the watermark sequence. The proposed technique has been validated through experiments using two ASICs in 65nm CMOS, one with an ARM Cortex-M0 microcontroller and one with a Cortex-A5 microprocessor. Silicon measurement results verify the viability of the technique for embedded processors. Furthermore, the proposed clock modulation technique demonstrates a significant area reduction, without compromising the detection performance. In our experiments an area overhead reduction of 98% was achieved. Through reuse of existing logic and reduction of watermark hardware implementation costs, the proposed clock modulation technique offers an improved robustness against removal attacks.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.4 Modeling and Optimization of Power Distribution Networks

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 2

Chair:
Luca Daniel, MIT, US

Co-Chair:
Stefano Grivet-Talocia, Politecnico di Torino, IT

The performance and robustness of 3D power distribution networks is of critical importance for state of the art electronic designs. The papers in this session discuss new modeling and optimization approaches for their efficient characterization and robust design, including order reduction, variability impact, via planning, decoupling capacitor selection, and thermal effects.

Time	Label	Presentation Title Authors
14:30	3.4.1	(Best Paper Award Candidate) SENSITIVITY-BASED WEIGHTING FOR PASSIVITY ENFORCEMENT OF LINEAR MACROMODELS IN POWER INTEGRITY APPLICATIONS Speakers: Andrea Ubolli¹, Stefano Grivet-Talocia¹, Michelangelo Bandinu² and Alessandro Chinea² ¹Politecnico di Torino, IT; ²IdemWorks s.r.l., IT Abstract The electrical performance of Power Distribution Networks (PDNs) is usually assessed by computing frequency responses through quasi-static or full-wave electromagnetic solvers. Such responses, often available in the scattering form, are then fed to suitable macromodeling algorithms for the extraction of compact reduced-order behavioral models that can be seemlessly simulated in the time domain by standard circuit solvers. Such algorithms perform a rational fitting of the raw scattering responses, followed by a passivity check and enforcement step. The resulting macromodel is typically very accurate when compared to the raw scattering responses. It may however happen that the responses of the PDN macromodel exhibit significant deviation from the true system responses under realistic loading conditions, which include appropriate models for active device blocks, decoupling capacitors, voltage regulators, etc. We highlight the source of this accuracy loss, and we propose a sensitivity-based weighting strategy that is able to optimize and tune the macromodel accuracy based on its specific nominal termination network. The particular focus of this paper is the definition and the inclusion of optimal weigths in the passivity enforcement step, which is recognized as the most challenging step. The result is a reliable macromodeling flow, which is able to produce passive, accurate and efficient reduced-order models of general PDN structures for power integrity analysis and verification.
15:00	3.4.2	EFFICIENT ANALYSIS OF VARIABILITY IMPACT ON INTERCONNECT LINES AND RESISTOR NETWORKS Speakers: Jorge Fernandez Villena¹ and Luis Miguel Silveira² ¹INESC ID, PT; ²INESC ID/IST - Lisbon University, PT Abstract Continued technology scaling coupled with limited lithographic capabilities is a leading cause of increased design variability. In the nanometer regime lithography tools have failed to keep pace with Moore's Law and printed feature sizes are a small fraction of the wavelength of light used in current processes. Such sub-wavelength printing makes features highly susceptible to perturbations in the lithographic process conditions which leads to printed designs exhibiting increased variability. Such variability directly affects design behavior and performance in multiple ways. One of the areas of concern is power grid (PG) design, where lithographic errors may locally modify the wire widths. These variations, that may affect any and all wires in the grid, have a critical impact on the power distribution across the chip, introducing considerable current fluctuations which are a potential cause for electromigration effects. To analyze and account for the impact of these errors requires a complete extraction of the PG, which generates a large resistive network, potentially with several million elements, whose simulation is computationally challenging. This paper proposes a fast and accurate variability analysis of very large resistor networks, such as PG extracted netlists, that allows estimating the effects of multiple parameter settings in reasonable time. The proposed model can be easily combined with Litho/CMP simulators in order to boost much needed design-aware lithography.
15:30	3.4.3	IMPLICIT INDEX-AWARE MODEL ORDER REDUCTION FOR RLC/RC NETWORKS Speakers: Nicodemus Banagaaya¹, Giuseppe Ali'², Wil . H. A. Schilders¹ and Caren Tischendorf³ ¹Eindhoven University of Technology, NL; ²University of Calabria and INFN, Gruppo collegato di Cosenza, IT; ³Institute of Mathematics, Humboldt-Universit¨at zu Berlin, DE Abstract This paper introduces the implicit-IMOR method for differential algebraic equations. This method is a modification of the Index-aware model order reduction (IMOR) method proposed in our earlier papers which is the explicit-IMOR method. It also involves first splitting the differential-algebraic equations (DAEs) into differential and algebraic parts using a basis of projectors. In contrast with the explicit-IMOR method, the implicit-IMOR method leads to implicit differential and algebraic parts. We demonstrate the implicit-IMOR method using the RLC/RC networks, but it can also be applied to other problems which lead to differential-algebraic equations.
15:45	3.4.4	P/G TSV PLANNING FOR IR-DROP REDUCTION IN 3D-ICS Speakers: Shengcheng Wang¹, Farshad Firouzi², Fabian Oboril¹ and Mehdi Tahoori¹ ¹Karlsruhe Institute of Technology, DE; ²Karlsruhe Institute of Technology (KIT), DE Abstract In recent years, interconnect issues emerged as major performance challenges for Two-Dimensional-Integrated- Circuits (2D-ICs). In this context, Three-Dimensional-ICs (3D- ICs), which consist of several active layers stacked above each other, offer a very attractive alternative to conventional 2D-ICs. However, 3D-ICs also face many challenges associated with the Power Distribution Network (PDN) design due to the increasing power density and larger supply current compared to 2D-ICs. As an important part of 3D-IC PDNs, Power/Ground (P/G) Through-Silicon-Vias (TSVs) should be well-managed. Excessive or ill-placed P/G TSVs impact the power integrity (e.g. IR-drop), and also consume a considerable amount of chip real estate. In this work, we propose a Mixed-Integer-Linear-Programming (MILP)-based technique to plan the P/G TSVs. The goal of our approach is to minimize the average IR-drop while satisfying the total area constraint of TSVs by optimizing the P/G TSV placement. Therefore, the locations, sizes and the total number of the P/G TSVs are co-optimized simultaneously. The experimental results show that the average IR-drop can be reduced by 11.8% in average using the proposed method compared to a random placement technique with a much smaller runtime.
16:00	IP1-15, 69	PACKAGE GEOMETRIC AWARE THERMAL ANALYSIS BY INFRARED-RADIATION THERMAL IMAGES Speakers: Jui-Hung Chien¹, Hao Yu², Ruei-Siang Hsu³, Hsueh-Ju Lin³ and Shih-Chieh Chang³ ¹Industrial Technology Research Institute, TW; ²None, TW; ³NTHU, TW Abstract Since packages affect the amount of heat transfer, it is important to include package and heat sink in thermal analysis. In this paper, we study the full-chip thermal response with different packages. We first discuss the difficulties of obtaining accurate package models for simulation. To facilitate a designer to perform thermal simulation with different packages, we propose to use a matrix called the package-transfer matrix which can transform a temperature profile of one package to another temperature profile of the desired package. To estimate and verify a package-transfer matrix, we propose an efficient method which uses Infrared Radiation (IR) images from two carefully design test chips with PBGA packages. Our experimental results show that the default package model CBGA in HotSpot can be accurately transferred to any other package through the package-transfer matrix.
16:01	IP1-16, 252	COST-EFFECTIVE DECAP SELECTION FOR BEYOND DIE POWER INTEGRITY Speakers: Yi-En Chen¹, Tu-Hsung Tsai¹, Shi-Hao Chen² and Hung-Ming Chen¹ ¹Department of Electronics Engineering National Chiao Tung University Hsinchu, Taiwan 300, R.O.C., TW; ²Global Unichip Corp, Hsinchu, Taiwan, TW Abstract In designing reliable power distribution networks (PDN) for power integrity (PI), it is essential to stabilize voltage supply to devices on chip. We usually employ decoupling capacitor (decap) to suppress the noise generated by the switching of devices. There have been numerous prior works on how to select/insert decaps in chip, package, or board to maintain PI, however optimal decap selection is usually not applicable due to design budget and manufacturability. Moreover, design cost is seldom touched or mentioned. In this research, we propose an efficientmethodology "PDCPSO" to automatically optimizing the selection of available decaps. This algorithm not only takes advantage of particle swarm optimization (PSO) to stochastically search the design space, but takes the most effective range of decaps into consideration to outperform the basic PSO. We apply this to three real package designs and the results show that, compared to the original decap selection by rules of thumb, our approach could shorten the design period and we have better combination of decaps at the same or lower cost. In addition, our methodology can also consider package-board co-design in optimizing different operation frequencies.
16:02	IP1-17, 554	CHARACTERIZING POWER DELIVERY SYSTEMS WITH ON/OFF-CHIP VOLTAGE REGULATORS FOR MANY-CORE PROCESSORS Speakers: Xuan Wang, Jiang Xu, Zhe Wang, Kevin J. Chen, Xiaowen Wu and Zhehui Wang, HKUST, HK Abstract Design of power delivery system has great influence on the power management in many-core processor systems. Moving voltage regulators from off-chip to on-chip gains more and more interest in the power delivery system design, because it is able to provide fast voltage scaling and multiple power domains. Previous works are proposed to implement power efficient on-chip regulators. It is also important to analyze the characteristics of the entire power delivery system to explore the tradeoff between the promising properties and costs of employing on-chip regulators. In this work, we develop an analytical model to evaluate important characteristics of the power delivery system, including on-chip/off-chip voltage regulators and the passive on-chip/on-board parasitic. Compared with SPICE simulations, our model achieves a fast system-level evaluation with comparable accuracy. Based on the model, geometric programming is utilized to find the optimal power efficiency of different architectures of power delivery systems under constraints of output voltage stability and area. Experiments show that compared with the conventional architecture using off-chip regulators, the hybrid one using both on-chip and off-chip voltage regulators achieves 1.0% power efficiency improvement and 68% area reduction of voltage regulators on average. We conclude that the hybrid architecture has potential for high power efficiency and small area at heavy workload, but careful account for the overhead of on-chip regulators is needed.
16:03	IP1-18, 527	MASK-COST-AWARE ECO ROUTING Speakers: Hsi-An Chien¹, Zhen-Yu Peng¹, Yun-Ru Wu², Ting-Hsiung Wang², Hsin-Chang Lin², Chi-Feng Wu² and Ting-Chi Wang¹ ¹National Tsing Hua University, TW; ²Realtek Semiconductor Corp., TW Abstract In this paper, we study a mask-cost-aware routing problem for engineering change order (ECO). By taking into account old routes for possible reuse, we present an approach for the problem. Encouraging experimental results are reported to demonstrate the effectiveness of our approach.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.5 Robust Architectures

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 3

Chair:
Todd Austin, University of Michigan, US

Co-Chair:
Muhammad Shafique, Karlsruhe Institute of Technology, DE

This session presents the design of novel architectures to support real-time and secure systems. The first paper couples a time-division multiplexed NoC with a real-time memory controller to design a cost-effective real-time system with improved worst-case latency at reduced area and power consumption. The next paper proposes bus designs for multi-cores that are analyzable for probabilistic timing analysis. The final paper in this session designs a lightweight hardware solution using lockstep shadow thread execution to detect and prevent code injection attacks.

Time	Label	Presentation Title Authors
14:30	3.5.1	(Best Paper Award Candidate) COUPLING TDM NOC AND DRAM CONTROLLER FOR COST AND PERFORMANCE OPTIMIZATION OF REAL-TIME SYSTEMS Speakers: Manil Dev Gomony¹, Benny Akesson² and Kees Goossens³ ¹Eindhoven University of Technology, NL; ²Czech Technical University in Prague, CZ; ³Eindhoven university of technology, NL Abstract Existing memory subsystems and TDM NoCs for real-time systems are optimized independently in terms of cost and performance by configuring their arbiters according to the bandwidth and/or latency requirements of their clients. However,when they are used in conjunction, and run in different clock domains, i.e. they are decoupled, there exists no structured methodology to select the NoC interface width and operating frequency for minimizing area and/or power consumption. Moreover,the multiple arbitration points, one in the NoC and the other in the memory subsystem, introduce additional overhead in the worst-case guaranteed latency. These makes it hard to design cost-efficient real-time systems. The three main contributions in this paper are: (1)We present a novel methodology to couple any existing TDM NoC with a real-time memory controller and compute the different NoC interface width and operating frequency combinations for minimal area and/or power consumption. (2)For two different TDM NoC types,one a packet-switched and the other circuit-switched, we show the trade-off between area and power consumption with the different NoC configurations, for different DRAM generations. (3)We compare the coupled and decoupled architectures with the two NoCs, in terms of guaranteed worst-case latency, area and power consumption by synthesizing the designs in 40 nm technology.Our experiments show that using a coupled architecture in a system consisting of 16 clients results in savings of over 44%in guaranteed latency, 18% and 17% in area, 19% and 11% in power consumption for a packet-switched and a circuit-switched TDM NoC, respectively, with different DRAM types.
15:00	3.5.2	BUS DESIGNS FOR TIME-PROBABILISTIC MULTICORE PROCESSORS Speakers: Javier Jalle¹, Leonidas Kosmidis¹, Jaume Abella², Eduardo Quinones¹ and Francisco Cazorla³ ¹Barcelona Supercomputing Center, ES; ²Barcelona Supercomputing Center (BSC-CNS), ES; ³Barcelona Supercomputing Center and IIIA-CSIC, ES Abstract Probabilistic Timing Analysis (PTA) reduces the amount of information needed to provide tight WCET estimates in real-time systems with respect to classic timing analysis. PTA imposes new requirements on hardware design that have been shown implementable for single-core architectures. However, no support has been proposed for multicores so far. In this paper, we propose several probabilistically-analysable bus designs for multicore processors ranging from 4 cores connected with a single bus, to 16 cores deploying a hierarchical bus design. We derive analytical models of the probabilistic timing behaviour for the different bus designs, show their suitability for PTA and evaluate their hardware cost. Our results show that the proposed bus designs (i) fulfil PTA requirements, (ii) allow deriving WCET estimates with the same cost and complexity as in single-core processors, and (iii) provide higher guaranteed performance than single-core processors, 3.4x and 6.6x on average for an 8-core and a 16-core setup respectively.
15:30	3.5.3	PROGRAMMABLE DECODER AND SHADOW THREADS: TOLERATE REMOTE CODE INJECTION EXPLOITS WITH DIVERSIFIED REDUNDANCY Speakers: Weidong Shi¹, Ziyi Liu¹, Shouhuai Xu² and Zhiqiang Lin³ ¹University of Houston, US; ²University of Texas at San Antonio, US; ³University of Texas at Dallas, US Abstract We present a lightweight hardware framework for providing high assurance detection and prevention of code injection attacks using a lockstep diversified shadow execution. Recent studies show that hardware diversification can detect software attacks by checking the consistency of their behavior simultaneously. Unfortunately, the severe performance degradation and extra system costs caused by these methods are unacceptable in many applications. This paper presents a hardware-level, lockstep shadow thread framework to enrich the diversity of the software execution, with the facilitation from programmable hardware decoder and novel CPU support of tightly coupled non-executing shadow thread technique. Specifically, given a piece of (legacy) binary code, we first generate diversified binary versions using an offline binary rewriter and programmable hardware binary translator at runtime. Two diversified binary code images are launched as dual simultaneous threads in the hardware layer with one as the primary thread and the other one as shadow thread. Instructions from the shadow thread are not executed but just compared, and thus incur no OS side-effects. The extended CPU is able to decode instructions from both threads, and dispatch them to next stage pipeline for a lockstep comparison. Any mismatch of the decoded instructions from the two threads caused by remotely injected binary code will be detected. Our design provides instruction set randomization (ISR) with minimal cost in performance, when compared with straight-forward ISR implementation. The simulation results indicate that our framework incurs very small overheads and provides a protection against code injection attacks.
16:00	IP1-19, 268	EXPLOITING NARROW-WIDTH VALUES FOR IMPROVING NON-VOLATILE CACHE LIFETIME Speakers: Guangshan Duan and Shuai Wang, Nanjing University, CN Abstract Due to the high cell density, low leakage power consumption, and less vulnerability to soft errors, the non-volatile memory technologies are among the most promising alternatives for replacing the traditional DRAM and SRAM technologies used in implementing main memory and caches in the modern microprocessor. However, one of the difficulties is the limited write endurance of most non-volatile memory technologies. In this paper, we propose to exploit the narrow-width values to improve the lifetime of the non-volatile last level caches. Leading zeros masking scheme is first proposed to reduce the write stress to the upper half of the narrow-width data. To balance the write variations between the upper half and the lower half of the narrow-width data, two swap schemes, the swap on write (SW) and swap on replacement (SRepl), are proposed. To further reduce the write stress to the non-volatile cache, we adopt two optimization schemes, the multiple dirty bit (MDB) and read before write (RBW), to improve its lifetime. Our experimental results show that by combining all our proposed schemes, the lifetime of the non-volatile caches can be improved by 245% on average.
16:01	IP1-20, 166	PARTIAL-SET: WRITE SPEEDUP OF PCM MAIN MEMORY Speakers: Li Bing¹, Shan Shuchang², Hu Yu² and Li XiaoWei³ ¹ICT,UCAS, CN; ²ICT,CAS, CN; ³ICT.CAS, CN Abstract Phase change memory (PCM) is a promising nonvolatile memory technology developed as a possible DRAM replacement. Although it offers the read latency close to that of DRAM, PCM generally suffers from the long write latency. Long write request may block the read requests on the critical path of cache/memory access, incurring adverse impact on the system performance. Besides, the write performance of PCM is very asymmetric, i.e, the SET operation (writing '1') is much slower than that of the RESET operation (writing '0'). In this work, we re-examine the resistance transform process during the SET operation of PCM and propose a novel Partial-SET scheme to alleviate the long write latency issue of PCM. During a write access to a memory line, a short Partial-SET pulse is applied first to program the PCM cells to a pre-stable state, achieving the same write latency as RESET. The partially-SET cells are then fully programmed within the retention window to preserve the data integrity. Experimental results show that our Partial-SET scheme can improve the memory access performance of PCM by more than 45% averagely with very marginal storage overhead.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.6 Cyber Physical Systems: Security and Co-design

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 4

Chair:
Rolf Ernst, Technische Universitaet Braunschweig, DE

Co-Chair:
Anuradha Annaswamy, MIT, US

This session showcases recent results in cybersecurity and codesign in CPS. The first paper analyzes a stealth cyberattack scenario where a distributed sensor system is disturbed by an attacker who tries to reduce the sensor fusion quality and suggests an algorithmic approach to increase robustness against this attack. The second paper addresses the joint design of a feedback controller and a server-based resource reservation mechanism to guarantee closed-loop stability. The third paper describes a codesign approach formally guaranteeing control robustness for a communication channel with a bounded number of frame losses.

Time	Label	Presentation Title Authors
14:30	3.6.1	(Best Paper Award Candidate) ATTACK-RESILIENT SENSOR FUSION Speakers: Radoslav Ivanov, Miroslav Pajic and Insup Lee, University of Pennsylvania, US Abstract This work considers the problem of attack-resilient sensor fusion in an autonomous system where multiple sensors measure the same physical variable. A malicious attacker may corrupt a subset of these sensors and send wrong measurements to the controller on their behalf, potentially compromising the safety of the system. We formalize the goals and constraints of such an attacker who also wants to avoid detection by the system. We argue that the attacker's capabilities depend on the amount of information she has about the correct sensors' measurements. In the presence of a shared bus where messages are broadcast to all components connected to the network, the attacker may consider all other measurements before sending her own in order to achieve maximal impact. Consequently, we investigate effects of communication schedules on sensor fusion performance. We provide worst- and average-case results in support of the Ascending schedule, where sensors send their measurements in a fixed succession based on their precision, starting from the most precise sensors. Finally, we provide a case study to illustrate the use of this approach.
15:00	3.6.2	BANDWIDTH-EFFICIENT CONTROLLER-SERVER CO-DESIGN WITH STABILITY GUARANTEES Speakers: Amir Aminifar¹, Enrico Bini², Petru Eles¹ and Zebo Peng¹ ¹Linköping University, SE; ²Lund University, SE Abstract Many cyber-physical systems comprise several control applications implemented on a shared platform, for which stability is a fundamental requirement. This is as opposed to the classical hard real-time systems where often the criterion is meeting the deadline. However, the stability of control applications depends on not only the delay experienced, but also the jitter. Therefore, the notion of deadline is considered to be artificial for control applications that promotes the need for new techniques for designing cyber-physical systems. The approach in this paper is built on a server-based resource reservation mechanism, which provides compositionality, isolation, and the opportunity of systematic controller-server co-design. We address the controller-server co-design of such systems to obtain design solutions with the minimal bandwidth to guarantee stability.
15:30	3.6.3	FAULT-TOLERANT CONTROL SYNTHESIS AND VERIFICATION OF DISTRIBUTED EMBEDDED SYSTEMS Speakers: Matthias Kauer¹, Damoon Soudbakhsh², Dip Goswami³, Samarjit Chakraborty⁴ and Anuradha Annaswamy⁵ ¹TUM CREATE Ltd,, SG; ²Masschussetts Institute of Technology, US; ³Eindhoven University of Technology, NL; ⁴TU Munich, DE; ⁵MIT, US Abstract We deal with synthesis of distributed embedded control systems closed over a faulty or severely constrained communication network. Such overloaded communication networks are common in cost-sensitive domains such as automotive. Design of such systems aims to meet all deadlines following the traditional notion of schedulability. In this work, we aim to exploit robustness of the controller and propose a novel implementation approach to achieve a tighter design. Toward this, we answer two research questions: (i) given a distributed architecture, how to characterize and formally verify the bound on deadline misses, (ii) given such a bound, how to design a controller such that desired stability and Quality of Control (QoC) requirements are met. We address question (i) by modeling a distributed embedded architecture as a network of Event Count Automata (ECA), and subsequently introducing and formally verifying a property formulation with reduced complexity. We address question (ii) by introducing a novel fault-tolerant control strategy which adjusts the control input at runtime based on the occurrence of fault or drop. We show that QoC under faulty communication improves significantly using the proposed fault-tolerant strategy.
16:00	IP1-21, 195	GARBAGE COLLECTION FOR MULTI-VERSION INDEX ON FLASH MEMORY Speakers: Kam-Yiu Lam¹, Jian-Tao Wang¹, Yuan-Hao Chang², Jen-Wei Hsieh³, Po-Chun Huang⁴, Chung Keung Poon⁵ and ChunJiang Zhu¹ ¹City University of Hong Kong, HK; ²Academia Sinica, TW; ³National Taiwan University of Science and Technology, TW; ⁴Acadmia Sinica, TW; ⁵City University of Hong Kong, TW Abstract In this paper, we study the important performance issues in using the purging-range query to reclaim old data versions to be free blocks in a flash-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based placement (FBP) scheme to place data versions in a block, the efficiency in garbage collection can be further enhanced by increasing the deadspans of data versions and reducing reallocation cost especially when the spaces of the flash memory for the databases are limited.
16:01	IP1-22, 395	D2CYBER: A DESIGN AUTOMATION TOOL FOR DEPENDABLE CYBERCARS Speakers: Arslan Munir and Farinaz Koushanfar, Rice University, US Abstract The next generation of automobiles (also known as cybercars) will increasingly incorporate electronic control units (ECUs) in novel automotive control applications. Recent work has demonstrated vulnerability of modern car control systems to security attacks that directly impacts the cybercar's physical safety and dependability. In this paper, we provide an integrated approach for the design of secure and dependable cybercars using a case study: a steer-by-wire (SBW) application over controller area network (CAN). The challenge is to embed both security and dependability over CAN while ensuring that the real-time constraints of the cybercar applications are not violated. Our approach enables early design feasibility analysis by embedding essential security primitives (i.e., confidentiality, integrity, and authentication) over CAN subject to the real-time constraints imposed by the desired quality of service and behavioral reliability. Our method leverages multi-core ECUs for providing fault-tolerance by redundant multi-threading (RMT) and also further enhances RMT for quick error detection. We quantify the error resilience of our approach and evaluate the interplay of performance, fault-tolerance, security, and scalability for our SBW case study.
16:02	IP1-23, 819	CONTRACT-BASED DESIGN OF CONTROL PROTOCOLS FOR SAFETY-CRITICAL CYBER-PHYSICAL SYSTEMS Speakers: Pierluigi Nuzzo, John Finn, Antonio Iannopollo and Alberto Sangiovanni-Vincentelli, University of California at Berkeley, US Abstract We introduce a platform-based design methodology that addresses the complexity and heterogeneity of cyber-physical systems by using assume-guarantee contracts to formalize the design process and enable realization of control protocols in a hierarchical and compositional manner. Given the architecture of the physical plant to be controlled, the design is carried out as a sequence of refinement steps from an initial specification to a final implementation, including synthesis from requirements and mapping of higher-level functional and non-functional models into a set of candidate solutions built out of a library of components at the lower level. Initial top-level requirements are captured as contracts and expressed using linear temporal logic (LTL) and signal temporal logic (STL) formulas to enable requirement analysis and early detection of inconsistencies. Requirements are then refined into a controller architecture by combining reactive synthesis steps from LTL specifications with simulation-based design space exploration steps. We demonstrate our approach on the design of embedded controllers for aircraft electric power distribution.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.7 On line Strategies for Reliability

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 5

Chair:
Fabrizio Lombardi, Northwestern University, US

Co-Chair:
Jie Han, University of Alberta, CA

This section presents different approaches to improve reliability of circuits and systems by using on line techniques. It shows different methods that can be applied to caches, processors and multicore architectures.

Time	Label	Presentation Title Authors
14:30	3.7.1	SPATIAL PATTERN PREDICTION BASED MANAGEMENT OF FAULTY DATA CACHES Speakers: Georgios Keramidas¹, Michail Mavropoulos², Anna Karvouniari² and Dimitris Nikolos² ¹Researcher, Univerity of Patras, GR; ²University of Patras, GR Abstract Technology scaling leads to significant faulty bit rates in on-chip caches. In this work, we propose a methodology to mitigate the impact of defective bits (due to permanent faults) in first-level set-associative data caches. Our technique assumes that faulty caches are enhanced with the ability of disabling their defective parts at cache subblock granularity. Our experimental findings reveal that while the occurrence of hard-errors in faulty caches may have a significant impact in performance, a lot of room for improvement exists, if someone is able to take into account the spatial reuse patterns of the to-be-referenced blocks (not all the data fetched into the cache is accessed). To this end, we propose frugal PC-indexed spatial predictors (with very small storage requirements) to orchestrate the (re)placement decisions among the fully and partially unusable faulty blocks. Using cycle-accurate simulations, a wide range of scientific applications, and a plethora of cache fault maps, we showcase that our approach is able to offer significant benefits in cache performance.
15:00	3.7.2	COMBINED DVFS AND MAPPING EXPLORATION FOR LIFETIME AND SOFT-ERROR SUSCEPTIBILITY IMPROVEMENT IN MPSOCS Speakers: Anup Das¹, Akash Kumar¹, Bharadwaj Veeravalli¹, Cristiana Bolchini² and Antonio Miele² ¹National University of Singapore, SG; ²Politecnico di Milano, IT Abstract Energy and reliability optimization are two of the most critical objectives for the synthesis of multiprocessor systems-on-chip (MPSoCs). Task mapping has shown significant promise as a low cost solution in achieving these objectives as standalone or in tandem as well. This paper proposes a multi-objective design space exploration to determine the mapping of tasks of an application on a multiprocessor system and voltage/frequency level of each tasks (exploiting the DVFS capabilities of modern processors) such that the reliability of the platform is improved while fulfilling the energy budget and the performance constraint set by system designers. In this respect, the reliability of a given MPSoC platform incorporates not only the impact of voltage and frequency on the aging of the processors (wear-out effect) but also on the susceptibility to soft-errors -- a joint consideration missing in all existing works in this domain. Further, the proposed exploration also incorporates soft-error tolerance by selective replication of tasks, making the proposed approach an interesting blend of reactive and proactive fault-tolerance. The combined objective of minimizing core aging together with the susceptibility to transient faults under a given performance/energy budget is solved by using a multi-objective genetic algorithm exploiting tasks' mapping, DVFS and selective replication as tuning knobs. Experiments conducted with real-life and synthetic application graphs clearly demonstrate the advantage of the proposed approach.
15:30	3.7.3	DARP: DYNAMICALLY ADAPTABLE RESILIENT PIPELINE DESIGN IN MICROPROCESSORS Speakers: Hu Chen, Sanghamitra Roy and Koushik Chakraborty, Utah State University, US Abstract In this paper, we demonstrate that the sensitized path delays in various microprocessor pipe stages exhibit intriguing temporal and spatial variations during the execution of real world applications. To effectively exploit these delay variations, we propose Dynamically Adaptable Resilient Pipeline (DARP)--a series of runtime techniques to boost power performance efficiency and fault tolerance in a pipelined microprocessor. DARP employs early error prediction to avoid a major portion of timing errors. Using a rigorous circuit-architectural infrastructure, we demonstrate substantial improvements in the performance (9.4-20%) and energy efficiency (6.4-27.9%), compared to state-of-the-art techniques.
16:00	IP1-24, 45	A FAULT DETECTION MECHANISM IN A DATA-FLOW SCHEDULED MULTITHREADED PROCESSOR Speakers: Jian Fu¹, Qiang Yang¹, Raphael Poss¹, Chris Jesshope¹ and Chunyuan Zhang² ¹University of Amsterdam, NL; ²National University of Defense Technology, CN Abstract This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled Multi-Threaded (DMT) multicore processor, called Data-flow scheduled Redundant Multi-Threading (DRMT). Meanwhile, It presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection related inter-core communication and alleviate the performance and hardware overheads induced by output comparison. Results show that the performance overhead of DRMT is less than 60% even when the number of threads is four times the number of processing elements. Also the performance and hardware overheads of AOC are insignificant.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

3.8 Hot Topic: Mission Profile Aware Design - The Solution for Successful Design of Tomorrows Automotive Electronics

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Exhibition Theatre

Organisers:
Goeran Jerke, Robert Bosch GmbH, DE
Oliver Bringmann, University of Tuebingen, DE

Chair:
Goeran Jerke, Robert Bosch GmbH, DE

Co-Chair:
Oliver Bringmann, University of Tuebingen, DE

In order to benefit from modern automotive semiconductor technologies, application robustness must now be considered as a design target. This includes the consequent consideration of environmental stress conditions and functional loads, which are formalized in so-called "mission profiles". We introduce the motivation to use mission profiles from an OEM and Tier n perspective. Additionally, we introduce the mission profile aware design flow and present several application scenarios.

Time	Label	Presentation Title Authors
14:30	3.8.1	MISSION PROFILES - SOLUTION OR CHALLENGE? THE OEM PERSPECTIVE Speaker: Ulrich Abelein, AUDI AG, DE Abstract The original equipment manufacturer (OEM) is driven by its own quality and innovation goals to imple-ment the newest available and suitable semiconductor technologies. In this talk the OEM perspective with regard to mission profiles will be presented and discussed. The difference between the current use of standard sets of requirements and a mission profile approach will be evaluated. This will be demon-strated by actual and upcoming challenges in the automotive industry. Therefore the use of up-to-date technologies in accordance with declining maturing and product development times has to be consid-ered. Mission profiles become increasingly important as they provide the opportunity to cover these re-quirements. A necessary step to assemble a mission profile is the derivation of all relevant functional load and envi-ronmental stress conditions of an electronic component and its sub-components. Therefore a formalized communication within the supply chain is necessary for a consistent availability of all relevant data. Dominant loads must be determined and appropriately allocated. One challenge is to consider the influ-ence of singular events on sporadic failures. Another challenge is the different time frame of the product engineering process of OEM, Tier 1 and semiconductor manufacturer. Despite the existence of multiple challenges to derive mission profiles, the mission profiles approach shows great promise to enable the design of robust electronic components for specific applications even in the presence of yet immature technologies.
15:00	3.8.2	MISSION PROFILE AWARE IC DESIGN - A CASE STUDY Speakers: Goeran Jerke¹ and Andrew Kahng² ¹Robert Bosch GmbH, DE; ²University of California, San Diego, USA, US Abstract Consistent consideration of mission profiles throughout a supply chain is essential for the development of robust electronic components. Consideration of mission profiles is still mainly a manual task today despite rapidly decreasing robustness margins in modern automotive semiconductor technologies. Mission profile awareness aids the automation of robustness aware design by formalizing and partially automating the generation, transformation, propagation and usage of all component-specific functional loads and environmental conditions for design implementation and validation. In addition, it aids the development of electronic components in yet immature technologies or in technologies with tight parameter variation bounds. This paper introduces the general concept, requirements and context of mission profile aware design. The general design approach is presented along with key differences and enhancements to existing design approaches. A case study focusing on mission profile usage and electromigration failure avoidance is presented to demonstrate various aspects of mission profile aware design.
15:30	3.8.3	MISSION PROFILE AWARE ROBUSTNESS ASSESSMENT OF AUTOMOTIVE POWER DEVICES Speakers: Thomas Nirmaier¹, Andreas Burger², Manuel Harrant¹, Alexander Viehl², Oliver Bringmann³, Wolfgang Rosenstiel³ and Georg Pelz¹ ¹Infineon Technologies AG, DE; ²FZI Research Center for Information Technology, DE; ³University of Tuebingen, DE Abstract In this paper we propose to exploit so called Mission Profiles to address increasing requirements on safety and power efficiency for automotive power ICs. These Mission Profiles constrain the required device performance space to valid application scenarios. Mission Profile data can be represented in arbitrary forms like temperature histograms or cumulated drive cycle data. Hence, the derivation of realistic verification scenarios on device level requires the generation of environmental properties as e.g. temperatures, board net conditions or currents. For the assessment of real application robustness we present a methodology to extract finite state machines out of measured vehicle data and integrate them in Mission Profiles. Subsequently Markov processes are derived from these finite state machines in order to automatically generate Mission Profile compliant test scenarios for the design and verification process. As a motivating example we show industry fault cases in which missing application fitness to power transient variations finally results in device failure. Verification results based on lab data are outlined and show the benefits of a fully mission profile driven IC verification flow.
15:45	3.8.4	APPLICATION OF MISSION PROFILES TO ENABLE CROSS-DOMAIN CONSTRAINT-DRIVEN DESIGN Speakers: Carolin Katzschke¹, Marc-Philipp Sohn¹, Markus Olbrich¹, Volker Meyer zu Bexten², Markus Tristl² and Erich Barke¹ ¹Institute of Microelectronic Systems, Leibniz Universität Hannover, DE; ²Infineon Technologies AG, DE Abstract Mission Profiles contain top-level stress information for the design of future systems. These profiles are refined and transformed to design constraints. We present methods to propagate the constraints between design domains like package and chip. We also introduce a cross-domain methodology for our corresponding constraint transformation system ConDUCT. The proposed methods are demonstrated on the basis of an automotive analog/mixed-signal application.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

UB03 Session 3

Date: Tuesday 25 March 2014
Time: 15:00 - 17:30
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB03.01	LARA: THE LARA COMPILER SUITE Authors: Joao Bispo, Pedro Pinto, Ricardo Nobre, Tiago Carvalho and Joao Cardoso, Universidade do Porto, PT Abstract LARA is an aspect-oriented programming (AOP) language which allows the description of sophisticated code instrumentation schemes, advanced mapping strategies including conditional decisions, based on hardware/software resources, and of sophisticated sequences of compiler transformations. Furthermore, LARA provides mechanisms for controlling all elements of a toolchain in a consistent and systematic way, using a unified programming interface. We present three compiler tools developed around the LARA technology, MATISSE, MANET and ReflectC. MATISSE is a compiler which 1) allows analyses and transformations on MATLAB code and 2) generates C code from the MATLAB code. MATISSE can be fully controlled through LARA aspects, which can define the type and shape of MATLAB variables, specify code insertion/removal actions, and define specialization directives and other additional information. MATISSE can output transformed MATLAB code and specialized C code. The knowledge provided by the LARA aspects allows MATISSE to generate C tailored to specific targets (e.g., use statically declared arrays to be compliant with the high-level synthesis tools such as Catapult C). MANET is a source-to-source compiler for ANSI C based on Cetus, and is controlled using LARA aspects. MANET manages to leverage the expressiveness and modularity of LARA to query and manipulate the Cetus AST, providing an easy compilation flow with main goal of code instrumentation and code transformations. LARA aspects allow for a simple selection of program elements in the code which can be analyzed or transformed, by either consulting their attributes or applying actions. Thus, MANET can be used to provide information reports based on compiler analyses, to implement sophisticated code instrumentation strategies, or to perform code optimizations and transformations. ReflectC is a C compiler based on CoSy's compiler framework. CoSy's configurability and retargetability make ReflectC particularly effective for exploration of compiler transformations and optimizations on possible architecture variations, and it is being used for hardware/software co-design and design space exploration (DSE). We will present demos of the tools and the use of LARA aspects and strategies to guide our suite of compilation tools providing: 1) C code generation from MATLAB code, according to information provided by LARA aspects; 2) Instrumentation of C code to be used for collecting specific compile and runtime information (e.g., execution time, range of values for specific variables, custom profiling); 3) User-controlled compiler optimizations targeting several architectures and DSE of sequences of compiler optimizations bearing in mind performance improvements. In addition to presenting examples for each of the tools of the LARA compilation suite, we show an execution of the complete toolchain, controlled by LARA aspects. More information ...
UB03.02	AN AUTOMATED DESIGN FLOW FOR FAST PROTOTYPING OF SIMULINK MODELS ONTO MPSOC Authors: Francesco Robino and Johnny Öberg, Royal Institute of Technology, SE Abstract Simulink is a modelling environment suitable to model embedded systems at system-level. However there is no standard to rapidly prototype Simulink models onto modern multiprocessor system-on-chip (MPSoC). In this demonstration we show how our NoC System Generator tool can be used as part of an automated platform-based design flow to synthesize a Simulink model to a network-on-chip based MPSoC implementation on FPGA. The performance of the generated prototype scales with the number of processors. More information ...
UB03.03	PATN: A PERFORMANCE ANALYSIS TOOL FOR NOC Authors: Yang Chen and Zhonghai Lu, KTH Royal Institute of Technology, SE Abstract With processors increased onto a single chip, and more and more time sensitive applications added to on-chip systems, performance bound analysis becomes essential for QoS Network-on-Chip (NoC) designs and evaluations. For the purpose of providing the reliable and automated analysis for QoS NoC, we propose PATN (Performance Analysis Tool for NoC), which automatically computes the end-to-end delay bounds of data flows, and backlog bounds of buffers for NoC with arbitrary topology. PATN is designed based on network calculus, which lies on solid mathematical foundations and provides well-guaranteed accuracy of the results. Network Calculus based analysis has been successfully employed for various communications networks, such as SpaceWire, AFDX, etc.. For example, Airbus adopted and approved the network calculus based analysis for certification on its aircraft A380. In this demonstration, we give a whole view of PATN through two segments. First, we explain the architecture and main functions; show the working flow and printing log by analysing end-to-end delay bound of a data flow in a simple network. The log shows that the analysis follows the theoretical methodology exactly, hence to obtain the correct and tight results, which as good as that the theory can achieve. Second, we use PATN to analyse the delay bounds and backlog bounds for 3 NoCs with different topologies -- binary tree, mesh, and hierarchical topology of binary tree and mesh. The analyses demonstrate computation speed and scalability of PATN. Moreover, comparisons of the delay bound, computed with different configuration parameters of the flows and routers, are conducted. It shows how the delay bound is effected by the parameters. More information ...
UB03.04	COMPILER FOR MAPPING STREAM PROCESSING APPLICATIONS ONTO REAL-TIME HETEROGENEOUS MULTIPROCESSOR SYSTEMS Authors: Stefan Geuns, Berend Dekens, Philip Wilmanns, Joost Hausmans, Guus Kuiper and Marco Bekooij, University of Twente, NL Abstract Heterogeneous multiprocessors system are employed for power-efficiency reasons in wearable software defined radios. These systems are hardware cost-effective and deliver a superior performance compared to their homogeneous counterparts. However these systems are notoriously hard to program without tool support, which makes it is desirable that programming is simplified with the help of an optimizing multiprocessor compiler for stream processing applications. This demonstration shows our multiprocessor compiler for mapping real-time stream processing applications onto our real-time heterogeneous multi-core system. The applications are described as sequential programs and are compiled into parallel task graphs. Buffer capacities are computed using dataflow analysis techniques given the real-time constraints of the application. Our multi-core system contains 16 MicroBlaze processor cores as well as two hardware accelerators and is prototyped on a Xilinx Virtex-6 FPGA. A connection-less communication ring is used for inter-processor communication. Our system is equipped with an analog RF front-end, which enables us to demonstrate PAL-video reception and decoding. More information ...
UB03.05	HWDEBLUR: DESIGN OF A HIGH PERFORMANCE CORE FOR REMOVING BLUR EFFECT ON IMAGES Authors: Giuseppe Airo' Farulla, Giulio Gambardella, Marco Indaco, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT Abstract This work aims at developing a high performance FPGA-based IP-core able to perform a deblurring algorithm in real-time. Modern approaches to deblurring usually either only handle simple types of blur, or need heavy user inter-action. Moreover, they usually require several minutes (or even whole hours) to process a single image. Our purpose is to study the current state-of-the-art and identify the best deblurring algorithms that are suitable for a hardware implementation. The selected algorithm is optimized and implemented in hardware in order to perform the deblurring task with highest possible performances. More information ...
UB03.06	PHARAON: PARALLEL AND HETEROGENEOUS ARCHITECTURES FOR REAL-TIME APPLICATIONS Authors: Luciano Lavagno¹, Mihai Lazarescu¹, Hector Posadas² and Eugenio Villar² ¹Politecnico di Torino, IT; ²Universidad de Cantabria, ES Abstract In this demo, we will present the work-in-progress of the EU FP7 PHARAON project, started in September 2011. The first objective of the project is the development of new techniques and tools capable to assist the designer in the development of parallel embedded systems, from executable specifications to target-specific implementation and debugging on a multicore platform. This tool chain offers and implements several parallelization strategies, reflecting the functional and non-functional constraints of the system, and driving the designer into incremental parallelization and adaptation steps. The second objective of the project is to develop monitoring and control techniques in the middleware of the system capable to automatically adapt platform services to application requirements and therefore reduce power consumption transparently. The demo will cover specifically: - the software parallelization tool suite, - the parallel software modeling and code generation suite. More information ...
UB03.07	COMPSOC: VIRTUAL EXECUTION PLATFORMS FOR MIXED TIME-CRITICALITY APPLICATIONS Author: Kees Goossens, TU Eindhoven, NL Abstract System-on-Chip (SOC) design gets increasingly complex, as a growing number of applications are inte- grated in such systems. These applications have mixed time-criticality, i.e., some have firm-, some soft-, and others non-real-time requirements. Executing such a mix of applications on a SOC poses several challenges. First, to reduce cost, platform resources, e.g., processors, interconnect, memories, are shared between applications. However, sharing causes interference between applications, making their behaviors inter- dependent. This results in two problems for SOC design and verification: 1) accurate system-level simulation and several approaches to formal verification are infeasible, because of the explosion in the number of possible combinations of applications, inputs, and resource states and 2) verification becomes a circular process that must be repeated if an application is added, removed, or modified, making integration and verification dominant parts of SOC development, in terms of time and money. The CompSOC platform addresses these problems by executing each application on an independent virtual execution platform (VEP). The VEPs are composable, i.e., cannot affect each other's behaviors. In the temporal domain an applications actual execution never varies by even a single clock cycle. Similarly, the energy and power behaviors of applications are also composable. As a result, applications can be designed, developed, verified, and executed in isolation. The VEPs are also predictable, meaning that all interference is bounded. This makes them virtualized also in terms of performance bounds, which enables firm real-time applications to be verified using formal performance analysis frameworks. The CompSOC platform uses the CoMiK microkernel to implement virtual processors on each processor time through temporal partitioning. Each application can use its own operating system (e.g. Compose, μcOS-III) and model of computation (e.g. CSDF, KPN, TT) in its VEP, to suit its level of time criticality. As more applications are integrated on a single SOC, the need arises for more dynamic behaviour. The system should be able to start, modify and stop applications at run time without affecting running appli- cations. For this purpose the CompSOC platform has been extended with a predictable and composable resource management framework. It manages application bundles that contain 1) an application in the form of executables (ELFs on multiple processors), and also 2) the specifications of the (one or more) particular VEPs that the application executes in, consisting of virtual processors, NOC connections, virtualised mem- ories, etc. At run time, the resource management framework can dynamically load and start application bundles by creating a VEP and then loading, booting, and executing an application within it. VEPs can also be modified, stopped, and deleted at run time. Our University Booth will present virtual-execution-platform and application-bundle concepts using an interactive demonstrator. It will show that the CompSOC has been extended with dynamic functionality, without sacrificing its key strengths: composability and predictability. We will demonstrate this through the use of the resource management framework and application bundles, showing that we can create, modify and delete virtual execution platforms running a mixed time-criticality application dynamically at run-time. More information ...
UB03.08	A HOLISTIC APPROACH TO POWER MANAGEMENT FOR ENERGY HARVESTING EMBEDDED SYSTEMS Authors: Kyungsoo Lee, Hideki Takase and Tohru Ishihara, Kyoto University, JP Abstract We present a holistic approach to maximizing the energy efficiency of energy harvesting embedded systems which consist of a processor system and an energy harvesting system. A power management program integrated on a real-time OS optimally switches operation mode of the processor and configuration of the energy harvesting system according to the workload of the processor and harvesting situation. The demonstration will show that our prototype system consisting of our processor chip and harvesting system board stably runs using harvested energy only. The processor has multiple cores having a different performance in each to improve the energy efficiency of computation. The energy harvesting board has high transferring efficiency to reduce the power loss. The entire system is controlled efficiently by our power management program implemented on Toppers OS. More information ...
UB03.09	FAULTIFY: PROBABILISTIC CIRCUIT FAULT EMULATION Authors: David May and Walter Stechele, TUM, DE Abstract We want to demonstrate an FPGA-based probability-aware fault emulator and its corresponding algorithms in the context of a real-time H.264 decoder. The demo will show that reliability constraints can be relaxed inside the circuit without noticeable degradation of the image quality when carefully investigating where the constraints can be relaxed. We will show how this investigation can to be done using our emulator and we will show the effect of a relaxed robustness of the circuit in real-time. More information ...
UB03.10	RTL+: DESIGN ENVIRONMENT: WALK BEFORE YOU RUN. Authors: Somayeh Sadeghi-Kohan, Behnaz Pourmohseni, Amir Reza Nekooei, Hanieh Hashemi, Hamed Najafi Haghi and Zainalabedin Navabi, University of Tehran, IR Abstract To enable development of high level designs with hardware correspondence, synthesizability must be satisfied in a top-down manner. Thus in this work, instead of using TLM-2.0 which is not established for synthesis, we will start with a level above RT level, "RTL+". RTL+ is basically using TLM-1.0 channels and includes abstract communications and handshakings that are mainly hidden from the designer. We develop a package of SystemC channels with hardware correspondence (synthesizable HDL) for the communication between various cores (with simple interfaces) and standard buses. More information ...
17:30	End of session
18:30	Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

IP1 Interactive Presentations

Date: Tuesday 25 March 2014
Time: 16:00 - 16:30
Location / Room: Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

Label	Presentation Title Authors
IP1-1	SAFE: SECURITY-AWARE FLEXRAY SCHEDULING ENGINE Speakers: Gang Han¹, Haibo Zeng², Yaping Li³ and Wenhua Dou¹ ¹National University of Defense Technology, CN; ²McGill University, CA; ³The Chinese University of Hong Kong, CN Abstract In this paper, we propose SAFE (Security Aware FlexRay scheduling Engine), to provide a problem definition and a design framework for FlexRay static segment schedule to address the new challenge on security. From a high level specification of the application, the architecture and communication middleware are synthesized to satisfy security requirements, in addition to extensibility, costs, and end-to-end latencies. The proposed design process is applied to two industrial case studies consisting of a set of active safety functions and an X-by-wire system respectively.
IP1-2	TRANSIENT ERRORS RESILIENCY ANALYSIS TECHNIQUE FOR AUTOMOTIVE SAFETY CRITICAL APPLICATIONS Speakers: Sujan Pandey and Bart Vermeulen, NXP Semiconductors, NL Abstract When a single bit is flipped as a result of a transient error in an electronic circuit, its effect can have a severe impact if the circuit is deployed in safety critical domains such as automotive, aeronautics, and industrial automation. In the design phase it is therefore essential to evaluate, and where necessary improve, the resilience of a circuit to all possible transient errors. In this paper, we present a method to analyze the transient error resiliency of a digital circuit. This method is based on an analytical model. It models a transient error as a random function and finds the vulnerable number of bits for each node. We perform a case study on a circuit implementation of a well-known adaptive filter algorithm. The results from the analytical and simulation models show that the analytical model is accurate enough to estimate the effects of transient errors on the performance of a digital circuit. Our analytical method also reduces the run time significantly in a design phase.
IP1-3	MODEL BASED HIERARCHICAL OPTIMIZATION STRATEGIES FOR ANALOG DESIGN AUTOMATION Speakers: Engin Afacan¹, Gunhan Dundar¹, Faik Baskaya¹, Simge Ay¹ and Francisco Fernandez² ¹Bogazici University, TR; ²Universidad de Sevilla, TR Abstract The design of complex analog circuits by using flat optimization-based approaches is inefficient, even impossible, due to the high number of design variables and the growth of the cost of performance evaluation with the circuit size. Over the past two decades, top-down hierarchical design approaches have been developed and applied. They are based on hierarchical circuit decomposition and specification transmission from top-level to lower level blocks. However, such specification transmission is usually performed with little knowledge on the feasibility of the specifications, leading, therefore, to costly redesign iterations. Even if the specification transmission is successful, there is no guarantee that it is optimal in terms of e.g., power consumption or area occupation. To palliate this problem, two novel model-based hierarchical synthesis methods are proposed in this paper: Model-Based Hierarchical Optimization (MBHO) and Improved Model-Based Hierarchical Optimization (IMBHO). They are based on the concurrent design at higher and lower hierarchical levels and appropriate communication between the different processes. Experimental results on a filter example comparing the new approaches and the conventional top-down design approach are provided.
IP1-4	A NOVEL LOW POWER 11-BIT HYBRID ADC USING FLASH AND DELAY LINE ARCHITECTURES Speakers: Hsun-Cheng Lee and Jacob Abraham, the University of Texas at Austin, US Abstract This paper presents a novel low power 11-bit hybrid ADC using flash and delay line architectures, where a 4-bit flash ADC is followed by a 7-bit delay-line ADC. This hybrid ADC inherits accuracy and power efficiency from flash ADCs and delay-line ADCs, respectively. Also, in order to reduce the power of the first stage flash ADC, a power-saving technique is adopted by biasing the DC tail current of the pre-amplifiers at 5μA instead of the operational current, 47μA in stand-by mode. The hybrid ADC was designed and simulated in a commercial 65nm process. With a 1.1 V supply and 100 MS/s, the ADC achieves an SNDR of 60 dB and consumes 1.6 mW, which results in a figure of merit (FOM) of 19.4 fJ/conversion-step without any calibration technique. Also, Monte Carlo simulations are performed with a 3σ device mismatch for the SNDR estimation, and the SNDR is observed to be better than 58.5 dB.
IP1-5	SEMI-SYMBOLIC ANALYSIS OF MIXED-SIGNAL SYSTEMS INCLUDING DISCONTINUITIES Speakers: Carna Radojicic, Christoph Grimm, Javier Moreno and Xiao Pan, TU Kaiserslautern, DE Abstract The paper describes an approach for semi-symbolic analysis of mixed-signal systems that contain discontinuous functions, e.g. due to modeling comparators. For modeling and semi- symbolic simulation, we use extended Affine Arithmetic. Affine Arithmetic is currently limited to accurate analysis of linear func- tions and mild non-linear functions, but not yet discontinuities. In this paper we extend the approach to also handle discontinuities. For demonstration, we symbolically analyze a Σ∆-modulator.
IP1-6	(Best Paper Award Candidate) NOVEL CIRCUIT TOPOLOGY SYNTHESIS METHOD USING CIRCUIT FEATURE MINING AND SYMBOLIC COMPARISON Speakers: Cristian Ferent and Alex Doboli, Stony Brook University, US Abstract This paper presents a reasoning-based approach to analog circuit synthesis using ordered node clustering representations (ONCR) to describe alternative circuit features and symbolic circuit comparison to characterize performance trade-offs of synthesized solutions. Case studies illustrate application of the proposed methods to topology selection and refinement.
IP1-7	AN EMBEDDED OFFSET AND GAIN INSTRUMENT FOR OPAMP IPS Speakers: Jinbo Wan and Hans KerkHoff, CAES-TDT, CTIT, University of Twente, NL Abstract Analog and mixed-signal IPs are increasingly required to use digital fabrication technologies and are deeply embedded into system-on-chips (SoC). These developments append more requirements and challenges on analog testing methodologies. Traditional analog testing methods suffer from less accessibility and control with regard to these embedded analog circuits in SoCs. As an alternative, an embedded instrument for analog OpAmp IP tests is proposed in this paper. It can provide the exact gain and offset values of OpAmps instead of only pass/fail result. What's more, it is an non-invasive monitor and can work online without isolating the DUT Opamp from its surrounding feedback networks. Nor does it require accurate test stimulations. In addition, the monitor can remove its own offsets without additional complex self-calibration circuits. All self-calibrations are completed in the digital domain after each measurement in real time. Therefore it is also suitable for aging-sensitive applications, in which the monitor may suffer from aging mechanisms and has additional offset drifts as well. The monitor measurement range for offset is from 0.2mV to 70mV, and for gain it is from 0dB to 40dB. The error for offset measurements can be 10% of the measurement value with plus/minus 0.1mV, and -2.5dB for gain measurements.
IP1-8	EVX: VECTOR EXECUTION ON LOW POWER EDGE CORES Speakers: Milovan Duric¹, Oscar Palomar¹, Aaron Smith², Osman Unsal¹, Adrian Cristal¹, Mateo Valero¹ and Doug Burger² ¹Barcelona Supercomputing Center, ES; ²Microsoft Research, US Abstract In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions.
IP1-9	PROGRAM AFFINITY PERFORMANCE MODELS FOR PERFORMANCE AND UTILIZATION Speakers: Ryan Moore and Bruce Childers, University of Pittsburgh, US Abstract Multithreaded applications have a wide variety of behavior, causing complex interactions with today's chip multiprocessor machines. Application threads may have large private working sets, and may compete for cache space and memory bandwidth. These threads benefit from large private caches. Other threads may share data or communicate, and thus, execute more quickly if using shared caches. Many applications fall somewhere in between, requiring careful thread-to-core assignments to maximize performance. Yet because of the large number of thread-to-core assignments on today's chip multiprocessors, it is time and energy prohibitive to exhaustively try and determine the best assignment. In this paper, we present and demonstrate application performance models that predict application performance given a proposed thread-to-core assignment. We show how these models can be quickly built and used to select thread-to-core assignments for multiple programs and to improve system utilization.
IP1-10	ADVANCED SIMD: EXTENDING THE REACH OF CONTEMPORARY SIMD ARCHITECTURES Speakers: Matthias Boettcher¹, Giacomo Gabrielli², Mbou Eyole², Alastair Reid² and Bashir M. Al-Hashimi¹ ¹University of Southampton, GB; ²ARM Ltd., GB Abstract SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g. Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity. This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed. We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.
IP1-11	A TIGHTLY-COUPLED HARDWARE CONTROLLER TO IMPROVE SCALABILITY AND PROGRAMMABILITY OF SHARED-MEMORY HETEROGENEOUS CLUSTERS Speakers: Paolo Burgio¹, Robin Danilo², Andrea Marongiu³, Philippe Coussy⁴ and Luca Benini⁵ ¹University of Bologna, Université de Bretagne-Sud, IT; ²Université de Bretagne-Sud, FR; ³University of Bologna, IT; ⁴Universite de Bretagne-Sud / Lab-STICC, FR; ⁵Università di Bologna, IT Abstract Modern designs for embedded many-core systems increasingly include application-specific units to accelerate key computational kernels with orders-of-magnitude higher execution speed and energy efficiency compared to software counterparts. A promising architectural template is based on heterogeneous clusters, where simple RISC cores and specialized HW units (HWPU) communicate in a tightly-coupled manner via L1 shared memory. Efficiently integrating processors and a high number of HW Processing Units (HWPUs) in such an system poses two main challenges, namely, architectural scalability and programmability. In this paper we describe an optimized Data Pump (DP) which connects several accelerators to a restricted set of communication ports, and acts as a virtualization layer for programming, exposing FIFO queues to offload "HW tasks" to them through a set of lightweight APIs. In this work, we aim at optimizing both these mechanisms, for respectively reducing modules area and making programming sequence easier and lighter.
IP1-12	INFORMER: AN INTEGRATED FRAMEWORK FOR EARLY-STAGE MEMORY ROBUSTNESS ANALYSIS Speakers: Shrikanth Ganapathy¹, Ramon Canal¹, Dan Alexandrescu², Enrico Costenaro², Antonio Gonzalez³ and Antonio Rubio¹ ¹Universitat Politecnica de Catalunya, ES; ²iRoC Technologies, FR; ³Intel and Universitat Politecnica de Catalunya, ES Abstract With the growing importance of parametric (process and environmental) variations in advanced technologies, it has become a serious challenge to design reliable, fast and low-power embedded memories. Adopting a variation-aware design paradigm requires a holistic perspective of memory-wide metrics such as yield, power and performance. However, accurate estimation of such metrics is largely dependent on circuit implementation styles, technology parameters and architecture-level specifics. In this paper, we propose a fully automated tool - INFORMER that helps high-level designers estimate memory reliability metrics rapidly and accurately. The tool relies on accurate circuit-level simulations of failure mechanisms such as ageing, soft-errors and parametric failures. The obtained statistics can then help couple low-level metrics with higher-level design choices. A new technique for rapid estimation of low-probability failure events is also proposed. We present three use-cases of our prototype tool to demonstrate its diverse capabilities in autonomously guiding large SRAM based robust memory designs.
IP1-13	WEAR-OUT ANALYSIS OF ERROR CORRECTION TECHNIQUES IN PHASE-CHANGE MEMORY Speakers: Caio Hoffman, Luiz Ramos, Rodolfo Azevedo and Guido Araújo, University of Campinas, BR Abstract Phase-Change Memory (PCM) is new memory technology and a possible replacement for DRAM, whose scaling limitations require new lithography technologies. Despite being promising, PCM has limited endurance (its cells withstand roughly 10^8 bit-flips before failing), which prompted the adoption of Error Correction Techniques (ECTs). However, previous lifetime analyses of ECTs did not consider the difference between the bit-flip frequencies of data and code bits, which may lead to inaccurate wear-out analyses for the ECTs. In this work, we improve the wear-out analysis of PCM by modeling and analyzing the bit-flip probabilities of five ECTs. Our models also enable an accurate estimation of energy consumption and analysis of the endurance-energy trade-off for each ECT.
IP1-14	APPROXIMATING THE AGE OF RF/ANALOG CIRCUITS THROUGH RE-CHARACTERIZATION AND STATISTICAL ESTIMATION Speakers: Doohwang Chang¹, Sule Ozev¹, Ozgur Sinanoglu² and Ramesh Karri³ ¹Arizona State University, US; ²New York University Abu Dhabi, AE; ³Polytechnic Institute of New York University, US Abstract Counterfeit ICs have become an issue for semiconductor manufacturers due to impacts on their reputation and lost revenue. Counterfeit ICs are either products that are intentionally mislabeled or legitimate products that are extracted from electronic waste. The former is easier to detect whereas the latter is harder since they are identical to new devices but display degraded performance due to environmental and use stress conditions. Detecting counterfeit ICs that are extracted from electronic waste requires an approach that can approximate the age of manufactured devices based on their parameters. In this paper, we present a methodology that uses information on both fresh and aged ICs and tries to distinguish between the fresh and aged population based on an estimate of the age. Since analog devices age mainly due to their bias stress, input signals play less of a role. Hence, it is possible to use simulation models to approximate the aging process, which would give us access to a large population of aged devices. Using this information, we can construct a statistical model that approximates the age of a given circuit. We use a Low noise amplifier (LNA) and an NMOS LC oscillator to demonstrate that individual aged devices can be accurately classified using the proposed method.
IP1-15	PACKAGE GEOMETRIC AWARE THERMAL ANALYSIS BY INFRARED-RADIATION THERMAL IMAGES Speakers: Jui-Hung Chien¹, Hao Yu², Ruei-Siang Hsu³, Hsueh-Ju Lin³ and Shih-Chieh Chang³ ¹Industrial Technology Research Institute, TW; ²None, TW; ³NTHU, TW Abstract Since packages affect the amount of heat transfer, it is important to include package and heat sink in thermal analysis. In this paper, we study the full-chip thermal response with different packages. We first discuss the difficulties of obtaining accurate package models for simulation. To facilitate a designer to perform thermal simulation with different packages, we propose to use a matrix called the package-transfer matrix which can transform a temperature profile of one package to another temperature profile of the desired package. To estimate and verify a package-transfer matrix, we propose an efficient method which uses Infrared Radiation (IR) images from two carefully design test chips with PBGA packages. Our experimental results show that the default package model CBGA in HotSpot can be accurately transferred to any other package through the package-transfer matrix.
IP1-16	COST-EFFECTIVE DECAP SELECTION FOR BEYOND DIE POWER INTEGRITY Speakers: Yi-En Chen¹, Tu-Hsung Tsai¹, Shi-Hao Chen² and Hung-Ming Chen¹ ¹Department of Electronics Engineering National Chiao Tung University Hsinchu, Taiwan 300, R.O.C., TW; ²Global Unichip Corp, Hsinchu, Taiwan, TW Abstract In designing reliable power distribution networks (PDN) for power integrity (PI), it is essential to stabilize voltage supply to devices on chip. We usually employ decoupling capacitor (decap) to suppress the noise generated by the switching of devices. There have been numerous prior works on how to select/insert decaps in chip, package, or board to maintain PI, however optimal decap selection is usually not applicable due to design budget and manufacturability. Moreover, design cost is seldom touched or mentioned. In this research, we propose an efficientmethodology "PDCPSO" to automatically optimizing the selection of available decaps. This algorithm not only takes advantage of particle swarm optimization (PSO) to stochastically search the design space, but takes the most effective range of decaps into consideration to outperform the basic PSO. We apply this to three real package designs and the results show that, compared to the original decap selection by rules of thumb, our approach could shorten the design period and we have better combination of decaps at the same or lower cost. In addition, our methodology can also consider package-board co-design in optimizing different operation frequencies.
IP1-17	CHARACTERIZING POWER DELIVERY SYSTEMS WITH ON/OFF-CHIP VOLTAGE REGULATORS FOR MANY-CORE PROCESSORS Speakers: Xuan Wang, Jiang Xu, Zhe Wang, Kevin J. Chen, Xiaowen Wu and Zhehui Wang, HKUST, HK Abstract Design of power delivery system has great influence on the power management in many-core processor systems. Moving voltage regulators from off-chip to on-chip gains more and more interest in the power delivery system design, because it is able to provide fast voltage scaling and multiple power domains. Previous works are proposed to implement power efficient on-chip regulators. It is also important to analyze the characteristics of the entire power delivery system to explore the tradeoff between the promising properties and costs of employing on-chip regulators. In this work, we develop an analytical model to evaluate important characteristics of the power delivery system, including on-chip/off-chip voltage regulators and the passive on-chip/on-board parasitic. Compared with SPICE simulations, our model achieves a fast system-level evaluation with comparable accuracy. Based on the model, geometric programming is utilized to find the optimal power efficiency of different architectures of power delivery systems under constraints of output voltage stability and area. Experiments show that compared with the conventional architecture using off-chip regulators, the hybrid one using both on-chip and off-chip voltage regulators achieves 1.0% power efficiency improvement and 68% area reduction of voltage regulators on average. We conclude that the hybrid architecture has potential for high power efficiency and small area at heavy workload, but careful account for the overhead of on-chip regulators is needed.
IP1-18	MASK-COST-AWARE ECO ROUTING Speakers: Hsi-An Chien¹, Zhen-Yu Peng¹, Yun-Ru Wu², Ting-Hsiung Wang², Hsin-Chang Lin², Chi-Feng Wu² and Ting-Chi Wang¹ ¹National Tsing Hua University, TW; ²Realtek Semiconductor Corp., TW Abstract In this paper, we study a mask-cost-aware routing problem for engineering change order (ECO). By taking into account old routes for possible reuse, we present an approach for the problem. Encouraging experimental results are reported to demonstrate the effectiveness of our approach.
IP1-19	EXPLOITING NARROW-WIDTH VALUES FOR IMPROVING NON-VOLATILE CACHE LIFETIME Speakers: Guangshan Duan and Shuai Wang, Nanjing University, CN Abstract Due to the high cell density, low leakage power consumption, and less vulnerability to soft errors, the non-volatile memory technologies are among the most promising alternatives for replacing the traditional DRAM and SRAM technologies used in implementing main memory and caches in the modern microprocessor. However, one of the difficulties is the limited write endurance of most non-volatile memory technologies. In this paper, we propose to exploit the narrow-width values to improve the lifetime of the non-volatile last level caches. Leading zeros masking scheme is first proposed to reduce the write stress to the upper half of the narrow-width data. To balance the write variations between the upper half and the lower half of the narrow-width data, two swap schemes, the swap on write (SW) and swap on replacement (SRepl), are proposed. To further reduce the write stress to the non-volatile cache, we adopt two optimization schemes, the multiple dirty bit (MDB) and read before write (RBW), to improve its lifetime. Our experimental results show that by combining all our proposed schemes, the lifetime of the non-volatile caches can be improved by 245% on average.
IP1-20	PARTIAL-SET: WRITE SPEEDUP OF PCM MAIN MEMORY Speakers: Li Bing¹, Shan Shuchang², Hu Yu² and Li XiaoWei³ ¹ICT,UCAS, CN; ²ICT,CAS, CN; ³ICT.CAS, CN Abstract Phase change memory (PCM) is a promising nonvolatile memory technology developed as a possible DRAM replacement. Although it offers the read latency close to that of DRAM, PCM generally suffers from the long write latency. Long write request may block the read requests on the critical path of cache/memory access, incurring adverse impact on the system performance. Besides, the write performance of PCM is very asymmetric, i.e, the SET operation (writing '1') is much slower than that of the RESET operation (writing '0'). In this work, we re-examine the resistance transform process during the SET operation of PCM and propose a novel Partial-SET scheme to alleviate the long write latency issue of PCM. During a write access to a memory line, a short Partial-SET pulse is applied first to program the PCM cells to a pre-stable state, achieving the same write latency as RESET. The partially-SET cells are then fully programmed within the retention window to preserve the data integrity. Experimental results show that our Partial-SET scheme can improve the memory access performance of PCM by more than 45% averagely with very marginal storage overhead.
IP1-21	GARBAGE COLLECTION FOR MULTI-VERSION INDEX ON FLASH MEMORY Speakers: Kam-Yiu Lam¹, Jian-Tao Wang¹, Yuan-Hao Chang², Jen-Wei Hsieh³, Po-Chun Huang⁴, Chung Keung Poon⁵ and ChunJiang Zhu¹ ¹City University of Hong Kong, HK; ²Academia Sinica, TW; ³National Taiwan University of Science and Technology, TW; ⁴Acadmia Sinica, TW; ⁵City University of Hong Kong, TW Abstract In this paper, we study the important performance issues in using the purging-range query to reclaim old data versions to be free blocks in a flash-based multi-version database. To reduce the overheads for using the purging-range query in garbage collection, the physical block labeling (PBL) scheme is proposed to provide a better estimation on the purging version number to be used for purging old data versions. With the use of the frequency-based placement (FBP) scheme to place data versions in a block, the efficiency in garbage collection can be further enhanced by increasing the deadspans of data versions and reducing reallocation cost especially when the spaces of the flash memory for the databases are limited.
IP1-22	D2CYBER: A DESIGN AUTOMATION TOOL FOR DEPENDABLE CYBERCARS Speakers: Arslan Munir and Farinaz Koushanfar, Rice University, US Abstract The next generation of automobiles (also known as cybercars) will increasingly incorporate electronic control units (ECUs) in novel automotive control applications. Recent work has demonstrated vulnerability of modern car control systems to security attacks that directly impacts the cybercar's physical safety and dependability. In this paper, we provide an integrated approach for the design of secure and dependable cybercars using a case study: a steer-by-wire (SBW) application over controller area network (CAN). The challenge is to embed both security and dependability over CAN while ensuring that the real-time constraints of the cybercar applications are not violated. Our approach enables early design feasibility analysis by embedding essential security primitives (i.e., confidentiality, integrity, and authentication) over CAN subject to the real-time constraints imposed by the desired quality of service and behavioral reliability. Our method leverages multi-core ECUs for providing fault-tolerance by redundant multi-threading (RMT) and also further enhances RMT for quick error detection. We quantify the error resilience of our approach and evaluate the interplay of performance, fault-tolerance, security, and scalability for our SBW case study.
IP1-23	CONTRACT-BASED DESIGN OF CONTROL PROTOCOLS FOR SAFETY-CRITICAL CYBER-PHYSICAL SYSTEMS Speakers: Pierluigi Nuzzo, John Finn, Antonio Iannopollo and Alberto Sangiovanni-Vincentelli, University of California at Berkeley, US Abstract We introduce a platform-based design methodology that addresses the complexity and heterogeneity of cyber-physical systems by using assume-guarantee contracts to formalize the design process and enable realization of control protocols in a hierarchical and compositional manner. Given the architecture of the physical plant to be controlled, the design is carried out as a sequence of refinement steps from an initial specification to a final implementation, including synthesis from requirements and mapping of higher-level functional and non-functional models into a set of candidate solutions built out of a library of components at the lower level. Initial top-level requirements are captured as contracts and expressed using linear temporal logic (LTL) and signal temporal logic (STL) formulas to enable requirement analysis and early detection of inconsistencies. Requirements are then refined into a controller architecture by combining reactive synthesis steps from LTL specifications with simulation-based design space exploration steps. We demonstrate our approach on the design of embedded controllers for aircraft electric power distribution.
IP1-24	A FAULT DETECTION MECHANISM IN A DATA-FLOW SCHEDULED MULTITHREADED PROCESSOR Speakers: Jian Fu¹, Qiang Yang¹, Raphael Poss¹, Chris Jesshope¹ and Chunyuan Zhang² ¹University of Amsterdam, NL; ²National University of Defense Technology, CN Abstract This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled Multi-Threaded (DMT) multicore processor, called Data-flow scheduled Redundant Multi-Threading (DRMT). Meanwhile, It presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection related inter-core communication and alleviate the performance and hardware overheads induced by output comparison. Results show that the performance overhead of DRMT is less than 60% even when the number of threads is four times the number of processing elements. Also the performance and hardware overheads of AOC are insignificant.

4.1 EXECUTIVE SESSION: Addressing Challenges of Reliable Chips

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Saal 1

Organiser:
Yervant Zorian, Fellow & Chief Architect, Synopsys, US

Executives:
Dan Alexandrescu, President & CEO, iROC Technologies, FR
Robert Aitken, Fellow, ARM, US
Robert Hum, GM & VP, Mentor Graphics, US
Stefan Singer, Fellow, Freescale, DE

While today's SOCs systematically use semiconductor production quality assessment and optimization solutions, meeting end-product requirements for reliability and availability augments the need to prepare the SOC design in advance to address such requirements. The speakers in this executive session will address the current trends and challenges in the semiconductor reliability and discuss the level of readiness needed in a chip to meet today's SOC requirements.

Time	Label	Presentation Title Authors
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

4.2 Hot Topic: Multicore Systems in Safety Critical Electronic Control Units for Automotive and Avionics

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 6

Organisers:
Jürgen Becker, KIT, DE
Oliver Sander, KIT, DE

Chair:
Jürgen Becker, KIT, DE

Co-Chair:
Oliver Sander, KIT, DE

Future applications in automotive and avionics show an ever increasing demand of computational processing power. The use of multicore devices is now emerging in embedded electronics. However these solutions are not directly applicable because of technical requirements that come along with the domain of safety critical and mixed critical applications, such as in automotive or avionics. The major challenge for deployment of multicore devices in safety critical applications such as automotive or avionics, is the lack of determinism and support of segregation due to shared resources. The goal of this session is to present the challenges that arise from the use of multicore devices in embedded safety-critical systems and mixed critical systems.

Time	Label	Presentation Title Authors
17:00	4.2.1	AUTOSAR AND MULTICORE Speakers: Stefan Kuntz¹ and Rolf Schneider² ¹Continental Automotive GmbH, DE; ²AUDI AG, DE Abstract AUTOSAR already supports developing applications for and integrating software components onto multicore based platforms. In addition, these capabilities pave the way for helping to migrate existing applications, originally developed for being executed on single core platforms, to multicore based platforms. This talk provides a brief introduction of the current state of AUTOSAR's multicore support and presents some scenarios that draws the attention to multicore specific questions and challenges in the particular context. Possible future directions in improving the AUTOSAR standard with regard to multicore and to gain more benefit from the availability of multiple cores, independent execution units, are sketched out.
17:30	4.2.2	CONCEPTS TO VALIDATE THE SAFE APPLICATION OF MULTICORE ARCHITECTURES IN THE AVIONICS DOMAIN Speaker: Ottmar Bender, Airbus Defence and Space, DE Abstract This presentation explains how commercially available multicore processors can be applied for safety critical applications in avionics systems. It also describes remaining difficulties which need to be solved for a full exploitation of multicore technology in the avionics domain. Furthermore a concept of an airborne radar application demonstrator built on multicore architecture is shown. This demonstrator shall allow the validation of essential solutions for the specific difficulties emerging from current multicore architectures.
18:00	4.2.3	MONITORING AND WCET ANALYSIS IN COTS MULTI-CORE-SOC-BASED MIXED-CRITICALITY SYSTEMS Speakers: Jan Nowotsch¹, Michael Paulitsch², Arne Henrichsen³, Werner Pongratz³ and Andreas Schacht³ ¹EADS Innovation Works, DE; ²EADS Innovation Work, DE; ³Cassidian, DE Abstract The performance and power efficiency of multi-core processors are attractive features for safety-critical applications, for example in avionics. But the inherent use of shared resources complicates timing analysability. In this paper we discuss a novel approach to compute the Worst-Case Execution Time (WCET) of multiple hard real-time applications scheduled on a Commercial Off-The-Shelf (COTS) multi-core processor. The analysis is closely coupled with mechanisms for temporal partitioning as, for instance, required in ARINC 653-based systems. Based on a discussion of the challenges for temporal partitioning and timing analysis in multi-core systems, we deduce a generic architecture model. Considering the requirements for re-usability and incremental development and certification, we use this model to describe our integrated analysis approach.
18:15	4.2.4	HARDWARE VIRTUALIZATION SUPPORT FOR SHARED RESOURCES IN MIXED-CRITICALITY MULTICORE SYSTEMS Speakers: Oliver Sander¹, Timo Sandmann², Viet Vu Duy³, Steffen Bähr³, Falco Bapp³, Juergen Becker³, Hans Ulrich Michel⁴, Dirk Kaule⁴, Daniel Adam⁴, Enno Luebbers⁵, Jürgen Hairbucher⁵, Andre Richter⁶, Christian Herber⁷ and Andreas Herkersdorf⁸ ¹KIT, DE; ²Karlsruhe Institute of Technology (KIT), DE; ³Karlsruhe Institute of Technology, DE; ⁴BMW F+T, DE; ⁵Intel GmbH, DE; ⁶TUM, DE; ⁷Technische Universität München, DE; ⁸TU München, DE Abstract Electric/Electronic architectures in modern automobiles evolve towards an hierachical approach where functionalities from several ECUs are consolidated into few domain computers. Performance requirements directly lead to multicore solutions but also to a combination of very different requirements on such ECUs. Using virtualization in addition is one promising way of achieving segregation in time and space of shared resources. Based on examples taken from the automotive domain several concepts for efficient hardware extensions of coprocessors and I/O devices are shown in this contribution. These provide mechanisms to ensure quality of service (QoS) levels in terms of exectution time, throughput and latency. The resulting infotainment architecture is a feasibility study and is integrated into a vehicle demonstrator as centralized infotainment platform (VCT).
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

4.3 Secure Device Identification

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 1

Chair:
Tim Gueneysu, RUB, DE

Co-Chair:
Patrick Schaumont, Virginia Tech, US

Physically Unclonable Functions (PUF) have received much attention for fingerprinting of electronic devices. This session presents novel constructions and threats on Ring-Oscillator-based and Sense-Amplifier-based PUFs.

Time	Label	Presentation Title Authors
17:00	4.3.1	ARO-PUF: AN AGING-RESISTANT RING OSCILLATOR PUF DESIGN Speakers: Md. Tauhidur Rahman¹, Domenic Forte¹, Jim Fahrny² and Mohammad Tehranipoor¹ ¹University of Connecticut, US; ²Comcast, US Abstract Physically Unclonable Functions (PUFs) have emerged as a security block with the potential to generate chip-specific identifiers and cryptographic keys. However it has been shown that the stability of these identifiers and keys is heavily impacted by aging and environmental variations. Previous techniques have mostly focused on improving PUF robustness against supply noise and temperature but aging has been largely neglected. In this paper, we propose a new aging resistant design for the popular ring-oscillator (RO)-PUF. Simulation results demonstrate that our aging resistant RO-PUF (called ARO-PUF) can produce unique, random, and more reliable keys. Only 7.7% bits get flipped on average over 10 years operation period for an ARO-PUF due to aging where the value is 32% for a conventional RO-PUF. The ARO-PUF shows an average inter-chip HD of 49.67% (close to ideal value 50%) and better than the conventional RO-PUF (~45%). With lower error, ARO-PUF offers ~24X area reduction for a $128$-bit key because of reduced ECC complexity and smaller PUF footprint.
17:30	4.3.2	(Best Paper Award Candidate) AN EFFICIENT RELIABLE PUF-BASED CRYPTOGRAPHIC KEY GENERATOR IN 65NM CMOS Speakers: Mudit Bhargava¹ and Ken Mai² ¹ARM, US; ²Carnegie Mellon University, US Abstract Physical unclonable functions (PUFs) are primitives that generate high-entropy, tamper resistant bits for use in secure systems. For applications such as cryptographic key generation, the PUF response bits must be highly reliable, consistent across multiple evaluations under voltage and temperature variations. Conventionally, error correcting codes (ECC) have been used to improve response reliability, but these techniques have siginificant area, power, and delay overheads and are vulnerable to information leakage. In this work, we present a highly-reliable, PUF-based, cryptographic key generator that uses no ECC, but instead uses built-in self-test to determine which PUF bits are reliable and only uses those bits for key generation. We implemented a prototype of the key generator in a 65nm bulk CMOS testchip. The key generator generates 1213 bits in an area of <50kμm2 with a measured bit error rate of < 5 ∗ 10−9 in both the nominal and worst case corners (100k measurements each). This is equivalent to a 128-bit key failure rate of < 10−6. The system can generate a 128-bit key in 1.15μs. Finally, we present a realization of a "strong"-PUF that uses 128 of these highly reliable bits in conjunction with an Advanced Encryption Standard (AES) cryptographic primitive and has a response time of 40ns and is realized in an area of 84kμm2.
18:00	4.3.3	INCREASING THE EFFICIENCY OF SYNDROME CODING FOR PUFS WITH HELPER DATA COMPRESSION Speakers: Matthias Hiller and Georg Sigl, Institute for Security in Information Technology; Technische Universität München, DE Abstract Physical Unclonable Functions (PUFs) provide secure cryptographic keys for resource constrained embedded systems without secure storage. A PUF measures internal manufacturing variations to create a unique, but noisy secret inside a device. Syndrome coding schemes create and store helper data about the structure of a specific PUF to correct errors within subsequent PUF measurements and generate a reliable key. This helper data can contain redundancy. We analyze existing schemes and show that data compression can be applied to decrease the size of the helper data of existing implementations. We introduce compressed Differential Sequence Coding (DSC), which is the most efficient syndrome coding scheme known to date for a popular reference scenario. Adding helper data compression to the DSC algorithm leads to an overall decrease of 68% in helper data size compared to other algorithms in a reference scenario. This is achieved without increasing the number of PUF bits and a minimal increase in logic size.
18:15	4.3.4	KEY-RECOVERY ATTACKS ON VARIOUS RO PUF CONSTRUCTIONS VIA HELPER DATA MANIPULATION Speakers: Jeroen Delvaux¹ and Ingrid Verbauwhede² ¹KU Leuven, BE; ²KU Leuven - COSIC, BE Abstract Physically Unclonable Functions (PUFs) are security primitives that exploit the unique manufacturing variations of an integrated circuit (IC). They are mainly used to generate secret keys. Ring oscillator (RO) PUFs are among the most widely researched PUFs. In this work, we claim various RO PUF constructions to be vulnerable against manipulation of their public helper data. Partial/full key-recovery is a threat for the following constructions, in chronological order. (1) Temperature-aware cooperative RO PUFs, proposed at HOST 2009. (2) The sequential pairing algorithm, proposed at HOST 2010. (3) Group-based RO PUFs, proposed at DATE 2013. (4) Or more general, all entropy distiller constructions proposed at DAC 2013.
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

4.4 "Almost there" emerging technologies

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 2

Chair:
Ian O'Connor, University of Lyon, FR

Co-Chair:
Michael Niemier, University of Notre Dame, US

The three papers in this session all address "nearer-term" emerging technologies. Stochastic computing techniques are becoming increasingly relevant as CMOS becomes more error prone, numerous industrial and academic efforts are targeting 3D integration, and integrated microfluidics promise to have a profound impact on healthcare and other domains.

Time	Label	Presentation Title Authors
17:00	4.4.1	IIR FILTERS USING STOCHASTIC ARITHMETIC Speakers: Naman Saraf, Kia Bazargan, David J Lilja and Marc D Riedel, University of Minnesota, Twin Cities, US Abstract We consider the design of IIR filters operating on oversampled sigma-delta modulated bit streams using stochastic arithmetic. Conventional digital filters process multi-bit data at the Nyquist rate using multi-bit multipliers and adders. High resolution ADCs based on the sigma-delta modulation generate random bits at an oversampled rate as intermediate data. We propose to filter the sigma-delta modulated bit streams directly and present first and second order low pass IIR filters based on the stochastic integrator. Experimental results show a significant reduction in hardware area by using stochastic filters.
17:30	4.4.2	EFFICIENT TRANSIENT THERMAL SIMULATION OF 3D ICS WITH LIQUID-COOLING AND THROUGH SILICON VIAS Speakers: Alain Fourmigue, Giovanni Beltrame and Gabriela Nicolescu, Polytechnique Montreal, CA Abstract Three-dimensional integrated circuits (3D ICs) with advanced cooling systems are emerging as a viable solution for many-core platforms. These architectures generate a high and rapidly changing thermal flux. Their design requires accurate transient thermal models. Several models have been proposed, either with limited capabilities, or poor simulation performance. This work introduces an efficient algorithm based on the Finite Difference Method to compute the transient temperature in 3D ICs. Our experiments show a 5x speedup versus state-of-the-art models, while maintaining the same level of accuracy, and demonstrate the effect of large through silicon vias arrays on thermal dissipation.
18:00	4.4.3	A LOGIC INTEGRATED OPTIMAL PIN-COUNT DESIGN FOR DIGITAL MICROFLUIDIC BIOCHIPS Speakers: Trung Anh Dinh¹, Shigeru Yamashita¹ and Tsung-Yi Ho² ¹Ritsumeikan University, JP; ²National Cheng Kung University, TW Abstract Digital microfluidic biochips have become one of the most promising technologies for biomedical experiments. In modern microfluidic technology, reducing the number of independent control pins that reflects most of the fabrication cost, power consumption and reliability of a microfluidic system, is a key challenge for every digital microfluidic biochip design. However, all the previous chip designs sacrifice the optimality of the problem, and only limited reduction on the number of control pins is observed. Moreover, most existing designs cannot satisfy high-throughput demand for bioassays, and thus inapplicable in practical contexts. In this paper, we propose the first optimal pin-count design scheme for digital microfluidic biochips. By integrating a very simple combinational logic circuit into the original chip, the proposed scheme can provide high-throughput for bioassays with an information-theoretic minimum number of control pins. Furthermore, to cope with the rapid growth of the chip's scale, we also propose a scalable and efficient heuristics. Experiments demonstrate that the proposed scheme can obtain much fewer number of control pins compared with the previous state-of-the-art works.
18:30	IP2-1, 978	FAST AND ACCURATE COMPUTATION USING STOCHASTIC CIRCUITS Speakers: Armin Alaghi and John P. Hayes, University of Michigan - Ann Arbor, US Abstract Stochastic computing (SC) is a low-cost design technique that has great promise in applications such as image processing. SC enables arithmetic operations to be performed on stochastic bit-streams using ultra-small and low-power circuitry. However, accurate computations tend to require long run-times due to the random fluctuations inherent in stochastic numbers (SNs). We present novel techniques for SN generation that lead to better accuracy/run-time trade-offs. First, we analyze a property called progressive precision (PP) which allows computational accuracy to grow systematically with run-time. Second, borrowing from Monte Carlo methods, we show that SC performance can be greatly improved by replacing the usual pseudo-random number sources by low-discrepancy (LD) sequences that are predictably progressive. Finally, we evaluate the use of LD stochastic numbers in SC, and show they can produce significantly faster and more accurate results than existing stochastic designs.
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

4.5 Memory System Architectures

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 3

Chair:
Muhammad Shafique, Karlsruhe Institute of Technology, DE

Co-Chair:
Cristina Silvano, Politecnico di Milano, IT

The memory sub-system plays an increasingly important role in modern multicore systems. Novel solutions are needed in order to deliver the expected performance improvements with minimal energy overheads. In addition, new solutions should be preferably backward compatible with already existing approaches. In this session we have four papers dealing with different aspects of the memory hierarchy in modern computing systems. ALLARM provides a novel, yet power efficient strategy towards cache coherence to simultaneously improve performance and reduce energy. The next paper in this session presents a novel packet-based interface and compression, which reduces communication overhead. The third paper deals with prefetcher aggressiveness and proposes a sound solution to reduce overall execution time. The last paper of this session proposes a novel extension of the shared L2 cache memory system, providing a very high aggregated bandwidth with a very low impact on L2 cache design complexity or operating frequency.

Time	Label	Presentation Title Authors
17:00	4.5.1	ACHIEVING EFFICIENT PACKET-BASED MEMORY SYSTEM BY EXPLOITING CORRELATION OF MEMORY REQUESTS Speakers: Tianyue Lu, Licheng Chen and Mingyu Chen, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Packet-based interface is a probable trend for future memory system to alleviate capacity and bandwidth bottleneck, meanwhile fine grained memory access is proven to efficiently reduce memory power. However leveraging both these two technologies will result in high packet overhead, cause previous implementations all adopt a simple design that a single packet is dedicated to a single request (SPSR). In this paper, we propose three optimizations to overcome the problem by exploiting correlations of memory requests. First, we propose a novel single packet multiple requests (SPMR) interface that encapsulates multiple requests into a packet to share packet head and tail. Second, we propose an adaptive compression mechanism for addresses within a packet by adopting an base-difference algorithm. Third, we propose a mechanism to merge multiple memory requests with continuous access addresses into a single request before packing. In this way, the granularity constraint of 64 bytes is break down that the efficiency of requests scheduling and row buffer will be improved. The experimental results show that, for memory-intensive workloads, the optimizations can effectively reduce packet overhead by about 53.9% and improve system performance by about 63.6%.
17:30	4.5.2	ALLARM: OPTIMIZING SPARSE DIRECTORIES FOR THREAD-LOCAL DATA Speakers: Amitabha Roy¹ and Timothy Jones² ¹EPFL, CH; ²University of Cambridge, GB Abstract Large-scale cache-coherent systems often impose unnecessary overhead on data that is thread-private for the whole of its lifetime. These include resources devoted to tracking the coherence state of the data, as well as unnecessary coherence messages sent out over the interconnect. In this paper we show how the memory allocation strategy for non-uniform memory access (NUMA) systems can be exploited to remove any coherence-related traffic for thread-local data, as well removing the need to track those cache lines in sparse directories. Our strategy is to allocate directory state only on a miss from a node in a different affinity domain from the directory. We call this ALLocAte on Remote Miss, or ALLARM. Our solution is entirely backward compatible with existing operating systems and software, and provides a means to scale cache coherence into the many-core era. On a mix of SPLASH2 and Parsec workloads, ALLARM is able to improve performance by 13% on average while reducing dynamic energy consumption by 9% in the on-chip network and 15% in the directory controller. This is achieved through a 46% reduction in the number of sparse directory entries evicted.
18:00	4.5.3	INTRODUCING THREAD CRITICALITY AWARENESS IN PREFETCHER AGGRESSIVENESS CONTROL Speakers: Biswabandan Panda and Shankar Balachandran, IIT Madras, IN Abstract A single parallel application running on a multi-core system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Some of the reasons for the slow progress of threads are (1) load imbalance, (2) frequent cache misses and (3) effect of synchronization primitives. Identifying critical threads and minimizing their cache miss latencies can improve system performance. One way to hide and tolerate the cache misses is through hardware prefetching. Hardware prefetching is one of the most commonly used memory latency hiding techniques. Previous studies have shown the effectiveness of hardware prefetchers for multiprogrammed workloads (multiple sequential applications running independently on different cores). In contrast to multiprogrammed workloads, the performance of a single parallel application depends on the progress of slow progress(critical) threads. This paper introduces Thread Criticality-aware Prefetcher Aggressiveness Control (TCPAC). TCPAC controls the aggressiveness of prefetchers at the L2 prefetching controllers (known as TCPAC-P), DRAM controller (known as TCPAC-D) and at the Last Level Cache (LLC) controller (known as TCPAC-C) based on the prefetch accuracy and the thread progress. Though each TCPAC subtechnique outperform the respective state-of-the-art techniques such as HPAC [2], PADC [4], and PACMan [3]. Combination of all the TCPAC sub-techniques named as TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan. On an average, on a 8 core system, in terms of improvement in execution time, TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan by 7.61%. For 12 and 16 cores, TCPAC-PDC beats the state-of-the-art combinations by 7.21% and 8.32% respectively.
18:15	4.5.4	A MULTI BANKED - MULTI PORTED - NON BLOCKING SHARED L2 CACHE FOR MPSOC PLATFORMS Speakers: Igor Loi¹ and Luca Benini² ¹University of Bologna, IT; ²Università di Bologna, IT Abstract On-chip L2 cache architectures, well established in high-performance parallel computing systems, are now becoming a performance-critical component also for multi/many-core architectures targeted at lower-power, embedded applications. The very stringent requirements on power and cost of these systems result in one of the key challenges in many-core designs, mandating the deployment of highly efficient L2 caches. In this perspective, sharing the L2 cache layer among all system cores has important advantages, such as increased utilization, fast inter-core communication, and reduced aggregate footprint because no undesired replication of lines occurs. This paper presents and explores a novel architecture for a shared L2 cache system with multi-port and multi-bank features. We target this L2 cache to a many-core platform based on hierarchical cluster structure that does not employ private data caches, and therefore does not require complex coherency mechanisms. In fact, our shared L2 cache can be seen logically as a Last Level Cache (LLC) adopting the terminology of higher-performance many-core products, although in these latter the LLC is more often an L3 layer. Our experimental results show a maximum aggregate bandwidth of 28GB/s (89% of the maximum channel capacity) for 100% hit traffic with random banking conflicts, as a realistic case. Physical implementation results in 28nm Fully-Depleted-Silicon-on-Insulator (FDSoI) show that our L2 cache can operate at up to 1GHz with a memory density loss of only 20% with respect to an L2 scratchpad for a 2 MB configuration.
18:30	IP2-2, 150	DRAM-BASED COHERENT CACHES AND HOW TO TAKE ADVANTAGE OF THE COHERENCE PROTOCOL TO REDUCE THE REFRESH ENERGY Speakers: Zoran Jaksic and Ramon Canal, Universitat Politecnica de Catalunya, ES Abstract Recent technology trends has turned DRAMs into an interesting candidate to substitute traditional SRAM-based on-chip memory structures (i.e. register file, cache memories). Nevertheless, a major problem to introduce these cells is that they lose their state (i.e. value) over time, and they have to be refreshed. This paper proposes the implementation of coherent caches with DRAM cells. Furthermore, we propose to use the coherence state to tune the refresh overhead. According to our analysis, an average of up to 57% of refresh energy can be saved. Also, comparing to the caches implemented in SRAMs total energy savings are on average up to 39% depending of the refresh policy with a performance loss below 8%
18:31	IP2-3, 302	(Best Paper Award Candidate) REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD3) Speakers: Alen Bardizbanyan¹, Magnus Själander², David Whalley² and Per Larsson-edefors¹ ¹Chalmers University of Technology, SE; ²Florida State University, US Abstract Fast set-associative level-one data caches (L1~DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD3) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD3 technique enables about 49% of all load operations to access the L1~DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1~DC energy by 13%.
18:32	IP2-4, 444	DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP Speakers: Preethi Parayil Mana Damodaran¹, Stefan Wallentowitz² and Andreas Herkersdorf³ ¹LIS, Technical University of Munich, DE; ²Technische Universität München, Institute for Integrated Systems, DE; ³TU München, DE Abstract In a shared-memory based tiled many-core system-on-chip architecture, memory accesses present a huge performance bottleneck in terms of access latency as well as bandwidth requirements. The best practice approach to address this issue is to provide a multi-level cache hierarchy and a suitable cache-coherency mechanism. This paper presents a method to increase the memory access performance in distributed-directory-coherency-protocol based tiled many-core systems. The proposed method introduces an alternate design for the system-wide shared last-level caches (LLC) placed between the memory and the node private caches (NPC). The proposed system-wide shared LLC layer is distributed over the entire network and it interacts with the home directories of specific cache lines. Results from simulating SPEC2000 benchmark applications executed on a SystemC model of the proposed design show a minimum performance improvement of 20-25% when compared to a model without the shared cache layer at the expense of an additional 2% of the total cache memory space (NPC + LLC memory). In addition, the proposed design shows a minimum 7-15% and an average 14-15% improvement in performance in comparison to centralized system-wide shared LLC of equivalent size and dynamic mapped distributed LLC of equivalent size respectively.
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

4.6 Code Generation and Optimization for Embedded Platforms

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 4

Chair:
Heiko Falk, Ulm University, DE

Co-Chair:
Florence Maraninchi, Grenoble IMP/VERIMAG, FR

This session covers the broad spectrum of topics in compilers, code optimization, and validation under consideration of today's embedded platforms. The first paper addresses the automated validation of binary translators. The second paper focusses on the on-device optimization of apps and system libraries of mobile platforms. The third paper deals with the code generation of Android image processing applications for heterogeneous GPU-based architectures. The session is rounded off by short presentations of work-in-progress ideas on model transformation, energy and wear-leveling optimization, and scheduling/register allocation.

Time	Label	Presentation Title Authors
17:00	4.6.1	EATBIT: EFFECTIVE AUTOMATED TEST FOR BINARY TRANSLATION WITH HIGH CODE COVERAGE Speakers: Hui Guo¹, Zhenjiang Wang¹, Chenggang Wu¹ and Ruining He² ¹Institute of Computing Technology, Chinese Academy of Sciences, CN; ²University of California, San Diego, US Abstract Binary translation makes it convenient to emulate one instruction set by another. Nowadays, it is growing in popularity in various applications, especially the embedded platforms. When it comes to the test of binary translators, traditional methodologies which still mainly rely on manual unit test is costly, labor intensive and often not adequate to test complicated algorithms in the translators. Some standard benchmark suites, like SPEC CPU2006, are compiled with different compilation options for further tests. However, the translation modules still have over 30% of their code unexecuted after such tests, according to our experimental results. Methodologies based on randomization can generate a vast variety of tests, thus improve the code coverage in the translation system. In this paper, we propose such an approach named EATBit. Test binaries are generated with randomly selected instructions and operands. The binaries and a large amount of input data are then refined to exclude invalid ones. Experimental results on a real binary translator demonstrate that EATBit can not only improve code coverage by over 20%, but also find some new bugs in the translator successfully.
17:30	4.6.2	ON-DEVICE OBJECTIVE-C APPLICATION OPTIMIZATION FRAMEWORK FOR HIGH-PERFORMANCE MOBILE PROCESSORS Speakers: Garo Bournoutian and Alex Orailoglu, University of California, San Diego, US Abstract Smartphones provide applications that are increasingly similar to those of interactive desktop programs, providing rich graphics and animations. To simplify the creation of these interactive applications, mobile operating systems employ high-level object-oriented programming languages and shared libraries to manipulate the device's peripherals and provide common user-interface frameworks. The presence of dynamic dispatch and polymorphism allows for robust and extensible application coding. Unfortunately, the presence of dynamic dispatch also introduces significant overheads during method calls, which directly impact execution time. Furthermore, since these applications rely heavily on shared libraries and helper routines, the quantity of these method calls is higher than those found in typical desktop-based programs. Optimizing these method calls centrally before consumers download the application onto a given phone is exacerbated due to the large diversity of hardware and operating system versions that the application could run on. This paper proposes a methodology to tailor a given Objective-C application and its associated device-specific shared library codebase using on-device post-compilation code optimization and transformation. In doing so, many polymorphic sites can be resolved statically, improving the overall application performance.
18:00	4.6.3	CODE GENERATION FOR EMBEDDED HETEROGENEOUS ARCHITECTURES ON ANDROID Speakers: Richard Membarth, Oliver Reiche, Frank Hannig and Jürgen Teich, University of Erlangen-Nuremberg, DE Abstract The success of Android is based on its unified Java programming model that allows to write platform-independent programs for a variety of different target platforms. However, this comes at the cost of performance. As a consequence, Google introduced APIs that allow to write native applications and to exploit multiple cores as well as embedded GPUs for compute-intensive parts. This paper proposes code generation techniques in order to target the Renderscript and Filterscript APIs. Renderscript harnesses multi-core CPUs and unified shader GPUs, while the more restricted Filterscript also supports GPUs with earlier shader models. Our techniques focus on image processing applications and allow to target these APIs and OpenCL from a common description. We further supersede memory transfers by sharing the same memory region among different processing elements on HSA platforms. As reference, we use an embedded platform hosting a multi-core ARM CPU and an ARM Mali GPU. We show that our generated source code is faster than native implementations in OpenCV as well as the pre-implemented script intrinsics provided by Google for acceleration on the embedded GPU.
18:30	IP2-5, 990	DESIGN OF SAFETY CRITICAL SYSTEMS BY REFINEMENT Speakers: Alex Iliasov¹, Arseniy Alekseyev², Danil Sokolov³ and Andrey Mokhov³ ¹Newcastle University, GB; ²Newcastle University, ZW; ³Newcastle University, BB Abstract An increasingly large number of safety-critical embedded systems rely on software to prevent and mitigate hazards occurring due to design errors and unexpected interactions of the system with its users and the environment. Implementing a safety instrumented function in the way advocated by the traditional software methods requires an intimate understanding and thorough validation of a complex ecosystem of programming languages, compilers, operating systems and hardware. We propose to consider an alternative where a system designer, for each individual problem, creates in a correct-by-construction manner both the design of a system and its compilation and execution infrastructure. This permits an uninterrupted chain of a formal correctness argument spanning from formalised requirements all the way to the gate-level characterisation of an execution environment. The past decade of advances in verification technology turned the mechanical verification of large-scale models into a reality while the pressure of certification makes the cost of a formally verified development routine increasingly acceptable. The proposed technique fits the Grand Challenge for Computer Research posed by Hoare in 2003, namely, development of a Verifying Compiler which not only mechanically translates a given program from one language to another but also verifies its correctness according to a formal specification. This allows meeting the most stringent software certification requirements such as SIL 4. We illustrate the idea with a small case-study developed using the Event-B modelling notation and tools.
18:31	IP2-6, 651	ENERGY OPTIMIZATION IN ANDROID APPLICATIONS THROUGH WAKELOCK PLACEMENT Speakers: Faisal Alam¹, Preeti Ranjan Panda¹, Nikhil Tripathi², Namita Sharma³ and Sanjiv Narayan² ¹IIT Delhi, IN; ²Calypto Design Systems, IN; ³Indian Institute of Technology Delhi, IN Abstract Energy efficiency is a critical factor in mobile systems, and a significant body of recent research efforts has focused on reducing the energy dissipation in mobile hardware and applications. The Android OS Power Manager provides programming interface routines called wakelocks for controlling the activation state of devices on a mobile system. An appropriate placement of wakelock acquire and release functions in the application can make a significant difference to the energy consumption. In this paper, we propose a data flow analysis based strategy for determining the placement of wakelock statements corresponding to the uses of devices in an application. Our experimental evaluation on a set of Android applications show significant (up to 32%) energy savings with the proposed optimization strategy.
18:32	IP2-7, 778	A WEAR-LEVELING-AWARE DYNAMIC STACK FOR PCM MEMORY IN EMBEDDED SYSTEMS Speakers: Qingan Li¹, Yanxiang He², Yong Chen², Chun Xue³, Nan Jiang² and Chao Xu² ¹Wuhan University & City University of Hong Kong, CN; ²Wuhan University, CN; ³City University of Hong Kong, CN Abstract Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics such as extremely low leakage power, high storage density and good scalability. However, PCM's low endurance constrains its practical applications. In this paper, we propose a Wear Leveling aware dynamic stack to extend PCM's lifetime when it is adopted in embedded systems as main memory. Through a dynamic stack, the memory space is circularly allocated to stack objects, and thus an even usage of PCM memory is achieved. The experimental results show that the proposed method can significantly reduce the write variation on PCM cells and enhance the lifetime of PCM memory.
18:33	IP2-8, 1056	LIFETIME HOLES AWARE REGISTER ALLOCATION FOR CLUSTERED VLIW PROCESSORS Speakers: Xuemeng Zhang¹, Hui Wu², Haiyan Sun¹ and Jingling Xue³ ¹National University of Defense Technology, CN; ²The University of New South Wales, AU; ³UNSW, AU Abstract This paper presents an on-the-fly register allocator which dynamically detects and utilises lifetime holes for clustered VLIW processors. A lifetime hole is an interval in which a variable does not contain a valid value. A register holding a lifetime hole can be allocated to another variable whose live range fits in the lifetime hole, leading to more efficient utilisation of registers. We propose efficient techniques for dynamically utilising lifetime holes and incorporate these techniques into our on-the-fly register allocator. We have simulated our register allocator and a linear scan register allocator without considering lifetime holes by using the MediaBench II benchmark suite. Our simulation results show that our register allocator reduces the number of spills by 12.5%, 11.7%, 12.7%, for three different processor models, respectively.
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

4.7 Dependable System Design

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 5

Chair:
Yiorgos Makris, University of Texas at Dallas, US

Co-Chair:
Haralampos Stratigopoulos, TIMA, FR

This section presents a variety of techniques to improve dependability of digital systems, showing how to improve security and fault tolerance at system level.

Time	Label	Presentation Title Authors
17:00	4.7.1	REAL-TIME TRUST EVALUATION IN INTEGRATED CIRCUITS Speakers: Yier Jin and Dean Sullivan, The University of Central Florida, US Abstract The use of side-channel measurements and fingerprinting, in conjunction with statistical analysis, has proven to be the most effective method for accurately detecting hardware Trojans in fabricated integrated circuits. However, these post-fabrication trust evaluation methods overlook the capabilities of advanced design skills that attackers can use in designing sophisticated Trojans. To this end, we have designed a Trojan using power-gating techniques and demonstrate that it can be masked from advanced side-channel fingerprinting detection while dormant. We then propose a real-time trust evaluation framework that continuously monitors the on-board global power consumption to monitor chip trustworthiness. The measurements obtained corroborate our frameworks effectiveness for detecting Trojans. Finally, the results presented are experimentally verified by performing measurements on fabricated Trojan-free and Trojan-infected variants of a reconfigurable linear feedback shift register (LFSR) array.
17:30	4.7.2	(Best Paper Award Candidate) VERIFICATION-GUIDED VOTER MINIMIZATION IN TRIPLE-MODULAR REDUNDANT CIRCUITS Speakers: Dmitry Burlyaev, Pascal Fradet and Alain Girault, INRIA, FR Abstract We present a formal approach to minimize the number of voters in triple-modular redundant sequential circuits. Our technique actually works on a single copy of the circuit and considers a user-defined fault model (under the form "at most 1 bit-flip every k clock cycles"). Verification-based voter minimization guarantees that the resulting circuit (i) is fault tolerant to the soft-errors defined by the fault model and (ii) is functionally equivalent to the initial one. Our approach operates at the logic level and takes into account the input and output interface specifications of the circuit. Its implementation makes use of graph traversal algorithms, fixed-point iterations, and BDDs. Experimental results on the ITC'99 benchmark suite indicate that our method significantly decreases the number of inserted voters which entails a hardware reduction of up to 55% and a clock frequency increase of up to 35% compared to full TMR. We address scalability issues arising from formal verification with approximations and assess their efficiency and precision.
18:00	4.7.3	TRADE-OFFS IN EXECUTION SIGNATURE COMPRESSION FOR RELIABLE PROCESSOR SYSTEMS Speakers: Jonah Caplan¹, Maria Mera², Peter Milder² and Brett Meyer¹ ¹McGill University, CA; ²SUNY Stonybrook, US Abstract As semiconductor processes scale, making transistors more vulnerable to transient upset, a wide variety of microarchitectural and system-level strategies are emerging to perform efficient error detection and correction computer systems. While these approaches often target various application domains and address error detection and correction at different granularities and with different overheads, an emerging trend is the use of state compression, e.g., cyclic redundancy check (CRC), to reduce the cost of redundancy checking. Prior work in the literature has shown that Fletcher's checksum (FC), while less effective where error detection probability is concerned, is less computationally complex when implemented in software than the more-effective CRC. In this paper, we reexamine the suitability of CRC and FC as compression algorithms when implemented in hardware for embedded safety-critical systems. We have developed and evaluated parameterizable implementations of CRC and FC in FPGA, and we observe that what was true for software implementations does not hold in hardware: CRC is more efficient than FC across a wide variety of target input bandwidths and compression strengths.
18:15	4.7.4	AN ENERGY-AWARE FAULT TOLERANT SCHEDULING FRAMEWORK FOR SOFT ERROR RESILIENT CLOUD COMPUTING SYSTEMS Speakers: Yue Gao, Sandeep Gupta, Yanzhi Wang and Massoud Pedram, University of Southern California, US Abstract For modern high performance systems, aggressive technology and voltage scaling has drastically increased their susceptibility to soft errors. At the grand scale of cloud computing, it is clear that soft error induced failures will occur far more frequently, but it is unclear as to how to effectively apply current error detection and fault tolerance techniques in scale. In this paper, we focus on energy-aware fault tolerant scheduling in public, multi-user cloud systems, and explore the three-way tradeoff between reliability (in terms of soft error resiliency), performance and energy. Through a systematically optimized resource allocation, error detection approach selection, virtual machine placement, spatial/temporal redundancy augmentation and task scheduling process, the cloud service provider can achieve high error coverage and fault tolerance confidence while minimizing global energy costs under user deadline constraints. Our scheduling algorithm includes a static scheduling phase that operates on task graph based workload inputs prior to execution, and a light-weight dynamic scheduler that migrates tasks during execution in case of excessive re-executions. All schedules are evaluated on a runtime simulation engine that (1) mimics the performance fluctuations in cloud systems, and (2) supports the injection of arbitrary fault patterns. Compared to current virtual machine or task replication techniques, we are able to reduce overall application failure rates by over 50% with approximately 76% total energy overhead.
18:30	IP2-9, 384	A LOW-POWER, HIGH-PERFORMANCE APPROXIMATE MULTIPLIER WITH CONFIGURABLE PARTIAL ERROR RECOVERY Speakers: Cong Liu¹, Jie Han¹ and Fabrizio Lombardi² ¹University of Alberta, CA; ²Northeastern University, US Abstract Approximate circuits have been considered for error-tolerant applications that can tolerate some loss of accuracy with improved performance and energy efficiency. Multipliers are key arithmetic circuits in many such applications such as digital signal processing (DSP). In this paper, a novel approximate multiplier with a lower power consumption and a shorter critical path than traditional multipliers is proposed for high-performance DSP applications. This multiplier leverages a newly-designed approximate adder that limits its carry propagation to the nearest neighbors for fast partial product accumulation. Different levels of accuracy can be achieved through a configurable error recovery by using different numbers of most significant bits (MSBs) for error reduction. The approximate multiplier has a low mean error distance, i.e., most of the errors are not significant in magnitude. Compared to the Wallace multiplier, a 16-bit approximate multiplier implemented in a 28nm CMOS process shows a reduction in delay and power of 20% and up to 69%, respectively. It is shown that by utilizing an appropriate error recovery, the proposed approximate multiplier achieves similar processing accuracy as traditional exact multipliers but with significant improvements in power and performance.
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

4.8 State-of-the-art in Verification: European Tertulia IC Design - Enabling AMS Structured Verification / Verification in FPGA & IP design flows

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Exhibition Theatre

Organiser:
Andreas Brüning, Silicon Saxony, DE

Time	Label	Presentation Title Authors
17:00	4.8.1	BRING ASIC-ALIKE VERIFICATION TO YOUR FPGA & IP DESIGN FLOW Speaker: Scott Calkins, Blue Pearl Software Inc, US Abstract This talk will highlight how successful design teams and IP firms such as PLDA are able to develop high quality code by using a process to control and optimize the HDL which is developed by different designers in different locations, even those with variety skill sets. PLDA designs and sells intellectual property (IP) cores and prototyping tools for ASIC and FPGA that aim to accelerate time-to-market for embedded electronic designers. PLDA specializes in high-speed interface protocols and technologies such as PCIe. Through the use of Blue Pearl Software's Symbolic Engine that maps code to RTL level then analyzed it for known structures, PLDA is able to generate deterministic results for the handful of synthesizers and target fabrics their customers demand. Analyzing the HDL before it is brought into cycle-based simulators allows designers to run FPGA-centric structural checks for Xilinx and Altera so it helps to detect bugs and specific optimizations earlier in the flow and automatically for the success and satisfaction of our customer's designers: "Blue Pearl Software's design analysis tool enables integration of formal verification techniques to our design flow, in order for us to detect structural bugs at the very early stage of code integration, and thus to deliver highest quality IP to our customers. On top, we definitely recommend Blue Pearl Software's solution to anyone who needs to increase design team productivity." Hugues Deneux, R&D Director of PLDA
17:20	4.8.2	TOWARDS CO-DESIGN AND CO-VERIFICATION OF HW, SW, AND ANALOG SYSTEMS Speaker: Christoph Grimm, TU Kaiserslautern, DE Abstract We can today design and verify digital hardware and software in a way that deserves the word co-design. Co-design achieves a significantly higher productivity in the design, and better performances of the product. Unfortunately, co-design and co-verification is not yet done in a similar productive way for analog and RF systems. The presentation will give an overview of methodology, tools, and languages that include analog and RF design Into a comprehensive co-design methodology. Particular focus is on tool integration and power profiling crossing the discrete-analog border.
17:40	4.8.3	ENABLING AMS STRUCTURED VERIFICATION Speakers: Gunter Strube¹ and Stefan Getzlaff² ¹MunEDA GmbH, DE; ²ZMDI, DE Abstract The verification of the robustness of design specifications with respect to all combinations of worst-case parameter conditions not only improves the design confidence, but it is increasingly becoming a requirement for quality assurance and documentation for norms. It is a complex task consuming significant man power and compute power and it tends to be sacrificed under time pressure in the final stage of a project. We present an automated structured approach that differentiates through it's thoroughness, it's efficiency and most of all it's ease-of-use. It enables even novice designers to apply advanced state-of-the art statistical tools to create a report including a measure of robustness for each specification and for the circuit.
18:20	4.8.4	TERTULIA IC-DESIGN - EUROPE TEAMS UP Speaker: Jürgen Haase, edacentrum, DE Abstract The clusters of Grenoble and Dresden developed to leading clusters of world-wide importance. Now these clusters have initiated substantial initiatives for collaboration in order to strengthen Europe´s position in the world-wide competition of microelectronics sites. This talk gives an overview about actual initiatives - including the tertulia IC-Design.
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

UB04 Session 4

Date: Tuesday 25 March 2014
Time: 17:30 - 19:30
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB04.01	QUANTUMEDA: A VISUALIZATION AND DESIGN ENVIRONMENT FOR TOPOLOGICAL QUANTUM CIRCUITS Authors: Ilia Polian, Wolfgang Wallner and Alexandru Paler, University of Passau, DE Abstract Quantum circuits use quantum-mechanical properties of certain physical systems, such as superposition and entanglement, to perform massively parallel calculations. They provide polynomial algorithms for problems for which only inefficient algorithms with asymptotically-exponential running time are known in conventional mod-els of computation. Building a scalable quantum computer that can process a large number of quantum bits (qubits) is one of the grand challenges of modern science. While first small quantum computers have been experimentally demonstrated and a number of implementation technologies have been suggested, all of them encounter difficulties when it comes to scaling. The central difficulty is the high susceptibility of such circuits to noise and decoherence, which necessitates the use of special quantum error correction. Topological quantum computing (TQC) is a paradigm that offers a path to scalability. It strikes a balance between systematic, intuitive methods to design large computations, and relatively loose requirements on the vulnerability of individual qubits to errors. The availability of a platform for implementing large quantum algo-rithm constitutes the need for methods to manage design complexity, including automatic synthesis, optimiza-tion, compaction, verification and visualization of TQC circuits. Topological quantum circuits are based on a three-dimensional cluster of qubits which supports highly efficient topological quantum error-correcting codes. In this way, the circuits can operate even though its individual qubits are subject to relatively high error rates. We will present the first environment for design of TQC circuits. The environment allows the user to graphically enter the structure of a circuit, add, delete and re-shape individual qubits, and perform optimization and compaction (both manually and by global replacement). The circuits are represented on an intermediate technology-independent level, where "logical qubits" that consist of a large number of physical qubits perform error-corrected operations. For example, the circuit in Fig. 1 shows an error-corrected CNOT gate implemented by four logical qubits represented by colored structures. The optimized representation can be translated into instruction sequences for a classical computer that operates the actual quantum hardware. More information ...
UB04.02	AIDA: ANALOG IC DESIGN AUTOMATION Authors: Nuno Horta¹, Nuno Lourenço², Ricardo Martins², Ricardo Póvoa², António Canelas² and Pedro Ventura¹ ¹Instituto de Telecomunicacoes, PT; ²Instituto de Telecomunicacoes / Instituto Superior Técnico, PT Abstract This demonstration presents AIDA, an analog integrated circuit (IC) design automation environment. AIDA includes two main modules, namely, AIDA-C and AIDA-L. AIDA-C is a circuit-level synthesis tool which uses state-of-the-art multi-objective multi-constrained optimization kernels, based on evolutionary computation techniques, where the robustness of the solutions is attained by considering a layout-aware approach and, also, extreme process variations by means of PVT corner analysis. The circuit's performance is measured using Spectre®, ELDO® or HSPICE® electrical simulators as evaluation engines. AIDA-L considers the device sizes and the best floorplan, obtained with AIDA-C, and generates the complete layout by placing and routing the devices, while fulfilling the technology design rules by using built-in design-rule check (DRC) and layout-versus-schematic (LVS) procedures. In order to demonstrate AIDA design environment several analog circuit structures, e.g., OTAs, LNAs, LC-Oscillators, etc., will be synthesized in a 130nm CMOS technology. AIDA-C is demonstrated for circuit-level sizing and optimization by generating a family of Pareto Optimal solutions based on user performance and functional specifications. AIDA-L is demonstrated by generating the layout of a user selected solution from AIDA-C, taking into account electrical currents information to mitigate electromigration and IR-drop effects, and also wiring symmetry for multiport multi-terminal signal nets of analog ICs. More information ...
UB04.03	PATN: A PERFORMANCE ANALYSIS TOOL FOR NOC Authors: Yang Chen and Zhonghai Lu, KTH Royal Institute of Technology, SE Abstract With processors increased onto a single chip, and more and more time sensitive applications added to on-chip systems, performance bound analysis becomes essential for QoS Network-on-Chip (NoC) designs and evaluations. For the purpose of providing the reliable and automated analysis for QoS NoC, we propose PATN (Performance Analysis Tool for NoC), which automatically computes the end-to-end delay bounds of data flows, and backlog bounds of buffers for NoC with arbitrary topology. PATN is designed based on network calculus, which lies on solid mathematical foundations and provides well-guaranteed accuracy of the results. Network Calculus based analysis has been successfully employed for various communications networks, such as SpaceWire, AFDX, etc.. For example, Airbus adopted and approved the network calculus based analysis for certification on its aircraft A380. In this demonstration, we give a whole view of PATN through two segments. First, we explain the architecture and main functions; show the working flow and printing log by analysing end-to-end delay bound of a data flow in a simple network. The log shows that the analysis follows the theoretical methodology exactly, hence to obtain the correct and tight results, which as good as that the theory can achieve. Second, we use PATN to analyse the delay bounds and backlog bounds for 3 NoCs with different topologies -- binary tree, mesh, and hierarchical topology of binary tree and mesh. The analyses demonstrate computation speed and scalability of PATN. Moreover, comparisons of the delay bound, computed with different configuration parameters of the flows and routers, are conducted. It shows how the delay bound is effected by the parameters. More information ...
UB04.04	GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES Authors: Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT Abstract Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The software is composed of a parser library to handle input circuit descriptions, a characterization library of graphene gates used in the synthesis process, a Biconditional Binary Decision Diagram library used to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices. More information ...
UB04.05	HWDEBLUR: DESIGN OF A HIGH PERFORMANCE CORE FOR REMOVING BLUR EFFECT ON IMAGES Authors: Giuseppe Airo' Farulla, Giulio Gambardella, Marco Indaco, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT Abstract This work aims at developing a high performance FPGA-based IP-core able to perform a deblurring algorithm in real-time. Modern approaches to deblurring usually either only handle simple types of blur, or need heavy user inter-action. Moreover, they usually require several minutes (or even whole hours) to process a single image. Our purpose is to study the current state-of-the-art and identify the best deblurring algorithms that are suitable for a hardware implementation. The selected algorithm is optimized and implemented in hardware in order to perform the deblurring task with highest possible performances. More information ...
UB04.06	ENERGY-MODULATED COMPUTING Authors: Maxim Rykunov, Reza Ramezani, Abdullah Baz, Xuefu Zhang, Delong Shang, Andrey Mokhov, Danil Sokolov, Fei Xia and Alex Yakovlev, Newcastle University, GB Abstract This demo will illustrate the principle of energy-modulated computing according to which the flow of energy entering a computing system determines its computational flow. This principle will be fundamental for building future autonomous systems, such as those powered by energy harvesting sources and aimed for survival in power-deficient conditions. The demo includes a set of experimental circuits (with three VLSI chips and PCBs) to work in variable power supply conditions and software tools for digital and analogue co-design (Workcraft, Petrify, MPSAT). More information ...
UB04.07	ID.FIX: AN EDA TOOL FOR FIXED-POINT REFINEMENT OF EMBEDDED SYSTEMS Authors: Olivier Sentieys¹, Daniel Menard² and Nicolas Simon³ ¹INRIA, FR; ²INSA Rennes, FR; ³University of Rennes, FR Abstract Most of digital image and signal processing algorithms are implemented into architectures based on fixed-point arithmetic to satisfy the cost and power consumption constraints of embedded systems. The fixed-point conversion process (or refinement) is crucial for reducing the time-to-market. Design tools to automate this phase and to explore the design space are thus required. The ID.Fix EDA tool based on the compiler infrastructure GECOS allows for the convertion of a floating-point C source code into a C code using fixed-point data types. The data word-lengths are optimized by minimizing the implementation cost under accuracy constraint. To obtain low optimization time, an analytical approach is used to evaluate the fixed-point computation accuracy. This approach is valid for systems made-up of any (smooth) arithmetic operations. More information ...
UB04.09	FAULTIFY: PROBABILISTIC CIRCUIT FAULT EMULATION Authors: David May and Walter Stechele, TUM, DE Abstract We want to demonstrate an FPGA-based probability-aware fault emulator and its corresponding algorithms in the context of a real-time H.264 decoder. The demo will show that reliability constraints can be relaxed inside the circuit without noticeable degradation of the image quality when carefully investigating where the constraints can be relaxed. We will show how this investigation can to be done using our emulator and we will show the effect of a relaxed robustness of the circuit in real-time. More information ...
19:30	End of session

Exhibition-Reception Exhibition Reception

Date: Tuesday 25 March 2014
Time: 18:30 - 19:30
Location / Room: Several serving points inside the Exhibition Area (Terrace Level)

The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

Time	Label	Presentation Title Authors
19:30		End of session

5.1 SPECIAL DAY Hot Topic: Predictable Multi-Core Computing

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Saal 1

Organiser:
Jürgen Teich, University of Erlangen-Nuremberg, DE

Chair:
Petru Eles, Linköping University, SE

Co-Chair:
Jürgen Teich, University of Erlangen-Nuremberg, DE

The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. This session treats this important problem of time predictability of applications on multi-core platforms by presenting results of the impact of resource sharing on performance, an architecture that has been designed to meet predictability requirements as well as new results on scheduling mixed critical applications on multi-core platforms.

Time	Label	Presentation Title Authors
08:30	5.1.1	IMPACT OF RESOURCE SHARING ON PERFORMANCE AND PERFORMANCE PREDICTION Speakers: Jan Reineke and Reinhard Wilhelm, Informatik, Universität des Saarlandes, DE Abstract Multi-core processors are increasingly considered as execution platforms for embedded systems because of their good performance/energy ratio. However, the interference on shared resources poses several problems. It may severely reduce the performance of tasks executed on the cores, and it increases the complexity of timing analysis and/or decreases the precision of its results. Many applications implemented on multi-core platforms are safety- and some also time-critical. A critical issue for these applications is the reduced predictability of such systems resulting from the interference of different applications on shared resources. These interferences can be at least of two kinds: Several applications may request a resource at the same time, but the resource can only admit one access at a time. As a consequence, an arbitration mechanism may delay the request of all but one application, thus slowing down the other applications. This is the case of resources like buses, typically called bandwidth resources. On the other hand, one application may also change the state of a shared resource such that another application using that resource will suffer from a slowdown. This is the case with shared caches, which fall into the class of storage resources. Interference of shared resources makes worst-case execution time (WCET) analysis of applications more difficult since a task or a thread can no longer be analyzed for its timing behavior in isolation. All potential interferences slowing down the task under analysis have to be considered. This leads to a combinatorial explosion of the analysis complexity, as all possible interleavings of different threads have to be analyzed.
09:00	5.1.2	TIME-CRITICAL COMPUTING ON A SINGLE CHIP MASSIVELY PARALLEL PROCESSOR Speaker: Benoît Dupont de Dinechin, Kalray, FR Abstract The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. We illustrate how this problem has been addressed by suitably designing the architecture, implementation, and programming model, of the Kalray MPPA-256 single-chip many-core processor. The MPPA-256 (Multi-Purpose Processing Array) processor integrates 256 processing engine (PE) cores and 32 resource management (RM) cores on a single 28nm CMOS chip. These VLIW cores are distributed across 16 compute clusters and 4 I/O subsystems, each with a locally shared memory. On-chip communication and synchronization are supported by an explicitly addressed dual network-on-chip (NoC), with one node per compute cluster and 4 nodes per I/O subsystem. Off-chip interfaces include DDR, PCI and Ethernet, and a direct access to the NoC for low-latency processing of data streams. The key architectural features that support time-critical applications are timing compositional cores, independent memory banks inside the compute clusters, and the data NoC whose guaranteed services are determined by network calculus. The programming model provides communicators that effectively support distributed computing primitives such as remote writes, barrier synchronizations, active messages, and communication by sampling. POSIX time functions expose synchronous clocks inside compute clusters and mesosynchronous clocks across the MPPA-256 processor.
09:30	5.1.3	MAPPING MIXED-CRITICALITY APPLICATIONS ON MULTI-CORE ARCHITECTURES Speakers: Georgia Giannopoulou¹, Nikolay Stoimenov¹, Pengcheng Huang² and Lothar Thiele³ ¹ETH Zurich, CH; ²ETHZ, CH; ³Swiss Federal Institute of Technology Zurich, CH Abstract A common trend in real-time embedded systems is to integrate multiple applications on a single platform. Such systems are known as mixed-criticality (MC) systems when the applications are characterized by different criticality levels. Nowadays, multicore platforms are promoted due to cost and performance benefits. However, certification of multicore MC systems is challenging as concurrently executed applications of different criticalities may block each other when accessing shared platform resources. Most of the existing research on multicore MC scheduling ignores the effects of resource sharing on the response times of applications. Recently, a MC scheduling strategy was proposed, which explicitly accounts for these effects. This paper discusses how to combine this policy with an optimization method for the partitioning of tasks to cores as well as the static mapping of memory blocks, i.e., task data and communication buffers, to the banks of a shared memory architecture. Optimization is performed at design time targeting at minimizing the worst-case response times of tasks and achieving efficient resource utilization. The proposed optimization method is evaluated using an industrial application.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.2 Hot Topic: Hacking and Protecting Hardware: Threats and Challenges

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 6

Organisers:
Said Hamdioui, TU Delft, NL
Giorgio Di Natale, LIRMM, FR

Chair:
Said Hamdioui, TU Delft, NL

Co-Chair:
Giorgio Di Natale, LIRMM, FR

For this Hot-Topic Session, we will have four leading researchers and experienced speakers from different companies to address both hacking and protecting ICs for chip data. Two speakers will focus on the weaknesses of IC and systems and the ways they can be hacked to retrieve secret data, while the other two will cover smart schemes that can be used to protect ICs from such attacks.

Time	Label	Presentation Title Authors
08:30	5.2.1	HARDWARE ATTACKS ON SECURE ICS Speaker: Gerard van Battum, Brightsight, NL Abstract He will talk a little bit about the history of attacks and their evolution till today. Thereafter, an overview and a classification of different attacks and their effects will be discussed. Examples will be given of hardware attack techniques on actual secure ICs, such as reverse engineering, mechanical probing, (e-beam) microscopy, etching and polishing, ROM code analysis and Focused Ion Beam modification. This will be put in perspective with commonly applied design practices to protect state-of-the-art secure ICs, which make hardware attacks more difficult.
08:52	5.2.2	ATTACKING SMART PHONES Speaker: Jean-Luc Danger, Secure IC, FR Abstract He will address the attack on mobile phones by side-channel. The cryptographic functions executed by the mobile phone processor leak information via the electromagnetic channel. Thus, this non-intrusive observation of the leakage is exploitable at distance to retrieve the secret keys of the cryptographic algorithms. Some attack examples will be given to demonstrate the power of such threats.
09:15	5.2.3	SECURING SYSTEM ON CHIPS Speaker: Fethulah Smailbegovic, ESCRYPT GmbH – Embedded Security, DE Abstract He will focus on the challenge of securing generic System on Chip (SoC) architectures. Growing SoC complexity, costs and short time-to-market requirements limit the availability of dedicated hardware security solutions in SoC architectures introducing new potential security risks. Answering the question of how to build secure and reliable SoCs is one of the major challenges in the near future. First, he will briefly talk about current state-of-the-art security solutions in SoCs and afterwards about architectural requirements for future secure System on Chip architectures.
09:37	5.2.4	SILICONAP: A SILICON AUTHENTICATION PLATFORM FOR SECURITY AND ANTI- COUNTERFEITING Speaker: Mohammad Tehranipoor, TrueLogic, US Abstract He will talk about design for security and anti-counterfeiting. His talk includes new design techniques for Trojan detection, Trojan prevention, vulnerability analysis, as well as design techniques for preventing counterfeiting of integrated circuits and providing means for easy detection.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.3 Reliable Systems in the Age of Variability

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 1

Chair:
Antonio Miele, Politecnico di Milano, IT

Co-Chair:
José L. Ayala, Complutense University of Madrid, ES

The evolution of the silicon industry over past decades has been fueled by continued scaling. This has motivated the rapid evolution of integration technologies. In future technology nodes, reliability is expected to become a first-order design constraint. This session tackles this with novel techniques, spanning from memoization to latency-insensitive systems, proposing to tolerate, recover and manage reliability issues in a more variable scenario.

Time	Label	Presentation Title Authors
08:30	5.3.1	(Best Paper Award Candidate) TEMPORAL MEMOIZATION FOR ENERGY-EFFICIENT TIMING ERROR RECOVERY IN GPGPUS Speakers: Abbas Rahimi¹, Luca Benini² and Rajesh Gupta¹ ¹UC San Diego, US; ²Università di Bologna, IT Abstract Manufacturing and environmental variability lead to timing errors in computing systems that are typically corrected by error detection and correction mechanisms at the circuit level. The cost and speed of recovery can be improved by memoization-based optimization methods that exploit spatial or temporal parallelisms in suitable computing fabrics such as general-purpose graphics processing units (GPGPUs). We propose here a temporal memoization technique for use in floating-point units (FPUs) in GPGPUs that uses value locality inside data-parallel programs. The technique recalls (memorizes) the context of error-free execution of an instruction on a FPU. To enable scalable and independent recovery, a single-cycle lookup table (LUT) is tightly coupled to every FPU to maintain contexts of recent error-free executions. The LUT reuses these memorized contexts to exactly, or approximately, correct errant FP instructions based on application needs. In real-world applications, the temporal memoization technique achieves an average energy saving of 8%-28% for a wide range of timing error rates (0%-4%) and outperforms recent advances in resilient architectures. This technique also enhances robustness in the voltage overscaling regime and achieves relative average energy saving of 66% with 11% voltage overscaling.
09:00	5.3.2	RELIABILITY-AWARE EXCEPTIONS: TOLERATING INTERMITTENT FAULTS IN MICROPROCESSOR ARRAY STRUCTURES Speakers: Waleed Dweik, Murali Annavaram and Michel Dubois, University of Southern California, US Abstract In future technology nodes, reliability is expected to become a first-order design constraint. Faults encountered in a chip can be classified into three categories: transient, intermittent, and permanent. Fault classification allows a chip to take the appropriate corrective action. Mechanisms have been proposed to distinguish transient from non-transient faults where all non-transient faults are handled as permanent. Intermittent faults induced by wearout phenomena have become the dominant reliability concern in nanoscale technology, yet there is no mechanism that provides finer classification of non-transient faults into intermittent and permanent faults. In this paper, we present a new class of exceptions called Reliability-Aware Exceptions (RAEs) which provide the ability to distinguish intermittent faults in microprocessor array structures. The RAE handlers have the ability to manipulate microprocessor array structures to recover from all three categories of faults. Using RAEs, we demonstrate that the reliability of two representative microarchitecture structures, load/store queue and reorder buffer in an out-of-order processor, is improved by average factors of 1.3 and 1.95, respectively.
09:30	5.3.3	TEMPERATURE AWARE ENERGY-RELIABILITY TRADE-OFFS FOR MAPPING OF THROUGHPUT-CONSTRAINED APPLICATIONS ON MULTIMEDIA MPSOCS Speakers: Anup Das, Akash Kumar and Bharadwaj Veeravalli, National University of Singapore, SG Abstract This paper proposes a design-time (offline) analysis technique to determine application task mapping and scheduling on a multiprocessor system and the voltage and frequency levels of each cores (offline DVFS) that minimize application computation and communication energy, simultaneously minimizing processor aging. The proposed technique incorporates (1) the effect of the voltage and frequency on the temperature of a core; (2) the effect of neighboring core voltage and frequency on the temperature (spatial effect); (3) pipelined execution and cyclic dependencies among tasks; and (4) the communication energy component which often constitutes a significant fraction of the total energy for multimedia applications. The temperature model proposed here can be easily integrated in the design space exploration for multiprocessor systems. Experiments conducted with applications modeled as synchronous data-flow graphs in conjunction with HotSpot tool for temperature modeling clearly demonstrate the quality and the speed-up achieved using the proposed approach. Further, they also show 40% savings in energy consumption with 6% increase in system lifetime.
09:45	5.3.4	RECOVERY-BASED RESILIENT LATENCY-INSENSITIVE SYSTEMS Speakers: Yuankai Chen¹, Xuan Zeng² and Hai Zhou¹ ¹Northwestern University, US; ²Fudan University, CN Abstract As the interconnect delay is becoming a larger fraction of the clock cycle time, the conventional global stalling mechanism, which is used to correct error in general synchronous circuits, would be no longer feasible because of the expensive timing cost for the stalling signal to travel across the circuit. In this paper, we propose recovery-based resilient latency-insensitive systems (RLISs) that efficiently integrate error-recovery techniques with latency-insensitive design to replace the global stalling. We first demonstrate a baseline RLIS as the motivation of our work that uses additional output buffer which guarantees that only correct data can enter the output channel. However this baseline RLIS suffers from performance degradations even when errors do not occur. We propose a novel improved RLIS that allows erroneous data to propagate in the system. Equipped with improved queues that prevent accumulation of erroneous data, the improved RLIS retains the system performance. We provide theoretical studies that analyze the impact of errors on system performance and the queue sizing problem. We also theoretically prove that the improved RLIS performs no worse than the global stalling mechanism. Experimental results show that the improved RLIS has 40.3\% and even 3.1\% throughput improvements compared to the baseline RLIS and the infeasible global stalling mechanism respectively, with less than 10\% hardware overhead.
10:00	IP2-10, 80	A LINUX-GOVERNOR BASED DYNAMIC REALIABILITY MANAGER FOR ANDROID MOBILE DEVICES Speakers: Pietro Mercati¹, Andrea Bartolini², Francesco Paterna¹, Tajana Simunic Rosing¹ and Luca Benini² ¹UCSD, US; ²University of Bologna, IT Abstract Reliability is a major concern in multiprocessors. Dynamic Reliability Management (DRM) aims at trading off processor performance with lifetime. The state-of-the-art publications study only the theory supported by simulation. This paper presents the first complete software implementation, working on a real hardware, of a low-overhead, Android-compatible workload-aware DRM Governor for mobile multiprocessors. We discuss the design challenges and the run-time overhead involved. We show the effectiveness of our governor in guaranteeing the predefined target lifetime and show that it achieves up to 100% of lifetime improvement with respect to traditional governors, while providing comparable performance for critical applications.
10:01	IP2-11, 182	YIELD AND TIMING CONSTRAINED SPARE TSV ASSIGNMENT FOR THREE-DIMENSIONAL INTEGRATED CIRCUITS Speakers: Yu-Guang Chen¹, Kuan-Yu Lai¹, Ming-Chao Lee², Yiyu Shi³, Wing-Kai Hon¹ and Shih-Chieh Chang¹ ¹National Tsing Hua University, TW; ²MediaTek Inc., TW; ³Missouri University of Science and Technology, US Abstract Through Silicon Via (TSV) is a critical enabling technique in three-dimensional integrated circuits (3D ICs). However, it may suffer from many reliability issues. Various fault-tolerance mechanisms have been proposed in literature to improve yield, at the cost of significant area overhead. In this paper, we focus on the structure that uses one spare TSV for a group of original TSVs, and study the optimal assignment of spare TSVs under yield and timing constraints to minimize the total area overhead. We show that such problem can be modeled through constrained graph decomposition. An efficient heuristic is further developed to address this problem. Experimental results show that under the same yield and timing constraints, our heuristic can reduce the area overhead induced by the fault-tolerance mechanisms by up to 38%, compared with a seemingly more intuitive nearest-neighbor based heuristic.
10:02	IP2-12, 568	COMPILER-DRIVEN DYNAMIC RELIABILITY MANAGEMENT FOR ON-CHIP SYSTEMS UNDER VARIABILITIES Speakers: Semeen Rehman, Florian Kriebel, Muhammad Shafique and Jörg Henkel, Karlsruhe Institute of Technology (KIT), DE Abstract This paper presents a novel Dynamic Reliability Management System (DyReMS) for on-chip systems that performs resilience-driven resource allocation and mapping. It accounts for both the tasks' resilience properties and heterogeneous error recovery features of different cores. DyReMS also chooses a reliable task version (out of multiple reliability-aware transformed options) depending upon the reliability level of the allocated core. In case of error detection, rollbacks are performed. Our system provides 70%-87% improved task reliability compared to a timing reliabil-ity-optimizing core assignment, i.e. minimizing the probability of deadline misses (with EDF scheduling).
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.4 Prediction and optimization of timing variations

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 2

Chair:
Antonio Rubio, UPC Barcelona, ES

Co-Chair:
Marisa López Vallejo, UPM Madrid, ES

The session addresses yield analysis due to timing variations as well as various flip flop design techniques improving timing margins under variability.

Time	Label	Presentation Title Authors
08:30	5.4.1	EFFICIENT HIGH-SIGMA YIELD ANALYSIS FOR HIGH DIMENSIONAL PROBLEMS Speakers: Moning Zhang, Zuochang Ye and Yan Wang, Tsinghua National Laboratory for Information Science and Technology, Institute of Microelectronics, Tsinghua University, CN Abstract High-sigma analysis is important for estimating the probability of rare events. Traditional high-sigma analysis can only work for small-size (low-dimension) problems limiting to 10 ~ 20 random variables, mostly due to the difficulty of finding optimal boundary points. In this paper we propose an efficient method to deal with high-dimension problems. The proposed method is based on performing optimization in a series of low dimension parameter spaces. The final solution can be regarded as a greedy version of the global optimization. Experiments show that the proposed method can efficiently work with problems with > 100 independent variables.
09:00	5.4.2	SUB-THRESHOLD LOGIC CIRCUIT DESIGN USING FEEDBACK EQUALIZATION Speakers: Mahmoud Zangeneh and Ajay Joshi, Boston University, US Abstract Low energy has become one of the primary constraint in the design of digital VLSI circuits in recent years. Minimum-energy consumption can be achieved in digital circuits by operating in the sub-threshold regime. However, in this regime process variation can result in up to an order of magnitude variations in Ion/Ioff ratios leading to timing errors, which can have a detrimental impact on the functionality of the sub-threshold circuits. These timing errors become more frequent in scaled technology nodes where process variations are highly prevalent. Therefore, mechanisms to mitigate these timing errors while minimizing the energy consumption in sub-threshold circuits are required. In this paper, we propose the use of a variable threshold feedback equalizer circuit with combinational logic blocks to mitigate the timing errors, which can then be leveraged to reduce the dominant leakage energy by scaling supply voltage or decreasing the propagation delay. At the fixed supply voltage, we can decrease the propagation delay of the critical path using equalizer circuits and, correspondingly decrease the leakage energy consumption. For a 8-bit carry lookahead adder designed in UMC 130 nm process, the operating frequency can be increased by 22.87% (on average), while reducing the leakage energy by 22.6% in the sub-threshold regime. Overall the feedback equalization technique provides up to 35.4% lower energy-delay product compared to the conventional non-equalized logic. Alternately, for a 8-bit carry lookahead adder, the proposed technique enables us to reduce the critical voltage (beyond which timing errors occur) from 300 mV (nominal design) to 270 mV (design with feedback circuit), and provides a 16.72% decrease in energy per operation while maintaining performance.
09:30	5.4.3	STOCHASTIC ANALYSIS OF BUBBLE RAZOR Speakers: Guowei Zhang¹ and Peter Beerel² ¹Tsinghua University, CN; ²Univ. of Southern California, US Abstract Bubble Razor has been proposed to eliminate required timing margins in synchronous design caused by increasing delay variation due to process variation and aging. However, the theoretical analysis of its performance under variability is unknown. This paper presents a Markov Chain model to describe the behavior of Bubble Razor. Using this model, we analyze its performance and provide an optimizing strategy to maximize its benefits
10:00	IP2-13, 1013	(Best Paper Award Candidate) MINIMIZING STATE-OF-HEALTH DEGRADATION IN HYBRID ELECTRICAL ENERGY STORAGE SYSTEMS WITH ARBITRARY SOURCE AND LOAD PROFILES Speakers: Yanzhi Wang¹, Xue Lin¹, Qing Xie¹, Naehyuck Chang² and Massoud Pedram¹ ¹University of Southern California, US; ²Seoul National University, KR Abstract Hybrid electrical energy storage (HEES) systems consisting of heterogeneous electrical energy storage (EES) elements are proposed to exploit the strengths of different EES elements and hide their weaknesses. The cycle life of the EES elements is one of the most important metrics. The cycle life is directly related to the state-of-health (SoH), which is defined as the ratio of full charge capacity of an aged EES element to its designed (or nominal) capacity. The SoH degradation models of battery in the previous literature can only be applied to charging/discharging cycles with the same state-of-charge (SoC) swing. To address this shortcoming, this paper derives a novel SoH degradation model of battery for charging/discharging cycles with arbitrary patterns. Based on the proposed model, this paper presents a near-optimal charge management policy focusing on extending the cycle life of battery elements in the HEES systems while simultaneously improving the overall cycle efficiency.
10:01	IP2-14, 517	DYNAMIC FLIP-FLOP CONVERSION TO TOLERATE PROCESS VARIATION IN LOW POWER CIRCUITS Speakers: Mehrzad Nejat, Bijan Alizadeh and Ali Afzali Kusha, School of Electrical and Computer Eng., College of Eng., University of Tehran, IR Abstract A novel time borrowing method called dynamic Flip-Flop conversion is presented in this paper. A timing violation predictor detects the violations halfway in the critical path and dynamically converts the critical Flip-Flop to a latch. This way, time borrowing benefits of latches are utilized in a Flip-Flop based design which is more adaptable with Computer-Aided- Design tools. The overhead of this method is smaller than that of similar methods due to the elimination of delay elements. According to the post-synthesis simulations and Monte-Carlo analysis of Spice simulations on some ITC'99 benchmark circuits, the power overhead of the proposed method is about 15% and 19% smaller than that of Soft-Edge-Flip-Flop and Dynamic- Clock-Stretching circuits respectively in a simple case of about 40% yield improvement. This overhead would be relatively even smaller for higher performance and yield improvements.
10:02	IP2-15, 900	A LOW POWER AND ROBUST CARBON NANOTUBE 6T SRAM DESIGN WITH METALLIC TOLERANCE Speakers: Luo Sun¹, Jimson Mathew¹, Rishad Shafik², Dhiraj Pradhan¹ and Zhen Li¹ ¹University of Bristol, GB; ²University of Southampton, GB Abstract Carbon nanotube field-effect transistor (CNTFET) is envisioned as a promising device to overcome the limitations of traditional CMOS based MOSFETs due to its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 8T cell based on CNTFET, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering metallic tolerance to make the proposed SRAM design more reliable.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.5 Boosting the Scalability of Formal Verification Technologies

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 3

Chair:
Fahim Rahim, Atrenta, FR

Co-Chair:
Bernd Becker, University of Freiburg, DE

While the industrial usage of formal methods has proliferated in the past decade, the capacity limitations of these techniques remains a challenge to their applicability. This session introduces a set of novel advances to boost the scalability of numerous state-of-the-art verification core technologies.

Time	Label	Presentation Title Authors
08:30	5.5.1	SCALABLE LIVENESS VERIFICATION FOR COMMUNICATION FABRICS Speakers: Sebastiaan Joosten and Julien Schmaltz, Open University, NL Abstract In the realm of multi-core processors and systems-on-chip, communication fabrics constitute a key element. A large number of queues and distributed control are two important aspects of this class of designs. These aspects make decomposition and abstraction techniques difficult to apply. For this class of designs, the application of formal methods is a real challenge. In particular, the verification of liveness properties is often intractable. Communication fabrics can be seen as a set of queues and flops interconnected by combinatorial logic. Based on this simple but powerful observation, we propose a novel method for liveness verification. Our method directly applies to Register Transfer Level designs. The essential aspects of our approach are (1) to abstract away from the details of queue implementations and (2) an efficient encoding of liveness properties in an SMT instance. Experimental results are promising. Designs with hundreds of queues can be analysed for liveness within minutes.
09:00	5.5.2	PROPERTY DIRECTED INVARIANT REFINEMENT FOR PROGRAM VERIFICATION Speakers: Tobias Welp¹ and Andreas Kuehlmann² ¹UC Berkeley, US; ²Coverity, Inc., US Abstract We present a novel, sound, and complete algorithm for deciding safety properties in programs with static memory allocation. The new algorithm extends the program verification paradigm using loop invariants with a counterexample guided abstraction refinement (CEGAR) loop where the refinement is achieved by strengthening loop invariants using the QF\_BV generalization of Property Directed Reachability (PDR). We compare the algorithm with other approaches to program verification and report experimental results.
09:30	5.5.3	SIMPLE INTERPOLANTS FOR LINEAR ARITHMETIC Speakers: Christoph Scholl¹, Florian Pigorsch¹, Stefan Disch¹ and Ernst Althaus² ¹University Freiburg, DE; ²University Mainz, DE Abstract Craig interpolation has turned out to be an essential method for many applications in formal verification. In this paper we focus on the computation of simple interpolants for the theory of linear arithmetic with rational coefficients. We successfully minimize the number of linear constraints in the final interpolant by several methods including proof transformations, linear programming, and SMT solving. Experimental results comparing the approach to standard methods from the literature prove the effectiveness of the approach and show reductions of up to 70% in the number of linear constraints.
09:45	5.5.4	TIGHTENING BDD-BASED APPROXIMATE REACHABILITY WITH SAT-BASED CLAUSE GENERALIZATION Speakers: Gianpiero Cabodi, Paolo Pasini, Stefano Quer and Danilo Vendraminetto, Politecnico di Torino, IT Abstract In the framework of symbolic model checking, BDD-based approximate reachability is potentially much more scalable than its exact counterpart. However, its practical applicability is highly limited by its static approach to abstraction, and the intrinsic difficulty to find an acceptable trade-off between accuracy and (memory/time) complexity. In this paper, we explore the use of CNF clauses, and of recent improvements in SAT algorithms, as additional players in BDD-based reachability. Cube generalization, a core step of the IC3 model checking algorithm, is the process of finding a minimal sub-clause, by removing as many literals as possible, such that it over-approximates a set of reachable states while excluding the cube. Generalization is used in IC3 to refine clause-based representations of state sets. We use it, in both the inductive and non inductive version, in order to strengthen BDD-based representations of state sets, computed by Machine By Machine (MBM) and Frame By Frame (FBF) over-approximate forward traversal algorithms. The resulting approach benefits from the orthogonal power of BDD and CNF representations, and it improves the scalability of BDD-based methods. Preliminary experimental results confirm that this approach can provide tighter representations of reachable state sets. Applications include fully BDD-based engines, as well as using over-approximate state sets as invariants or constraints in SAT-based model checking.
10:00	IP2-16, 831	MAKE IT REAL: EFFECTIVE FLOATING-POINT REASONING VIA EXACT ARITHMETIC Speakers: Miriam Leeser¹, Saoni Mukherjee¹, Jaideep Ramachandran¹ and Thomas Wahl² ¹Northeastern University, US; ²Northeastern University, Boston, US Abstract Floating-point arithmetic is widely used in scientific computing. While many programmers are subliminally aware that floating-point numbers only approximate the reals, few are cognizant of the dangers this entails for programming. Such dangers range from tolerable rounding errors in sequential programs, to unexpected, divergent control flow in parallel code. To address these problems, we present a decision procedure for floating-point arithmetic (FPA) that exploits the proximity to real arithmetic (RA), via a lossless reduction from FPA to RA. Our procedure does not involve any form of bit-blasting or bit-vectorization, and can thus generate much smaller back-end decision problems, albeit in a more complex logic. This tradeoff is beneficial for the exact and reliable analysis of parallel scientific software, which tends to give rise to large but benignly structured formulas. We have implemented a prototype decision engine and present encouraging results analyzing such software for numerical accuracy.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.6 Emerging logic technologies

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 4

Chair:
Mehdi Tahoori, KIT, DE

Co-Chair:
Marco Ottavi, University of Rome "Tor Vergata", IT

The papers in this session consider new ways to realize both Boolean and non-Boolean logic. Potential implementations are based on graphene, spin, and resonance energy transfer.

Time	Label	Presentation Title Authors
08:30	5.6.1	RETLAB: A FAST DESIGN-AUTOMATION FRAMEWORK FOR ARBITRARY RET NETWORKS Speakers: Mohammad Mottaghi, Arjun Rallapalli and Chris Dwyer, Duke University, US Abstract Resonance energy transfer (RET) circuits are networks of photo-active molecules that can implement arbitrary logic functions. The nanoscale size of these structures can bring high-density computation to new domains, e.g., in vivo sensing and computation. A key challenge in the design of a RET network is to find, among a huge set of configurations (i.e., design space), the optimum choice and arrangement of molecules on a nanostructure. The prohibitively large size of the design space makes it impractical to evaluate every possible configuration, motivating the need for design-space pruning to be integrated into the design flow. To this end, we have developed a computer-aided design framework, called RETLab, that enables structured pruning of the design space to extract a sufficiently small subset, which is fully evaluated and ranked based on user-defined metrics to yield the best configuration. More importantly, we have developed a new RET-simulation algorithm, which is several orders of magnitude (e.g., for a 4-node network, one million times) faster than the conventional Monte-Carlo-based simulation (MCS). This speedup in configuration evaluation enables a significantly more extensive design-space exploration with fewer and less constrained heuristics, compared to existing RET-network design methods which are ad-hoc and rely on MCS for configuration evaluation.
09:00	5.6.2	DESIGN OF 3D NANOMAGNETIC LOGIC CIRCUITS: A FULL-ADDER CASE STUDY Speakers: Robert Perricone, X. Sharon Hu, Joe Nahas and Michael Niemier, University of Notre Dame, US Abstract Nanomagnetic logic (NML) is a ``beyond-CMOS'' technology that combines logic and memory capabilities through field-coupled interactions between nanoscale magnets. NML is intrinsically non-volatile, low-power, and radiation-hard when compared to CMOS equivalents. Moreover, there have been numerous demonstrations of NML circuit functionality within the last decade. These fabricated structures typically employ devices with in-plane magnetization to move and process data. However, in-plane layouts imply circuits and interconnects in only two dimensions (2D), which makes signal routing -- and hence circuits -- more complex. In this paper, we introduce NML circuits that move and process data in three dimensions (3D). We employ devices with perpendicular magnetic anisotropy (PMA) (i.e., out-of-plane magnetization states) and discuss their behavior when utilized in 3D designs. Furthermore, we provide a systematic design approach for 3D NML circuits using a threshold full adder as a case study. We compare our 3D adder to 2D adders to highlight the benefits of 3D NML circuits, which include simpler signal routing and a smaller area footprint.
09:30	5.6.3	HIGHLY ACCURATE SPICE-COMPATIBLE MODELING FOR SINGLE- AND DOUBLE-GATE GNRFETS WITH STUDIES ON TECHNOLOGY SCALING Speakers: Morteza Gholipour¹, Ying-Yu Chen², Amit Sangai² and Deming Chen² ¹University of Tehran, IR; ²University of Illinois at Urbana-Champaign, US Abstract In this paper, we present a highly accurate closed-form compact model for Schottky-Barrier-type Graphene Nano-Ribbon Field-Effect Transistors (SB-GNRFETs). This is a physics-based analytical model for the current-voltage (I-V) characteristics of SB-GNRFETs. We carry out accurate approximations of Schottky barrier tunneling, channel charge and current, which provide improved accuracy while maintaining compactness. This SPICE-compatible compact model surpasses the existing model [15] in accuracy, and enables efficient circuit-level simulations of futuristic GNRFET-based circuits. The proposed model considers various design parameters and process variation effects, including graphene-specific edge roughness, which allows complete and thorough exploration and evaluation of SB-GNRFET circuits. We are able to model both single- and double-gate SB-GNRFETs, so we can evaluate and compare these two types of SB-GNRFET. We also compare circuit-level performance of SB-GNRFETs with multi-gate (MG) Si-CMOS for a scalability study in future generation technology. Our circuit simulations indicate that SB-GNRFET has an energy-delay product (EDP) advantage over Si-CMOS; the EDP of the ideal SB-GNRFET (assuming no process variation) is ~1.3% of that of Si-CMOS, while the EDP of the non-ideal case with process variation is 136% of that of Si-CMOS. Finally, we study technology scaling with SB-GNRFET and MG Si-CMOS. We show that the EDP of ideal (non-ideal) SB-GNRFET is ~0.88% (54%) EDP of that of Si-CMOS as the technology nodes scales down to 7 nm.
09:45	5.6.4	REWIRING FOR THRESHOLD LOGIC CIRCUIT MINIMIZATION Speakers: Chia-Chun Lin¹, Chun-Yao Wang¹, Yung-Chih Chen² and Ching-Yi Huang¹ ¹Dept. of Computer Science, National Tsing Hua University, TW; ²Dept. of Computer Science and Engineering, Yuan Ze University, TW Abstract Recently, there have been many works focusing on synthesis, verification, and testing of threshold circuits due to the rapid development in efficient implementation of threshold logic circuits. To minimize the hardware cost of threshold circuit implementation, this paper proposes a heuristic that consists of rewiring operations and a simplification procedure. Additionally, a subset of input vectors of a gate, called critical-effect vectors, are proved to be complete for formally verifying the equivalence of two threshold logic gates, instead of the whole truth table in this paper. This achievement can accelerate the equivalence checking of two threshold logic gates. The experimental results show that the proposed heuristic can efficiently reduce the cost.
10:00	IP2-17, 238	WIDTH MINIMIZATION IN THE SINGLE-ELECTRON TRANSISTOR ARRAY SYNTHESIS Speakers: Chian-Wei Liu¹, Chang-En Chiang¹, Ching-Yi Huang¹, Chun-Yao Wang¹, Yung-Chih Chen², Suman Datta³ and Vijaykrishnan Narayanan⁴ ¹Dept. of Computer Science, National Tsing Hua University, TW; ²Dept. of Computer Science and Engineering, Yuan Ze University, TW; ³Department of Electrical Engineering, The Pennsylvania State University, US; ⁴Department of Computer Science and Engineering, The Pennsylvania State University, US Abstract Power consumption has become one of the primary challenges to meet the Moore's law. For reducing power consumption, Single-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore's law due to its ultra-low power consumption during operation. Prior work has proposed an automated mapping approach for SET architecture which focuses on minimizing the number of hexagons in an SET array. However, the area of an SET array is more related to the width. Consequently, in this work, we propose an approach for width minimization of the SET arrays. The experimental results show that the proposed approach saves 26% of width compared with the state-of-the-art for a set of MCNC and IWLS 2005 benchmarks while spending similar CPU time.
10:01	IP2-18, 704	AREA MINIMIZATION SYNTHESIS FOR RECONFIGURABLE SINGLE-ELECTRON TRANSISTOR ARRAYS WITH FABRICATION CONSTRAINTS Speakers: Yi-Hang Chen, Jian-Yu Chen and Juinn-Dar Huang, Department of Electronics Engineering, National Chiao Tung University, TW Abstract As fabrication processes exploit even deeper submicron technology, power dissipation has become a crucial issue for most electronic circuit and system designs nowadays. In particular, leakage power is becoming a dominant source of power consumption. Recently, the reconfigurable single-electron transistor (SET) array has been proposed as an emerging circuit design style for continuing Moore's Law due to its ultra-low power consumption. Several automated synthesis approaches have been developed for the reconfigurable SET array in the past few years. Nevertheless, all of those existing methods consider fabrication constraints, which are mandatory, merely in late synthesis stages. In this paper, we propose a synthesis algorithm, featuring both variable reordering and product term reordering, for area minimization. In addition, our algorithm takes those mandatory fabrication constraints into account in early stages for better outcomes. Experimental results show that our new method can achieve an area reduction of up to 24% as compared to current state-of-the-art techniques.
10:02	IP2-19, 247	SOFTWARE-BASED PAULI TRACKING IN FAULT-TOLERANT QUANTUM CIRCUITS Speakers: Alexandru Paler¹, Simon Devitt², Kae Nemoto² and Ilia Polian¹ ¹University of Passau, DE; ²National Institute of Informatics, JP Abstract The realisation of large-scale quantum computing is no longer simply a hardware question. The rapid development of quantum technology has resulted in dozens of control and programming problems that should be directed towards the classical computer science and engineering community. One such problem is known as Pauli tracking. Methods for implementing quantum algorithms that are compatible with crucial error correction technology utilise extensive quantum teleportation protocols. These protocols are intrinsically probabilistic and result in correction operators that occur as byproducts of teleportation. These byproduct operators do not need to be corrected in the quantum hardware itself , but are tracked through the circuit and output results emph{reinterpreted}. This tracking is routinely ignored in quantum information as it is assumed that tracking algorithms will eventually be developed. In this work we help fill this gap and present an algorithm for tracking byproduct operators through a quantum computation.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.7 Test Generation and Optimization

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 5

Chair:
Xiaoqing Wen, Kyushu Institute of Technology, JP

Co-Chair:
Grzegorz Mrugalski, Mentor Graphics, PL

The session covers generation of tests for different fault models including interconnect opens, interconnect for 3D memories, and small delay faults. Additionally test optimization for SoC designs is presented.

Time	Label	Presentation Title Authors
08:30	5.7.1	(Best Paper Award Candidate) EFFICIENT SMT-BASED ATPG FOR INTERCONNECT OPEN DEFECTS Speakers: Dominik Erb¹, Karsten Scheibler¹, Matthias Sauer² and Bernd Becker² ¹University of Freiburg, Chair of Computer Architecture, DE; ²University of Freiburg, DE Abstract Interconnect opens are known to be one of the predominant defects in nanoscale technologies. However, automatic test pattern generation for open faults is challenging, because of their rather unstable behaviour and the numerous electric parameters which need to be considered. Thus, most approaches try to avoid accurate modeling of all constraints and use simplified fault models in order to detect as many faults as possible or make assumptions which decrease both complexity and accuracy. This paper presents a new SMT-based approach which for the first time supports the Robust Enhanced Aggressor Victim model without restrictions and handles oscillations. It is combined with the first open fault simulator fully supporting the Robust Enhanced Aggressor Victim model and thereby accurately considering unknown values. Experimental results show the high efficiency of the new method outperforming previous approaches by up to two orders of magnitude.
09:00	5.7.2	INTERCONNECT TEST FOR 3D STACKED MEMORY-ON-LOGIC Speakers: Mottaqiallah Taouil¹, Mahmoud Masadeh¹, Said Hamdioui¹ and Erik Jan Marinissen² ¹Delft University of Technology, NL; ²IMEC, BE Abstract Three-dimensional stacked IC (3D-SIC) technology based on Through-Silicon Vias (TSVs) provides numerous advantages as compared to traditional 2D-ICs. A potential application is memory stacked on logic, providing enhanced throughput, and reduced latency and power consumption. However, testing the TSV interconnects between the two dies is challenging, as both the memory and the logic die might come from different manufacturers. Currently, no standard exists and the proposed solutions fail to address dynamic and time-critical faults (at speed testing). In addition, memory vendors have not been in favor to put additional DfT structures such as JTAG for interconnect testing on their memory devices. This paper proposes a new Memory Based Interconnect Test (MBIT) approach for 3D stacked memories. Our test patterns are applied by read and write instructions to the memory and are validated by a case study where a 3D memory is assumed to be stacked on a MIPS64 processor. The main benefits of the MBIT approach are: (1) zero area overhead, (2) the ability to detect both static and dynamic faults and perform at speed testing, (3) flexibility in applying any test pattern, as this can be executed by the CPU on the logic die and (4) extreme short test execution time.
09:30	5.7.3	AN EFFECTIVE APPROACH TO AUTOMATIC FUNCTIONAL PROCESSOR TEST GENERATION FOR SMALL-DELAY FAULTS Speakers: Andreas Riefert¹, Lyl Ciganda², Matthias Sauer¹, Paolo Bernardi², Matteo Sonza Reorda³ and Bernd Becker¹ ¹University of Freiburg, DE; ²Politecnico di Torino, IT; ³Politecnico di Torino - DAUIN, IT Abstract Functional microprocessor test methods provide several advantages compared to DFT approaches, like reduced chip cost and at speed execution. However, the automatic generation of functional test patterns is an open issue. In this work we present an approach for the automatic generation of functional microprocessor test sequences for small-delay faults based on Bounded Model Checking. We utilize an ATPG framework for small-delay faults in sequential, non-scan circuits and propose a method for constraining the input space for generating functional test sequences (i.e., test programs). We verify our approach by evaluating the miniMIPS microprocessor. In our experiments we were able to reach over 97 % fault efficiency. To the best of our knowledge, this is the first fully automated approach to functional microprocessor test for small-delay faults.
09:45	5.7.4	MULTI-SITE TEST OPTIMIZATION FOR MULTI-VDD SOCS USING SPACE- AND TIME-DIVISION MULTIPLEXING Speakers: Fotios Vartziotis¹, Chrysovalantis Kavousianos², Krishnendu Chakrabarty³, Rubin Parekhji⁴ and Arvind Jain⁴ ¹University of Ioannina, GR; ²Department of Computer Science and Engineering, University of Ioannina, GR; ³Duke University, US; ⁴Texas Instruments, IN Abstract Even though system-on-chip (SoC) testing at multiple voltage settings significantly increases test complexity, the use of a different shift frequency at each voltage setting offers parallelism that can be exploited by time-division multiplexing (TDM) to reduce test length. We show that TDM is especially effective for small-bitwidth and heavily loaded test-access mechanisms (TAMs), thereby tangibly increasing the effectiveness of multi-site testing. However, TDM suffers from some inherent limitations that do not allow the fullest possible exploitation of TAM bandwidth. To overcome these limitations, we propose space-division multiplexing (SDM), which complements TDM and offers higher multi-site test efficiency. We implement space- and time-division multiplexing (STDM) using a new, scalable test-time minimization method based on a combination of bin packing and simulated annealing. Results for industrial SoCs, highlight the advantages of the proposed optimization method.
10:00	IP2-20, 50	AN EFFICIENT TEMPERATURE-GRADIENT BASED BURN-IN TECHNIQUE FOR 3D STACKED ICS Speakers: Nima Aghaee, Zebo Peng and Petru Eles, Linköping University, SE Abstract Burn-in is usually carried out with high temperature and elevated voltage. Since some of the early-life failures depend not only on high temperature but also on temperature gradients, simply raising up the temperature of an IC is not sufficient to detect them. This is especially true for 3D stacked ICs, since they have usually very large temperature gradients. The efficient detection of these early-life failures requires that specific temperature gradients are enforced as a part of the burn-in process. This paper presents an efficient method to do so by applying high power stimuli to the cores of the IC under burn-in through the test access mechanism. Therefore, no external heating equipment is required. The scheduling of the heating and cooling intervals to achieve the required temperature gradients is based on thermal simulations and is guided by functions derived from a set of thermal equations. Experimental results demonstrate the efficiency of the proposed method.
10:01	IP2-21, 17	TEST AND NON-TEST CUBES FOR DIAGNOSTIC TEST GENERATION BASED ON MERGING OF TEST CUBES Speaker: Irith Pomeranz, Purdue University, US Abstract Test generation by merging of test cubes supports test compaction and test data compression. This paper describes a new approach to the use of test cube merging for the generation of compact diagnostic test sets. For this the paper uses the new concept of non-test cubes. While a test cube for a fault fi0 detects the fault, a non-test cube for a fault fi1 prevents the fault from being detected. Merging a test cube for a fault fi0 and a non-test cube for a fault fi1 produces a diagnostic test cube that distinguishes the two faults. The paper describes a procedure for diagnostic test generation based on merging of test and non-test cubes. Experimental results demonstrate that compact diagnostic test sets are obtained.
10:02	IP2-22, 905	NEW IMPLEMENTIONS OF PREDICTIVE ALTERNATE ANALOG/RF TEST WITH AUGMENTED MODEL REDUNDANCY Speakers: Haithem Ayari, Florence Azais, Serge Bernard, Mariane Comte, Vincent Kerzerho and Michel Renovell, LIRMM, CNRS/Univ. Montpellier 2, FR Abstract This paper discusses new implementations of the predictive alternate test strategy that exploit model redundancy in order to improve test confidence. The key idea is to build during the training phase, not only one regression model for each specification as in the classical implementation, but several regression models. This redundancy is then used during the testing phase to identify suspect predictions and remove the corresponding devices from the alternate test flow. In this paper, we explore various options for implementing model redundancy, based on the use of different indirect measurement combinations and/or different partitions of the training set. The proposed implementations are evaluated on a real case study for which we have production test data from 10,000 devices.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

5.8 Hot Topic: System Integration - The Bridge between More than Moore and More Moore

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Exhibition Theatre

Organisers:
Manfred Dietrich, Fraunhofer IIS/EAS Dresden, DE
Kai Hahn, University Siegen, DE

Chair:
Manfred Dietrich, Fraunhofer IIS/EAS Dresden, DE

Co-Chair:
Kai Hahn, University Siegen, DE

System Integration using 3D technology is a very promising way to cope with current and future requirements for electronic systems. Since the pure shrinking of devices (known as "More Moore") will come to an end due to physical and economic restrictions, the integration of systems (e.g. by stacking dies, or by adding sensor functions) shows a way to maintain the growth in complexity as well as in diversity which is necessary for future applications. This so called "More than Moore" approach complements the conventional SoC product engineering. This session gives insights in System Integration design challenges from different perspectives, ranging from design technology over MEMS product engineering and 3D interconnect to automotive cyber physical systems.

Time	Label	Presentation Title Authors
08:30	5.8.1	DESIGN TECHNOLOGY FOR 3-D INTEGRATED SYSTEMS Speaker: Andy Heinig, Fraunhofer IIS/EAS, DE Abstract More than Moore technologies (MtM) enable the dense integration of different circuits in a package. The short length and small spacing of wires enable high-speed and highly parallel interconnects between system parts as e.g. processor and memory. In the first part of the presentation we will give an overview on the current status of MtM from system-in-package up to 3D stacking with trough silicon vias at interposers or stacked directly. The second part of the presentation is dedicated to the design of MtM systems. One challenge is the tight integration of analog and digital dies that requests the consideration of several electrical and multi-physical interactions e.g. thermal management, power distribution and electromagnetic compatibility stronger than in 2D-SoC-design. The second challenge is the wide design space opened by MtM. It request new methods that guides the designer to find the best trade-off between system performance and production costs. By the means of Processor and WideIO memory integration at silicon interposer, that increases the memory bandwidth in future high-end applications we demonstrate new EDA methods for design space exploration, estimation of routing congestion and interposer routing.
08:45	5.8.2	SEMICONDUCTOR PACKAGING IS BACK TO EUROPE - ADVANCES IN SYSTEM INTEGRATION IN WAFER LEVEL PACKAGING Speaker: Steffen Kroehnert, NANIUM S.A. - Niederlassung Dresden, DE Abstract Different market segments from mobile communication and consumer to automotive see the increasing need to focus on system integration on less space instead of single components or functional groups. This drives advanced semiconductor packaging to diversify and become fairly more complex, but at the same time an integrated functional part of the system. The demand for more and more diversified functionality on same or even less space drives the development of "More-than-Moore" (MtM) solutions in the packaging world. The keyword is again "System-in-Package" (SiP). Chip-Package-Board Co-Design and Co-Development are essential key for success. Besides some theory, the paper will show some real product examples where system integration in the package saved up to 4X space on the board for the same functionality with even more performance. While today the majority of SiP is still realized using laminated organic substrate interposers, the need to close the gap to System-on-Chip (SoC) performance is driving closer distances of the single functional elements to each other. This can be realized by Fan-Out Wafer Level Packaging (FO-WLP) technologies, like eWLB (embedded Wafer Level Ball Grid Array), which overcomes Fan-In Wafer Level Packaging (FI-WLP) limitations especially in terms of system integration, keeping the advantages of scalability and cost efficient batch processing. In the paper the good progress made to develop eWLB as technology platform will be shown, mainly using FO-WLP as enabler for System-in-Package on Wafer Level (WLSiP).
09:00	5.8.3	MEMS AND 3D-IC PRODUCT ENGINEERING - TECHNOLOGY DESIGN FOR SYSTEM INTEGRATION Speaker: Kai Hahn, University Siegen, DE Abstract Taking into account the diversity of technologies from die manufacturing to packaging it becomes clear that for product engineering of integrated systems such as MEMS or stacked 3D circuits the constraints and inter-dependencies of design and manufacturing are of special interest. The configuration of these technologies is strongly application specific and design methods differ completely from the approach known from the development of conventional two dimensional ICs.The presentation will cover methods and tools for technology design in the area of MEMS as well as for 3D integration.
09:15	5.8.4	3D-TSV-HUB: POTENTIALS AND CHALLENGES FOR VERTICAL INTERCONNECTS IN NETWORKS-ON-CHIPS Speaker: Andreas Herkersdorf, TU München, DE Abstract Sophisticated Network-On-Chips (NoCs) will form the backbone for on-chip communication in future System-On-Chip designs. Already in conventional planar systems the synthesis of application specific NoCs is a complex task. When shifting to a stacked die environment, further degrees of freedom are added and a large design space is created. Through Silicon Vias (TSVs) are deployed for building vertical NoC links in a 3D systems. However, TSVs are cost intensive under several aspects. Area consumption is high due to large TSV diameters (compared to planar metal layer interconnect) and keep out areas in intermediate die layers. Furthermore, mechanical induced stress can lead to runtime failures and an overall low system yield. Therefore, in order to ensure that a minimum number of TSVs are performance and cost efficiently operated to their full capacity, a 3D-TSV-Hub has been proposed to support smart mapping of communication flows onto TSVs during NoC synthesis. Mechanisms to handle production and runtime failures are integrated into the 3D-TSV-Hub concept and also considered during synthesis. Design aspects like compliance to thermal requirements and manufacturing process requirements influence the optimization of 3D-NoCs. In order to consider such factors, we operate the NoC synthesis tool in interplay with Design Space Exploration and 3D-Floorplanner tools of project partners.
09:30	5.8.5	SENSORS AND POWER DRIVERS, BRIDGE BETWEEN SYSTEM ENVIRONMENT AND COMPUTING Speaker: Jochen Reisinger, Infineon Technologies Austria AG, AT Abstract There are many main drivers enabling the most significant innovations in system solutions which are based on, or only supported by, electronics. No doubt, the best known driver is the availability of deep submicron technology nodes for the implementation of computing functions (More Moore). But competitive system architectures and functional system partitionings (technology selections) strongly depend on highly efficient interfaces to the system environment. Those interfaces are supporting many functions as for example: a) The sensing of physical parameters (temperature, pressure, speed, power, ...) b) Providing power and control signals for actuators (drivers for motors, pumps, ...) c) Providing power for the computing system (- including safety, power up/down) d) Interfacing to human bodies and/or other system elements and e) Communication of system operative and control data (WiFi, Bluetooth, ..). Those interfaces ask for highly efficient (in terms of space, power, performance, .., and cost) 3D integration technologies and design methodologies. Infineon examples for sensors and drivers will be presented.
09:45	5.8.6	CONCLUSIONS AND DISCUSSION Speaker: Manfred Dietrich, Fraunhofer, DE
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

IP2 Interactive Presentations

Date: Wednesday 26 March 2014
Time: 10:00 - 10:30
Location / Room: Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the morning. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

Label	Presentation Title Authors
IP2-1	FAST AND ACCURATE COMPUTATION USING STOCHASTIC CIRCUITS Speakers: Armin Alaghi and John P. Hayes, University of Michigan - Ann Arbor, US Abstract Stochastic computing (SC) is a low-cost design technique that has great promise in applications such as image processing. SC enables arithmetic operations to be performed on stochastic bit-streams using ultra-small and low-power circuitry. However, accurate computations tend to require long run-times due to the random fluctuations inherent in stochastic numbers (SNs). We present novel techniques for SN generation that lead to better accuracy/run-time trade-offs. First, we analyze a property called progressive precision (PP) which allows computational accuracy to grow systematically with run-time. Second, borrowing from Monte Carlo methods, we show that SC performance can be greatly improved by replacing the usual pseudo-random number sources by low-discrepancy (LD) sequences that are predictably progressive. Finally, we evaluate the use of LD stochastic numbers in SC, and show they can produce significantly faster and more accurate results than existing stochastic designs.
IP2-2	DRAM-BASED COHERENT CACHES AND HOW TO TAKE ADVANTAGE OF THE COHERENCE PROTOCOL TO REDUCE THE REFRESH ENERGY Speakers: Zoran Jaksic and Ramon Canal, Universitat Politecnica de Catalunya, ES Abstract Recent technology trends has turned DRAMs into an interesting candidate to substitute traditional SRAM-based on-chip memory structures (i.e. register file, cache memories). Nevertheless, a major problem to introduce these cells is that they lose their state (i.e. value) over time, and they have to be refreshed. This paper proposes the implementation of coherent caches with DRAM cells. Furthermore, we propose to use the coherence state to tune the refresh overhead. According to our analysis, an average of up to 57% of refresh energy can be saved. Also, comparing to the caches implemented in SRAMs total energy savings are on average up to 39% depending of the refresh policy with a performance loss below 8%
IP2-3	(Best Paper Award Candidate) REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD3) Speakers: Alen Bardizbanyan¹, Magnus Själander², David Whalley² and Per Larsson-edefors¹ ¹Chalmers University of Technology, SE; ²Florida State University, US Abstract Fast set-associative level-one data caches (L1~DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD3) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD3 technique enables about 49% of all load operations to access the L1~DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1~DC energy by 13%.
IP2-4	DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP Speakers: Preethi Parayil Mana Damodaran¹, Stefan Wallentowitz² and Andreas Herkersdorf³ ¹LIS, Technical University of Munich, DE; ²Technische Universität München, Institute for Integrated Systems, DE; ³TU München, DE Abstract In a shared-memory based tiled many-core system-on-chip architecture, memory accesses present a huge performance bottleneck in terms of access latency as well as bandwidth requirements. The best practice approach to address this issue is to provide a multi-level cache hierarchy and a suitable cache-coherency mechanism. This paper presents a method to increase the memory access performance in distributed-directory-coherency-protocol based tiled many-core systems. The proposed method introduces an alternate design for the system-wide shared last-level caches (LLC) placed between the memory and the node private caches (NPC). The proposed system-wide shared LLC layer is distributed over the entire network and it interacts with the home directories of specific cache lines. Results from simulating SPEC2000 benchmark applications executed on a SystemC model of the proposed design show a minimum performance improvement of 20-25% when compared to a model without the shared cache layer at the expense of an additional 2% of the total cache memory space (NPC + LLC memory). In addition, the proposed design shows a minimum 7-15% and an average 14-15% improvement in performance in comparison to centralized system-wide shared LLC of equivalent size and dynamic mapped distributed LLC of equivalent size respectively.
IP2-5	DESIGN OF SAFETY CRITICAL SYSTEMS BY REFINEMENT Speakers: Alex Iliasov¹, Arseniy Alekseyev², Danil Sokolov³ and Andrey Mokhov³ ¹Newcastle University, GB; ²Newcastle University, ZW; ³Newcastle University, BB Abstract An increasingly large number of safety-critical embedded systems rely on software to prevent and mitigate hazards occurring due to design errors and unexpected interactions of the system with its users and the environment. Implementing a safety instrumented function in the way advocated by the traditional software methods requires an intimate understanding and thorough validation of a complex ecosystem of programming languages, compilers, operating systems and hardware. We propose to consider an alternative where a system designer, for each individual problem, creates in a correct-by-construction manner both the design of a system and its compilation and execution infrastructure. This permits an uninterrupted chain of a formal correctness argument spanning from formalised requirements all the way to the gate-level characterisation of an execution environment. The past decade of advances in verification technology turned the mechanical verification of large-scale models into a reality while the pressure of certification makes the cost of a formally verified development routine increasingly acceptable. The proposed technique fits the Grand Challenge for Computer Research posed by Hoare in 2003, namely, development of a Verifying Compiler which not only mechanically translates a given program from one language to another but also verifies its correctness according to a formal specification. This allows meeting the most stringent software certification requirements such as SIL 4. We illustrate the idea with a small case-study developed using the Event-B modelling notation and tools.
IP2-6	ENERGY OPTIMIZATION IN ANDROID APPLICATIONS THROUGH WAKELOCK PLACEMENT Speakers: Faisal Alam¹, Preeti Ranjan Panda¹, Nikhil Tripathi², Namita Sharma³ and Sanjiv Narayan² ¹IIT Delhi, IN; ²Calypto Design Systems, IN; ³Indian Institute of Technology Delhi, IN Abstract Energy efficiency is a critical factor in mobile systems, and a significant body of recent research efforts has focused on reducing the energy dissipation in mobile hardware and applications. The Android OS Power Manager provides programming interface routines called wakelocks for controlling the activation state of devices on a mobile system. An appropriate placement of wakelock acquire and release functions in the application can make a significant difference to the energy consumption. In this paper, we propose a data flow analysis based strategy for determining the placement of wakelock statements corresponding to the uses of devices in an application. Our experimental evaluation on a set of Android applications show significant (up to 32%) energy savings with the proposed optimization strategy.
IP2-7	A WEAR-LEVELING-AWARE DYNAMIC STACK FOR PCM MEMORY IN EMBEDDED SYSTEMS Speakers: Qingan Li¹, Yanxiang He², Yong Chen², Chun Xue³, Nan Jiang² and Chao Xu² ¹Wuhan University & City University of Hong Kong, CN; ²Wuhan University, CN; ³City University of Hong Kong, CN Abstract Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics such as extremely low leakage power, high storage density and good scalability. However, PCM's low endurance constrains its practical applications. In this paper, we propose a Wear Leveling aware dynamic stack to extend PCM's lifetime when it is adopted in embedded systems as main memory. Through a dynamic stack, the memory space is circularly allocated to stack objects, and thus an even usage of PCM memory is achieved. The experimental results show that the proposed method can significantly reduce the write variation on PCM cells and enhance the lifetime of PCM memory.
IP2-8	LIFETIME HOLES AWARE REGISTER ALLOCATION FOR CLUSTERED VLIW PROCESSORS Speakers: Xuemeng Zhang¹, Hui Wu², Haiyan Sun¹ and Jingling Xue³ ¹National University of Defense Technology, CN; ²The University of New South Wales, AU; ³UNSW, AU Abstract This paper presents an on-the-fly register allocator which dynamically detects and utilises lifetime holes for clustered VLIW processors. A lifetime hole is an interval in which a variable does not contain a valid value. A register holding a lifetime hole can be allocated to another variable whose live range fits in the lifetime hole, leading to more efficient utilisation of registers. We propose efficient techniques for dynamically utilising lifetime holes and incorporate these techniques into our on-the-fly register allocator. We have simulated our register allocator and a linear scan register allocator without considering lifetime holes by using the MediaBench II benchmark suite. Our simulation results show that our register allocator reduces the number of spills by 12.5%, 11.7%, 12.7%, for three different processor models, respectively.
IP2-9	A LOW-POWER, HIGH-PERFORMANCE APPROXIMATE MULTIPLIER WITH CONFIGURABLE PARTIAL ERROR RECOVERY Speakers: Cong Liu¹, Jie Han¹ and Fabrizio Lombardi² ¹University of Alberta, CA; ²Northeastern University, US Abstract Approximate circuits have been considered for error-tolerant applications that can tolerate some loss of accuracy with improved performance and energy efficiency. Multipliers are key arithmetic circuits in many such applications such as digital signal processing (DSP). In this paper, a novel approximate multiplier with a lower power consumption and a shorter critical path than traditional multipliers is proposed for high-performance DSP applications. This multiplier leverages a newly-designed approximate adder that limits its carry propagation to the nearest neighbors for fast partial product accumulation. Different levels of accuracy can be achieved through a configurable error recovery by using different numbers of most significant bits (MSBs) for error reduction. The approximate multiplier has a low mean error distance, i.e., most of the errors are not significant in magnitude. Compared to the Wallace multiplier, a 16-bit approximate multiplier implemented in a 28nm CMOS process shows a reduction in delay and power of 20% and up to 69%, respectively. It is shown that by utilizing an appropriate error recovery, the proposed approximate multiplier achieves similar processing accuracy as traditional exact multipliers but with significant improvements in power and performance.
IP2-10	A LINUX-GOVERNOR BASED DYNAMIC REALIABILITY MANAGER FOR ANDROID MOBILE DEVICES Speakers: Pietro Mercati¹, Andrea Bartolini², Francesco Paterna¹, Tajana Simunic Rosing¹ and Luca Benini² ¹UCSD, US; ²University of Bologna, IT Abstract Reliability is a major concern in multiprocessors. Dynamic Reliability Management (DRM) aims at trading off processor performance with lifetime. The state-of-the-art publications study only the theory supported by simulation. This paper presents the first complete software implementation, working on a real hardware, of a low-overhead, Android-compatible workload-aware DRM Governor for mobile multiprocessors. We discuss the design challenges and the run-time overhead involved. We show the effectiveness of our governor in guaranteeing the predefined target lifetime and show that it achieves up to 100% of lifetime improvement with respect to traditional governors, while providing comparable performance for critical applications.
IP2-11	YIELD AND TIMING CONSTRAINED SPARE TSV ASSIGNMENT FOR THREE-DIMENSIONAL INTEGRATED CIRCUITS Speakers: Yu-Guang Chen¹, Kuan-Yu Lai¹, Ming-Chao Lee², Yiyu Shi³, Wing-Kai Hon¹ and Shih-Chieh Chang¹ ¹National Tsing Hua University, TW; ²MediaTek Inc., TW; ³Missouri University of Science and Technology, US Abstract Through Silicon Via (TSV) is a critical enabling technique in three-dimensional integrated circuits (3D ICs). However, it may suffer from many reliability issues. Various fault-tolerance mechanisms have been proposed in literature to improve yield, at the cost of significant area overhead. In this paper, we focus on the structure that uses one spare TSV for a group of original TSVs, and study the optimal assignment of spare TSVs under yield and timing constraints to minimize the total area overhead. We show that such problem can be modeled through constrained graph decomposition. An efficient heuristic is further developed to address this problem. Experimental results show that under the same yield and timing constraints, our heuristic can reduce the area overhead induced by the fault-tolerance mechanisms by up to 38%, compared with a seemingly more intuitive nearest-neighbor based heuristic.
IP2-12	COMPILER-DRIVEN DYNAMIC RELIABILITY MANAGEMENT FOR ON-CHIP SYSTEMS UNDER VARIABILITIES Speakers: Semeen Rehman, Florian Kriebel, Muhammad Shafique and Jörg Henkel, Karlsruhe Institute of Technology (KIT), DE Abstract This paper presents a novel Dynamic Reliability Management System (DyReMS) for on-chip systems that performs resilience-driven resource allocation and mapping. It accounts for both the tasks' resilience properties and heterogeneous error recovery features of different cores. DyReMS also chooses a reliable task version (out of multiple reliability-aware transformed options) depending upon the reliability level of the allocated core. In case of error detection, rollbacks are performed. Our system provides 70%-87% improved task reliability compared to a timing reliabil-ity-optimizing core assignment, i.e. minimizing the probability of deadline misses (with EDF scheduling).
IP2-13	(Best Paper Award Candidate) MINIMIZING STATE-OF-HEALTH DEGRADATION IN HYBRID ELECTRICAL ENERGY STORAGE SYSTEMS WITH ARBITRARY SOURCE AND LOAD PROFILES Speakers: Yanzhi Wang¹, Xue Lin¹, Qing Xie¹, Naehyuck Chang² and Massoud Pedram¹ ¹University of Southern California, US; ²Seoul National University, KR Abstract Hybrid electrical energy storage (HEES) systems consisting of heterogeneous electrical energy storage (EES) elements are proposed to exploit the strengths of different EES elements and hide their weaknesses. The cycle life of the EES elements is one of the most important metrics. The cycle life is directly related to the state-of-health (SoH), which is defined as the ratio of full charge capacity of an aged EES element to its designed (or nominal) capacity. The SoH degradation models of battery in the previous literature can only be applied to charging/discharging cycles with the same state-of-charge (SoC) swing. To address this shortcoming, this paper derives a novel SoH degradation model of battery for charging/discharging cycles with arbitrary patterns. Based on the proposed model, this paper presents a near-optimal charge management policy focusing on extending the cycle life of battery elements in the HEES systems while simultaneously improving the overall cycle efficiency.
IP2-14	DYNAMIC FLIP-FLOP CONVERSION TO TOLERATE PROCESS VARIATION IN LOW POWER CIRCUITS Speakers: Mehrzad Nejat, Bijan Alizadeh and Ali Afzali Kusha, School of Electrical and Computer Eng., College of Eng., University of Tehran, IR Abstract A novel time borrowing method called dynamic Flip-Flop conversion is presented in this paper. A timing violation predictor detects the violations halfway in the critical path and dynamically converts the critical Flip-Flop to a latch. This way, time borrowing benefits of latches are utilized in a Flip-Flop based design which is more adaptable with Computer-Aided- Design tools. The overhead of this method is smaller than that of similar methods due to the elimination of delay elements. According to the post-synthesis simulations and Monte-Carlo analysis of Spice simulations on some ITC'99 benchmark circuits, the power overhead of the proposed method is about 15% and 19% smaller than that of Soft-Edge-Flip-Flop and Dynamic- Clock-Stretching circuits respectively in a simple case of about 40% yield improvement. This overhead would be relatively even smaller for higher performance and yield improvements.
IP2-15	A LOW POWER AND ROBUST CARBON NANOTUBE 6T SRAM DESIGN WITH METALLIC TOLERANCE Speakers: Luo Sun¹, Jimson Mathew¹, Rishad Shafik², Dhiraj Pradhan¹ and Zhen Li¹ ¹University of Bristol, GB; ²University of Southampton, GB Abstract Carbon nanotube field-effect transistor (CNTFET) is envisioned as a promising device to overcome the limitations of traditional CMOS based MOSFETs due to its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 8T cell based on CNTFET, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering metallic tolerance to make the proposed SRAM design more reliable.
IP2-16	MAKE IT REAL: EFFECTIVE FLOATING-POINT REASONING VIA EXACT ARITHMETIC Speakers: Miriam Leeser¹, Saoni Mukherjee¹, Jaideep Ramachandran¹ and Thomas Wahl² ¹Northeastern University, US; ²Northeastern University, Boston, US Abstract Floating-point arithmetic is widely used in scientific computing. While many programmers are subliminally aware that floating-point numbers only approximate the reals, few are cognizant of the dangers this entails for programming. Such dangers range from tolerable rounding errors in sequential programs, to unexpected, divergent control flow in parallel code. To address these problems, we present a decision procedure for floating-point arithmetic (FPA) that exploits the proximity to real arithmetic (RA), via a lossless reduction from FPA to RA. Our procedure does not involve any form of bit-blasting or bit-vectorization, and can thus generate much smaller back-end decision problems, albeit in a more complex logic. This tradeoff is beneficial for the exact and reliable analysis of parallel scientific software, which tends to give rise to large but benignly structured formulas. We have implemented a prototype decision engine and present encouraging results analyzing such software for numerical accuracy.
IP2-17	WIDTH MINIMIZATION IN THE SINGLE-ELECTRON TRANSISTOR ARRAY SYNTHESIS Speakers: Chian-Wei Liu¹, Chang-En Chiang¹, Ching-Yi Huang¹, Chun-Yao Wang¹, Yung-Chih Chen², Suman Datta³ and Vijaykrishnan Narayanan⁴ ¹Dept. of Computer Science, National Tsing Hua University, TW; ²Dept. of Computer Science and Engineering, Yuan Ze University, TW; ³Department of Electrical Engineering, The Pennsylvania State University, US; ⁴Department of Computer Science and Engineering, The Pennsylvania State University, US Abstract Power consumption has become one of the primary challenges to meet the Moore's law. For reducing power consumption, Single-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore's law due to its ultra-low power consumption during operation. Prior work has proposed an automated mapping approach for SET architecture which focuses on minimizing the number of hexagons in an SET array. However, the area of an SET array is more related to the width. Consequently, in this work, we propose an approach for width minimization of the SET arrays. The experimental results show that the proposed approach saves 26% of width compared with the state-of-the-art for a set of MCNC and IWLS 2005 benchmarks while spending similar CPU time.
IP2-18	AREA MINIMIZATION SYNTHESIS FOR RECONFIGURABLE SINGLE-ELECTRON TRANSISTOR ARRAYS WITH FABRICATION CONSTRAINTS Speakers: Yi-Hang Chen, Jian-Yu Chen and Juinn-Dar Huang, Department of Electronics Engineering, National Chiao Tung University, TW Abstract As fabrication processes exploit even deeper submicron technology, power dissipation has become a crucial issue for most electronic circuit and system designs nowadays. In particular, leakage power is becoming a dominant source of power consumption. Recently, the reconfigurable single-electron transistor (SET) array has been proposed as an emerging circuit design style for continuing Moore's Law due to its ultra-low power consumption. Several automated synthesis approaches have been developed for the reconfigurable SET array in the past few years. Nevertheless, all of those existing methods consider fabrication constraints, which are mandatory, merely in late synthesis stages. In this paper, we propose a synthesis algorithm, featuring both variable reordering and product term reordering, for area minimization. In addition, our algorithm takes those mandatory fabrication constraints into account in early stages for better outcomes. Experimental results show that our new method can achieve an area reduction of up to 24% as compared to current state-of-the-art techniques.
IP2-19	SOFTWARE-BASED PAULI TRACKING IN FAULT-TOLERANT QUANTUM CIRCUITS Speakers: Alexandru Paler¹, Simon Devitt², Kae Nemoto² and Ilia Polian¹ ¹University of Passau, DE; ²National Institute of Informatics, JP Abstract The realisation of large-scale quantum computing is no longer simply a hardware question. The rapid development of quantum technology has resulted in dozens of control and programming problems that should be directed towards the classical computer science and engineering community. One such problem is known as Pauli tracking. Methods for implementing quantum algorithms that are compatible with crucial error correction technology utilise extensive quantum teleportation protocols. These protocols are intrinsically probabilistic and result in correction operators that occur as byproducts of teleportation. These byproduct operators do not need to be corrected in the quantum hardware itself , but are tracked through the circuit and output results emph{reinterpreted}. This tracking is routinely ignored in quantum information as it is assumed that tracking algorithms will eventually be developed. In this work we help fill this gap and present an algorithm for tracking byproduct operators through a quantum computation.
IP2-20	AN EFFICIENT TEMPERATURE-GRADIENT BASED BURN-IN TECHNIQUE FOR 3D STACKED ICS Speakers: Nima Aghaee, Zebo Peng and Petru Eles, Linköping University, SE Abstract Burn-in is usually carried out with high temperature and elevated voltage. Since some of the early-life failures depend not only on high temperature but also on temperature gradients, simply raising up the temperature of an IC is not sufficient to detect them. This is especially true for 3D stacked ICs, since they have usually very large temperature gradients. The efficient detection of these early-life failures requires that specific temperature gradients are enforced as a part of the burn-in process. This paper presents an efficient method to do so by applying high power stimuli to the cores of the IC under burn-in through the test access mechanism. Therefore, no external heating equipment is required. The scheduling of the heating and cooling intervals to achieve the required temperature gradients is based on thermal simulations and is guided by functions derived from a set of thermal equations. Experimental results demonstrate the efficiency of the proposed method.
IP2-21	TEST AND NON-TEST CUBES FOR DIAGNOSTIC TEST GENERATION BASED ON MERGING OF TEST CUBES Speaker: Irith Pomeranz, Purdue University, US Abstract Test generation by merging of test cubes supports test compaction and test data compression. This paper describes a new approach to the use of test cube merging for the generation of compact diagnostic test sets. For this the paper uses the new concept of non-test cubes. While a test cube for a fault fi0 detects the fault, a non-test cube for a fault fi1 prevents the fault from being detected. Merging a test cube for a fault fi0 and a non-test cube for a fault fi1 produces a diagnostic test cube that distinguishes the two faults. The paper describes a procedure for diagnostic test generation based on merging of test and non-test cubes. Experimental results demonstrate that compact diagnostic test sets are obtained.
IP2-22	NEW IMPLEMENTIONS OF PREDICTIVE ALTERNATE ANALOG/RF TEST WITH AUGMENTED MODEL REDUNDANCY Speakers: Haithem Ayari, Florence Azais, Serge Bernard, Mariane Comte, Vincent Kerzerho and Michel Renovell, LIRMM, CNRS/Univ. Montpellier 2, FR Abstract This paper discusses new implementations of the predictive alternate test strategy that exploit model redundancy in order to improve test confidence. The key idea is to build during the training phase, not only one regression model for each specification as in the classical implementation, but several regression models. This redundancy is then used during the testing phase to identify suspect predictions and remove the corresponding devices from the alternate test flow. In this paper, we explore various options for implementing model redundancy, based on the use of different indirect measurement combinations and/or different partitions of the training set. The proposed implementations are evaluated on a real case study for which we have production test data from 10,000 devices.

UB05 Session 5

Date: Wednesday 26 March 2014
Time: 10:00 - 12:00
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB05.01	DESIGN SPACE EXPLORATION FOR A LANE-KEEPING-SUPPORT CASE STUDY Authors: Raphael Weber¹, Eike Thaden¹, Stefan Henkler², Jens Höfflinger³ and Steffen Prochnow⁴ ¹OFFIS, DE; ²OFFIS, Germany, DE; ³Robert Bosch GmbH, Germany, DE; ⁴ETAS GmbH, Germany, DE Abstract We present a design space exploration demonstration applied to an industrial lane-keeping-support case study. We minimize communication, costs, weight, and the number of processing elements also satisfying hard real-time constraints for distributed embedded systems. The input system is modeled in SysML with TADL2 extensions and the SPES modeling framework from the SPES-XT project. The case study is derived from real data from the operational division of Bosch with promising results. More information ...
UB05.02	AIDA: ANALOG IC DESIGN AUTOMATION Authors: Nuno Horta¹, Nuno Lourenço², Ricardo Martins², Ricardo Póvoa², António Canelas² and Pedro Ventura¹ ¹Instituto de Telecomunicacoes, PT; ²Instituto de Telecomunicacoes / Instituto Superior Técnico, PT Abstract This demonstration presents AIDA, an analog integrated circuit (IC) design automation environment. AIDA includes two main modules, namely, AIDA-C and AIDA-L. AIDA-C is a circuit-level synthesis tool which uses state-of-the-art multi-objective multi-constrained optimization kernels, based on evolutionary computation techniques, where the robustness of the solutions is attained by considering a layout-aware approach and, also, extreme process variations by means of PVT corner analysis. The circuit's performance is measured using Spectre®, ELDO® or HSPICE® electrical simulators as evaluation engines. AIDA-L considers the device sizes and the best floorplan, obtained with AIDA-C, and generates the complete layout by placing and routing the devices, while fulfilling the technology design rules by using built-in design-rule check (DRC) and layout-versus-schematic (LVS) procedures. In order to demonstrate AIDA design environment several analog circuit structures, e.g., OTAs, LNAs, LC-Oscillators, etc., will be synthesized in a 130nm CMOS technology. AIDA-C is demonstrated for circuit-level sizing and optimization by generating a family of Pareto Optimal solutions based on user performance and functional specifications. AIDA-L is demonstrated by generating the layout of a user selected solution from AIDA-C, taking into account electrical currents information to mitigate electromigration and IR-drop effects, and also wiring symmetry for multiport multi-terminal signal nets of analog ICs. More information ...
UB05.03	KAOLIN: A MODEL-BASED EDA TOOL TO PROGRAM, REUSE OR RETARGET EMBEDDED SYSTEMS ON FPGAS. Authors: Yvan Eustache, Dominique Blouin, Mickaël Lanoé, Jean-Philippe Diguet and Philippe Coussy, Lab-STICC, Université de Bretagne-Sud, FR Abstract The demonstration presents the Kaolin EDA tool to improve and speed-up embedded systems development on FPGAs. It provides modeling abstractions to shield the user from implementation details and prevent frequent time-consuming errors. It allows you to reuse legacy projects and IPs and retarget them to other platforms with different back-end tools. The Kaolin technology is based on models of components, platforms and FPGA development tools. It allows automating platform-independent system generation including vendor tool files and scripts, verification and high-level analysis, and template-based documentation generation. Kaolin nicely fits in the development flow as a bridge between the user and low level FPGA vendor tools. User appropriation is facilitated: it requires no new language to be learnt; it allows the import of legacy codes (HDL) and the software (C/C++) to hardware migration with built-in High-Level Synthesis capabilities. Kaolin can be customized to meet domain and user-specific requirements. During the demonstration, Kaolin will be used to quickly implement a control and signal processing system deployed on a FPGA and embedded on a radio-controlled toy car. More information ...
UB05.04	SECURE CLOUD-BASED WORKFLOW-AS-A-SERVICE (WFAAS) ENVIRONMENT WITH ROLE-BASED-ACCESS-CONTROL (RBAC) FOR SOC DESIGN Authors: Sai Manoj P D¹, Sai Manoj P. D.¹, Hao Yu¹ and Joseph Lee² ¹Nanyang Technological University, SG; ²Silicon Cloud International, US Abstract The SoC design process requires multiple EDA tools, custom IP's, and technology design kit from multiple providers. The design environment needs to be secure and collaborative. These requirements can be realized by using an integrated cloud based Workflow-as-a-Service (WFaaS) design environment. We demonstrate a cloud-based design environment for a SoC design with multiple CPU cores and analog IO's. This design environment uses an innovative Role-Based-Access-Control user security model where designers interact through a web portal dashboard to perform the design workflows. More information ...
UB05.05	MOTORBRAIN: MODEL-BASED DESIGN AND VIRTUAL INTEGRATION OF AN INTELLIGENT AND SAFE ELECTRICAL POWERTRAIN Authors: Sven Rosinger, Maher Fakih and Jörg Walter, OFFIS - Institut für Informatik, DE Abstract Hardware prototypes and hardware in the loop simulations are commonly used during embedded vehicle- and motor-control unit design. This demonstrator presents a platform that is an order of magnitude cheaper than existing systems but still easy to integrate into present workflows: Within an existing model-driven design methodology, a real-time hardware simulation is performed using the Raspberry Pi single-board computer to simulate an e-motor with little development effort and in conjunction with an industrial motor control unit. More information ...
UB05.06	PHARAON: PARALLEL AND HETEROGENEOUS ARCHITECTURES FOR REAL-TIME APPLICATIONS Authors: Luciano Lavagno¹, Mihai Lazarescu¹, Hector Posadas² and Eugenio Villar² ¹Politecnico di Torino, IT; ²Universidad de Cantabria, ES Abstract In this demo, we will present the work-in-progress of the EU FP7 PHARAON project, started in September 2011. The first objective of the project is the development of new techniques and tools capable to assist the designer in the development of parallel embedded systems, from executable specifications to target-specific implementation and debugging on a multicore platform. This tool chain offers and implements several parallelization strategies, reflecting the functional and non-functional constraints of the system, and driving the designer into incremental parallelization and adaptation steps. The second objective of the project is to develop monitoring and control techniques in the middleware of the system capable to automatically adapt platform services to application requirements and therefore reduce power consumption transparently. The demo will cover specifically: - the software parallelization tool suite, - the parallel software modeling and code generation suite. More information ...
UB05.07	LARA: THE LARA COMPILER SUITE Authors: Joao Bispo, Pedro Pinto, Ricardo Nobre, Tiago Carvalho and Joao Cardoso, Universidade do Porto, PT Abstract LARA is an aspect-oriented programming (AOP) language which allows the description of sophisticated code instrumentation schemes, advanced mapping strategies including conditional decisions, based on hardware/software resources, and of sophisticated sequences of compiler transformations. Furthermore, LARA provides mechanisms for controlling all elements of a toolchain in a consistent and systematic way, using a unified programming interface. We present three compiler tools developed around the LARA technology, MATISSE, MANET and ReflectC. MATISSE is a compiler which 1) allows analyses and transformations on MATLAB code and 2) generates C code from the MATLAB code. MATISSE can be fully controlled through LARA aspects, which can define the type and shape of MATLAB variables, specify code insertion/removal actions, and define specialization directives and other additional information. MATISSE can output transformed MATLAB code and specialized C code. The knowledge provided by the LARA aspects allows MATISSE to generate C tailored to specific targets (e.g., use statically declared arrays to be compliant with the high-level synthesis tools such as Catapult C). MANET is a source-to-source compiler for ANSI C based on Cetus, and is controlled using LARA aspects. MANET manages to leverage the expressiveness and modularity of LARA to query and manipulate the Cetus AST, providing an easy compilation flow with main goal of code instrumentation and code transformations. LARA aspects allow for a simple selection of program elements in the code which can be analyzed or transformed, by either consulting their attributes or applying actions. Thus, MANET can be used to provide information reports based on compiler analyses, to implement sophisticated code instrumentation strategies, or to perform code optimizations and transformations. ReflectC is a C compiler based on CoSy's compiler framework. CoSy's configurability and retargetability make ReflectC particularly effective for exploration of compiler transformations and optimizations on possible architecture variations, and it is being used for hardware/software co-design and design space exploration (DSE). We will present demos of the tools and the use of LARA aspects and strategies to guide our suite of compilation tools providing: 1) C code generation from MATLAB code, according to information provided by LARA aspects; 2) Instrumentation of C code to be used for collecting specific compile and runtime information (e.g., execution time, range of values for specific variables, custom profiling); 3) User-controlled compiler optimizations targeting several architectures and DSE of sequences of compiler optimizations bearing in mind performance improvements. In addition to presenting examples for each of the tools of the LARA compilation suite, we show an execution of the complete toolchain, controlled by LARA aspects. More information ...
UB05.08	MICROTESK: RECONFIGURABLE OPEN-SOURCE FRAMEWORK FOR TEST PROGRAM GENERATION Authors: Andrei Tatarnikov, Alexander Kamkin and Artem Kotsynyak, Institute for System Programming of the Russian Academy of Sciences (ISP RAS), RU Abstract Test program generation plays a major role in functional verification of microprocessors. Due to tremendous growth in complexity of modern designs and rigid constraints on time to market, it becomes an increasingly difficult task. In spite of powerful test program generation tools available in the market, development of functional tests is still known to be the bottleneck of the microprocessor design cycle. The common problem is that it takes a significant effort to reconfigure a test program generation environment for a new microprocessor design. The model-based approach applied in the state-of-the-art tools, like Genesys-Pro (IBM Research), still does not provide enough flexibility since creating a microprocessor model is difficult and requires special knowledge and skills. MicroTESK, the open-source test program generation framework being developed at ISPRAS, offers an approach to ease customization by using light-weight formal specifications to describe the target microprocessor architecture. The approach helps reduce the effort needed to create a microprocessor model and, consequently, minimize the time required to create functional tests. In addition to gaining flexibility, the use of formal specifications also allows automated extraction of knowledge about test situations that occur in a microprocessor (coverage model), thus, facilitating creating directed tests and improving test coverage. By the present moment, a demo prototype of MicroTESK has been implemented. It uses the Sim-nML architecture description language to specify the target microprocessor architecture and provides a convenient Ruby-based language for creating test templates that serve as an abstract description of test programs to be generated. The current version of the framework focuses primarily on RISK microprocessors including ARM, MIPS and SPARK. Supported test generation methods include random, combinatorial, template-based and model-based generation. Flexible architecture of the framework allows adding support for new test generation methods. More information ...
UB05.09	LEVERAGING DYNAMIC RECONFIGURATION TO INCREASE FAULT-TOLERANCE IN FPGA-BASED SATELLITE SYSTEMS Authors: Sebastian Korf¹, Dario Cozzi¹, Dirk Jungewelter¹, Jens Hagemeyer¹, Mario Porrmann¹ and Jorgen Ilstad² ¹CITEC (Bielefeld University), DE; ²ESTEC (European Space Agency), DE Abstract This demonstrator shows how todays SoCs for satellite payload processing can be extended with high-speed interfaces and computing power utilizing commercial dynamically reconfigurable FPGAs. The use of these FPGAs in space environment will lead to faults due to radiation. Therefore, special methods have been developed to increase the system reliability. We will demonstrate an environment for automatic fault detection and correction in relevant applications like image and video processing. More information ...
UB05.10	RTL+: DESIGN ENVIRONMENT: WALK BEFORE YOU RUN. Authors: Somayeh Sadeghi-Kohan, Behnaz Pourmohseni, Amir Reza Nekooei, Hanieh Hashemi, Hamed Najafi Haghi and Zainalabedin Navabi, University of Tehran, IR Abstract To enable development of high level designs with hardware correspondence, synthesizability must be satisfied in a top-down manner. Thus in this work, instead of using TLM-2.0 which is not established for synthesis, we will start with a level above RT level, "RTL+". RTL+ is basically using TLM-1.0 channels and includes abstract communications and handshakings that are mainly hidden from the designer. We develop a package of SystemC channels with hardware correspondence (synthesizable HDL) for the communication between various cores (with simple interfaces) and standard buses. More information ...
12:00	End of session
12:30	Lunch Break in Exhibition Area Sandwich lunch

6.1 SPECIAL DAY Hot Topic: The fight against Dark Silicon

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Saal 1

Organiser:
Jörg Henkel, Karlsruhe Institute of Technology, DE

Chair:
Jörg Henkel, Karlsruhe Institute of Technology, DE

Co-Chair:
Jürgen Teich, University of Erlangen-Nuremberg, DE

Dark Silicon is predicted to dominate the chip footage of upcoming many-core systems within a decade since Dennard Scaling fails mainly due to the voltage-scaling problem that results in higher power densities. It would deem upcoming technologies nodes inefficient since a majority of cores would lie fallow. Significant research efforts have started within the last couple of years to investigate and mitigate Dark Silicon effects to ensure an effective use of available chip footage. This special session gives a snapshot of current research activities of this grand challenge. In particular, the three talks present the newest trends and developments starting with the problem of Dennard Scaling and how it mandates new design constraints followed by the problem of power delivery and cooling, and concluding with the newest directions in efficient resource management for many-core systems.

Time	Label	Presentation Title Authors
11:00	6.1.1	A LANDSCAPE OF THE NEW DARK SILICON DESIGN REGIME Speaker: Michael Taylor, University of California, San Diego, US Abstract Due to the breakdown of Dennard scaling, the percentage of a silicon chip that can switch at full frequency is dropping exponentially with each process generation. This utilization wall forces designers to ensure that, at any point in time, large fractions of their chips are effectively dark silicon, i.e., significantly underclocked or idle for large periods of time. As exponentially larger fractions of a chip's transistors become dark, silicon area becomes an exponentially cheaper resource relative to power and energy consumption. This shift is driving a new class of architectural techniques that "spend" area to "buy" energy efficiency. All of these techniques seek to introduce new forms of heterogeneity into the computational stack. This work examines four key approaches—the four horsemen—that have emerged as top contenders for thriving in the dark silicon age. Each class carries with its virtues deep-seated restrictions that requires a careful understanding of the underlying tradeoffs and benefits. Further, we propose a set of dark silicon design principles, and examine how one of the darkest computing architectures of all, the human brain, trades off energy and area in ways that provide potential insights into future directions for computer architecture.
11:30	6.1.2	INTEGRATED MICROFLUIDIC POWER GENERATION AND COOLING FOR BRIGHT SILICON MPSOCS Speakers: Mohamed M. Sabry¹, Arvind Sridhar¹, Patrick Ruch², David Atienza¹ and Bruno Michel² ¹EPFL, CH; ²IBM Research, CH Abstract The soaring demand for computing power in our digital information age has produced, as collateral undesirable effect, a surge in power consumption and heat density for Multiprocessors System-on-Chip (MPSoC). Accordingly, significant portion of the energy consumed in state-of-the-art MPSoCs is dissipated in cooling. The remaining energy is used for computation, and causes the temperature ramp-up to operating conditions that already preclude operating all the cores at maximum performance levels, in order to prevent system overheating and failures. This situation is set to worsen as shipments of high-end (i.e., even denser) many-core servers are increasing at a 25% compound annual growth rate. With more power demands, MPSoCs will face a power delivery wall due to the reliability limitations of the underlying power delivery medium. Thus, state-of-the-art worst-case power and cooling delivery solutions are reaching their limits and it will no longer be possible to power up simultaneously all the available on-chip cores (situation known as the existence of dark silicon); hence, drastically limiting the benefits of technology scaling. In this paper we propose a disruptive approach to overcome the prevailing worst-case power and cooling provisioning paradigm for MPSoCs. This proposed approach integrates MPSoC with an on-chip microfluidic fuel cell network for joint cooling delivery and power supply (i.e., local power generation and delivery). By providing an alternative mean to power delivery integrated with cooling, MPSoCs are expected to gain in IO connectivity. Thanks to this disruptive technology, we can envision the removal of the current limits of power delivery and heat dissipation in server designs, subsequently avoiding dark silicon in future MPSoCs and enabling new perspectives in future energy-proportional computing architecture designs.
12:00	6.1.3	EFFECTIVE RESOURCE MANAGEMENT TOWARDS EFFICIENT COMPUTING Speaker: Per Stenström, Chalmers University of Technology, SE Abstract Improving performance of computers at historical rates, as dictated by Moore's Law, is becoming increasingly more challenging especially because we are hitting the chip power-budget wall. But challenges usually direct us to focus on opportunities we have neglected in the past. I will focus on some of these overlooked opportunities in this talk. One such opportunity is to question what are meaningful performance goals for individual applications. I will present a resource management framework in which architectural resources are assigned to applications based on their performance requirements. I will also talk about some innovations that enable us to compute more power-efficiently by using memory resources more effectively by, for example, exploiting value locality.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

6.2 Embedded Tutorial: Emerging Transistor Technologies: From Devices to Architectures

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 6

Organisers:
Michael Niemier, University of Notre Dame, US
X. Sharon Hu, University of Notre Dame, US

Chair:
Michael Niemier, University of Notre Dame, US

This "vertically integrated" session is focused on emerging transistor technologies - particularly devices that operate at low voltages and that have steep slopes. It will: (1) introduce desirable (and undesirable) features of new device technologies; (2) highlight how new transistor technologies could impact von Neumann architectures; a particular emphasis will be placed on (a) heterogeneous multi-core architectures and accelerators (where heterogeneity stems from different device technologies) and (b) modeling efforts at all levels of the chip hierarchy (i.e., from the device-level to the architectural-level); (3) illustrate how new device technologies could lead to significant improvements in the performance/efficiency of non-von Neumann architectures. Notably, talks (2) and (3) will identify roles for new device technologies in hybrid analog-digital systems with an end goal of improved application-level performance/efficiency.

Time	Label	Presentation Title Authors
11:00	6.2.1	ENERGY EFFICIENT COMPUTING WITH TUNNEL FETS Speakers: Adrian Ionescu, Arnab Biswas, Nilay Dagtekin and Livio Lattanzio, Nanolab, Ecole Polytechnique Fédérale de Lausanne, CH Abstract This paper will review the state-of-the-art in energy efficient computing using tunnel FETs from device to circuit level, including digital IC and memory applications. At device level we will particularly discuss the major challenges remaining for tunnel FETs, with particular emphasis on: (i) selection of the most appropriate material systems and band-gap engineering of heterostructure Tunnel FETs to simultaneously offer best performance trade-off: low Ioff, high Ion, high Ion/Ioff, subthermal swing over more than 4 decades of current, and operation below 0.3V, (ii) specifically optimized device design (i.e. field aligned to the tunneling path, avoidance of super-linear onset, minimize Miller effect), (iii) understanding the role of defects for BTBT and providing appropriate control, (iv) understanding and controlling parameter sensitivity and variability, (v) accurate physics-based BTBT modeling of heterojunction tunnel FETs. We will detail the Electron-Hole Bilayer Tunnel FET (EHBTFET), as switch candidate for sub-0.1V operation exploiting tunneling through a bias-induced electron-hole bilayer based on a calibrated quantum-mechanical simulator. We will make performance projections for EHBTFET complementary logic compared to CMOS logic of same dimensions and using recent energy benchmarking. Finally, the design and use of Tunnel FETs as capacitorless DRAM cells, implemented as a double-gate (DG) fully-depleted Silicon-On-Insulator (FD-SOI) architecture will be reported and its principle, embodiment and scalability discussed. We will present recent experimental results on Tunnel FET DRAM memory operation schemes and demonstrate its potential for ultra-low power memories. In conclusion, this paper demonstrates that Tunnel FETs stand as the most promising steep slope switch candidates to reduce the supply voltage below 0.3 V and offer significant power dissipation savings for digital computing.
11:30	6.2.2	MODELING STEEP SLOPE DEVICES: FROM CIRCUITS TO ARCHITECTURES Speakers: Karthik Swaminathan¹, Moon Seok Kim¹, Nandhini Chandramoorthy², Behnam Sedighi³, Robert Perricone³, Jack Sampson¹ and Vijaykrishnan Narayanan⁴ ¹Pennsylvania State University, US; ²The Pennsylvania State University, US; ³University of Notre Dame, US; ⁴Penn State University, US Abstract Steep Slope devices, with Heterojunction Tunnel FETs (TFETs) in particular, have been proposed as a viable solution to overcome the subthreshold slope limitation in exist- ing CMOS technology and achieve ultra-low voltage operation with acceptable performance. However, state-of-the-art FinFET technologies continue to demonstrate superior performance than steep slope devices in application domains demanding peak single threaded performance. In this context, we examine different computing paradigms where TFET technologies can be used, not just as a 'drop in' replacement, but as an additional parameter to augment the architectural design space. This greatly widens the scope of optimizations for performance and power. We investigate the tradeoffs between device and architectures in general purpose processors when performance, power and temperature are individually constrained. We also synthesize examples of domain-speciﬁc accelerators used in computer vision using in-house TFET standard cell libraries to demonstrate the energy beneﬁts of designing TFET-based accelerators. We demonstrate that synthesizing these accelerators using TFETs reduces energy by over 6X in comparison to an equivalent iso- voltage CMOS-based design and by over 30% in comparison to an iso-performance CMOS design.
12:00	6.2.3	STEEP SLOPE TRANSISTOR TECHNOLOGIES: IMPACTS ON CNN ARCHITECTURES Speakers: Indranil Palit¹, Behnam Sedighi¹, Xiaobo Sharon Hu¹, Joseph Nahas¹, Michael Niemier¹ and András Hortváth² ¹University of Notre Dame, US; ²Pázmány Péter Catholic University, HU Abstract A Cellular Neural Network (CNN) is a highly-parallel, analog processor that can significantly outperform von Neumann architectures for certain classes of problems. In this paper, we illustrate how emerging, beyond-CMOS devices could help to further enhance the capabilities of CNNs, particularly for solving problems with non-binary outputs. We show how CNNs based on devices such as graphene transistors - with multiple steep current growth regions separated by negative differential regions (NDR) in their I-V characteristics - could be used to recognize multiple patterns simultaneously. (This would require multiple steps given a conventional, binary CNN.) Also, we demonstrate how circuits based on tunneling field effect transistors (TFETs) can also be used to form circuits capable of performing similar tasks. With this approach, more "exotic" I-V characteristics are not required - which should be an asset when considering issues such as cell-to-cell mismatch, etc. As a case study, we present a CNN-cell design that employs TFET-based circuitry to realize ternary outputs. We then illustrate how this hardware could be employed to efficiently solve a tactile sensing problem. The total number of computation steps, as well as the required hardware could be reduced significantly when compared to an approach based on a conventional CNN.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

6.3 Management of Micro/Macro Renewable Energy Storage Systems

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 1

Chair:
Geoff Merrett, University of Southampton, UK

Co-Chair:
Davide Brunelli, University of Trento, IT

Modern energy storage systems affect all areas of power electronics, from micro-power energy harvesting systems to mega-watt Smart Grid systems. Papers in this session address novel approaches for on-chip power electronics operating under variable Vdd, and optimisation approaches to efficient design of smart grid energy storage.

Time	Label	Presentation Title Authors
11:00	6.3.1	(Best Paper Award Candidate) ASYNCHRONOUS DESIGN FOR NEW ON-CHIP WIDE DYNAMIC RANGE POWER ELECTRONICS Speakers: Delong Shang¹, Xuefu Zhang², Fei Xia³ and Alex Yakovlev² ¹School of EEE,Newcastle University, GB; ²School of EEE, Newcastle University, GB; ³School of EEE, GB Abstract Asynchronous circuits will play an important role in microelectronic systems in the future, especially in energy harvesting and autonomous (EHA) systems where such circuits will be able to offer robustness and deliver high efficiency in a wide range of power-energy conditions. The concept of Capacitor Bank Block (CBB) mechanisms was proposed to form the basis of electronics for powering asynchronous loads. These mechanisms will benefit EHA systems by enabling effective co-scheduling of computational tasks and energy supply. This paper demonstrates how the CBB mechanisms can themselves be controlled by asynchronous circuits, thereby forming a new type of power delivery units (PDU) that will be able to deliver power to intelligent digital logic in future EHA systems. These PDUs are superior to traditional power converters largely because the latter can only regulate sufficiently high power and energy levels (regular and periodic) as well as their controllers require stable power levels themselves. This makes them unsuitable for intermittent and sporadic conditions inherent to EHA systems. In this paper, a novel asynchronous control for the CBB is described. Experiments and analysis of the new PDUs, comprising CBBs and asynchronous control, are presented and discussed in detail.
11:30	6.3.2	REAL-TIME OPTIMIZATION OF THE BATTERY BANKS LIFETIME IN HYBRID RESIDENTIAL ELECTRICAL SYSTEMS Speakers: Maurizio Rossi, Alessandro Toppano and Davide Brunelli, University of Trento, IT Abstract We present a real-time optimization framework to manage Hybrid Residential Electrical Systems (HRES) with multiple Energy sources and heterogeneous storage units. HRES represents urban buildings where photovoltaic (PV) or other renewable sources are installed along with the traditional connection to the main grid. In this paper heterogeneous storage units are used to realize energy buffers for the exceeding energy produced by the renewable when buildings and the grid are not available to accept it. We considered two different battery banks as electric energy storage, in particular lead-acid as the primary one for its low price and low self-discharge rate; while the lithium-ion chemistry is used as secondary bank because of the higher energy density and higher number of cycles. The proposed optimization strategy aims at maximizing the lifetime of the battery banks and to reduce the energy bill by managing the variability of the PV source, in price-varying scenarios. We used a Dynamic-Programming (DP) algorithm to schedule off-line the use of the lead-acid bank minimizing the number of cycles and the Depth-of-Discharge (DoD) under given irradiance forecasts and user load profiles. Forecasts of the user loads and of the renewable energy intake are introduced in the optimization. Moreover a Real-Time scheme is introduced to manage the lithium bank and to minimize the need and the purchase of energy from the Grid when the actual demand does not fit the forecast. Our simulation results outperform the state of the art where the efficiency of both banks is not taken into consideration, even if complex approaches based on DP are used.
12:00	6.3.3	OPTIMAL DIMENSIONING OF ACTIVE CELL BALANCING ARCHITECTURES Speakers: Swaminathan Narayanaswamy¹, Sebastian Steinhorst¹, Martin Lukasiewycz², Matthias Kauer³ and Samarjit Chakraborty⁴ ¹TUM CREATE, SG; ²TUM CREATE Singapore, SG; ³TUM CREATE Ltd,, SG; ⁴TU Munich, DE Abstract This paper presents an approach to optimal dimensioning of active cell balancing architectures, which are of increasing relevance in EES for EV or stationary applications such as smart grids. Active cell balancing equalizes the state of charge of cells within a battery pack via charge transfers, increasing the effective capacity and lifetime. While optimization approaches have been introduced into the design process of several aspects of EES, active cell balancing architectures have, until now, not been systematically optimized in terms of their components. Therefore, this paper analyzes existing architectures to develop design metrics for energy dissipation, installation volume, and balancing current. Based on these design metrics, a methodology to efficiently obtain Pareto-optimal configurations for a wide range of inductors and transistors at different balancing currents is developed. Our methodology is then applied to a case study, optimizing two state-of-the-art architectures using realistic balancing algorithms. The results give evidence of the applicability of systematic optimization in the domain of cell balancing, leading to higher energy efficiencies with minimized installation space.
12:15	6.3.4	OPTIMAL DESIGN AND MANAGEMENT OF A SMART RESIDENTIAL PV AND ENERGY STORAGE SYSTEM Speakers: Di Zhu¹, Yanzhi Wang¹, Naehyuck Chang² and Massoud Pedram¹ ¹Univ. of Southern California, US; ²Seoul National University, KR Abstract Solar photovoltaic (PV) technology has been widely deployed in large power plants operated by utility companies. However, the home owners are not yet convinced of the saving cost benefits of this technology, and consequently, in spite of government subsidies, they have been reluctant to install PV systems in their homes. The main reason for this is the absence of a complete and truthful analysis which could explain to home owners under what conditions spending money on a PV system can actually save them money over a long-term, but known, time horizon. This paper thus presents a design and management mechanism for a smart residential energy system comprising PV modules, electrical energy storage banks, and conversion circuits connected to the power grid. First, we figure out how much savings can be achieved by a system with given PV modules and EES bank capacities by optimally solving the daily energy flow control problem of such a system. Based on the daily optimization results, we come up with the optimal system specifications with a fixed budget. Experiments are conducted for various electricity prices and different profiles of PV output power and load demand. Re-sults show that the designed system breaks even in 6 years and in the system lifetime achieves up to 8% annual profit besides paying back the budget.
12:30	IP3-1, 939	DESIGN AND FABRICATION OF A 315 μH BONDWIRE MICRO-TRANSFORMER FOR ULTRA-LOW VOLTAGE ENERGY HARVESTING Speakers: Enrico Macrelli¹, Ningning Wang², Saibal Roy², Michael Hayes², Rudi Paolo Paganelli³, Marco Tartagni¹ and Aldo Romani¹ ¹DEI, University of Bologna, IT; ²Tyndall National Institute, UCC, IE; ³CNR-IEIIT, University of Bologna, IT Abstract This paper presents a design study of a new topology for miniaturized bondwire transformers fabricated and assembled with standard IC bonding wires and toroidal ferrite (Fair-Rite 5975000801) as a magnetic core. The micro-transformer realized on a PCB substrate, enables the build of magnetics on-top-of-chip, thus leading to the design of high power density components. Impedance measurements in a frequency range between 100 kHz to 5 MHz, show that the secondary self-inductance is enhanced from 0.3 μH with an epoxy core to 315 μH with the ferrite core. Moreover, the micro-machined ferrite improves the coupling coefficient from 0.1 to 0.9 and increases the effective turns ratio from 0.5 to 35. Finally, a low-voltage IC DC-DC converter solution, with the transformer mounted on-top, is proposed for energy harvesting applications.
12:31	IP3-2, 85	PROVIDING REGULATION SERVICES AND MANAGING DATA CENTER PEAK POWER BUDGETS Speakers: Baris Aksanli and Tajana Rosing, University of California San Diego, US Abstract Data centers are good candidates for providing regulation services in the power markets due to their large power consumption and flexibility. In this paper, we develop a framework that explores the feasibility of data center participation in these markets. We use a battery-based design that can not only help with providing ancillary services, but can also limit peak power costs without any workload performance degradation. The results of our study using data for a 21MW data center show up to $480,000/year savings can be obtained, corresponding to 1280 more servers providing services.
12:32	IP3-3, 812	THE ENERGY BENEFIT OF LEVEL-CROSSING SAMPLING INCLUDING THE ACTUATOR'S ENERGY CONSUMPTION Speakers: Burkhard Hensel and Klaus Kabitzsch, Dresden University of Technology, DE Abstract When using level-crossing (also called send-on-delta) sampling in control loops, messages can be saved compared to periodic sampling without degrading control performance. While it is clear that reducing messages improves also the energy efficiency of battery-powered sensor devices, this can be disadvantageous for the energy efficiency the actuator device. This paper addresses the question, under which conditions level-crossing sampling is also for the actuator device more energy-efficient than periodic sampling. It is shown that there is an optimum inter-sample interval. Methods for reaching this optimum by appropriate controller and transmission settings are given. The theory is demonstrated using several known, standardized wireless network protocols.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

6.4 Power delivery and distribution

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 2

Chair:
Edith Beigné, CEA LETI Grenoble, FR

Co-Chair:
Domenik Helms, OFFIS Oldenburg, DE

This session will present innovative solutions for power delivery in complex SoCs using configurable structures working over large condition range. Configurable DC-DC and LDOs architectures will be considered underlining power efficiency issues of off-chip and on-chip regulators. Fine-grain approaches are also proposed to deal with distributed in-die power generation reducing static and dynamic power in complex SoCs.

Time	Label	Presentation Title Authors
11:00	6.4.1	DESIGN AND EVALUATION OF FINE-GRAINED POWER-GATING FOR EMBEDDED MICROPROCESSORS Speakers: Masaaki Kondo¹, Hiroaki Kobyashi², Ryuichi Sakamoto², Motoki Wada², Jun Tsukamoto², Mitaro Namiki², Weihan Wang³, Hideharu Amano³, Kensaku Matsunaga⁴, Masaru Kudo⁴, Kimiyoshi Usami⁴, Toshiya Komoda⁵ and Hiroshi Nakamura⁵ ¹The University of Electro-Communications, JP; ²Tokyo University of Agriculture and Technology, JP; ³Keio University, JP; ⁴Shibaura Institute of Technology, JP; ⁵The University of Tokyo, JP Abstract Power-performance efficiency is still remaining a primary concern for microprocessor designers. One of the sources of power inefficiency for recent LSI chips is increasing leakage power consumption. Power-gating is a well known technique to reduce leakage power consumption by switching off the power supply to idle logic blocks. Recently, fine-grained power-gating is emerged as a technique to minimize leakage current during the active processor cycles by switching on and off a logic blocks in much finer temporal/spatial granularity. Though fine-grained power-gating is useful, a comprehensive evaluation and analysis has not been conducted on a real LSI chips. In this paper, we evaluate fine-grained run-time power-gating for microprocessors' functional units using a real embedded microprocessor. We also introduce an architecture and compiler co-operative power-gating scheme which mitigates negative power reduction caused by the energy overhead associated with fine-grained power-gating. The experimental results with a fabricated core shows that a hardware-based scheme saves power consumption of functional units by 44% and hardware compiler co-operative scheme further improves power efficiency by 5.9% when core temperature is 25C.
11:30	6.4.2	SUPERRANGE: WIDE OPERATIONAL RANGE POWER DELIVERY DESIGN FOR BOTH STV AND NTV COMPUTING Speakers: Xin He¹, Guihai Yan², Yinhe Han³ and Xiaowei Li³ ¹Institue of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN; ³institute of Computing Technology, Chinese Academy of Sciences, CN Abstract The load power range of modern processors is greatly enlarged because many advanced power management techniques like dynamic voltage frequency scaling, Turbo boosting, and Near Threshold Voltage technologies are incorporated. However, the power saving may be offset by power loss in power delivery; moreover, as the efficiency of power delivery varies greatly with different load conditions, conventional power delivery designs cannot maintain high efficiency over the entire voltage range. We propose SuperRange, a wide operational range power delivery scheme. SuperRange complements the power delivery capability of on-chip voltage regulator and off-chip voltage regulator. Experimental results show SuperRange has an average 70% power conversion efficiency over wide operational range which outperforms conventional power delivery schemes. And it also exhibits superior resilience to power-constrained systems.
12:00	6.4.3	MODELING AND ANALYSIS OF DIGITAL LDOS WITH ADAPTIVE CONTROL FOR HIGH EFFICIENCY UNDER WIDE DYNAMIC RANGE DIGITAL LOADS Speakers: Samantak Gangopadhyay, Youngtak Lee, Saad Bin Nasir and Arijit Raychowdhury, Georgia Institute of Technology, US Abstract Discrete time digital linear regulators, including low dropout regulators (LDOs) have become competitive in muti-Vcc digital systems for fine-grained spatio-temporal voltage regulation and distribution. However wide dynamic current range of the digital load circuits poses serious problems in maintaining stability and high efficiency at all corners. In this paper we present a control model for discrete time LDOs and demonstrate how online adaptive control can be employed for consistent performance and high efficiency across the load current range.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

6.5 Beyond EDA: Extending the Application Domain of Formal Methods

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 3

Chair:
Christoph Scholl, University of Freiburg, DE

Co-Chair:
Gianpiero Cabodi, Politecnico di Torino, IT

Formal methods are traditionally used to verify the correctness or hardware, software, or protocols. This session introduces a set of applications which extend the use of formal methods into new domains. The first three papers demonstrate novel ways to bridge formal verification results into the synthesis domain. The fourth leverages formal reasoning to certify the correctness of photonic systems.

Time	Label	Presentation Title Authors
11:00	6.5.1	(Best Paper Award Candidate) USING MAXBMC FOR PARETO-OPTIMAL CIRCUIT INITIALIZATION Speakers: Sven Reimer, Matthias Sauer, Tobias Schubert and Bernd Becker, University of Freiburg, DE Abstract Abstract—In this paper we present MaxBMC, a novel formalism for solving optimization problems in sequential systems. Our approach combines techniques from symbolic SAT-based Bounded Model Checking (BMC) and incremental MaxSAT, leading to the first MaxBMC solver. In traditional BMC safety and liveness properties are validated. We extend this formalism: in case the required property is satisfied, an optimization problem is defined to maximize the quality of the reached witnesses. Further, we compare its qualities in different depths of the system, leading to Pareto-optimal solutions. We state a sound and complete algorithm that not only tackles the optimization problem but moreover verifies whether a global optimum has been identified by using a complete BMC solver as back-end. As a first reference application we present the problem of circuit initialization. Additionally, we give pointers to other tasks which can be covered by our formalism quite naturally and further demonstrate the efficiency and effectiveness of our approach.
11:30	6.5.2	PARTIAL WITNESSES FROM PREPROCESSED QUANTIFIED BOOLEAN FORMULAS Speakers: Martina Seidl¹ and Robert Könighofer² ¹JKU Linz, AT; ²TU Graz, AT Abstract For effectively solving quantified Boolean formulas (QBF) in prenex conjunctive normal form, preprocessors have shown to be indispensable. A preprocessor rewrites a formula in such a manner that information valuable for the solver is made explicit and irrelevant information is removed. For this purpose, rewriting techniques, which would be too costly when repeatedly applied during the solving process, are used. Unfortunately, many of these techniques are not model preserving and therefore incompatible with recent certification frameworks. In consequence, the application of a preprocessor prohibits the xtraction of witnesses encoding a solution or a counterexample. In this paper, we show how to obtain an assignment for the variables of the outermost quantifier block as partial witness which is sufficient for many practical applications. We modified the publicly available preprocessor bloqqer for extracting partial witnesses. We empirically compare the effectiveness of the modified and the original version of bloqqer. Further, we apply the new version of bloqqer for solving hardware synthesis problems for which it turns out to be extremely beneficial.
12:00	6.5.3	EQUIVALENCE CHECKING FOR FUNCTION PIPELINING IN BEHAVIORAL SYNTHESIS Speakers: Kecheng Hao¹, Sandip Ray² and Fei Xie³ ¹Xilinx Inc., US; ²Strategic CAD Labs, Intel Corporation, US; ³Portland State University, US Abstract Function pipelining is a key transformation in high-level synthesis. However, synthesizing the complex pipeline logic is an error-prone process. Sequential equivalence checking (SEC) support is highly desired to provide confidence in the correctness of synthesized pipelines. However, SEC for function pipelining is challenging due to the significant difference between the behavioral specification and the synthesized RTL. Furthermore, function pipelines include hardware logic for dynamically inserting "bubbles" (pipeline stalls), which bring additional difficulties in equivalence checking. We develop an SEC framework for behaviorally synthesized function pipelines by (1) building a reference pipeline model with a certified function pipelining transformation, which faithfully captures bubble insertion; and (2) checking the equivalence between the reference model and synthesized RTL implementation. We demonstrate the scalability of our approach on industry-strength designs synthesized by a commercial tool.
12:15	6.5.4	TOWARDS THE FORMAL ANALYSIS OF MICRORESONATORS BASED PHOTONIC SYSTEMS Speakers: Umair Siddique¹ and Sofiene Tahar² ¹Concordia University, Montreal, Canada, CA; ²Department of Electrical and Computer Engineering, Concordia University, CA Abstract Recent developments in the fabrication technology attracted the attention of optical engineers and physicists in the area of VLSI photonics. Due to the physical nature of light-wave systems and their usage in safety critical domains such as human surgeries and high budget space missions, it is indispensable to build high assurance systems. Traditionally, the analysis of such systems has been carried out by paper-and-pencil based proofs and numerical computations. However, these techniques cannot provide perfectly accurate results due to the risk of human error and inherent approximations of numerical algorithms. In order to overcome these limitations, we propose to use higher-order logic theorem proving to improve the analysis in the domain of integrated optics or VLSI photonics. In particular, this paper provides a higher-order logic formalization of optical microresonators which are the most fundamental building blocks of many photonic devices. In order to illustrate the practical utilization of our work, we present the formal analysis of 2-D microresonator lattice optical filters.
12:30	IP3-4, 108	SKETCHILOG: SKETCHING COMBINATIONAL CIRCUITS Speakers: Andrew Becker, David Novo and Paolo Ienne, École Polytechnique Fédérale de Lausanne, CH Abstract Despite the progress of higher-level languages and tools, Register Transfer Level (RTL) is still by far the dominant input format for high performance digital designs. Experienced designers can directly express their microarchitectural intuitions in RTL. Yet, RTL is terribly verbose, burdened with trivial details, and thus error prone. In this paper, we augment a modern RTL language (Chisel) with new semantic elements to express an imprecise specification: a sketch. We show how, in combination with a naive, unoptimized, but functionally correct reference, a designer can utilize the language and supporting infrastructure to focus on the key design intuition and omit some of the necessary details. The resulting design is exactly or almost exactly as good as the one the designer could have achieved by spending the time to manually complete the sketch. We show that, even limiting ourselves to combinational circuits, realistic instances of meaningful design problems are solved quickly, saving considerable design and debugging effort.
12:31	IP3-5, 557	TOWARDS VERIFYING DETERMINISM OF SYSTEMC DESIGNS Speakers: Hoang M. Le and Rolf Drechsler, University of Bremen, DE Abstract Ensuring the correctness of high-level SystemC designs is an important and challenging problem in today's Electronic System Level (ESL) methodology. Prevalently, a design is checked against a functional specification given by e.g. a testcase with reference output or a user-defined property. Another research direction takes the view of a SystemC design as a piece of concurrent software. The design is then checked for common concurrency problems and thus, a functional specification is not required. Along this line, several methods for deadlock detection and race analysis have been developed. In this work, we propose to consider a new concurrency verification problem, namely input-output determinism, for SystemC designs. That means for each possible input, the design must produce the same output under any valid process schedule. We argue that determinism verification is stronger than both deadlock detection and race analysis. Beside being an attractive correctness criterion itself, proven determinism helps to accelerate both simulative and formal verification. We also present a preliminary study to show the feasibility of determinism verification for SystemC designs.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

6.6 Model-Based Design and Hardware/Software Interfaces

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 4

Chair:
Wang Wang Yi, Uppsala University, SE

Co-Chair:
Wolfgang Nebel, OFFIS, DE

This sessions covers multiple abstraction in embedded system design. The first paper proposes a scalable approach to refinement checking of component-based systems using contracts and local refinement assertions. The second paper revisits the paradigm of using a set of communicating asynchronous components for implementation of synchronous models. The third paper presents a hardware scheduling support for OpenMP and the fourth paper proposes an object-aware translation layer for flash memories.

Time	Label	Presentation Title Authors
11:00	6.6.1	LIBRARY-BASED SCALABLE REFINEMENT CHECKING FOR CONTRACT-BASED DESIGN Speakers: Antonio Iannopollo, Pierluigi Nuzzo, Stavros Tripakis and Alberto Sangiovanni-Vincentelli, University of California, Berkeley, US Abstract Given a global specification contract and a system described by a composition of contracts, system verification reduces to checking that the composite contract refines the specification contract, i.e. that any implementation of the composite contract implements the specification contract and is able to operate in any environment admitted by it. Contracts are captured using high-level declarative languages, for example, linear temporal logic (LTL). In this case, refinement checking reduces to an LTL satisfiability checking problem, which can be very expensive to solve for large composite contracts. This paper proposes a scalable refinement checking approach that relies on a library of contracts and local refinement assertions. We propose an algorithm that, given such a library, breaks down the refinement checking problem into multiple successive refinement checks, each of smaller scale. We illustrate the benefits of the approach on an industrial case study of an aircraft electric power system, with up to two orders of magnitude improvement in terms of execution time.
11:30	6.6.2	ISOCHRONOUS NETWORKS BY CONSTRUCTION Speakers: Yu Bai and Klaus Schneider, University of Kaiserslautern, DE Abstract While synchronous system models have many advantages over asynchronous models concerning verification and validation, many implementation platforms do not provide efficient means for synchronization. For this reason, we consider a design flow that starts with a synchronous system model that is then transformed into an asynchronous one for synthesis. In essence, it partitions the synchronous system into a set of asynchronous components that communicate with each other via FIFO buffers. Of course, the synthesized system still has to behave as the original synchronous model, i.e., for each variable exactly the same flow of data values must be observed and only the membership to synchronous reaction steps is no longer explicitly given. In this paper, we prove that this correctness guarantee is given provided that (1) each component knows which of the input values have to be used for the next reaction (endochrony), (2) each component is able to perform the reaction (constructiveness), and (3) components agree on the clocks of their shared variables (isochrony/clock-consistency).
11:45	6.6.3	TIGHTLY-COUPLED HARDWARE SUPPORT TO DYNAMIC PARALLELISM ACCELERATION IN EMBEDDED SHARED MEMORY CLUSTERS Speakers: Paolo Burgio¹, Giuseppe Tagliavini², Francesco Conti², Andrea Marongiu² and Luca Benini³ ¹University of Bologna, Université de Bretagne-Sud, IT; ²University of Bologna, IT; ³Università di Bologna, IT Abstract Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.
12:00	6.6.4	P-OFTL: AN OBJECT-BASED SEMANTIC-AWARE PARALLEL FLASH TRANSLATION LAYER Speakers: Wei Wang, Youyou Lu and Jiwu Shu, Tsinghua Univiersity, CN Abstract With increased density and decreased price, flash memory has been widely used in storage systems for its low latency and low power features. However, traditional storage systems are designed and excessively optimized for magnetic disks, and the potential of flash memory is not brought into full play in the form of Solid State Drives (SSDs). In this paper, we propose p-OFTL, an object-based semantic-aware parallel flash translation layer (FTL). p-OFTL removes the mapping table in the FTL and directly manages the flash memory in file objects, which enables optimization of data layout in the flash using object semantics. While the removing of the mapping table improves system performance, a challenge remains to exploit the internal parallelism when maintaining the continuity of logical addresses in each object, which is essential for efficient garbage collection. To address this challenge, p-OFTL statically remaps the addresses by shifting the bits in the addresses, which spreads writes to different internal parallel units without another mapping table. Also, p-OFTL employs a semantic-aware data grouping algorithm to group data pages by trading off the hot-cold clustering for the continuity of logical addresses, so as to reduce the page movement in garbage collection. Experiments show that p-OFTL improves system performance by 4.0% ˜ 10.3% and reduces garbage collection overhead by 15.1% ˜ 32.5% in semantic-aware data grouping compared to those in semantic-unaware data grouping algorithms.
12:30	IP3-6, 148	USING GUIDED LOCAL SEARCH FOR ADAPTIVE RESOURCE RESERVATION IN LARGE-SCALE EMBEDDED SYSTEMS Speaker: Timon ter Braak, University of Twente, NL Abstract To maintain a predictable execution environment, an embedded system must ensure that applications are, in advance, provided with sufficient resources to process tasks, exchange information and to control peripherals. The problem of assigning tasks to processing elements with limited resources, and routing communication channels through a capacitated interconnect is combined into an integer linear programming formulation. We describe a guided local search algorithm to solve this problem at run-time. This algorithm allows for a hybrid strategy where configurations computed at design-time may be used as references to lower the computational overhead at run-time. Computational experiments on a dataset with 100 tasks and 20 processing elements show the effectiveness of this algorithm compared to state-of-the-art solvers CPLEX and Gurobi. The guided local search algorithm finds an initial solution within 100 milliseconds, is competitive for small platforms, scales better with the size of the platform, and has lower memory usage (2-19%).
12:32	IP3-7, 797	(Best Paper Award Candidate) ACCELERATING GRAPH COMPUTATION WITH RACETRACK MEMORY AND POINTER-ASSISTED GRAPH REPRESENTATION Speakers: Eunhyek Park¹, Helen Li², Sungjoo Yoo¹ and Sunggu Lee¹ ¹POSTECH, KR; ²Univ. of Pittsburgh, US Abstract The poor performance of NAND Flash memory, such as long access latency and large granularity access, is the major bottleneck of graph processing. This paper proposes an intelligent storage for graph processing which is based on fast and low cost racetrack memory and a pointer-assisted graph representation. Our experiments show that the proposed intelligent storage based on racetrack memory reduces total processing time of three representative graph computations by 40.2%~86.9% compared to the graph processing, GraphChi, which exploits sequential accesses based on normal NAND Flash memory-based SSD. Faster execution also reduces energy consumption by 39.6%~90.0%. The in-storage processing capability gives additional 10.5%~16.4% performance improvements and 12.0%~14.4% reduction of energy consumption.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

6.7 Hardening Approaches at Different Design Levels

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 5

Chair:
Lorena Anghel, TIMA, FR

Co-Chair:
Cecilia Metra, University of Bologna, IT

New solutions for the design of hardened hardware components, from circuit to processor level.

Time	Label	Presentation Title Authors
11:00	6.7.1	NOSTRADAMUS: LOW-COST HARDWARE-ONLY ERROR DETECTION FOR PROCESSOR CORES Speakers: Ralph Nathan and Daniel Sorin, Duke University, US Abstract We propose a new, low-cost, hardware-only scheme to detect errors in superscalar, out-of-order processor cores. For each instruction decoded, Nostradamus compares what the instruction is expected to do against what the instruction actually does. We implement Nostradamus in RTL on top of a baseline superscalar, out-of-order core, and we experimentally evaluate its ability to detect injected errors. We also evaluate Nostradamus's area and power overheads.
11:30	6.7.2	WORD-LINE POWER SUPPLY SELECTOR FOR STABILITY IMPROVEMENT OF EMBEDDED SRAMS IN HIGH RELIABILITY APPLICATIONS Speakers: Bartomeu Alorda, Cristian Carmona and Sebastia Bota, Balearic Islands University, ES Abstract Embedded SRAM yield dominates the overall ASIC yield, therefore the methodologies centered on improving SRAM cell stability will be introduced in the design as a mandatory. Word-line voltage modulation has showed that it is possible to improve cell stability during access operations. The high variability of physical and performance parameters introduce the need to adopt adaptable solutions to adequately improve SRAM cell stability. In this work, we present a word-line voltage selector circuit designed to modulate power-supply word-line voltage at each individual embedded SRAM block. The final area overhead is minimal and several strategies can be implemented with the embedded SRAM allowing adjust word-line voltage value during the life of ASIC, taking into account different operation, aging and degradations effects.
12:00	6.7.3	A HIGH PERFORMANCE SEU-TOLERANT LATCH FOR NANOSCALE CMOS TECHNOLOGY Speaker: Zhengfeng Huang, Hefei University of Technology, CN Abstract This paper presents a high performance latch to tolerate radiation-induced single event upset in 45 nm CMOS technology. The latch can improve robustness by masking the soft errors utilizing Muller C-element and dual modular redundancy hardening. The power dissipation, propagation delay and reliability of the presented SEU-tolerant latch are analyzed by SPICE simulations. The results show that the presented latch provides a higher robustness and lower power-delay product than classical implementations and alternative hardened solutions.
12:15	6.7.4	A LOW-COST RADIATION HARDENED FLIP-FLOP Speakers: Yang Lin, Mark Zwolinski and Basel Halak, University of Southampton, GB Abstract The aggressive scaling of semiconductor devices has caused a significant increase in the soft error rate caused by radiation hits. This has led to an increasing need for fault-tolerant techniques to maintain system reliability. Conventional radiation hardening techniques, typically used in safety-critical applications, are prohibitively expensive for non-safety-critical electronics. This work proposes a novel flip-flop architecture named SETTOFF which significantly improves circuit resilience to radiation hits over previous techniques. In addition, compared to other techniques such as a TMR latch, SETTOFF reduces the area and performance overhead by up to 50% and 80%, respectively; the power consumption is also reduced by up to 85%. In addition, a novel reliability metric called radiation-induced failure rate is developed which can be a valuable tool to predict the impact of radiation hits and quantitatively compare the reliability of various radiation hardened techniques. Our analysis shows that the proposed technique can achieve zero SEU failure rate, and significantly reduce the SET failure rate.
12:30	IP3-8, 98	PSP-CACHE: A LOW-COST FAULT-TOLERANT CACHE MEMORY ARCHITECTURE Speakers: Hamed Farbeh and Seyed Ghassem Miremadi, Sharif University of Technology, IR Abstract Cache memories constitute a large fraction of processor chip area and are highly vulnerable to soft errors caused by energetic particles. To protect these memories, most of the modern processors employ Error Detection Codes (EDCs) or Error Correction Codes (ECCs). EDCs/ECCs impose significant overheads in terms of area and energy; these overheads increase as a function of interleaving EDCs/ECCs to detect/correct multiple errors. This paper proposes a new cache architecture to minimize the area and energy overheads of EDCs/ECCs in set-associative L1-caches. Simulation results for a 4-way set-associative cache show that the proposed architecture reduces both the area and static power overheads of parity code by about 75% and the dynamic energy overhead by about 73% in comparison to conventional cache architecture. These reduction figures are about 68% and about 66%, respectively, for SEC-DED code. The above reductions are achieved without affecting the error coverage.
12:31	IP3-9, 31	A HYBRID NON-VOLATILE SRAM CELL WITH CONCURRENT SEU DETECTION AND CORRECTION Speakers: Pilin Junsangsri¹, Fabrizio Lombardi¹ and Jie Han² ¹Northeastern University, US; ²University of Alberta, CA Abstract This paper presents a hybrid non-volatile (NV) SRAM cell with a new scheme for SEU tolerance. The proposed NVSRAM cell consists of a 6T SRAM core and a Resistive RAM (RRAM), made of a 1T and a Programmable Metallization Cell (PMC). The proposed cell has concurrent error detection (CED) and correction capabilities; CED is accomplished using a dual-rail checker, while correction is accomplished by utilizing the restore operation; data from the non-volatile memory element is copied back to the SRAM core. The dual-rail checker utilizes two XOR gates each made of 2 inverters and 2 ambipolar transistors, hence, it has a hybrid nature. Extensive simulation results are provided. The simulation results show that the proposed scheme is very efficient in terms of numerous figures of merit such as delay and circuit complexity and thus applicable to integrated circuits such as FPGAs requiring secure on-chip non-volatile storage (i.e. LUTs) for multi-context configurability.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

6.8 First Time Right in Analog Design Enabling New Business Cases

Date: Wednesday 26 March 2014
Time: 11:00 - 12:30
Location / Room: Exhibition Theatre

Time	Label	Presentation Title Authors
11:00	6.8.1	ACHIEVING FIRST-TIME-RIGHT SILICON IN ANALOG DESIGNS - A FOUNDRY PERSPECTIVE Speaker: Jörg Doblaski, X-FAB, DE Abstract Today's demanding analog- and mixed-signal applications often do not allow for a "second shot": Due to both schedule- and budget requirements, costly and time-consuming re-spins of all components need to be avoided to be successful. "First-Time-Right" is the goal for these designs. The presentation will outline the challenges involved in achieving first-time-right analog designs. It will highlight what impact the choice of process architecture makes, and will discuss the pros and cons of different process architectures, such as BCD and SOI. Fabless companies rely on their foundry to provide not only the right processes, but also excellent modeling of process and devices, and highquality, feature-rich process design kits. The influence of the design kit quality will be discussed in a second part of the presentation, as well as choice of the right EDA tools and design flows. Finally, it will be discussed how the relationship between foundry, fabless company and EDA provider needs to be developed in order to better support First-Time-Right in analog- and mixed signal designs.
11:30	6.8.2	EXPLORING THE DESIGN-SPACE WITH "FAATS" TO ACHIEVE FIRST-TIME-RIGHT SILICON IN ANALOG DESIGNS Speaker: Markus Meissner, University of Frankfurt, DE Abstract The demand for lower supply voltages, faster processing speeds, smaller technology nodes, the accompanied higher variation impact under constantly reduced product cycles, significantly increases the necessity for automation during the design of analog modules. This presentation demonstrates the recent progress on the research of the "Fully Automated Analog Topology Synthesis Framework" (FAATS) by introducing its unique approach to elevate automated analog circuit design to the next step. How valuable an extensive design-space exploration can support "First-Time-Right" requirements is presented on different (design) case studies and an exclusive peek into an ongoing ASIC development strongly driven by FAATS.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

UB06 Session 6

Date: Wednesday 26 March 2014
Time: 12:00 - 14:00
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB06.01	SOC VERIFICATION: AUTOMATED FUNCTIONAL VERIFICATION OF SYSTEMS-ON-CHIP Authors: Zdenek Prikryl, Marcela Simkova and Karel Masarik, Faculty of Information Technology, Brno University of Technology, CZ Abstract An increase of the complexity of systems-on-chip (SoC) induces an increase of the complexity of their verification as well. The reason is that we must verify not only the functions of separate logic blocks, but we need to check their interconnections, timing and functional collaboration as well. Therefore, there is still a great demand for verification tools, which are time-effective, fast and as automated as possible. Exactly these issues we target in our solution. You are welcome to see the live demonstration at our booth! More information ...
UB06.02	HIPACC: AUTOMATIC GPU CODE GENERATION FOR ANDROID Authors: Oliver Reiche¹, Richard Membarth², Frank Hannig¹ and Jürgen Teich¹ ¹University of Erlangen-Nuremberg, DE; ²Saarland University, DE Abstract We present the Heterogeneous Image Processing Acceleration (HIPAcc) framework. It allows programmers to develop image preprocessing applications while providing high productivity, flexibility, and portability as well as competitive performance. The same algorithm description serves as basis for targeting different GPU accelerators and low-level languages. Hereby, imaging algorithms can be expressed in a compact and productive way by using a domain-specific language (DSL) that is embedded into C ++ code. Using the HIPAcc source-to-source compiler, DSL code is compiled to CUDA, OpenCL, C/C ++, or even Renderscript code, which targets heterogeneous architectures on recent MPSoCs running Android. Programming those MPSoCs can be challenging, in particular when targeting different architectures (CPU/GPU/DSP). HIPAcc lifts this burden from programmers by automatically applying source code transformations based on domain knowledge and a built-in architecture model. This demonstration shows the seamless integration of HIPAcc into the Android Developer Tools and provides a live comparison of generated code to functional identical handwritten naive implementations of image filters on recent MPSoCs running Android. More information ...
UB06.03	CUCUMBER-VERILOG: BEHAVIOR DRIVEN DEVELOPMENT FOR CIRCUIT DESIGN AND VERIFICATION Authors: Melanie Diepenbeck, Mathias Soeken, Ulrich Kühne and Rolf Drechsler, University of Bremen, DE Abstract When designing hardware one usually applies a top-down approach in which starting from a natural language specification a design is implemented and afterwards tested and verified for correctness. In contrast, software development is pushed towards agile techniques such as Test Driven Development (TDD), where tests play a central role in driving the implementation. Behavior Driven Development (BDD) extends TDD by using natural language style scenarios to describe the tests. Essentially, in both techniques testing and implementation is interleaved: first, test cases are written, and secondly, the implementation is extended to satisfy them. Since nowadays 70% of the the effort to design hardware systems is spent on verification, test and verification should receive more attention and be applied as soon as possible. We present a BDD tool tailored for the Verilog hardware description language which enables a new design flow for hardware design, test, and verification. BDD acceptence tests are readily given by means of the natural language specification. Assigning test code to their sentences yields a testbench which serves as a starting point for the implementation. In the same time, the natural language scenarios form a test documentation that is easily accessable also to non-experts. Furthermore, our tool allows for the generalization of test cases to properties suitable for formal verification. As properties are typically more difficult to formalize than test cases, our approach facilitates the access to formal verification. In our demonstration, we will show how to implement hardware designs using our BDD tool and how properties are generalized from test cases which can then can be verified by a model checker automatically. More information ...
UB06.04	COMPILER FOR MAPPING STREAM PROCESSING APPLICATIONS ONTO REAL-TIME HETEROGENEOUS MULTIPROCESSOR SYSTEMS Authors: Stefan Geuns, Berend Dekens, Philip Wilmanns, Joost Hausmans, Guus Kuiper and Marco Bekooij, University of Twente, NL Abstract Heterogeneous multiprocessors system are employed for power-efficiency reasons in wearable software defined radios. These systems are hardware cost-effective and deliver a superior performance compared to their homogeneous counterparts. However these systems are notoriously hard to program without tool support, which makes it is desirable that programming is simplified with the help of an optimizing multiprocessor compiler for stream processing applications. This demonstration shows our multiprocessor compiler for mapping real-time stream processing applications onto our real-time heterogeneous multi-core system. The applications are described as sequential programs and are compiled into parallel task graphs. Buffer capacities are computed using dataflow analysis techniques given the real-time constraints of the application. Our multi-core system contains 16 MicroBlaze processor cores as well as two hardware accelerators and is prototyped on a Xilinx Virtex-6 FPGA. A connection-less communication ring is used for inter-processor communication. Our system is equipped with an analog RF front-end, which enables us to demonstrate PAL-video reception and decoding. More information ...
UB06.05	S4ECOB APU: ENERGY-EFFICIENT HIGH-PERFORMANCE ACOUSTIC PROCESSING UNIT Authors: Wolfram Kattanek¹, Sebastian Uziel¹, Thomas Elste¹, Stephan Gerlach², Danilo Hollosi² and Stefan Goetze² ¹Institut für Mikroelektronik- und Mechatronik-Systeme gemeinnützige GmbH, DE; ²Fraunhofer Institute for Digital Media Technology, IDMT Project Group Hearing, Speech and Audio Technology, DE Abstract An embedded 24-channel acoustic processing system consisting of an FPGA based front-end and a multi-core microcontroller subsystem is presented here. It is specifically designed for a smart building solution estimating the occupancy level of rooms and areas solely based on acoustic features and source localization. The overall goal is to use this occupancy estimate to lower the energy consumption of large buildings. An overview of the hardware and software concept as well as a brief description of the acoustic occupancy level estimation is given. The APU was developed as part of the EU FP7 project - Sounds for Energy Control of Buildings (S4ECoB). More information ...
UB06.06	SCOPE: TIME-DECOUPLED PARALLEL SYSTEMC SIMULATION Authors: Jan Weinstock¹, Christoph Schumacher², Rainer Leupers², Gerd Ascheid² and Laura Tosoratto³ ¹RWTH Aachen University, DE; ²Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE; ³Istituto Nazionale di Fisica Nucleare - Sezione di Roma, IT Abstract With increasing system size and complexity, designers of embedded systems face the challenge of efficiently simulating these systems in order to enable target specific software development and design space exploration as early as possible. Today's multicore workstations offer enormous computational power, but traditional simulation engines like the OSCI SystemC kernel only operate on a single thread, thereby leaving a lot of computational potential unused. Most modern embedded system designs include multiple processors. This work proposes SCope, a SystemC kernel that aims at exploiting the inherent parallelism of such systems by simulating the processors on different threads. A lookahead mechanism is employed to reduce the required synchronization between the simulation threads, thereby further increasing simulation speed. The virtual prototype of the European FP7 project EURETILE system simulator is used as demonstrator for the proposed work, showing a speedup of 4x on a four core host system compared to sequential simulation. The demo will visualize the simulation of the EURETILE system using an OpenGL based graphical user interface. The simulator will be presented as a regular sequential version based on OSCI SystemC, and as a parallel version based on the new SCope parallel SystemC kernel. More information ...
UB06.07	COMPSOC: VIRTUAL EXECUTION PLATFORMS FOR MIXED TIME-CRITICALITY APPLICATIONS Author: Kees Goossens, TU Eindhoven, NL Abstract System-on-Chip (SOC) design gets increasingly complex, as a growing number of applications are inte- grated in such systems. These applications have mixed time-criticality, i.e., some have firm-, some soft-, and others non-real-time requirements. Executing such a mix of applications on a SOC poses several challenges. First, to reduce cost, platform resources, e.g., processors, interconnect, memories, are shared between applications. However, sharing causes interference between applications, making their behaviors inter- dependent. This results in two problems for SOC design and verification: 1) accurate system-level simulation and several approaches to formal verification are infeasible, because of the explosion in the number of possible combinations of applications, inputs, and resource states and 2) verification becomes a circular process that must be repeated if an application is added, removed, or modified, making integration and verification dominant parts of SOC development, in terms of time and money. The CompSOC platform addresses these problems by executing each application on an independent virtual execution platform (VEP). The VEPs are composable, i.e., cannot affect each other's behaviors. In the temporal domain an applications actual execution never varies by even a single clock cycle. Similarly, the energy and power behaviors of applications are also composable. As a result, applications can be designed, developed, verified, and executed in isolation. The VEPs are also predictable, meaning that all interference is bounded. This makes them virtualized also in terms of performance bounds, which enables firm real-time applications to be verified using formal performance analysis frameworks. The CompSOC platform uses the CoMiK microkernel to implement virtual processors on each processor time through temporal partitioning. Each application can use its own operating system (e.g. Compose, μcOS-III) and model of computation (e.g. CSDF, KPN, TT) in its VEP, to suit its level of time criticality. As more applications are integrated on a single SOC, the need arises for more dynamic behaviour. The system should be able to start, modify and stop applications at run time without affecting running appli- cations. For this purpose the CompSOC platform has been extended with a predictable and composable resource management framework. It manages application bundles that contain 1) an application in the form of executables (ELFs on multiple processors), and also 2) the specifications of the (one or more) particular VEPs that the application executes in, consisting of virtual processors, NOC connections, virtualised mem- ories, etc. At run time, the resource management framework can dynamically load and start application bundles by creating a VEP and then loading, booting, and executing an application within it. VEPs can also be modified, stopped, and deleted at run time. Our University Booth will present virtual-execution-platform and application-bundle concepts using an interactive demonstrator. It will show that the CompSOC has been extended with dynamic functionality, without sacrificing its key strengths: composability and predictability. We will demonstrate this through the use of the resource management framework and application bundles, showing that we can create, modify and delete virtual execution platforms running a mixed time-criticality application dynamically at run-time. More information ...
UB06.08	TTOOL/DIPLODOCUSDF: A UML ENVIRONMENT FOR HARDWARE/SOFTWARE CO-DESIGN OF DATA-DOMINATED SYSTEMS-ON-CHIP Authors: Andrea Enrici, Ludovic Apvrille and Renaud Pacalet, Telecom ParisTech, FR Abstract The development of new Systems on Chip commonly relies on previous products for whom, due to factors such as system complexities, time and cost constraints, little design space exploration can be performed. Hardware and software are typically composed as if they were separate components, whereas their interactions yield more than the sum of the two parts. In the scope of the demonstration, we present our enhanced version of TTool/DiplodocusDF, a UML model-driven engineering tool and methodology for the design of heterogeneous data processing systems. Our contributions enrich the modeling and design space exploration capabilities of TTool/DiplodocusDF to target complex transfer schemes and control information exchange at different abstraction levels. Our ameliorated methodology is applied to two signal processing applications, showing the analysis of novel interactions between typically conflicting aspects such as computations vs communications and dataflows vs controlflows. More information ...
UB06.09	PIGGY'S WEAVER: A DEMONSTRATION FOR FOCUSING ON SEPARATION OF DEBUGGING CONCERNS BASED ON DYNAMIC PROGRAM REWRITING TOOL: PIGGY'S WEAVER Authors: Ikuta Tanigawa¹, Nobuhiko Ogura², Midori Sugaya³ and Harumi Watanabe¹ ¹Tokai University, JP; ²Tokyo City University, JP; ³Shibaura Institute of Technology, JP Abstract Dynamic program rewriting is needed to continuous work and reduces costs of maintenance. We propose a dynamic rewriting tool "Piggy's Weaver" for C# program. The tool attaches and detaches pieces of code to program at any points on each concern. Especially these attachments are focused on debugging concern. In the demonstration, we will apply the tool to a cloud and embedded system "Piggy Net" which is a cooperating charity pot with SNS and was awarded 2nd prize on D2C2012 by Microsoft Japan. More information ...
UB06.10	UNISON: ASSEMBLY CODE GENERATION USING CONSTRAINT PROGRAMMING Authors: Roberto Castañeda Lozano¹, Gabriel Hjort Blindell², Mats Carlsson¹ and Christian Schulte² ¹Swedish Institute of Computer Science, SE; ²KTH Royal Institute of Technology, SE Abstract We demonstrate Unison - a simple, flexible and potentially optimal code generator that solves interdependent code generation tasks together using constraint programming as a modern combinatorial optimization method. We show how Unison takes into account the task interdependencies and their combinatorial nature to improve the speed of the code generated by LLVM (a state-of-the-art compiler) for Hexagon (a digital signal processor ubiquitous in modern mobile platforms). More information ...
14:00	End of session
16:00	Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.0 Special Day Keynote

Date: Wednesday 26 March 2014
Time: 13:30 - 14:00
Location / Room: Saal 1

The automotive industry is in a radical change process driven by technology. On the one hand side the proliferation of communication technologies into the car leads to internet connected vehicles. The vehicle will become an integral part of the internet - opening new processing paradigms for the car itself. On the other hand the vehicle itself significantly expands its sensor and processing capabilities by the use of radar, video, ultrasound sensors and usage of state of the art CPU and GPU processor architectures. In our talk we will address both developments and outline foreseen future applications as future driving assistant and infotainment systems as well as highly automated driving. We will discuss major requirements for the future electrical architectures and implications for future automotive chips.

Time	Label	Presentation Title Authors
13:30	7.0.1	SPECIAL DAY KEYNOTE: THE CONNECTED CAR AND ITS IMPLICATION TO THE AUTOMOTIVE CHIP ROADMAP Speaker: Dr.-Ing. Michael Bolle, Robert Bosch Gmbh, DE
14:00		End of session
16:00		Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

UB07 Session 7

Date: Wednesday 26 March 2014
Time: 14:00 - 16:00
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB07.01	VIDEO-BASED ABSOLUTE NAVIGATION APPROACH: A NOVEL APPROACH FOR VIDEO-BASED ABSOLUTE NAVIGATION IN SPACE EXPLORATION MISSIONS Authors: Pascal Trotta, Tadewos Getahun Tadewos, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT Abstract Nowadays, space agencies have increased their research efforts in order to enhance the success rate of space exploration missions. Future space missions will increasingly adopt Video Based Navigation (VBN) systems to assist the entry, descent and landing (EDL) phase of space modules. This poster will show a preliminary work on a novel approach for Video-based Absolute Navigation (VBAN). Moreover, the poster depicts how a VBAN processing chain can exploit FPGA devices to achieve high throughput. Several visual results will be shown to highlight the peculiarities of the proposed approach. More information ...
UB07.02	AIDA: ANALOG IC DESIGN AUTOMATION Authors: Nuno Horta¹, Nuno Lourenço², Ricardo Martins², Ricardo Póvoa², António Canelas² and Pedro Ventura¹ ¹Instituto de Telecomunicacoes, PT; ²Instituto de Telecomunicacoes / Instituto Superior Técnico, PT Abstract This demonstration presents AIDA, an analog integrated circuit (IC) design automation environment. AIDA includes two main modules, namely, AIDA-C and AIDA-L. AIDA-C is a circuit-level synthesis tool which uses state-of-the-art multi-objective multi-constrained optimization kernels, based on evolutionary computation techniques, where the robustness of the solutions is attained by considering a layout-aware approach and, also, extreme process variations by means of PVT corner analysis. The circuit's performance is measured using Spectre®, ELDO® or HSPICE® electrical simulators as evaluation engines. AIDA-L considers the device sizes and the best floorplan, obtained with AIDA-C, and generates the complete layout by placing and routing the devices, while fulfilling the technology design rules by using built-in design-rule check (DRC) and layout-versus-schematic (LVS) procedures. In order to demonstrate AIDA design environment several analog circuit structures, e.g., OTAs, LNAs, LC-Oscillators, etc., will be synthesized in a 130nm CMOS technology. AIDA-C is demonstrated for circuit-level sizing and optimization by generating a family of Pareto Optimal solutions based on user performance and functional specifications. AIDA-L is demonstrated by generating the layout of a user selected solution from AIDA-C, taking into account electrical currents information to mitigate electromigration and IR-drop effects, and also wiring symmetry for multiport multi-terminal signal nets of analog ICs. More information ...
UB07.03	BICONDITIONAL BINARY DECISION DIAGRAM MANIPULATION PACKAGE Authors: Luca Amaru¹, Alexios Balatsoukas-Stimming², Pierre-Emmanuel Gaillardon³, Andreas Burg² and Giovanni De Micheli³ ¹EPFL, CH; ²EPFL-TCL, CH; ³EPFL-LSI, CH Abstract In this software demonstration, we present a logic manipulation package based on Biconditional Binary Decision Diagrams (BBDDs). BBDDs are a novel class of canonical binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. We show how Verilog files from real life designs can be rapidly read and processed by the BBDD manipulation package, for verification, testing or synthesis purposes. In particular, we demonstrate the benefit deriving from BBDD re-writing of arithmetic circuits in the synthesis of a product code iterative decoder. More information ...
UB07.04	GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES Authors: Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT Abstract Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The software is composed of a parser library to handle input circuit descriptions, a characterization library of graphene gates used in the synthesis process, a Biconditional Binary Decision Diagram library used to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices. More information ...
UB07.05	TOMAHAWK2: PERFORMANCE IMPACT OF INSTRUCTION SET ARCHITECTURE EXTENSIONS FOR DYNAMIC TASK SCHEDULING UNITS Author: Oliver Arnold, Technische Universität Dresden, DE Abstract In this demo a heterogeneous MPSoC is controlled by a dynamic task scheduling unit called CoreManager. The instruction set architecture of this unit has been extended to improve performance for dynamic data dependency checking, task scheduling, processing element allocation and data transfer management. The MPSoC as well as the NoC are integrated in a cycle-accurate virtual system prototype. The performance impact of the CoreManager is analyzed on component as well as on system level. More information ...
UB07.06	LEGO: TOOLS FOR HYBRID INTEGRATION Author: Fredrik Jonsson, Royal Institute of Technology, SE Abstract Performance of printed devices depends on the geometry, but is also affected by processing steps of other components integrated onto the same substrate. Since different designs use different devices, process stack, models and design rules must be dynamically determined. In this work we propose and demonstrate an experimental design flow to allow efficient design of hybrid and printed electronic circuits. More information ...
UB07.07	UVM-SYSTEMC-AMS: UVM STANDARD-COMPLIANT SYSTEMC (AMS)-BASED VERIFICATION FRAMEWORK FOR HETEROGENEOUS SYSTEMS Authors: Zhi Wang¹, Yao Li², Marie-Minerve Louerat², Francois Pecheux², Martin Barnasconi³, Thilo Vörtler⁴ and Karsten Einwich⁴ ¹Laboratoire d'informatique de Paris 6, FR; ²UPMC-LIP6, FR; ³NXP, NL; ⁴Fraunhofer IIS, DE Abstract Today's societal needs for innovative products in terms of communication, mobility, health, entertainment, and safety directly impact microelectronics design methodologies. The embedded systems are simultaneously software-driven, digitally assisted, complex and heterogeneous, but existing verification methodologies are mostly focused on pure digital devices and are completely decoupled from analog verification. This presentation shows how the principles of the new UVM methodology can be soundly enhanced to offer to the test designer a flexible framework for the virtual prototyping of multi-discipline testbenches that supports both digital and Analog Mixed-Signal (AMS) at the architectural level. More information ...
UB07.08	TTOOL/DIPLODOCUSDF: A UML ENVIRONMENT FOR HARDWARE/SOFTWARE CO-DESIGN OF DATA-DOMINATED SYSTEMS-ON-CHIP Authors: Andrea Enrici, Ludovic Apvrille and Renaud Pacalet, Telecom ParisTech, FR Abstract The development of new Systems on Chip commonly relies on previous products for whom, due to factors such as system complexities, time and cost constraints, little design space exploration can be performed. Hardware and software are typically composed as if they were separate components, whereas their interactions yield more than the sum of the two parts. In the scope of the demonstration, we present our enhanced version of TTool/DiplodocusDF, a UML model-driven engineering tool and methodology for the design of heterogeneous data processing systems. Our contributions enrich the modeling and design space exploration capabilities of TTool/DiplodocusDF to target complex transfer schemes and control information exchange at different abstraction levels. Our ameliorated methodology is applied to two signal processing applications, showing the analysis of novel interactions between typically conflicting aspects such as computations vs communications and dataflows vs controlflows. More information ...
UB07.09	A HOLISTIC APPROACH TO POWER MANAGEMENT FOR ENERGY HARVESTING EMBEDDED SYSTEMS Authors: Kyungsoo Lee, Hideki Takase and Tohru Ishihara, Kyoto University, JP Abstract We present a holistic approach to maximizing the energy efficiency of energy harvesting embedded systems which consist of a processor system and an energy harvesting system. A power management program integrated on a real-time OS optimally switches operation mode of the processor and configuration of the energy harvesting system according to the workload of the processor and harvesting situation. The demonstration will show that our prototype system consisting of our processor chip and harvesting system board stably runs using harvested energy only. The processor has multiple cores having a different performance in each to improve the energy efficiency of computation. The energy harvesting board has high transferring efficiency to reduce the power loss. The entire system is controlled efficiently by our power management program implemented on Toppers OS. More information ...
UB07.10	STMC TOOLS: A STATE TRANSITION MODEL DESCRIPTION LANGUAGE STMC AND ITS TOOLS - AN EXTENSION OF THE C PROGRAMMING LANGUAGE FOR DEVELOPING DRIVER SOFTWARE AND FIRMWARE WITH MODELS - Authors: Nobuhiko Ogura¹, Ikuta Tanigawa², Takuya Todoroki¹, Kenji Arai¹ and Harumi Watanabe² ¹Tokyo City University, JP; ²Tokai University, JP Abstract We present a state transition model description programming language. It can be translated to pure standard C programs without any OS or handwritten frameworks, hence it is suit for developing low level driver software and firmware, unlike many other automatic software generation tools from software models that usually focuses on higher level models. We show the language and translator to executable software and visual diagram generator, and analysis tools, with embedded software examples. More information ...
16:00	End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.1 SPECIAL DAY Panel: HW/SW Co-Development - The Industrial Workflow

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Saal 1

Organiser:
Johannes Stahl, Synopsys, US

Chair:
Iris Stroh, Markt & Technik, DE

This panel brings together the entire supply chain for the use of virtual prototyping starting with the end users at an automotive Tier1, a semiconductor supplier, IP providers and the virtual prototyping and software development tool providers. The panelists will discuss what are the benefits and challenges of accelerating software development using virtual prototyping are for deployment in industrial projects.

Panelists:

Andreas Schwerin, Siemens, DE
Martin Vaupel, Bosch, DE
Albrecht Mayer, Infineon, DE
Nick Gatherer, ARM, GB
Frank Schirrmeister, Cadence, US
Stephan Lauterbach, Lauterbach, US
Colin Walls, Mentor Graphics, US
Andreas Hoffmann, Synopsys, US

16:00

End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.2 Embedded Tutorial: Cross Layer Resiliency in Real World

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 6

Organiser:
Vikas Chandra, ARM, US

Chair:
Yanjing Li, Intel, US

Co-Chair:
Ulf Schlichtmann, TUM, DE

Resilience at different design hierarchies will be needed in Complex SoCs to handle failures due to variability, reliability and design errors (logical or electrical). The main reasons for the marginal behavior are sheer design complexity, uncertainties in manufacturing processes, temporal variability and operating conditions. In this session, we will cover the basics of cross layer resiliency and explore the reliability challenges in both embedded processors as well as large scale computing resources.

Time	Label	Presentation Title Authors
14:30	7.2.1	CROSS-LAYER RESILIENCE EXPLORATION AND OPTIMIZATION Speaker: Subhasish Mitra, Stanford University, US Abstract This talk will discuss systematic methodologies for exploring cross-layer resilience, encompassing error detection, correction and recovery techniques, for complex SoCs. The objective is to address several key questions such as: 1. Given a design, is cross-layer resilience always the best option? 2. What are the right models that link resilience techniques across multiple layers for quick, yet accurate, estimation of coverage and costs? 3. What is the proper framework to explore the large space of existing resilience techniques for error detection, correction, and recovery across various abstraction layers?
15:00	7.2.2	RELIABILITY CHALLENGES IN EMBEDDED PROCESSORS Speaker: Vikas Chandra, ARM, US Abstract Embedded processors are now at the heart of the mobile revolution and have the aspirations to power even high performance data centers. It is of utmost importance to understand the reliability challenges in embedded processors and find ways to tackle them across different layers of design abstraction. In this talk, I will talk about the reliability requirements in embedded processors, the challenges we are facing and our approach to make the design more robust. We will discuss our approaches of measuring wearout in commercial processors as well as efficient design of in-situ monitors to track timing errors.
15:30	7.2.3	BILLION CHIPS OF TRILLION TRANSISTORS: HOW TO MAKE THEM RELIABLE? Speakers: Chen-Yong Cher¹ and Silvia Mueller² ¹IBM Research, US; ²IBM Boeblingen, DE Abstract Due to increasing demand for personal devices, high performance computing systems and commercial data centers, microprocessor and main memory designers face numerous challenges in delivering large number of chips at effective cost. While frequency scaling effectively ended, technology scaling continues to provide increasing number of transistors. To effectively utilize these transistors for performance, designers turn to sophisticated and highly integrated chip designs such as multi-core (e.g., Intel i7, IBM POWER7, BlueGene/Q), GPGPU (e.g., NVIDIA Tigra) heterogeneous SoC (e.g., IBM Wirespeed). The increasing demand for chips and transistors presents numerous challenges on reliability, power and manufacturing costs. In large scale HPC systems and data centers, the increasing number of chips also raises per-chip reliability requirement in order to achieve system reliability targets.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.3 Low power methods and multicore architectures for mobile health applications

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 1

Chair:
Giovanni Ansaloni, EPFL, CH

Co-Chair:
Andrea Bartolini, University of Bologna, IT

Achieving low power operation is essential for battery operated mobile health applications. In this session, the papers address this important issue. The first two papers present multicore architectural methods for bio-signal processing, dealing with synchronisation and innovative memory architecture design. The last two papers focus on low power design of applications for bio-signal processing: tuning of sensor usage based on applications and methods to selectively drop computations to save power, without affecting the accuracy.

Time	Label	Presentation Title Authors
14:30	7.3.1	HARDWARE/SOFTWARE APPROACH FOR CODE SYNCHRONIZATION IN LOW-POWER MULTI-CORE SENSOR NODES Speakers: Rubén Braojos¹, Ahmed Dogan², Ivan Beretta², Giovanni Ansaloni² and David Atienza² ¹École Polytechnique Fédérale de Lausanne, CH; ²EPFL, CH Abstract Latest embedded bio-signal analysis applications, targeting low-power Wireless Body Sensor Nodes (WBSNs), present conflicting requirements. On one hand, bio-signal analysis applications are continuously increasing their demand for high computing capabilities. On the other hand, long-term signal processing in WBSNs must be provided within their highly constrained energy budget. In this context, parallel processing effectively increases the power efficiency of WBSNs, but only if the execution can be properly synchronized among computing elements. To address this challenge, in this work we propose a hardware/software approach to synchronize the execution of bio-signal processing applications in multi-core WBSNs. This new approach requires little hardware resources and very few adaptations in the source code. Moreover, it provides the necessary flexibility to execute applications with an arbitrarily large degree of complexity and parallelism, enabling considerable reductions in power consumption for all multi-core WBSN execution conditions. Experimental results show that a multi-core WBSN architecture using the illustrated approach can obtain energy savings of up to 40%, with respect to an equivalent single-core architecture, when performing advanced bio-signal analysis.
15:00	7.3.2	HYBRID MEMORY ARCHITECTURE FOR VOLTAGE SCALING IN ULTRA-LOW POWER MULTI-CORE BIOMEDICAL PROCESSORS Speakers: Daniele Bortolotti¹, Andrea Bartolini¹, Christian Weis², Davide Rossi¹ and Luca Benini¹ ¹University of Bologna, IT; ²University of Kaiserslautern, DE Abstract Technology scaling enables today the design of sensor-based ultra-low cost chips well suited for emerging applications such as wireless body sensor networks, urban life and environment monitoring. Energy consumption is the key limiting factor of this up-coming revolution and memories are often the energy bottleneck mainly due to leakage power. This paper proposes an ultra-low power multi-core architecture targeting eHealth monitoring systems, where applications involve collection of sequences of slow biomedical signals and highly parallel computations at very low voltage. We propose a hybrid memory architecture that combines 6T-SRAM and 8T-SRAM operating in the same voltage domain and capable of dispatching at high voltage a normal operation and at low voltage a fully reliable small memory partition (8T) while the rest of the memory (6T) is state-retentive. Our architecture offers significant energy savings with a low area overhead in typical eHealth Compressed Sensing-based applications.
15:30	7.3.3	CONTEXT AWARE POWER MANAGEMENT FOR MOTION-SENSING BODY AREA NETWORK NODES Speakers: Filippo Casamassima¹, Elisabetta Farella² and Luca Benini³ ¹University of Bologna, IT; ²DEI - University of Bologna, IT; ³Università di Bologna, IT Abstract Body Area Networks (BANs) are widely used mainly for healthcare and fitness purposes. In both cases, the lifetime of sensor nodes included in the BAN is a key aspect that may affect the functionality of the whole system. Typical approaches to power management are based on a trade-off between the data rate and the monitoring time. Our work introduces a power management layer capable to opportunistically use data sampled by sensors to detect contextual information such as user activity and adapt the node operating point accordingly. The use of this layer has been demonstrated on a commercial sensor node, increasing its battery lifetime up to a factor of 5.
15:45	7.3.4	A QUALITY-SCALABLE AND ENERGY-EFFICIENT APPROACH FOR SPECTRAL ANALYSIS OF HEART RATE VARIABILITY Speakers: Georgios Karakonstantis¹, Aviinaash Sankaranarayanan², Mohamed Sabry¹, David Atienza¹ and Andreas Burg¹ ¹EPFL, CH; ²Debiotech S.A., CH Abstract Today there is a growing interest in the integration of health monitoring applications in portable devices necessitating the development of methods that improve the energy efficiency of such systems. In this paper, we present a systematic approach that enables energy-quality trade-offs in spectral analysis systems for bio-signals, which are useful in monitoring various health conditions as those associated with the heart-rate. To enable such trade-offs, the processed signals are expressed initially in a basis in which significant components that carry most of the relevant information can be easily distinguished from the parts that influence the output to a lesser extent. Such a classification allows the pruning of operations associated with the less significant signal components leading to power savings with minor quality loss since only less useful parts are pruned under the given requirements. To exploit the attributes of the modified spectral analysis system, thresholding rules are determined and adopted at design- and run-time, allowing the static or dynamic pruning of less-useful operations based on the accuracy and energy requirements. The proposed algorithm is implemented on a typical sensor node simulator and results show up-to 82% energy savings when static pruning is combined with voltage and frequency scaling, compared to the conventional algorithm in which such trade-offs were not available. In addition, experiments with numerous cardiac samples of various patients show that such energy savings come with a 4.9% average accuracy loss, which does not affect the system detection capability of sinus-arrhythmia which was used as a test case.
16:00	IP3-10, 633	BATTERY AWARE STOCHASTIC QOS BOOSTING IN MOBILE COMPUTING DEVICES Speakers: Hao Shen, Qiuwen Chen and Qinru Qiu, Syracuse University, US Abstract Mobile computing has been weaved into everyday lives to a great extend. Their usage is clearly imprinted with user's personal signature. The ability to learn such signature enables immense potential in workload prediction and resource management. In this work, we investigate the user behavior modeling and apply the model for energy management. Our goal is to maximize the quality of service (QoS) provided by the mobile device (i.e., smartphone), while keep the risk of battery depletion below a given threshold. A Markov Decision Process (MDP) is constructed from history user behavior. The optimal management policy is solved using linear programing. Simulations based on real user traces validate that, compared to existing battery energy management techniques, the stochastic control performs better in boosting the mobile devices' QoS without significantly increasing the chance of battery depletion.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.4 Runtime memory optimization and GPU/manycore architectures

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 2

Chair:
Alberto Nannarelli, DTU Copenhagen, DK

Co-Chair:
Alberto Macii, PoliTo Torino, IT

The session starts with memory design techniques under PVT variation and ageing for DRAMs and SRAM caches. Afterwards, bus, memory and partitioning techniques for 2D and 3D GPUs and manycores are presented.

Time	Label	Presentation Title Authors
14:30	7.4.1	EXPLOITING EXPENDABLE PROCESS-MARGINS IN DRAMS FOR RUN-TIME PERFORMANCE OPTIMIZATION Speakers: Karthik Chandrasekar¹, Sven Goossens², Christian Weis³, Martijn Koedam², Benny Akesson⁴, Norbert Wehn³ and Kees Goossens⁵ ¹Delft University of Technology, NL; ²Eindhoven University of Technology, NL; ³University of Kaiserslautern, DE; ⁴Czech Technical University in Prague, CZ; ⁵Eindhoven university of technology, NL Abstract Manufacturing-time process (P) variations and runtime variations in voltage (V) and temperature (T) can affect a DRAM's performance (internal delays) severely. To counter the effects of these variations, DRAM vendors provide substantial design-time PVT margins to guarantee correct DRAM functionality under worst-case conditions. Unfortunately, with technology scaling these design margins have become large and very pessimistic for a majority of the manufactured DRAMs. While run-time variations are specific to operating conditions and their margins difficult to optimize, process variations are manufacturing-time effects and excessive process-margins can be reduced on a per-device basis, if properly identified. In this paper, we propose a generic post-manufacturing performance characterization methodology for DRAMs that identifies this excess in process-margins for any given DRAM device at runtime, while retaining the requisite margins for voltage (noise) and temperature variations. By doing so, the methodology ascertains the actual impact of process-variations on the particular DRAM device and optimizes its access latencies, thereby improving its overall performance. We evaluate this methodology on 48 DDR3 devices (from 12 DIMMs) and verify the derived timings under worst-case operating conditions, showing up to 33.3% and 25.9% reduction in DRAM read and write latencies, respectively.
15:00	7.4.2	CACHE AGING REDUCTION WITH IMPROVED PERFORMANCE USING DYNAMICALLY RE-SIZABLE CACHE Speakers: Haroon Mahmood, Massimo Poncino and Enrico Macii, Politecnico di Torino, Torino Italy, IT Abstract Aging of transistors is a limiting factor for long term reliability of devices in sub-100nm technologies. It's a worst-case metric where the lifetime of a device is determined by the earliest failing component. Impact is more serious on memory arrays, where failure of a single SRAM cell would cause the failure of the whole system. Previous works have shown that partitioning based strategies based on power management techniques can effectively control aging effects and can extend lifetime of the cache significantly. However, such a benefit comes as a trade-off with performance which reduces proportionally as the time elapses. To address this problem and provide a single solution to concurrently improve aging, energy and performance of the cache, we propose an architectural solution based on the dynamically re-sizable cache and cache partitioning approaches. By this strategy, cache is dynamically re-sized and reconfigured whenever a cache block becomes unreliable. Coupling such aging mitigation technique along with dynamically re-sizable cache approach provides on average 30% lifetime improvement with less than 0.4x degradation in performance whereas, in previous solutions, performance degradation sometimes goes upto 10x.
15:15	7.4.3	ON GPU BUS POWER REDUCTION WITH 3D IC TECHNOLOGIES Speakers: Young-joon Lee¹ and Sung Kyu Lim² ¹Intel Corporation, US; ²Georgia Institute of Technology, US Abstract The complex buses consume significant power in graphics processing units (GPUs). In this paper, we demonstrate how the power consumption of buses in GPUs can be reduced with 3D IC technologies. Based on layout simulations, we found that partitioning and floorplanning of 3D ICs affect the power benefit amount, as well as the technology setup, target clock frequency, and circuit switching activity. With 3D IC technologies, we achieved the total power reduction of up to 21.5% for our GPU.
15:45	7.4.4	PROCESS VARIATION-AWARE WORKLOAD PARTITIONING ALGORITHMS FOR GPUS SUPPORTING SPATIAL MULTITASKING Speakers: Paula Aguilera¹, Jungseob Lee¹, Amin Farmahini Farahani¹, Michael Schulte², Katherine Morrow¹ and Nam Sung Kim¹ ¹University of Wisconsin-Madison, US; ²AMD, US Abstract High-level programming languages have transformed graphics processing units (GPUs) from domain-restricted devices into powerful compute platforms. Yet many "general-purpose GPU" (GPGPU) applications fail to fully utilize the GPU resources. Executing multiple applications simultaneously on different regions of the GPU (spatial multitasking) thus improves system performance. However, within-die process variations lead to significantly different maximum operating frequencies (Fmax) of the streaming multiprocessors (SMs) within a GPU. As the chip size and number of SMs per chip increase, the frequency variation is also expected to increase, exacerbating the problem. The increased number of SMs also provides a unique opportuni-ty: we can allocate resources to concurrently-executing applica-tions based on how those applications are affected by the differ-ent available Fmax values. In this paper, we study the effects of per-SM clocking on spatial multitasking-capable GPUs. We demonstrate two factors that affect the performance of simulta-neously-running applications: (i) the SM partitioning algorithm that decides how many resources to assign to each application, and (ii) the assignment of SMs to applications based on the oper-ating frequencies of those SMs and the applications characteris-tics. Our experimental results show that spatial multitasking that partitions SMs based on application characteristics, when com-bined with per-SM clocking, can greatly improve application performance by up to 46% on average compared to cooperative multitasking with global clocking.
16:00	IP3-11, 240	A THERMAL RESILIENT INTEGRATION OF MANY-CORE MICROPROCESSORS AND MAIN MEMORY BY 2.5D TSI I/OS Speakers: Sih-Sian Wu¹, Kanwen Wang¹, Sai Manoj P. D.¹, Tsung-Yi Ho² and Hao Yu¹ ¹Nanyang Technological University, SG; ²National Cheng Kung University, TW Abstract One memory-logic-integration design platform is developed in this paper with thermal reliability analysis provided for 2.5D throughsilicon-interposer (TSI) and 3D through-silicon-via (TSV) based integrations. Temperature-dependent delay and power models have been developed at microarchitecture level for 2.5D and 3D integrations of many-core microprocessors and main memory, respectively. Experiments are performed by general-purpose benchmarks from SPEC CPU2006 and also cloud-oriented benchmarks from Phoenix with the following observations. The memory-logic integration by 3D RC-interconnected TSV I/Os can result in thermal runaway failures due to strong electrical-thermal couplings. On the other hand, the one by 2.5D transmission-line-interconnected TSI I/Os has shown almost the same energy efficiency and better thermal resilience.
16:01	IP3-12, 24	LEVERAGING ON-CHIP NETWORKS FOR EFFICIENT PREDICTION ON MULTICORE COHERENCE Speaker: Libo Huang, National University of Defense Technology, CN Abstract Coherent data prediction is introduced as a promising architectural technique for reducing cache-to-cache accesses in directory protocol. However, limited on-chip resources cause the accuracy of current prediction to be generally low. Low accuracy would result in a large number of unnecessary or incorrect predictions, which would consequently generate excessive network traffic. This leads to large power and performance overhead for coherent memory access. This paper proposes an early abort mechanism (EBT) that leverages NoC design to reduce the negative effect of wrong prediction operations, thus facilitating overall performance improvement and traffic reduction. Using detailed full-system simulations, we conclude that EBT provides a cost-effective solution for designing efficient multicore processors. To the best of our knowledge, this study is the first to leverage on-chip network for the prediction optimization on multicore coherence.
16:02	IP3-13, 184	AN ADAPTIVE MEMORY INTERFACE CONTROLLER FOR IMPROVING BANDWIDTH UTILIZATION OF HYBRID AND RECONFIGURABLE SYSTEMS Speakers: Vito Giovanni Castellana¹, Antonino Tumeo² and Fabrizio Ferrandi¹ ¹Politecnico di Milano, DEIB, IT; ²Pacific Northwest National Laboratory, US Abstract Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth bound, but also presents large amounts of inherent dynamic parallelism because they can potentially spawn concurrent activities for each one of the element they are exploring. Hybrid architectures, which integrate general purpose processors with reconfigurable devices, appears promising target platforms for accelerating irregular applications. These systems often connect to distributed and multi-ported memories, potentially enabling parallel memory operations. However, these memory architectures introduce several challenges, such as the necessity to manage concurrency and synchronization to avoid structural conflicts on shared memory locations and to guarantee consistency. In this paper we present an adaptive Memory Interface Controller (MIC) that addresses these issues. The MIC is a general and customizable solution that can target several different memory structures, and is suitable for High Level Synthesis frameworks. It implements a dynamic arbitration scheme, which avoids conflicts on memory resources at runtime, and supports atomic memory operations, commonly exploited for synchronization directives in parallel programming paradigms. The MIC simultaneously maps multiple accesses to different memory ports, allowing fine grained parallelism exploitation and ensuring correctness also in the presence of irregular and statically unpredictable memory access patterns. We evaluated the effectiveness of our approach on a typical irregular kernel, graph Breadth First Search (BFS), exploring different design alternatives.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.5 Emerging memory technologies

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 3

Chair:
Aida Todri, CNRS, FR

Co-Chair:
Lars Bauer, KIT, DE

The papers in this sessions consider ways to improve the energy, performance, and reliability of emerging memory technologies. STT-RAM and PCRAM are addressed.

Time	Label	Presentation Title Authors
14:30	7.5.1	ASYNCHRONOUS ASYMMETRICAL WRITE TERMINATION (AAWT) FOR A LOW POWER STT-MRAM Speakers: Rajendra Bishnoi¹, Mojtaba Ebrahimi², Fabian Oboril² and Mehdi Tahoori² ¹Karlsruhe Institiute of Technology, DE; ²Karlsruhe Institute of Technology, DE Abstract Spin Transfer Torque (STT) memory is an emerging and promising non-volatile storage technology. However, the high write current is still a major challenge which leads to a huge power consumption of the memory. Due to an inherent torque asymmetry of the Magnetic Tunnel Junction (MTJ) device employed in STT memories, the switching time between parallel to anti-parallel and anti-parallel to parallel magnetization is significantly different. Hence, the write latencies for writing '0' and '1' are also considerably different. In this paper, we propose a technique called Asynchronous Asymmetrical Write Termination (AAWT) which utilizes this asymmetrical behavior to terminate the write operations asynchronously and as a result significantly reduces the write power consumption. Furthermore, we present two different AAWT implementations to determine the actual write termination times. The first one makes use of a clock signal and the second one employs a self-timing approach based on an internal delay element. As shown by our experimental results, AAWT can reduce the total write energy by 30 % in average with a negligible area overhead.
15:00	7.5.2	WRITE-ONCE-MEMORY-CODE PHASE CHANGE MEMORY Speakers: Jiayin Li and Kartik Mohanram, University of Pittsburgh, US Abstract This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM -- attributed to PCM SET -- by proposing a novel PCM memory architecture that integrates WOM-codes at the memory organization and memory controller levels. The proposed <2^2>^2/3 WOM-code PCM architecture is able to reduce memory write (read) latency by 20.1% (10.2%) on average across general-purpose (SPEC CPU2006), embedded (MiBench), and high-performance (SPLASH-2) benchmarks. To further improve the write latency of WOM-code PCM, we propose a PCM-refresh approach that uses idle cycles to preemptively set PCM rows to the initial WOM-code state. Results show that WOM-code PCM with PCM-refresh can reduce memory write (read) latency by 54.9% (47.9%) on average across the benchmarks. Finally, to balance write latency improvements against WOM-code PCM overhead, we propose a WOM-code cached PCM (WCPCM) architecture that uses WOM-code PCM as the cache alongside conventional PCM main memory. For just 4.7% memory overhead, WCPCM reduces memory write (read) latency by 47.2% (44.0%) on average across the benchmarks.
15:30	7.5.3	IMPROVING STT-MRAM DENSITY THROUGH MULTI-BIT ERROR CORRECTION Speakers: Brandon Del Bel, Jongyeon Kim, Chris H. Kim and Sachin S. Sapatnekar, University of Minnesota, US Abstract STT-MRAMs are prone to data corruption due to inadvertent bit flips. Traditional methods enhance robustness at the cost of area/energy by using larger cell sizes to improve the thermal stability of the MTJ cells. This paper employs multi-bit error correction with DRAM-style refreshing to mitigate errors and provides a methodology for determining the optimal level of correction. A detailed analysis demonstrates that the reduction in non-volatility requirements afforded by strong error correction translates to significantly lower area for the memory array compared to simpler ECC schemes, even when accounting for the increased overhead of error correction.
16:00	IP3-14, 458	ENERGY EFFICIENT IN-MEMORY AES ENCRYPTION BASED ON NONVOLATILE DOMAIN-WALL NANOWIRE Speakers: Yuhao Wang¹, Pingfan Kong¹, Hao Yu¹ and Dennis Sylvester² ¹Nanyang Technological University, SG; ²University of Michigan, US Abstract The widely applied Advanced Encryption Standard (AES) encryption algorithm is critical in secure big-data storage. Data oriented applications have imposed high throughput and low power, i.e., energy efficiency (J/bit), requirements when applying AES encryption. This paper explores an in-memory AES encryption using the newly introduced domain-wall nanowire. We show that all AES operations can be fully mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire, called DW-AES. The experimental results show that DW-AES can achieve the best energy efficiency of 24 pJ/bit, which is 9X and 6.5X times better than CMOS ASIC and ReRAM-CMOL implementations, respectively. Under the same area budget, the proposed DW-AES exhibits 6.4X higher throughput and 29\% power saving compared to a CMOS ASIC implementation; 1.7X higher throughput and 74\% power reduction compared to a ReRAM-CMOL implementation.
16:01	IP3-15, 391	ICE: INLINE CALIBRATION FOR MEMRISTOR CROSSBAR-BASED COMPUTING ENGINE Speakers: Boxun Li¹, Yu Wang¹, Yiran Chen², Helen Li² and Huazhong Yang¹ ¹Tsinghua University, CN; ²University of Pittsburgh, US Abstract The emerging neuromorphic computation provides a revolutionary solution to the alternative computing architecture and effectively extends Moore's Law. The discovery of the memristor presents a promising hardware realization of neuromorphic systems with incredible power efficiency, allowing efficiently executing the analog matrix-vector multiplication on the memristor crossbar architecture. However, during computations, the memristor will slowly drift from its initial programmed state, leading to a gradual decline of the computation precision of memristor crossbar-based computing engine (MCE). In this paper, we propose an inline calibration mechanism to guarantee the computation quality of the MCE. The inline calibration mechanism collects the MCE's computation error through `interrupt-and-benchmark (I&B)' operations and predicts the best calibration time through polynomial fitting of the computation error data. We also develop an adaptive technique to adjust the time interval between two neighbor I&B operations and minimize the negative impact of the I&B operation on system performance. The experiment results demonstrate that the proposed inline calibration mechanism achieves a calibration efficiency of 91.18% on average and negligible performance overhead (i.e., 0.439%)
16:02	IP3-16, 533	COMPLEMENTARY RESISTIVE SWITCH BASED STATEFUL LOGIC OPERATIONS USING MATERIAL IMPLICATION Speakers: Yuanfan Yang¹, Jimson Mathew¹, Dhiraj K Pradhan¹, Marco Ottavi² and Salvatore Pontarelli² ¹University of Bristol, GB; ²University of Rome "Tor Vergata", IT Abstract Memristor based logic and memories are increasingly becoming one of the fundamental building blocks for future system design. Hence, it is important to explore various methodologies for implementing these blocks. In this paper, we present a novel Complementary Resistive Switching (CRS) based stateful logic operations using material implication. The proposed solution benefits from exponential reduction in sneak path current in crossbar implemented logic. We validated the effectiveness of our solution through SPICE simulations on a number of logic circuits. It has been shown that only 4 steps are required for implementing N input NAND gate whereas memristor based stateful logic needs N+1 steps.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.6 Performance and timing analysis

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 4

Chair:
Wang Yi, Uppsala University, SE

Co-Chair:
Petru Eles, Linköping University, SE

This session includes three papers. The first uses data mining techniques to detect performance bottlenecks to improve the scalability of multicore platforms for embedded applications. The second proposes to use regular expressions for specifying the patterns of deadline misses and hits to relax schedulability analysis for cyber physical systems. The third presents an approach to the scheduling of streaming applications, considering latency constraints and minimization of the number of processors required.

Time	Label	Presentation Title Authors
14:30	7.6.1	(Best Paper Award Candidate) SCALABILITY BOTTLENECKS DISCOVERY IN MPSOC PLATFORMS USING DATA MINING ON SIMULATION TRACES Speakers: Sofiane Lagraa¹, Alexandre Termier² and Frédéric Pétrot¹ ¹Grenoble institute of Technologie, FR; ²University of Joseph Fourier, FR Abstract Nowadays, a challenge faced by many developers is the profiling of parallel applications so that they can scale over more and more cores. This is especially critical for embedded systems powered by Multi-Processor System-on-Chip (MPSoC), where ever demanding applications have to run smoothly on numerous cores, each with modest computational performance. The reasons for the lack of scalability of parallel applications are numerous, and it can be time consuming for a developer to pinpoint the correct one. In this paper, we propose a fully automatic method which detects the instructions of the code which lead to a lack of scalability. The method is based on data mining techniques exploiting low level execution traces produced by MPSoC simulators. Our experiments show the accuracy of the proposed technique on five different kinds of applications, and how the information reported can be exploited by application developers.
15:00	7.6.2	COMPUTING A LANGUAGE-BASED GUARANTEE FOR TIMING PROPERTIES OF CYBER-PHYSICAL SYSTEMS Speakers: Neil Dhruva, Pratyush Kumar, Georgia Giannopoulou and Lothar Thiele, ETH Zurich, CH Abstract Real-time systems are often guaranteed in terms of schedulability, which verifies whether or not all jobs meet their deadlines. However, such a guarantee can be insufficient in certain applications. In this paper, we propose a method to compute a language-based guarantee which provides a more detailed description of the deadline miss patterns of an observed task. The only requirement of our method is that the timing behavior of the real-time system be modelled by a network of timed automata. We compute the language-based guarantee by constructing an equivalent finite state automaton in an iterative manner, using a counter-example guided procedure. We illustrate the language-based guarantee for two applications: design of a networked control system and scheduling in a mixed criticality system. In both cases, we show that the language-based guarantee leads to a more efficient design than the schedulability guarantee.
15:30	7.6.3	RESOURCE OPTIMIZATION FOR CSDF-MODELED STREAMING APPLICATIONS WITH LATENCY CONSTRAINTS Speakers: Di Liu¹, Jelena Spasic¹, Jiali Teddy Zhai¹, Todor Stefanov¹ and Gang Chen² ¹Leiden University, NL; ²Technical University Munich, DE Abstract In this paper, we study the problem of minimizing the number of processors required for scheduling latency-constrained streaming applications modeled as CSDF graphs, where the actors of a CSDF are executed as strictly periodic tasks. We formalize the problem and prove that due to the strict periodicity of actors the problem is an integer convex programming problem, that can be solved efficiently by using an existing convex programming solver. We evaluate our solution approach on a set of 13 real-life streaming applications modeled as CSDF graphs and demonstrate that it can reduce the number of processors in more than 52% of the conducted experiments in comparison to an existing approach.
16:00	IP3-17, 163	A LAYERED APPROACH FOR TESTING TIMING IN THE MODEL-BASED IMPLEMENTATION Speakers: BaekGyu Kim¹, Hyeon I Hwang², Taejoon Park², Sanghyuk Son² and Insup Lee¹ ¹University of Pennsylvania, US; ²Daegu Gyeongbuk Institute of Science & Technology, KR Abstract The model-based implementation is to derive an implementation from a model that has been shown to meet requirements. Even though this approach can be used to guarantee that an implementation satisfies functional requirements that are shown to be correct at the model level, it is still challenging to assure timing requirements at the implementation level. We propose a layered approach in testing timing requirements conformance of implemented systems developed by model-based implementation. In our approach, the abstraction boundary of the implemented system is formally defined using Parnas' four-variables model. Then, the proposed approach tests timing aspects of the interaction between the auto-generated code and the target platform-dependent code based on the four-variables. This approach aims at not only detecting the timing requirement violation, but also at measuring delay-segments that contribute to the timing deviation of the implemented system w.r.t. the model. We show the case study of testing timing requirements of an infusion pump system to illustrate the applicability of the proposed framework.
16:01	IP3-18, 222	MODEL-BASED PROTOCOL LOG GENERATION FOR TESTING A TELECOMMUNICATION TEST HARNESS USING CLP Speakers: Kenneth Balck¹, Olga Grinchtein¹ and Justin Pearson² ¹Ericsson AB, SE; ²Uppsala University, SE Abstract Within telecommunications development it is vital to have frameworks and systems to replay complicated scenarios on equipment under test, often there are not enough available scenarios. In this paper we study the problem of testing a test harness, which replays scenarios and analyses protocol logs for the Public Warning System service, which is a part of the Long Term Evolution (LTE) 4G standard. Protocol logs are sequences of messages with timestamps; and are generated by different mobile network entities. In our case study we focus on user equipment protocol logs. In order to test the test harness we require that logs have both incorrect and correct behaviour. It is easy to collect logs from real system runs, but these logs do not show much variation in the behaviour of system under test. We present an approach where we use constraint logic programming (CLP) for both modelling and test generation, where each test case is a protocol log. In this case study, we uncovered previously unknown faults in the test harness.
16:02	IP3-19, 294	TIME-DECOUPLED PARALLEL SYSTEMC SIMULATION Speakers: Jan Weinstock¹, Christoph Schumacher¹, Rainer Leupers¹, Gerd Ascheid¹ and Laura Tosoratto² ¹RWTH Aachen, DE; ²Istituto Nazionale di Fisica Nucleare, Sezione di Roma, IT Abstract With increasing system size and complexity, designers of embedded systems face the challenge of efficiently simulating these systems in order to enable target specific software development and design space exploration as early as possible. Today's multicore workstations offer enormous computational power, but traditional simulation engines like the OSCI SystemC kernel only operate on a single thread, thereby leaving a lot of computational potential unused. Most modern embedded system designs include multiple processors. This work proposes SCope, a SystemC kernel that aims at exploiting the inherent parallelism of such systems by simulating the processors on different threads. A lookahead mechanism is employed to reduce the required synchronization between the simulation threads, thereby further increasing simulation speed. The virtual prototype of the European FP7 project EURETILE system simulator is used as demonstrator for the proposed work, showing a speedup of 4.01x on a four core host system compared to sequential simulation.
16:03	IP3-20, 128	A UNIFIED METHODOLOGY FOR A FAST BENCHMARKING OF PARALLEL ARCHITECTURE Speakers: Alexandre Guerre, Jean-Thomas Acquaviva and Yves Lhuillier, CEA LIST, FR Abstract Benchmarking of architectures is today jeopardized by the explosion of parallel architectures and the dispersion of parallel programming models. Parallel programming requires architecture dependent compilers and languages as well as high programmer expertise. Thus, an objective comparison has become a harder task. This paper presents a novel methodology to evaluate and to compare parallel architectures in order to ease the programmer work. It is based on the usage of micro-benchmarks, code profiling and characterization tools. The main contribution of this methdology is a semi-automatic prediction of the performance for sequential applications on a set of parallel architectures. In addition the performance estimation is correlated with the cost of other criteria such as power or portability. Our methodology prediction was validated on anindustrial application. Results are within a range of 20%.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.7 Design-for-Test and Test Access

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 5

Chair:
Erik Jan Marinissen, IMEC, BE

Co-Chair:
Hans-Joachim Wunderlich, Univ. of Stuttgart, DE

This session covers topics that blend test with fault tolerance, security, and logic placement. As the area of IC test matures the core technology is adapting to the needs of the design and its implementation. As we move forward to advanced nodes in manufacturing the need to tolerate errors could blend with test methods. Test structures provide access to key design IP which is of concern in some situations. Papers in this session address solutions in IC Test for security.

Time	Label	Presentation Title Authors
14:30	7.7.1	BIT-FLIPPING SCAN - A UNIFIED ARCHITECTURE FOR FAULT TOLERANCE AND OFFLINE TEST Speakers: Michael Imhof¹ and Hans-Joachim Wunderlich² ¹Institute of Computer Architecture and Computer Engineering, University of Stuttgart, DE; ²University of Stuttgart, DE Abstract Test is an essential task since the early days of digital circuits. Every produced chip undergoes at least a production test supported by on-chip test infrastructure to reduce test cost. Throughout the technology evolution fault tolerance gained importance and is now necessary in many applications to mitigate soft errors threatening consistent operation. While a variety of effective solutions exists to tackle both areas, test and fault tolerance are often implemented orthogonally, and hence do not exploit the potential synergies of a combined solution. The unified architecture presented here facilitates fault tolerance and test by combining a checksum of the sequential state with the ability to flip arbitrary bits. Experimental results confirm a reduced area overhead compared to a orthogonal combination of classical test and fault tolerance schemes. In combination with heuristically generated test sequences the test application time and test data volume are reduced significantly.
15:00	7.7.2	TESTING PUF-BASED SECURE KEY STORAGE CIRCUITS Speakers: Mafalda Cortez¹, Gijs Roelofs¹, Said Hamdioui¹ and Giorgio Di Natale² ¹Delft University of Technology, NL; ²LIRMM, FR Abstract Design for test is an integral part of any VLSI chip. However, for secure systems extra precautions have to be taken to prevent that the test circuitry could reveal secret information. This paper addresses secure test for Physical Unclonable Function based systems. In particular it provides the testability analysis and a secure Built-In Self-Test (BIST) solution for Fuzzy Extractor (FE) which is the main component of PUF-based systems. The scheme targets high stuck-at-fault (SAF) coverage by performing scan-chain free functional testing, to prevent scan-chain abuse for attacks. The scheme reuses existing FE sub-blocks (for pattern generation and compression) to minimize the area overhead. The scheme is integrated in FE design and simulated; the results show that a SAF fault coverage of 95.1% can be realized with no more than 50k clock cycles at the cost of a negligible area overhead of only 2.2%. Higher fault coverage is possible to realize at extra cost.
15:30	7.7.3	MAKING IT HARDER TO UNLOCK AN LSIB: HONEYTRAPS AND MISDIRECTION IN A P1687 NETWORK Speakers: Adam Zygmontowicz¹, Jennifer Dworak¹, Al Crouch² and John Potter² ¹Southern Methodist University, US; ²ASSET InterTech, US Abstract Today's chips often contain a wealth of embedded instruments and data, including sensors, hardware monitors, built-in self test (BIST) engines, and chip IDs, among others. IEEE P1687 was specifically designed to provide access to such instruments in an efficient manner, and some companies are already implementing the proposed standard on their chips. However, while such instruments provide valuable information and features to authorized users who need to harness them for test, debug, diagnosis, and possibly counterfeit detection, it may be desirable to restrict unauthorized access to them through the P1687 network. Previous work has proposed replacing some of the segment insertion bits (SIBs), which add scan path segments in a P1687 network, with locking SIBs (LSIBs). LSIBs use the data that is naturally scanned through the network as keys to hide instruments from attackers. However, that previous work did not investigate many of the techniques and structures that can be used to significantly increase the time an attacker is likely to need to unlock LSIBs and gain access to hidden instruments. In this work, we explore some of these techniques and show how simple modifications to a P1687 network protected with LSIBs can significantly increase the difficulty an attacker faces in attempting to access protected instruments.
15:45	7.7.4	CO-OPTIMIZATION OF MEMORY BIST GROUPING, TEST SCHEDULING, AND LOGIC PLACEMENT Speakers: Ilgweon Kang and Andrew B. Kahng, UC San Diego, US Abstract Built-in self-test (BIST) is a well-known design technique in which part of a circuit is used to test the circuit itself. BIST plays an important role for embedded memories, which do not have pins or pads exposed toward the periphery of the chip for testing with automatic test equipment. With the rapidly increasing number of embedded memories in modern SOCs (up to hundreds of memories in each hard macro of the SOC), product designers incur substantial costs of test time (subject to possible power constraints) and BIST logic physical resources (area, routing, power). However, only limited previous work addresses the physical design optimization of BIST logic; notably, Chien et al. [7] optimize BIST design with respect to test time, routing length, and area. In our work, we propose a new three-step heuristic approach to minimize test time as well as test physical layout resources, subject to given upper bounds on power consumption. A key contribution is an integer linear programming ILP framework that determines optimal test time for a given cluster of memories using either one or two BIST controllers, subject to test power limits and with full comprehension of available serialization and parallelization. Our heuristic approach integrates (i) generation of a hypergraph over the memories, with test time-aware weighting of hyperedges, along with top-down, FM-style min-cut partitioning; (ii) solution of an ILP that comprehends parallel and serial testing to optimize test scheduling per BIST controller; and (iii) placement of BIST logic to minimize routing and buffering costs. When evaluated on hard macros from a recent industrial 28nm networking SOC, our heuristic solutions reduce test time estimates by up to 11.57% with strictly fewer BIST controllers per hard macro, compared to the industrial solutions.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

7.8 Panel: FD-SOI - the Enabling European Technology for Energy Efficient Solutions - Creating a Solution Hive & Design Hub as Eco-System for Future Success

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Exhibition Theatre

Moderator:
Oliver Bringmann, University of Tübingen, DE

Fully-Depleted Silicon On Insulator (FD-SOI) is emerging as a promising solution to continue the CMOS scaling roadmap at the 22nm technology node and beyond, especially for low power and System-on-Chip applications. After a short introduction into the FD-SOI technology, this panel discusses the role of FD-SOI as the key enabling technology to tackle the challenges of the major European application domains. This includes the creation of a European ecosystem to provide an easy access for industry and SMEs to a leading-edge semiconductor technology with manageable costs. The panelist take a look at different perspectives and discusses the technology, the SME, the application, the EDA, and the research viewpoint to FD-SOI and its impact to European industry.

16:00

End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

IP3 Interactive Presentations

Date: Wednesday 26 March 2014
Time: 16:00 - 16:30
Location / Room: Conference Level, foyer

Label	Presentation Title Authors
IP3-1	DESIGN AND FABRICATION OF A 315 μH BONDWIRE MICRO-TRANSFORMER FOR ULTRA-LOW VOLTAGE ENERGY HARVESTING Speakers: Enrico Macrelli¹, Ningning Wang², Saibal Roy², Michael Hayes², Rudi Paolo Paganelli³, Marco Tartagni¹ and Aldo Romani¹ ¹DEI, University of Bologna, IT; ²Tyndall National Institute, UCC, IE; ³CNR-IEIIT, University of Bologna, IT Abstract This paper presents a design study of a new topology for miniaturized bondwire transformers fabricated and assembled with standard IC bonding wires and toroidal ferrite (Fair-Rite 5975000801) as a magnetic core. The micro-transformer realized on a PCB substrate, enables the build of magnetics on-top-of-chip, thus leading to the design of high power density components. Impedance measurements in a frequency range between 100 kHz to 5 MHz, show that the secondary self-inductance is enhanced from 0.3 μH with an epoxy core to 315 μH with the ferrite core. Moreover, the micro-machined ferrite improves the coupling coefficient from 0.1 to 0.9 and increases the effective turns ratio from 0.5 to 35. Finally, a low-voltage IC DC-DC converter solution, with the transformer mounted on-top, is proposed for energy harvesting applications.
IP3-2	PROVIDING REGULATION SERVICES AND MANAGING DATA CENTER PEAK POWER BUDGETS Speakers: Baris Aksanli and Tajana Rosing, University of California San Diego, US Abstract Data centers are good candidates for providing regulation services in the power markets due to their large power consumption and flexibility. In this paper, we develop a framework that explores the feasibility of data center participation in these markets. We use a battery-based design that can not only help with providing ancillary services, but can also limit peak power costs without any workload performance degradation. The results of our study using data for a 21MW data center show up to $480,000/year savings can be obtained, corresponding to 1280 more servers providing services.
IP3-3	THE ENERGY BENEFIT OF LEVEL-CROSSING SAMPLING INCLUDING THE ACTUATOR'S ENERGY CONSUMPTION Speakers: Burkhard Hensel and Klaus Kabitzsch, Dresden University of Technology, DE Abstract When using level-crossing (also called send-on-delta) sampling in control loops, messages can be saved compared to periodic sampling without degrading control performance. While it is clear that reducing messages improves also the energy efficiency of battery-powered sensor devices, this can be disadvantageous for the energy efficiency the actuator device. This paper addresses the question, under which conditions level-crossing sampling is also for the actuator device more energy-efficient than periodic sampling. It is shown that there is an optimum inter-sample interval. Methods for reaching this optimum by appropriate controller and transmission settings are given. The theory is demonstrated using several known, standardized wireless network protocols.
IP3-4	SKETCHILOG: SKETCHING COMBINATIONAL CIRCUITS Speakers: Andrew Becker, David Novo and Paolo Ienne, École Polytechnique Fédérale de Lausanne, CH Abstract Despite the progress of higher-level languages and tools, Register Transfer Level (RTL) is still by far the dominant input format for high performance digital designs. Experienced designers can directly express their microarchitectural intuitions in RTL. Yet, RTL is terribly verbose, burdened with trivial details, and thus error prone. In this paper, we augment a modern RTL language (Chisel) with new semantic elements to express an imprecise specification: a sketch. We show how, in combination with a naive, unoptimized, but functionally correct reference, a designer can utilize the language and supporting infrastructure to focus on the key design intuition and omit some of the necessary details. The resulting design is exactly or almost exactly as good as the one the designer could have achieved by spending the time to manually complete the sketch. We show that, even limiting ourselves to combinational circuits, realistic instances of meaningful design problems are solved quickly, saving considerable design and debugging effort.
IP3-5	TOWARDS VERIFYING DETERMINISM OF SYSTEMC DESIGNS Speakers: Hoang M. Le and Rolf Drechsler, University of Bremen, DE Abstract Ensuring the correctness of high-level SystemC designs is an important and challenging problem in today's Electronic System Level (ESL) methodology. Prevalently, a design is checked against a functional specification given by e.g. a testcase with reference output or a user-defined property. Another research direction takes the view of a SystemC design as a piece of concurrent software. The design is then checked for common concurrency problems and thus, a functional specification is not required. Along this line, several methods for deadlock detection and race analysis have been developed. In this work, we propose to consider a new concurrency verification problem, namely input-output determinism, for SystemC designs. That means for each possible input, the design must produce the same output under any valid process schedule. We argue that determinism verification is stronger than both deadlock detection and race analysis. Beside being an attractive correctness criterion itself, proven determinism helps to accelerate both simulative and formal verification. We also present a preliminary study to show the feasibility of determinism verification for SystemC designs.
IP3-6	USING GUIDED LOCAL SEARCH FOR ADAPTIVE RESOURCE RESERVATION IN LARGE-SCALE EMBEDDED SYSTEMS Speaker: Timon ter Braak, University of Twente, NL Abstract To maintain a predictable execution environment, an embedded system must ensure that applications are, in advance, provided with sufficient resources to process tasks, exchange information and to control peripherals. The problem of assigning tasks to processing elements with limited resources, and routing communication channels through a capacitated interconnect is combined into an integer linear programming formulation. We describe a guided local search algorithm to solve this problem at run-time. This algorithm allows for a hybrid strategy where configurations computed at design-time may be used as references to lower the computational overhead at run-time. Computational experiments on a dataset with 100 tasks and 20 processing elements show the effectiveness of this algorithm compared to state-of-the-art solvers CPLEX and Gurobi. The guided local search algorithm finds an initial solution within 100 milliseconds, is competitive for small platforms, scales better with the size of the platform, and has lower memory usage (2-19%).
IP3-7	(Best Paper Award Candidate) ACCELERATING GRAPH COMPUTATION WITH RACETRACK MEMORY AND POINTER-ASSISTED GRAPH REPRESENTATION Speakers: Eunhyek Park¹, Helen Li², Sungjoo Yoo¹ and Sunggu Lee¹ ¹POSTECH, KR; ²Univ. of Pittsburgh, US Abstract The poor performance of NAND Flash memory, such as long access latency and large granularity access, is the major bottleneck of graph processing. This paper proposes an intelligent storage for graph processing which is based on fast and low cost racetrack memory and a pointer-assisted graph representation. Our experiments show that the proposed intelligent storage based on racetrack memory reduces total processing time of three representative graph computations by 40.2%~86.9% compared to the graph processing, GraphChi, which exploits sequential accesses based on normal NAND Flash memory-based SSD. Faster execution also reduces energy consumption by 39.6%~90.0%. The in-storage processing capability gives additional 10.5%~16.4% performance improvements and 12.0%~14.4% reduction of energy consumption.
IP3-8	PSP-CACHE: A LOW-COST FAULT-TOLERANT CACHE MEMORY ARCHITECTURE Speakers: Hamed Farbeh and Seyed Ghassem Miremadi, Sharif University of Technology, IR Abstract Cache memories constitute a large fraction of processor chip area and are highly vulnerable to soft errors caused by energetic particles. To protect these memories, most of the modern processors employ Error Detection Codes (EDCs) or Error Correction Codes (ECCs). EDCs/ECCs impose significant overheads in terms of area and energy; these overheads increase as a function of interleaving EDCs/ECCs to detect/correct multiple errors. This paper proposes a new cache architecture to minimize the area and energy overheads of EDCs/ECCs in set-associative L1-caches. Simulation results for a 4-way set-associative cache show that the proposed architecture reduces both the area and static power overheads of parity code by about 75% and the dynamic energy overhead by about 73% in comparison to conventional cache architecture. These reduction figures are about 68% and about 66%, respectively, for SEC-DED code. The above reductions are achieved without affecting the error coverage.
IP3-9	A HYBRID NON-VOLATILE SRAM CELL WITH CONCURRENT SEU DETECTION AND CORRECTION Speakers: Pilin Junsangsri¹, Fabrizio Lombardi¹ and Jie Han² ¹Northeastern University, US; ²University of Alberta, CA Abstract This paper presents a hybrid non-volatile (NV) SRAM cell with a new scheme for SEU tolerance. The proposed NVSRAM cell consists of a 6T SRAM core and a Resistive RAM (RRAM), made of a 1T and a Programmable Metallization Cell (PMC). The proposed cell has concurrent error detection (CED) and correction capabilities; CED is accomplished using a dual-rail checker, while correction is accomplished by utilizing the restore operation; data from the non-volatile memory element is copied back to the SRAM core. The dual-rail checker utilizes two XOR gates each made of 2 inverters and 2 ambipolar transistors, hence, it has a hybrid nature. Extensive simulation results are provided. The simulation results show that the proposed scheme is very efficient in terms of numerous figures of merit such as delay and circuit complexity and thus applicable to integrated circuits such as FPGAs requiring secure on-chip non-volatile storage (i.e. LUTs) for multi-context configurability.
IP3-10	BATTERY AWARE STOCHASTIC QOS BOOSTING IN MOBILE COMPUTING DEVICES Speakers: Hao Shen, Qiuwen Chen and Qinru Qiu, Syracuse University, US Abstract Mobile computing has been weaved into everyday lives to a great extend. Their usage is clearly imprinted with user's personal signature. The ability to learn such signature enables immense potential in workload prediction and resource management. In this work, we investigate the user behavior modeling and apply the model for energy management. Our goal is to maximize the quality of service (QoS) provided by the mobile device (i.e., smartphone), while keep the risk of battery depletion below a given threshold. A Markov Decision Process (MDP) is constructed from history user behavior. The optimal management policy is solved using linear programing. Simulations based on real user traces validate that, compared to existing battery energy management techniques, the stochastic control performs better in boosting the mobile devices' QoS without significantly increasing the chance of battery depletion.
IP3-11	A THERMAL RESILIENT INTEGRATION OF MANY-CORE MICROPROCESSORS AND MAIN MEMORY BY 2.5D TSI I/OS Speakers: Sih-Sian Wu¹, Kanwen Wang¹, Sai Manoj P. D.¹, Tsung-Yi Ho² and Hao Yu¹ ¹Nanyang Technological University, SG; ²National Cheng Kung University, TW Abstract One memory-logic-integration design platform is developed in this paper with thermal reliability analysis provided for 2.5D throughsilicon-interposer (TSI) and 3D through-silicon-via (TSV) based integrations. Temperature-dependent delay and power models have been developed at microarchitecture level for 2.5D and 3D integrations of many-core microprocessors and main memory, respectively. Experiments are performed by general-purpose benchmarks from SPEC CPU2006 and also cloud-oriented benchmarks from Phoenix with the following observations. The memory-logic integration by 3D RC-interconnected TSV I/Os can result in thermal runaway failures due to strong electrical-thermal couplings. On the other hand, the one by 2.5D transmission-line-interconnected TSI I/Os has shown almost the same energy efficiency and better thermal resilience.
IP3-12	LEVERAGING ON-CHIP NETWORKS FOR EFFICIENT PREDICTION ON MULTICORE COHERENCE Speaker: Libo Huang, National University of Defense Technology, CN Abstract Coherent data prediction is introduced as a promising architectural technique for reducing cache-to-cache accesses in directory protocol. However, limited on-chip resources cause the accuracy of current prediction to be generally low. Low accuracy would result in a large number of unnecessary or incorrect predictions, which would consequently generate excessive network traffic. This leads to large power and performance overhead for coherent memory access. This paper proposes an early abort mechanism (EBT) that leverages NoC design to reduce the negative effect of wrong prediction operations, thus facilitating overall performance improvement and traffic reduction. Using detailed full-system simulations, we conclude that EBT provides a cost-effective solution for designing efficient multicore processors. To the best of our knowledge, this study is the first to leverage on-chip network for the prediction optimization on multicore coherence.
IP3-13	AN ADAPTIVE MEMORY INTERFACE CONTROLLER FOR IMPROVING BANDWIDTH UTILIZATION OF HYBRID AND RECONFIGURABLE SYSTEMS Speakers: Vito Giovanni Castellana¹, Antonino Tumeo² and Fabrizio Ferrandi¹ ¹Politecnico di Milano, DEIB, IT; ²Pacific Northwest National Laboratory, US Abstract Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth bound, but also presents large amounts of inherent dynamic parallelism because they can potentially spawn concurrent activities for each one of the element they are exploring. Hybrid architectures, which integrate general purpose processors with reconfigurable devices, appears promising target platforms for accelerating irregular applications. These systems often connect to distributed and multi-ported memories, potentially enabling parallel memory operations. However, these memory architectures introduce several challenges, such as the necessity to manage concurrency and synchronization to avoid structural conflicts on shared memory locations and to guarantee consistency. In this paper we present an adaptive Memory Interface Controller (MIC) that addresses these issues. The MIC is a general and customizable solution that can target several different memory structures, and is suitable for High Level Synthesis frameworks. It implements a dynamic arbitration scheme, which avoids conflicts on memory resources at runtime, and supports atomic memory operations, commonly exploited for synchronization directives in parallel programming paradigms. The MIC simultaneously maps multiple accesses to different memory ports, allowing fine grained parallelism exploitation and ensuring correctness also in the presence of irregular and statically unpredictable memory access patterns. We evaluated the effectiveness of our approach on a typical irregular kernel, graph Breadth First Search (BFS), exploring different design alternatives.
IP3-14	ENERGY EFFICIENT IN-MEMORY AES ENCRYPTION BASED ON NONVOLATILE DOMAIN-WALL NANOWIRE Speakers: Yuhao Wang¹, Pingfan Kong¹, Hao Yu¹ and Dennis Sylvester² ¹Nanyang Technological University, SG; ²University of Michigan, US Abstract The widely applied Advanced Encryption Standard (AES) encryption algorithm is critical in secure big-data storage. Data oriented applications have imposed high throughput and low power, i.e., energy efficiency (J/bit), requirements when applying AES encryption. This paper explores an in-memory AES encryption using the newly introduced domain-wall nanowire. We show that all AES operations can be fully mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire, called DW-AES. The experimental results show that DW-AES can achieve the best energy efficiency of 24 pJ/bit, which is 9X and 6.5X times better than CMOS ASIC and ReRAM-CMOL implementations, respectively. Under the same area budget, the proposed DW-AES exhibits 6.4X higher throughput and 29\% power saving compared to a CMOS ASIC implementation; 1.7X higher throughput and 74\% power reduction compared to a ReRAM-CMOL implementation.
IP3-15	ICE: INLINE CALIBRATION FOR MEMRISTOR CROSSBAR-BASED COMPUTING ENGINE Speakers: Boxun Li¹, Yu Wang¹, Yiran Chen², Helen Li² and Huazhong Yang¹ ¹Tsinghua University, CN; ²University of Pittsburgh, US Abstract The emerging neuromorphic computation provides a revolutionary solution to the alternative computing architecture and effectively extends Moore's Law. The discovery of the memristor presents a promising hardware realization of neuromorphic systems with incredible power efficiency, allowing efficiently executing the analog matrix-vector multiplication on the memristor crossbar architecture. However, during computations, the memristor will slowly drift from its initial programmed state, leading to a gradual decline of the computation precision of memristor crossbar-based computing engine (MCE). In this paper, we propose an inline calibration mechanism to guarantee the computation quality of the MCE. The inline calibration mechanism collects the MCE's computation error through `interrupt-and-benchmark (I&B)' operations and predicts the best calibration time through polynomial fitting of the computation error data. We also develop an adaptive technique to adjust the time interval between two neighbor I&B operations and minimize the negative impact of the I&B operation on system performance. The experiment results demonstrate that the proposed inline calibration mechanism achieves a calibration efficiency of 91.18% on average and negligible performance overhead (i.e., 0.439%)
IP3-16	COMPLEMENTARY RESISTIVE SWITCH BASED STATEFUL LOGIC OPERATIONS USING MATERIAL IMPLICATION Speakers: Yuanfan Yang¹, Jimson Mathew¹, Dhiraj K Pradhan¹, Marco Ottavi² and Salvatore Pontarelli² ¹University of Bristol, GB; ²University of Rome "Tor Vergata", IT Abstract Memristor based logic and memories are increasingly becoming one of the fundamental building blocks for future system design. Hence, it is important to explore various methodologies for implementing these blocks. In this paper, we present a novel Complementary Resistive Switching (CRS) based stateful logic operations using material implication. The proposed solution benefits from exponential reduction in sneak path current in crossbar implemented logic. We validated the effectiveness of our solution through SPICE simulations on a number of logic circuits. It has been shown that only 4 steps are required for implementing N input NAND gate whereas memristor based stateful logic needs N+1 steps.
IP3-17	A LAYERED APPROACH FOR TESTING TIMING IN THE MODEL-BASED IMPLEMENTATION Speakers: BaekGyu Kim¹, Hyeon I Hwang², Taejoon Park², Sanghyuk Son² and Insup Lee¹ ¹University of Pennsylvania, US; ²Daegu Gyeongbuk Institute of Science & Technology, KR Abstract The model-based implementation is to derive an implementation from a model that has been shown to meet requirements. Even though this approach can be used to guarantee that an implementation satisfies functional requirements that are shown to be correct at the model level, it is still challenging to assure timing requirements at the implementation level. We propose a layered approach in testing timing requirements conformance of implemented systems developed by model-based implementation. In our approach, the abstraction boundary of the implemented system is formally defined using Parnas' four-variables model. Then, the proposed approach tests timing aspects of the interaction between the auto-generated code and the target platform-dependent code based on the four-variables. This approach aims at not only detecting the timing requirement violation, but also at measuring delay-segments that contribute to the timing deviation of the implemented system w.r.t. the model. We show the case study of testing timing requirements of an infusion pump system to illustrate the applicability of the proposed framework.
IP3-18	MODEL-BASED PROTOCOL LOG GENERATION FOR TESTING A TELECOMMUNICATION TEST HARNESS USING CLP Speakers: Kenneth Balck¹, Olga Grinchtein¹ and Justin Pearson² ¹Ericsson AB, SE; ²Uppsala University, SE Abstract Within telecommunications development it is vital to have frameworks and systems to replay complicated scenarios on equipment under test, often there are not enough available scenarios. In this paper we study the problem of testing a test harness, which replays scenarios and analyses protocol logs for the Public Warning System service, which is a part of the Long Term Evolution (LTE) 4G standard. Protocol logs are sequences of messages with timestamps; and are generated by different mobile network entities. In our case study we focus on user equipment protocol logs. In order to test the test harness we require that logs have both incorrect and correct behaviour. It is easy to collect logs from real system runs, but these logs do not show much variation in the behaviour of system under test. We present an approach where we use constraint logic programming (CLP) for both modelling and test generation, where each test case is a protocol log. In this case study, we uncovered previously unknown faults in the test harness.
IP3-19	TIME-DECOUPLED PARALLEL SYSTEMC SIMULATION Speakers: Jan Weinstock¹, Christoph Schumacher¹, Rainer Leupers¹, Gerd Ascheid¹ and Laura Tosoratto² ¹RWTH Aachen, DE; ²Istituto Nazionale di Fisica Nucleare, Sezione di Roma, IT Abstract With increasing system size and complexity, designers of embedded systems face the challenge of efficiently simulating these systems in order to enable target specific software development and design space exploration as early as possible. Today's multicore workstations offer enormous computational power, but traditional simulation engines like the OSCI SystemC kernel only operate on a single thread, thereby leaving a lot of computational potential unused. Most modern embedded system designs include multiple processors. This work proposes SCope, a SystemC kernel that aims at exploiting the inherent parallelism of such systems by simulating the processors on different threads. A lookahead mechanism is employed to reduce the required synchronization between the simulation threads, thereby further increasing simulation speed. The virtual prototype of the European FP7 project EURETILE system simulator is used as demonstrator for the proposed work, showing a speedup of 4.01x on a four core host system compared to sequential simulation.
IP3-20	A UNIFIED METHODOLOGY FOR A FAST BENCHMARKING OF PARALLEL ARCHITECTURE Speakers: Alexandre Guerre, Jean-Thomas Acquaviva and Yves Lhuillier, CEA LIST, FR Abstract Benchmarking of architectures is today jeopardized by the explosion of parallel architectures and the dispersion of parallel programming models. Parallel programming requires architecture dependent compilers and languages as well as high programmer expertise. Thus, an objective comparison has become a harder task. This paper presents a novel methodology to evaluate and to compare parallel architectures in order to ease the programmer work. It is based on the usage of micro-benchmarks, code profiling and characterization tools. The main contribution of this methdology is a semi-automatic prediction of the performance for sequential applications on a set of parallel architectures. In addition the performance estimation is correlated with the cost of other criteria such as power or portability. Our methodology prediction was validated on anindustrial application. Results are within a range of 20%.

UB08 Session 8

Date: Wednesday 26 March 2014
Time: 16:00 - 18:00
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB08.01	VIDEO-BASED ABSOLUTE NAVIGATION APPROACH: A NOVEL APPROACH FOR VIDEO-BASED ABSOLUTE NAVIGATION IN SPACE EXPLORATION MISSIONS Authors: Pascal Trotta, Tadewos Getahun Tadewos, Paolo Prinetto, Daniele Rolfo and Pascal Trotta, Politecnico di Torino, IT Abstract Nowadays, space agencies have increased their research efforts in order to enhance the success rate of space exploration missions. Future space missions will increasingly adopt Video Based Navigation (VBN) systems to assist the entry, descent and landing (EDL) phase of space modules. This poster will show a preliminary work on a novel approach for Video-based Absolute Navigation (VBAN). Moreover, the poster depicts how a VBAN processing chain can exploit FPGA devices to achieve high throughput. Several visual results will be shown to highlight the peculiarities of the proposed approach. More information ...
UB08.02	HIPACC: AUTOMATIC GPU CODE GENERATION FOR ANDROID Authors: Oliver Reiche¹, Richard Membarth², Frank Hannig¹ and Jürgen Teich¹ ¹University of Erlangen-Nuremberg, DE; ²Saarland University, DE Abstract We present the Heterogeneous Image Processing Acceleration (HIPAcc) framework. It allows programmers to develop image preprocessing applications while providing high productivity, flexibility, and portability as well as competitive performance. The same algorithm description serves as basis for targeting different GPU accelerators and low-level languages. Hereby, imaging algorithms can be expressed in a compact and productive way by using a domain-specific language (DSL) that is embedded into C ++ code. Using the HIPAcc source-to-source compiler, DSL code is compiled to CUDA, OpenCL, C/C ++, or even Renderscript code, which targets heterogeneous architectures on recent MPSoCs running Android. Programming those MPSoCs can be challenging, in particular when targeting different architectures (CPU/GPU/DSP). HIPAcc lifts this burden from programmers by automatically applying source code transformations based on domain knowledge and a built-in architecture model. This demonstration shows the seamless integration of HIPAcc into the Android Developer Tools and provides a live comparison of generated code to functional identical handwritten naive implementations of image filters on recent MPSoCs running Android. More information ...
UB08.04	GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES Authors: Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT Abstract Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The software is composed of a parser library to handle input circuit descriptions, a characterization library of graphene gates used in the synthesis process, a Biconditional Binary Decision Diagram library used to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices. More information ...
UB08.05	TOMAHAWK2: PERFORMANCE IMPACT OF INSTRUCTION SET ARCHITECTURE EXTENSIONS FOR DYNAMIC TASK SCHEDULING UNITS Author: Oliver Arnold, Technische Universität Dresden, DE Abstract In this demo a heterogeneous MPSoC is controlled by a dynamic task scheduling unit called CoreManager. The instruction set architecture of this unit has been extended to improve performance for dynamic data dependency checking, task scheduling, processing element allocation and data transfer management. The MPSoC as well as the NoC are integrated in a cycle-accurate virtual system prototype. The performance impact of the CoreManager is analyzed on component as well as on system level. More information ...
UB08.06	LEGO: TOOLS FOR HYBRID INTEGRATION Author: Fredrik Jonsson, Royal Institute of Technology, SE Abstract Performance of printed devices depends on the geometry, but is also affected by processing steps of other components integrated onto the same substrate. Since different designs use different devices, process stack, models and design rules must be dynamically determined. In this work we propose and demonstrate an experimental design flow to allow efficient design of hybrid and printed electronic circuits. More information ...
UB08.07	UVM-SYSTEMC-AMS: UVM STANDARD-COMPLIANT SYSTEMC (AMS)-BASED VERIFICATION FRAMEWORK FOR HETEROGENEOUS SYSTEMS Authors: Zhi Wang¹, Yao Li², Marie-Minerve Louerat², Francois Pecheux², Martin Barnasconi³, Thilo Vörtler⁴ and Karsten Einwich⁴ ¹Laboratoire d'informatique de Paris 6, FR; ²UPMC-LIP6, FR; ³NXP, NL; ⁴Fraunhofer IIS, DE Abstract Today's societal needs for innovative products in terms of communication, mobility, health, entertainment, and safety directly impact microelectronics design methodologies. The embedded systems are simultaneously software-driven, digitally assisted, complex and heterogeneous, but existing verification methodologies are mostly focused on pure digital devices and are completely decoupled from analog verification. This presentation shows how the principles of the new UVM methodology can be soundly enhanced to offer to the test designer a flexible framework for the virtual prototyping of multi-discipline testbenches that supports both digital and Analog Mixed-Signal (AMS) at the architectural level. More information ...
18:00	End of session
19:30	DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.1 SPECIAL DAY System Simulation and Virtual Prototyping

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Saal 1

Organiser:
Johannes Stahl, Synopsys, US

Chair:
Johannes Stahl, Synopsys, US

In this session we will review several practical applications of virtual prototyping for architecture design work and software development across different markets such as mobile, industrial and automotive. The authors will share their practical experiences in using the virtual prototyping methodology and current commercial tools.

Time	Label	Presentation Title Authors
17:00	8.1.1	POWER MODELING AND ANALYSIS IN EARLY DESIGN PHASES Speakers: Bernhard Fischer, Christian Cech and Hannes Muhr, Siemens, AT Abstract Low power consumption of electronic devices has been an important requirement for many cyber-physical systems in field. Today, power dissipation is often estimated by spreadsheet-based power analysis. A leading-edge high-level power analysis method has the objective of providing high confidence levels in early design stages, where power design decisions have severe impact. This work examines and compares three high-level power analysis approaches (spreadsheet-based, Synopsys Platform Architect MCO, and DOCEA Aceplorer) by an industrial use case.
17:30	8.1.2	SYSTEM-LEVEL DESIGN METHODOLOGY ENABLING FAST DEVELOPMENT OF BASEBAND MP-SOC FOR 4G SMALL CELL BASE STATION Speakers: Shan Tang, Zhu Ziyuan and Yongtao Su, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract "Small Cell" is regarded as the solution to optimize 4G wireless networks with improved coverage and capacity and expected to be deploy in a large number. To meet performance requirements and special constraints on the cost and size, we design a heterogeneous multi-processor SoC for small cell base station, which is composed of ASP (Application Specific Processor) cores, hardware accelerators, general-purpose processor core, and infrastructure and interface blocks. The challenges of developing such a complex chip drive us to employ system-level design methodology in both single core and mutli-core architecture optimizations. The paper discusses in detail the LISA (Language for Instruction-Set Architectures)/SystemC based ASP-algorithm joint optimization, and task-graph driven multi-core architecture exploration. Finally, the results of silicon implementation on SMIC 55nm technology are presented.
18:00	8.1.3	VIRTUAL PROTOTYPE LIFE CYCLE IN AUTOMOTIVE APPLICATIONS Speaker: Manfred Thanner, Freescale, Germany, DE Abstract Virtual prototypes for automotive applications see a unique life cycle in the context of the supply chain from semiconductor to Tier1 to OEMs and within the eco-system. The presentation gives an overview of current experiences and finding in this field and challenges observed. The virtual platforms targeting the mid to high end application spaces of chassis, to powertrain and driver information systems. The use cases primarily address today seminconductor internal developments and Tier1 level deployment. Additionally different software vendors use the models in their development cycle which drive model requirements like stimulus and abstraction levels. The development of virtual prototypes often start with the reuse of existing cores, accelerators and IP models. These models had certain use cases to address and were created accordingly. Therefore the models sometimes don't necessarily match fully the requirements of the overall virtual prototype and compromises were made. Further to this, models are often from different design centers, vendors, etc. This can lead to conflicting model features versus the primary use case requirements of the virtual platform for the intended usage. Examples are cycle accuracy vs. functional, correct behavior vs. error behavior and error injection. The virtual platform life cycle is also affected by the availability and integration of 3rd party IP models which adds the commercial terms and license dependency. Further to this, the virtual prototypes need to be integrated or connected to the EDA environments of the "receiving companies". In the deployment phase of the virtual prototype within the automotive eco system a supply chain needs to be in place. This creates challenges in terms of model interfaces, tool compatibility and integration and support chain.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.2 Hot Topic: Near Threshold Computing (NTC)

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 6

Organiser:
Michael Huebner, Ruhr-University Bochum, Ge

Chair:
Michael Huebner, Ruhr-University Bochum, DE

To face with the power/utilization wall, Near-Threshold Computing (NTC) has emerged as one of the most promising approach to achieve an order of magnitude improvement or more in energy efficiency of microprocessors and reconfigurable hardware. NTC takes advantage of the quadratic relation between the supply voltage (Vdd) and the dynamic power, by lowering the supply voltage of chips to a value only slightly higher than the threshold voltage.

Time	Label	Presentation Title Authors
17:00	8.2.1	EXTREME-SCALE COMPUTER ARCHITECTURE: ENERGY EFFICIENCY FROM THE GROUND UP Speaker: Josep Torrellas, University of Illinois Urbana Champaign, US Abstract As we move to integration levels of 1,000-core processor chips, it is clear that energy and power consumption are the most formidable obstacles. To construct such a chip, we need to rethink the whole compute stack from the ground up for energy efficiency --- and attain Extreme-Scale Computing. First of all, we want to operate at low voltage, since this is the point of maximum energy efficiency. Unfortunately, in such an environment, we have to tackle substantial process variation. Hence, it is important to design efficient voltage regulation, so that each region of the chip can operate at the most efficient voltage and frequency point. At the architecture level, we require simple cores organized in a hierarchy of clusters. Moreover, we also need techniques to reduce the leakage of on-chip memories and to lower the voltage guardbands of logic. Finally, data movement should be minimized, through both hardware and software techniques. With a systematic approach that cuts across multiple layers of the computing stack, we can deliver the required energy efficiencies.
17:30	8.2.2	VOLTAGE ISLAND MANAGEMENT IN NEAR THRESHOLD MANYCORE ARCHITECTURES TO MITIGATE DARK SILICON Speakers: Cristina Silvano¹, Gianluca Palermo¹, Sotirios Xydis² and Ioannis Stamelakos¹ ¹Politecnico di Milano, IT; ²National Technical University of Athens, GR Abstract The power-wall problem driven by the stagnation of supply voltages in deep-submicron technology nodes, is now the major scaling barrier for moving towards the manycore era. Although the technology scaling enables extreme volumes of computational power, power budget violations will permit only a limited portion to be actually exploited, leading to the so called dark silicon. Near-Threshold voltage Computing (NTC) has emerged as a promising approach to overcome the manycore power-wall, at the expenses of reduced performance values and higher sensitivity to process variations. Given that several application domains operate over specific performance constraints, the performance sustainability is considered a major issue for the wide adoption of NTC. Thus, in this paper, we investigate how performance guarantees can be ensured when moving towards NTC manycores through variability-aware voltage and frequency allocation schemes. We propose three aggressive NTC voltage tuning and allocation strategies, showing that STC performance can be efficiently sustained or even optimized at the NTC regime. Finally, we show that NTC highly depends on the underlying workload characteristics, delivering average power gains of 65% for thread-parallel workloads and up to 90% for process-parallel workloads, while offering an extensive analysis on the effects of different voltage tuning/allocation strategies and voltage regulator configurations.
18:00	8.2.3	RESOLVING THE MEMORY BOTTLENECK FOR SINGLE SUPPLY NEAR-THRESHOLD COMPUTING Speakers: Tobias Gemmeke¹, Mohamed Sabry², Jan Stuijt¹, Praveen Raghavan³, Francky Catthoor³ and David Atienza² ¹Holst-Centre / imec, NL; ²ESL-EPFL, CH; ³imec, BE Abstract This papers focuses on state-of-the-art memory designs for NTC. It presents new ways to design reliable low-voltage memories cost-effectively by reusing available cell libraries, or by adding a digital wrapper around existing commercially available memories. The approach is based on modeling at system level supported by silicon measurement on a test chip in a 40nm low-power processing technology. Advanced monitoring, control and run-time error mitigation schemes enable the operation of these memories at the same optimal near-Vt voltage level as the digital logic. Reliability degradation is thus overcome and this opens the way to solve the memory bottleneck in NTC systems. Starting at the available silicon measurements, the analysis is extended to a future 14 and 10 nm technology nodes.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.3 Physical Attacks and countermeasures

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 1

Chair:
Francesco Regazzoni, Alari, CH

Co-Chair:
Shivam Bhasin, Telecom Paristech, FR

Physical Attacks are a major security threat for embedded system applications. This session focuses on several aspects of this problem. The presented papers range from countermeasures against power analysis and fault-based attacks, including electromagnetic and laser injections.

Time	Label	Presentation Title Authors
17:00	8.3.1	EFFICIENCY OF A GLITCH DETECTOR AGAINST ELECTROMAGNETIC FAULT INJECTION Speakers: Loic Zussa¹, Amine Dehbaoui¹, Karim Tobich², Jean-Max Dutertre¹, Philippe Maurine², Ludovic Guillaume-Sage², Jessy Clediere³ and Assia Tria³ ¹ENSM-SE, FR; ²LIRMM, FR; ³CEA, FR Abstract The use of electromagnetic glitches has recently emerged as an effective fault injection technique for the purpose of conducting physical attacks against integrated circuits. First research works have shown that electromagnetic faults are induced by timing constraint violations and that they are also located in the vicinity of the injection probe. This paper reports the study of the efficiency of a glitch detector against EM injection. This detector was originally designed to detect any attempt of inducing timing violations by means of clock or power glitches. Because electromagnetic disturbances are more local than global, the use of a single detector proved to be inefficient. Our subsequent investigation of the use of several detectors to obtain a full fault detection coverage is reported, it also provides further insights into the properties of electromagnetic injection and into the key role played by the injection probe.
17:30	8.3.2	ANALYZING AND ELIMINATING THE CAUSES OF FAULT SENSITIVITY ANALYSIS Speakers: Nahid Farhady Ghalaty, Aydin Aysu and Patrick Schaumont, Virginia Tech, US Abstract Fault Sensitivity Analysis (FSA) is a new type of side-channel attack that exploits the relation between the sensitive data and the faulty behavior of a circuit, the so-called fault sensitivity. This paper analyzes the behavior of different implementations of AES S-box architectures against FSA, and proposes a systematic countermeasure against this attack. This paper has two contributions. First, we study the behavior and structure of several S-box implementations, to understand the causes behind the fault sensitivity. We identify two factors: the timing of fault sensitive paths, and the number of logic levels of fault sensitive gates within the netlist. Next, we propose a systematic countermeasure against FSA. The countermeasure masks the effect of these factors by intelligent insertion of delay elements. We evaluate our methodology by means of an FPGA prototype with built-in timing-measurement. We show that FSA can be thwarted at low hardware overhead. Compared to earlier work, our method operates at the logic-level, is systematic, and can be easily generalized to bigger circuits.
18:00	8.3.3	A SMALLER AND FASTER VARIANT OF RSM Speakers: Noritaka Yamashita, Kazuhiko Minematsu, Toshihiko Okamura and Yukiyasu Tsunoo, NEC, JP Abstract Masking is one of the major countermeasures against side-channel attacks to cryptographic modules. Nassar et al. recently proposed a highly efficient masking method, called Rotating S-boxes Masking (RSM), which can be applied to a block cipher based on Substitution-Permutation Network. It arranges multiple masked S-boxes in parallel, which are rotated in each round. This rotation requires remasking process for each round to adjust current masks to those of the S-boxes. In this paper, we propose a method for reducing the complexity of RSM further by omitting the remasking process when the linear diffusion layer of the encryption algorithm has a certain algebraic property. Our method can be applied to AES with a reduced complexity from RSM, while keeping the equivalent security level.
18:30	IP4-1, 140	A MULTIPLE FAULT INJECTION METHODOLOGY BASED ON CONE PARTITIONING TOWARDS RTL MODELING OF LASER ATTACKS Speakers: Athanasios Papadimitriou¹, David Hely¹, Vincent Beroulle¹, Paolo Maistri² and Regis Leveugle³ ¹LCIS Laboratory - Grenoble INP, FR; ²TIMA Laboratory / CNRS, FR; ³TIMA Laboratory / Grenoble INP, FR Abstract Laser attacks, especially on circuits manufactured with recent deep submicron semiconductor technologies, pose a threat to secure integrated circuits due to the multiplicity of errors induced by a single attack. An efficient way to neutralize such effects is the design of appropriate countermeasures, according to the circuit implementation and characteristics. Therefore tools which allow the early evaluation of security implementations are necessary. Our efforts involve the development of an RTL fault injection approach more representative of laser attacks than random multi-bit fault injections and the utilization and evolution of state of the art emulation techniques to reduce the duration of the fault injection campaigns. This will ultimately lead to the design and validation of new countermeasures against laser attacks, on ASICs implementing cryptographic algorithms.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.4 Efficient Designs for Telecom and Financial Applications

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 2

Chair:
Sergio Saponara, University of Pisa, IT

Co-Chair:
Amer Baghdadi, Telecom Bretagne, FR

The session presents energy and performance efficient implementations of wireless communication and financial applications

Time	Label	Presentation Title Authors
17:00	8.4.1	(Best Paper Award Candidate) ENERGY EFFICIENT MIMO PROCESSING: A CASE STUDY OF OPPORTUNISTIC RUN-TIME APPROXIMATIONS Speakers: David Novo¹, Nazanin Farahpour¹, Ubaid Ahmad², Francky Catthoor² and Paolo Ienne¹ ¹EPFL, CH; ²IMEC, BE Abstract Worst-case design is one of the keys to practical engineering: create solutions that can withstand the most adverse possible conditions. Yet, the ever-growing need for higher energy efficiency suggest a grim outlook for worst-case design in the future. In this paper, we propose opportunistic run- time approximations to enable a continuous adaptation of the processing precision (operator type and bitwidth) to the actual execution context without modifying the algorithm functionality. We show that by relaxing the processing precision whenever possible, a VLSI implementation of an advanced wireless receiver algorithm based on opportunistic run-time approximations can save about 40% of the energy consumed by an optimized static implementation. These energy savings are achieved at the expense of a slight increase in overall chip area.
17:30	8.4.2	ENERGY-EFFICIENT FPGA IMPLEMENTATION FOR BINOMIAL OPTION PRICING USING OPENCL Speakers: Valentin Mena Morales¹, Pierre-Henri Horrein¹, Erik Hochapfel², Sandrine Vaton³ and Amer Baghdadi¹ ¹Institut Mines-Telecom; Telecom Bretagne; Lab-STICC, FR; ²ADACSYS, FR; ³Institut Mines-Telecom; Telecom Bretagne; IRISA, FR Abstract Energy efficiency of financial computations is a performance criterion that can no longer be dismissed, and is as crucial as raw acceleration and accuracy of the solution. In order to reduce the energy consumption of financial accelerators, FPGAs offer a good compromise with low power consumption and high parallelism. However, designing and prototyping an application on an FPGA-based platform are typically very time-consuming and require significant skills in hardware design. This issue constitutes a major drawback with respect to software-centric acceleration platforms and approaches. A high-level approach has been chosen, using Altera's implementation of the OpenCL standard, to answer this issue. We present two implementations on such a target of the binomial option pricing model on American options. The results obtained on a Terasic DE4 - Stratix IV board form a solid basis to hold all the constraints necessary for a real world application. The best implementation can evaluate more than 2000 options/s with an average power of less than 20W.
18:00	8.4.3	HARDWARE IMPLEMENTATION OF A REED-SOLOMON SOFT DECODER BASED ON INFORMATION SET DECODING Speakers: Stefan Scholl and Norbert Wehn, TU Kaiserslautern, DE Abstract Soft decision decoding of Reed-Solomon codes can largely improve frame errors rates over currently used hard decision decoding. In this paper, we present a new hardware implementation for soft decoding of Reed-Solomon codes based on information set decoding. To our best knowledge this is the first hardware implementation of information set decoding for long Reed-Solomon codes. We propose a reduced complexity version of the decoding algorithm, that is optimized for efficient hardware implementation and enables high throughput. The decoder was implemented on a Virtex 7 FPGA, achieving a gain of 0.75 dB compared to conventional hard decision decoding and a throughput of up to 1.19 GBit/s for the widely used RS(255,239). This gain in FER is achieved with less complexity and more than 15x larger throughput than other state-of-the-art architectures.
18:15	8.4.4	AMBIENT VARIATION-TOLERANT AND INTER COMPONENTS AWARE THERMAL MANAGEMENT FOR MOBILE SYSTEM ON CHIPS Speakers: Francesco Paterna¹, Joe Zanotelli² and Tajana Rosing¹ ¹University of California, San Diego, US; ²Qualcomm Inc., US Abstract In this work we measure and study two key aspects of the thermal behavior of smartphones: 1) thermal interaction between the components on the printed circuit board and 2) the influence of phone's ambient temperature which is subject to large variations. The measurements on the smartphone running typical workloads show that the heat generated by the communication subsystem and the high temperatures on the back cover of the phone can increase the SoC temperature by as much as 17oC. None of the run-time thermal management studies presented to date considered this interaction, as there was no model available. We design a thermal model that captures this thermal dependency and a policy able to avoid thermal emergencies while minimizing the impact on performance.
18:30	IP4-2, 183	ENERGY EFFICIENT DATA FLOW TRANSFORMATION FOR GIVENS ROTATION BASED QR DECOMPOSITION Speakers: Namita Sharma¹, Preeti Ranjan Panda¹, Min Li², Prashant Agrawal² and Francky Catthoor² ¹Indian Institute of Technology Delhi, IN; ²IMEC, BE Abstract QR Decomposition (QRD) is a typical matrix decomposition algorithm that shares many common features with other algorithms such as LU and Cholesky decomposition. The principle can be realized in a large number of valid processing sequences that differ significantly in the number of memory accesses and computations, and hence, the overall implementation energy. With modern low power embedded processors evolving towards register files with wide memory interfaces and vector functional units (FUs), the data flow in matrix decomposition algorithms needs to be carefully devised to achieve energy efficient implementation. In this paper, we present an efficient data flow transformation strategy for the Givens Rotation based QRD that optimizes data memory accesses. We also explore different possible implementations for QRD of multiple matrices using the SIMD feature of the processor. With the proposed data flow transformation, a reduction of up to 36% is achieved in the overall energy over conventional QRD sequences.
18:31	IP4-3, 781	MODE-CONTROLLED DATAFLOW BASED MODELING & ANALYSIS OF A 4G-LTE RECEIVER Speakers: Hrishikesh Salunkhe¹, Orlando Moreira² and Kees van Berkel³ ¹PhD Candidate, NL; ²Principal DSP Systems Engineer, NL; ³Prof. Dr., NL Abstract Today's smartphones and tablets contain multiple cellular modems to support 2G/3G/4G standards, including Long Term Evolution (LTE). They run on complex multi-processor hardware platforms and have to meet hard real-time constraints. Dataflow modeling can be used to design an LTE receiver. Static dataflow allows a rich set of analysis techniques, but is too restrictive to model the dynamic behavior in many realistic applications, including LTE receivers. Dynamic dataflow allows modeling of many realistic applications, but does not support rigorous temporal analysis. Mode-Controlled Dataflow (MCDF) is a restricted form of dynamic dataflow, and allows the same analysis techniques as static dataflow, in principle. We prove that MCDF is sufficiently expressive to handle the dynamic behavior of a realistic LTE receiver, by systematically and stepwise developing a complete MCDF model for an LTE receiver.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.5 Modeling & Specification

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 3

Chair:
Wolfgang Mueller, University of Paderborn, DE

Co-Chair:
Francois Pecheux, UPMC, FR

The first presentation proposes an analytical model to estimate the contention and the resulting delays on accessing shared components in a multi-core environment. In order to find the right granularity for design space exploration, the second presentation provides an algorithm for automatic aggregation of design blocks based upon their static computation demands. Finally, the last presentation proposes a novel formal notation for reactive system requirements in order to reduce translational efforts and thus make specifications both easier and quicker to create.

Time	Label	Presentation Title Authors
17:00	8.5.1	AN ACTIVITY-SENSITIVE CONTENTION DELAY MODEL FOR HIGHLY EFFICIENT DETERMINISTIC FULL-SYSTEM SIMULATIONS Speakers: Shu-Yung Chen, Chien-Hao Chen and Ren-Song Tsay, The Department of Computer Science National Tsing Hua University, Taiwan, TW Abstract As modern systems are integrating exceeding number of components for better performance and functionality, early full-system simulation tools have become essential for validating complex concurrent system interaction activities. In the past decades, many useful timing-accurate system simulation tools have been developed; however, we find that even for the most efficient techniques, more than 90% of overhead occurs when simulating shared devices, such as buses. Instead of adopting the constant-delay model that compromises accuracy or using the time-consuming precise scheduling approach, we propose in this paper an effective system activity-sensitive contention delay model that can dynamically capture runtime contention situations and system configuration changes. To verify the idea, we construct an analytical bus delay model and integrate that into a system simulation tool. The experimental results show 20 to 80 times performance improvement over the scheduling-based bus model on full-system simulations and the estimated timing difference is less than 3%.
17:30	8.5.2	AUTOMATIC SPECIFICATION GRANULARITY TUNING FOR DESIGN SPACE EXPLORATION Speakers: Jiaxing Zhang and Gunar Schirner, Northeastern University, US Abstract Algorithm Design Environments (ADE), such as Simulink, have been shown to be efficient for development, analysis, and evaluation of algorithms. Recent tools propose to facilitate algorithm / architecture co-design by bridging the gap from ADE to System-Level Design Environments (SLDE) through automatic synthesis from algorithm models to SLDL specifications. With the wide range of block characteristic (from simple logic functions to complex kernels) in the algorithm model, however, it is challenging to select a suitable compositional granularity for SLD Language (SLDL) blocks in the synthesized specification. A high volume of SLDL blocks of little computation will increase the number of mapping possibilities, whereas large blocks with heavy computation on the other hand allow inter-block fusion reducing the computational demands in the overall specification yet sacrificing the mapping flexibility. In this paper, we introduce an automatic specification granularity tuning mechanism to determine the granularity in the synthesized specification model hierarchy guided by the computational demands of algorithm blocks. Our granularity selection significantly simplifies the early design space exploration as only a meaningful block decomposition is exposed in the synthesized specification. It leads to an overall system with less computational demands by leveraging the block fusion capabilities in the ADE. At the same time our granularity decision ensures that sufficient flexibility remains in the system for exploring heterogeneous mapping of the algorithm. Our results on real world examples show that specification models can be synthesized with 80% efficiency through block fusion with 70-90% fewer but coarser grained blocks.
18:00	8.5.3	EDT: A SPECIFICATION NOTATION FOR REACTIVE SYSTEMS Speakers: Murali Krishna Goldsmith, Venkatesh R, Ulka Shrotri and Supriya Agrawal, Tata Research Development and Design Centre, Tata Consultancy Services Limited, IN Abstract Requirements of reactive systems express the relationship between sensors and actuators and are usually described in a natural language and a mix of state-based and stream-based paradigms. Translating these into a formal language is an important pre-requisite to automate the verification of requirements. The analysis effort required for the translation is a prime hurdle to formalization gaining acceptance among software engineers and testers. We present Expressive Decision Tables (EDT), a novel formal notation designed to reduce the translation efforts from both state-based and stream-based informal requirements. We have also built a tool, EDTTool, to generate test data and expected output from EDT specifications. In a case study consisting of more than 200 informal requirements of a real-life automotive application, translation of the informal requirements into EDT needed 43% lesser time than their translation into Statecharts. Further, we tested the Statecharts using test data generated by EDTTool from the corresponding EDT specifications. This testing detected one bug in a mature feature and exposed several missing requirements in another. The paper presents the EDT notation, comparison to other similar notations and the details of the case study.
18:30	IP4-4, 636	MODEL-BASED ACTOR MULTIPLEXING WITH APPLICATION TO COMPLEX COMMUNICATION PROTOCOLS Speakers: Christian Zebelein¹, Christian Haubelt¹, Joachim Falk², Tobias Schwarzer² and Jürgen Teich² ¹University of Rostock, DE; ²University of Erlangen-Nuremberg, DE Abstract We propose a dynamic scheduling approach for the concurrent execution of logical actor instances on a single synthesized actor instance. Based on a formal dataflow model of computation, the proposed approach can be applied to a wide range of applications in a model-based design flow. As case-study, we evaluate a bus-cycle-accurate SystemC RTL model based on an InfiniBand network adapter in a PCI Express system.
18:31	IP4-5, 743	A NOVEL MODEL FOR SYSTEM-LEVEL DECISION MAKING WITH COMBINED ASP AND SMT SOLVING Speakers: Alexander Biewer¹, Jens Gladigau¹ and Christian Haubelt² ¹Robert Bosch GmbH, DE; ²University of Rostock, DE Abstract In this paper, we present a novel model enabling system-level decision making for time-triggered many-core architectures in automotive systems. The proposed application model includes shared data entities that need to be bound to memories during decision making. As a key enabler to our approach, we explicitly separate computation and shared memory communication over a network-on-chip (NoC). To deal with contention on a NoC, we model the necessary basis to implement a time-triggered schedule that guarantees freedom of interference. We compute fundamental design decisions, namely (a) spatial binding, (b) multi-hop routing, and (c) time-triggered scheduling, by a novel coupling of answer set programming (ASP) with satisfiability modulo theories (SMT) solvers. First results of an automotive case study demonstrate the applicability of our method for complex real-world applications.
18:32	IP4-6, 102	DESPERATE: SPEEDING-UP DESIGN SPACE EXPLORATION BY USING PREDICTIVE SIMULATION SCHEDULING Speakers: Giovanni Mariani, Gianluca Palermo, Vittorio Zaccaria and Cristina Silvano, Politecnico di Milano, IT Abstract Design Space Exploration (DSE) is the problem to find the best architecture configuration in a platform based design problem. To accurately evaluate a configuration, computational expensive simulations are required. A common approach to reduce DSE execution time is to use analytic performance prediction models to approximate some of the required simulations, thus to prune the design space by removing bad configuration candidates. In this paper we will demonstrate that state of the art analytic techniques to speedup the DSE process are not capable to fully exploit the potentialities of a parallel simulation environment. We will demonstrate that, when different simulations can be run in parallel, predicting simulation time to better schedule the simulations on the parallel simulation environment is a more profitable approach with a speedup of more than 2x when compared to state of the art approaches.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.6 Mapping and Scheduling for Many-Core Embedded Systems

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 4

Chair:
Marc Geilen, Eindhoven University of Technology, NL

Co-Chair:
Sébastien Le Beux, Ecole Centrale de Lyon, FR

This session discusses novel ideas for embedded software implementation on many-core architectures. The first presentation deals with an optimized implementation of a H265 video coding algorithm on many-core architectures. A run-time scheduling approach for GPGPU architectures for priority-based systems is presented in the second presentation. The third talk presents an efficient run-time resource manager heuristic for many-core architectures based on a Lagrangian relaxation technique.

Time	Label	Presentation Title Authors
17:00	8.6.1	SOFTWARE ARCHITECTURE OF HIGH EFFICIENCY VIDEO CODING FOR MANY-CORE SYSTEMS WITH POWER-EFFICIENT WORKLOAD BALANCING Speakers: Muhammad Usman Karim Khan, Muhammad Shafique and Jörg Henkel, Karlsruhe Institute of Technology (KIT), DE Abstract The High Efficiency Video Coding (HEVC) standard aims at providing ~50% better compression compared to its predecessor (H.264) at the cost of high computational complexity. To enable HEVC video encoding in real-time scenarios, special coding support for parallelization is provided in HEVC that can be exploited by many-core systems. In this work, we present a HEVC software architecture where a video frame is adaptively divided into independent video frame regions (i.e. so-called video tiles) which are processed concurrently on multiple cores. By balancing the workload of each video tile mapped to a particular core, the total power consumption of a system is reduced (through dynamically scaling the operating frequency) under a given frame-rate constraint. We also exploit user tolerance to further curtail the HEVC workload with insignificant video quality degradation. Experimental results illustrate that the proposed approach results in ~43% power savings on a many-core system.
17:30	8.6.2	GPU-EVR: RUN-TIME EVENT BASED REAL-TIME SCHEDULING FRAMEWORK ON GPGPU PLATFORM Speakers: Haeseung Lee¹ and Mohammad Abdullah Al Faruque² ¹University of California, Irvine, US; ²University of California Irvine, US Abstract GPU architecture has traditionally been used in graphics application because of its enormous computing capability. Moreover, GPU architecture has also been used for general purpose computing in these days. Most of the current scheduling frameworks that are developed to handle GPGPU workload operate sequentially. This is problematic since this sequential approach may not be scalable for real-time systems, which is a consequence of the approach's inability to support preemption. We propose a novel scheduling framework that provides real-time support for the GPGPU platform. In contrast to existing frameworks, our proposed framework considers both concurrent execution of applications on the GPU and mapping between streaming multiprocessors and thread blocks. By considering both concurrent execution and mapping, our framework is able to guarantee timing up to 6.4 times as many applications compared to TimeGraph and Global EDF. In addition, our experimental applications use up to 20% less power under our scheduling framework compared to TimeGraph and Global EDF.
18:00	8.6.3	MULTI-OBJECTIVE DISTRIBUTED RUN-TIME RESOURCE MANAGEMENT FOR MANY-CORES Speakers: Stefan Wildermann, Michael Glaß and Jürgen Teich, University of Erlangen-Nuremberg, DE Abstract Dynamic usage scenarios of many-core systems require sophisticated run-time resource management that can deal with multiple often conflicting application and system objectives. This paper proposes an approach based on non-linear programming techniques that is able to trade off between objectives while respecting targets regarding their values. We propose a distributed application embedding for dealing with soft system-wide constraints as well as a centralized one for strict constraints. The experiments show that both approaches may significantly outperform related heuristics.
18:30	IP4-7, 323	COMIK: A PREDICTABLE AND CYCLE-ACCURATELY COMPOSABLE REAL-TIME MICROKERNEL Speakers: Andrew Nelson¹, Ashkan Beyranvand Nejad¹, Anca Molnos², Martijn Koedam³ and Kees Goossens³ ¹TU Delft, NL; ²CEA Leti, FR; ³TU Eindhoven, NL Abstract The functionality of embedded systems is ever increasing. This has lead to mixed time-criticality systems, where applications with a variety of real-time requirements co-exist on the same platform and share resources. Due to inter-application interference, verifying the real-time requirements of such systems is generally non trivial. In this paper, we present the CoMik microkernel that provides temporally predictable and composable processor virtualisation. CoMik's virtual processors are cycle-accurately composable, i.e. their timing cannot affect the timing of co-existing virtual processors by even a single cycle. Real-time applications executing on dedicated virtual processors can therefore be verified and executed in isolation, simplifying the verification of mixed time-criticality systems. We demonstrate these properties through experimentation on an FPGA prototyped hardware platform.
18:31	IP4-8, 71	UTILIZATION-AWARE LOAD BALANCING FOR THE ENERGY EFFICIENT OPERATION ON THE BIG.LITTLE PROCESSOR Speakers: Myungsun Kim¹, Kibeom Kim², James Geraci¹ and Seongsoo Hong³ ¹Samsung Electronics, KR; ²SAMSUNG Electronics, KR; ³Seoul National University, KR Abstract ARM's big.LITTLE architecture introduces the opportunity to optimize power consumption by selecting the core type most suitable for a level of processing demand. To take advantage of this new axis of optimization, we introduce the processor utilization factor into the Linux kernel's load balancing algorithm after carefully analyzing the power management mechanism of the big.LITTLE processor's port of Linux and deriving its state diagram representation. Our mechanism improves the Linux kernel's ability to assign tasks to cores in an energy efficient manner without having to make it directly aware of the available core types. Our experiments with a real test bed show that our algorithm improves energy consumption over the standard Linux scheduler up to 11.35% with almost no corresponding reduction in performance.
18:32	IP4-9, 538	HEVCDTM: APPLICATION-DRIVEN DYNAMIC THERMAL MANAGEMENT FOR HIGH EFFICIENCY VIDEO CODING Speakers: Daniel Palomino¹, Muhammad Shafique², Hussam Amrouch², Altamiro Susin³ and Jörg Henkel² ¹Karlsruhe Institute of Technology (KIT), BR; ²Karlsruhe Institute of Technology (KIT), DE; ³Federal University of Rio Grande do Sul, BR Abstract This paper presents an application-driven algorithm for Dynamic Thermal Management (DTM) for the High Efficiency Video Coding (HEVC). For efficient design of such a DTM policy, we perform an offline thermal analysis of an HEVC encoder and demonstrate the impact of different video sequences and different coding configurations on the processor temperature. Our thermal analysis is leveraged to develop an efficient application-driven DTM policy that performs temperature-aware coding along with an application-driven control of DTM knobs (e.g., frequency scaling) in order to meet the temperature constraints while still providing high video quality (i.e. PSNR loss < 0.01dB). For accurate thermal analysis and evaluation, we deploy an infrared camera-based thermal measurement setup that, on the contrary to state-of-the-art setups, does not require adding any extra layer on top of the measured chip, thus allowing the camera to accurately capture the infrared emissions from the die.
18:33	IP4-10, 714	IMPROVING EFFICIENCY OF EXTENSIBLE PROCESSORS BY USING APPROXIMATE CUSTOM INSTRUCTIONS Speakers: Mehdi Kamal¹, Amin Ghasem Azar¹, Ali Afzali-Kusha¹ and Massoud Pedram² ¹University of Tehran, IR; ²University of Southern California, US Abstract In this paper, we propose to move the conventional extensible processor design flow to the approximate computing domain to gain more speedup. In this domain, the instruction set architecture (ISA) design flow selects both exact and approximate custom instructions (CIs). The proposed approach could be used for the applications where imprecise results may be tolerated. In the CI identification phase of the flow, the CIs which do not satisfy the maximum propagation delay but can provide approximate results also may be included in the CI candidate set. Next, in the selection phase, we propose a merit function which selects CIs with higher cycle savings and small error rates. The efficacy of the proposed approximate design flow is investigated using the case studies of the discrete cosine transform (DCT) and inverse DCT (iDCT) of the MPEG2 application. Also, the impact of the process variation on the impreciseness of the results is investigated.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.7 Performance Modeling and Delay Test

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 5

Chair:
Robert Aitken, ARM, US

Co-Chair:
Mehdi Tahoori, KIT, DE

As technology dimensions shrink and process complexity increases, it becomes vital to accurately model performance limiters such as device and metal variability, as well as to determine when these effects become so critical that delay requirements are exceeded.

Time	Label	Presentation Title Authors
17:00	8.7.1	EFFICIENT PERFORMANCE ESTIMATION WITH VERY SMALL SAMPLE SIZE VIA PHYSICAL SUBSPACE PROJECTION AND MAXIMUM A POSTERIORI ESTIMATION Speakers: Li Yu¹, Sharad Saxena², Christopher Hess², Ibrahim (Abe) Elfadel³, Dimitri Antoniadis⁴ and Duane Boning⁴ ¹Massachusetts Institute of Technology, US; ²PDF Solution, Inc, US; ³Masdar Institute of Science and Technology, AE; ⁴MIT, US Abstract In this paper, we propose a novel integrated circuits performance estimation algorithm through a physical subspace projection and maximum-a-posteriori (MAP) estimation. Our goal is to estimate the distribution of a target circuit performance with very small measurement samples from on-chip monitor circuits. The key idea in this work is to exploit the fact that simulation and measurement data are physically correlated under different circuit configurations and topologies. First, different groups of measurements are projected to a subspace spanned by a set of physical variables. The projection is achieved by performing a sensitivity analysis of measurement parameters with respect to the subspace variables using virtual source compact model. Then a Bayesian treatment is developed by introducing prior distributions over these subspace variables. Maximum a posteriori estimation is then applied using the prior, and an expectation-maximization (EM) algorithm is used to estimate the circuit performance. The proposed method is validated by post-silicon measurement for a commercial 28-nm process. An average error reduction of 2x is achieved which can be translated to 32x reduction on data needed for samples on the same die. A 150x and 70x sample size reduction on training dies is also achieved compared to traditional least-square fitting method and least-angle regression method respectively without reducing accuracy.
17:30	8.7.2	JOINT VIRTUAL PROBE: JOINT EXPLORATION OF MULTIPLE TEST ITEMS' SPATIAL PATTERNS FOR EFFICIENT SILICON CHARACTERIZATION AND TEST PREDICTION Speakers: Shuangyue Zhang¹, Fan Lin², Chun-Kai Hsu², Kwang-Ting Cheng² and Hong Wang¹ ¹Department of Automation, Tsinghua University, CN; ²Department of Electrical and Computer Engineering, University of California, Santa Barbara, US Abstract Virtual Probe (VP), proposed for characterization of spatial variations and for test time reduction, can effectively reconstruct the spatial pattern of a test item for an entire wafer using measurement values from only a small fraction of dies on the wafer. However, VP calculates the spatial signature of each test item separately, one item at a time, resulting in very long runtime for complex chips which often require hundreds, or even thousands, of test items in production. In this paper, we propose a new method, named Joint Virtual Probe (JVP), which can jointly derive spatial patterns of multiple test items. By simultaneously handling a large group of test items, JVP significantly reduces the overall runtime. And the prediction accuracy can also be improved because of JVP's implicit use of inter-test-item correlations in predicting spatial patterns. The experimental results on two industrial products, with 277 and 985 parametric test items in the production test programs respectively, demonstrate that, JVP achieves an average speedup of ～170X and ～50X over VP in the pre-test analysis and the test application phases respectively, as well as a slightly higher prediction accuracy than VP.
18:00	8.7.3	SUBSTITUTING TRANSITION FAULTS WITH PATH DELAY FAULTS AS A BASIC DELAY FAULT MODEL Speaker: Irith Pomeranz, Purdue University, US Abstract Comparing a single transition fault with a single path delay fault, targeting (i.e., simulating or generating a test for) a path delay fault is not more complex than targeting a transition fault. However, targeting a set of path delay faults is significantly more complex than targeting a set of transition faults when the goal is to consider the testable path delay faults that are associated with the longest paths. The reason is the large fraction of untestable path delay faults among these faults. This complication is removed if the requirement on the lengths of the paths is removed. In this case, it is possible to use path delay faults instead of transition faults as a basic delay fault model for better coverage of small delay defects. This paper studies the effects of using path delay faults as a basic delay fault model instead of transition faults.
18:15	8.7.4	STANDARD CELL LIBRARY TUNING FOR VARIABILITY TOLERANT DESIGNS Speakers: Sebastien Fabrie¹, Juan Diego Echeverri², Maarten Vertregt² and Jose Pineda² ¹Eindhoven University of Technology, NL; ²NXP Semiconductors, NL Abstract In today's semiconductor industry we see a move towards smaller technology feature sizes. These smaller feature sizes pose a problem due to mismatch between identical cells on a single die known as local variation. In this paper a library tuning method is proposed which makes a smart selection of cells in a standard cell library to reduce the design's sensitivity to local variability. This results in a robust IC design with an identifiable behavior towards local variations. Experimental results performed on a widely used microprocessor design synthesized for a high performance timing show that we can achieve a timing spread reduction of 37% at an area increase cost of 7%.
18:30	IP4-11, 194	PROBABILISTIC STANDARD CELL MODELING CONSIDERING NON-GAUSSIAN PARAMETERS AND CORRELATIONS Speakers: André Lange¹, Christoph Sohrmann¹, Roland Jancke¹, Joachim Haase¹, Ingolf Lorenz² and Ulf Schlichtmann³ ¹Fraunhofer Institute for Integrated Circuits (IIS), Design Automation Division (EAS), DE; ²GLOBALFOUNDRIES Inc., DE; ³Technische Universität München, DE Abstract Variability continues to pose challenges to integrated circuit design. With statistical static timing analysis and high-yield estimation methods, solutions to particular problems exist, but they do not allow a common view on performance variability including potentially correlated and non-Gaussian parameter distributions. In this paper, we present a probabilistic approach for variability modeling as an alternative: model parameters are treated as multi-dimensional random variables. Such a fully multivariate statistical description formally accounts for correlations and non-Gaussian random components. Statistical characterization and model application are introduced for standard cells and gate-level digital circuits. Example analyses of circuitry in a 28 nm industrial technology illustrate the capabilities of our modeling approach.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

8.8 Hot Topic: Beyond CMOS Ultra-low-power Computing

Date: Wednesday 26 March 2014
Time: 17:00 - 18:30
Location / Room: Exhibition Theatre

Organiser:
Saibal Mukhopadhyay, Georgia Institute of Technology, US

Chair:
Arijit Raychowdhury, Georgia Institute of Technology, US

Co-Chair:
Saibal Mukhopadhyay, Georgia Institute of Technology, US

With conventional CMOS scaling becoming increasingly challenging, the designers wonder what opportunities and challenges exist beyond-CMOS for both Boolean and non-Boolean computing. This session will discuss three very different and promising emerging technologies -- Tunneling Field-Effect Transistor, Spintronics, and nano-electro-mechanical switches (NEMS) -- for low-power electronics. The talks will discuss the need for innovating and evaluating new circuit and system design methods as new device technologies emerge.

Time	Label	Presentation Title Authors
17:00	8.8.1	ULTRA-LOW POWER ELECTRONICS WITH SI/GE TUNNEL FET Speakers: Amit Trivedi, Mohammad Faisal Amir and Saibal Mukhopadhyay, Georgia Institute of Technology, US Abstract Si/Ge Tunnel FET (TFET) with its subthermal subthreshold swing is attractive for low power analog and digital designs. Greater Ion/Ioff ratio of TFET can reduce the dynamic power in digital designs, while higher gm/IDS can lower the bias power of analog amplifier. However, the above benefits of TFET are eclipsed by MOSFET at a higher power/performance point. Ultra low power scalability of the key analog and digital circuits, SRAM and operational transconductance amplifier (OTA), with TFET is demonstrated. Analyzing a TFET based cellular neural network, this work shows the feasibility of ultra-low-power neuromorphic computing with TFET.
17:30	8.8.2	BRAIN-INSPIRED COMPUTING WITH SPIN TORQUE DEVICES Speakers: Kaushik Roy, Mrigank Sharad, Deliang Fan and Karthik Yogendra, Purdue University, US Abstract In this paper we discuss the potential of emerging spin-torque devices for computing applications. Recent proposals for spin-based computing schemes may be differentiated as 'all-spin' vs. hybrid, programmable vs. fixed, and, Boolean vs. non-Boolean. All-spin logic-styles may offer high area-density due to small form-factor of nano-magnetic devices. However, circuit and system-level design techniques need to be explored that leaverage the specific spin-device characterisitcs to achieve energy-efficiency, performance and reliability comparable to those of CMOS. The non-volatility of nano-magnets can be exploited in the design of energy and area-efficient programmable logic. In such logic-styles, spin-devices may play the dual-role of computing as well as memory-elements that provide field-programmability. Spin-based threshold logic design is presented as an example. Emerging spintronic phenomena may lead to ultra-low-voltage, current-mode, spin-torque switches that can offer attractive computing capabilities, beyond digital switches. Such devices may be suitable for non-Boolean data-processing applications which involve analog processing. Integration of such spin-torque devices with charge-based devices like CMOS and resistive memory can lead to highly energy-efficient information processing hardware for applicatons like pattern-matching, neuromorphic-computing, image-processing and data-conversion. Finally, we discuss the possibility of using coupled spin-torque nano oscillators for low-power non-Boolean computing.
18:00	8.8.3	TOWARD ULTRALOW-POWER COMPUTING AT EXTEME WITH SILICON CARBIDE (SIC) NANOELECTROMECHANICAL LOGIC Speakers: Swarup Bhunia¹, Vaishnavi Ranganathan², Tina He², Srihari Rajgopal², Rui Wang², Mehran Mehregany² and Philip Feng² ¹Case Western Reserve University, US; ²Case Western Reserve U., US Abstract Growing number of important application areas, including automotive and industrial applications as well as space, avionics, combustion engine, intelligent propulsion systems, and geo-thermal exploration require electronics that can work reliable at extreme conditions - in particular at a temperature > 250°C and at high radiation (1-30 Mrad), where conventional electronics fail to work reliably. Traditionally, existing wide-band-gap semiconductors, e.g., silicon carbide (SiC) transistor-based electronics have been considered most viable for high temperature and high radiation applications. However, the large-size, high threshold voltage, low switching speed and high leakage current make logic design with these devices unattractive. Additionally, the leakage current markedly increases at high temperature (in the range of 10 µA for a 2-input NAND gate), which induces self-heating effect and makes power delivery at high temperature very challenging. To address these issues, in this paper we present a computing platform for low-power reliable operation at extreme environment using SiC electromechanical switches. We show that a device-circuit-architecture co-design approach can provide reliable long-term operation with virtually zero leakage power.
18:30		End of session
19:30		DATE Party in "Gläserne Manufaktur" of the Volkswagen AG The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

Party DATE Party

Date: Wednesday 26 March 2014
Time: 19:30 - 23:00
Location / Room: "Gläserne Manufaktur" of the Volkswagen AG

The DATE Party is again scheduled on the second conference day, Wednesday, March 26, 2014, starting from 19:30 h. This year, it will take place in one of Dresden's most exciting and modern buildings, the "Gläserne Manufaktur" of the car manufacturer Volkswagen AG (www.glaesernemanufaktur.de/en/). The party will feature a flying buffet style dinner with various catering points and accompanying drinks. Light background music and the possibility of guided visits through the extraordinary premises will round off the evening. It provides a perfect opportunity to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. Please kindly note that it is no seated dinner. All delegates, exhibitors and their guests are encouraged to attend the party. Please be aware that entrance is only possible with a party ticket. Each full conference registration includes a ticket for the DATE Party. Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Ticket price for the full Evening Social Programme: 75 € per person.

Time	Label	Presentation Title Authors
23:00		End of session

9.1 SPECIAL DAY Hot Topic: CMOS scaling - from evolutionary to revolutionary computing

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Saal 1

Organisers:
Thomas Mikolajick, NamLab gGmbH, DE
Ian O'Connor, Lyon Institute of Nanotechnology, FR

Chair:
Thomas Mikolajick, NamLab gGmbH, DE

Co-Chair:
Ian O'Connor, Lyon Institute of Nanotechnology, FR

Transistors as switches have now scaled down to a point where the classical bulk structure is no longer tenable and it is necessary to change the nature of the channel structure. In this session, the three principal contenders for following on from conventional devices will be examined. The first paper looks at the use of III-V nanowires, with expected benefits in terms of speed and energy, as well as integration challenges. The second paper looks at how the use of switches with controllable polarity, such as in silicon nanowire devices, can improve the energy efficiency of systems on chip. The devices themselves are explored in detail in the third paper, with the concept of fine-grain reconfigurability at the fore. The fourth and final paper gives a reality check on carbon electronics and the most promising devices in this class.

Time	Label	Presentation Title Authors
08:30	9.1.1	III-V SEMICONDUCTOR NANOWIRES FOR FUTURE DEVICES Speakers: H. Schmid, B. Borg, K. Moselund, P. Das Kunungo, G. Signorello, S. Karg, P. Mensch, V. Schmidt and H. Riel, IBM Research, CH Abstract The monolithic integration of III-V nanowires on silicon by direct epitaxial growth enables new possibilities for the design and fabrication of electronic as well as optoelectronic devices. We demonstrate a new growth technique to directly integrate III-V semiconducting nanowires on silicon using selective area epitaxy within a nanotube template. Thus we achieve small diameter nanowires, controlled doping profiles and sharp heterojunctions essential for future device applications. We experimentally demonstrate vertical tunnel diodes and gate-all-around tunnel FETs based on InAs-Si nanowire heterojunctions. The results indicate the benefits of the InAs-Si material system combining the possibility of achieving high Ion with high Ion/Ioff ratio.
08:50	9.1.2	ADVANCED SYSTEM ON A CHIP DESIGN BASED ON CONTROLLABLE-POLARITY FETS Speakers: Pierre-Emmanuel Gaillardon, Luca Amaru, Jian Zhang and Giovanni De Micheli, Integrated Systems Laboratory – Swiss Federal Institute of Technology, CH Abstract Abstract—Field-Effect Transistors (FETs) with on-line controllable-polarity are promising candidates to support next generation System-on-Chip (SoC). Thanks to their enhanced functionality, controllable-polarity FETs enable a superior design of critical components in a SoC, such as processing units and memories, while also providing native solutions to control power consumption. In this paper, we present the efficient design of a SoC core with controllable-polarity FET. Processing units are speeded-up at the datapath level, as arithmetic operations require fewer physical resources than in standard CMOS. Power consumption is decreased via embedded power-gating techniques and tunable high-performance/low-power devices operation. Memory cells are made smaller by merging the access interface with the storage circuitry. We foresee the advantages deriving from these techniques, by evaluating their impact on the design of SoC for a contemporary telecommunication application. Using a 22-nm vertically-stacked silicon nanowire technology, we estimate a delay and power reduction of 20% and 19% respectively, at a cost of a moderate area overhead of 15%, with respect to a state-of-art FinFET technology.
09:15	9.1.3	RECONFIGURABLE SILICON NANOWIRE DEVICES AND CIRCUITS: OPPORTUNITIES AND CHALLENGES Speakers: Walter Weber¹, André Heinzig², Jens Trommer¹, Markus König², Matthias Grube¹ and Thomas Mikolajick¹ ¹Namlab gGmbH, DE; ²Technische Universität Dresden, DE Abstract Reconfigurable fine-grain electronics target an increase in the number of integrated logic functions per chip by enhancing the functionality at the device level and by implementing a compact and technologically simple hardware platform. Here we study a promising realization approach by employing reconfigurable nanowire transistors (RFETs) as the multifunctional building-blocks to be integrated therein. RFETs merge the electrical characteristics of unipolar n- and p- type FETs into a single universal device. The switch comprises four terminals, where three of them act as the conventional FET electrodes and the fourth acts as an electric select signal to dynamically program the desired switch type. The transistor consists of two independent charge carrier injection valves as represented by two gated Schottky junctions integrated within an intrinsic silicon nanowire. Radial compressive strain applied to the channel is used as a scalable method to adjust n- and p-FET currents to each other, thereby enabling complementary logic circuits. Simple but relevant examples for the reconfiguration of complete gates will be given, demonstrating the potential of this technology.
09:35	9.1.4	ADVANCING CMOS WITH CARBON ELECTRONICS Speaker: Franz Kreupl, TU Munich, DE Abstract A fresh look on carbon-based transistor channel materials like single-walled carbon nanotubes (CNT) and graphene nanoribbons (GNR) in future electronic applications is given. Although theoretical predictions initially promised that GNR (which do have a bandgap) would perform equally well as transistors based on CNTs, experimental evidence for the well-behaved transistor action is missing up to now. Possible reasons for the shortcomings as well as possible solutions to overcome the performance gap will be addressed. In contrast to GNR, short channel CNT field effect transistors (FET) demonstrate in the experimental realization almost ideal transistor characteristics down to very low bias voltages. Therefore, CNT-FETs are clear frontrunners in the search of a future CMOS switch, that will enable further voltage and gate length scaling. Essential features which distinguish CNT-FETs from alternative solution will be discussed and benchmarked. Finally, the gap to industrial wafer-level scale SWCNT integration will be addressed and strategies for achieving highly aligned carbon nanotube fabrics will be discussed. Without such a high yield wafer-scale integration, SWCNT circuits will be an illusional dream.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.2 Low-Cost, High-Performance NoCs

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 6

Chair:
Kees Goossens, Eindhoven University, NL

Co-Chair:
Luca Ramini, University of Ferrara, IT

This session pushes the boundaries of NoC performance optimization while at the same time accounting for implementation constraints. The first paper takes a perspective where express channels are added to the topology, and then smart application mapping is performed. The second paper instead chooses the TDM NoC route to provide guaranteed performance, and significantly optimizes the TDM scheduling process. Finally, the last paper reduces buffer sizes, while also providing elasticity, in a router's virtual channel buffers.

Time	Label	Presentation Title Authors
08:30	9.2.1	APPLICATION MAPPING FOR EXPRESS CHANNEL-BASED NETWORKS-ON-CHIP Speakers: Di Zhu¹, Lizhong Chen¹, Siyu Yue² and Massoud Pedram¹ ¹Univ. of Southern California, US; ²University of Southern California, US Abstract With the emergence of many-core multiprocessor system-on-chips (MPSoCs), the on-chip networks are facing serious challenges in providing fast communication for various tasks and cores. One promising solution shown in recent studies is to add express channels to the network as shortcuts to bypass intermediate routers, thereby reducing packet latency. However, this approach also greatly changes the packet delay estimation and traffic behaviors of the network, both of which have not yet been exploited in existing mapping algorithms. In this paper, we explore the opportunities in optimizing application mapping for express channel-based on-chip networks. Specifically, we derive a new delay model for this type of networks, identify their unique characteristics, and propose an efficient heuristic mapping algorithm that increases the bypassing opportunities by reducing unnecessary turns that would otherwise impose the entire router pipeline delay to packets. Simulation results show that the pro-posed algorithm can achieve a 2~4X reduction in the number of turns and 10~26% reduction in the average packet delay.
09:00	9.2.2	PARALLEL PROBE BASED DYNAMIC CONNECTION SETUP IN TDM NOCS Speakers: Shaoteng Liu, Axel Jantsch and Zhonghai Lu, KTH, SE Abstract Abstract—We propose a Time-Division Multiplexing (TDM) based connection oriented NoC with a novel double-time wheel router architecture combined with a run-time parallel probing setup method. In comparison with traditional TDM connection setup methods, our design has the following advantages: (1) it allocates paths and time slots at run-time; (2) it is fast with predictable and bounded setup latency; (3) it avoids additional resources (no auxiliary network or central processor to find and manage connections); (4) it is fully distributed and therefore it scales nicely with network size. Compared to a packet based setup method, our probe based design can reduce path setup delay by up to 81% and increase network load by 110% in an 8x8 mesh, while avoiding the auxiliary network. Compared to a centralized method, our solution can double the success rate, while eliminating the central resource for path setup and reducing the wire overhead. Synthesis results suggest that our design is faster and smaller than all comparable solutions.
09:30	9.2.3	ELASTISTORE: AN ELASTIC BUFFER ARCHITECTURE FOR NETWORK-ON-CHIP ROUTERS Speakers: Giorgos Dimitrakopoulos¹, Ioannis Seitanidis¹, Anastasios Psarras¹ and Chrysostomos Nicopoulos² ¹Democritus University of Thrace, GR; ²University of Cyprus, CY Abstract The design of scalable Network-on-Chip (NoC) architectures calls for new implementations that achieve high-throughput and low-latency operation, without exceeding the stringent area-energy constraints of modern Systems-on-Chip (SoC). The router's buffer architecture is a critical design aspect that affects both network-wide performance and implementation characteristics. In this paper, we extend Elastic Buffer (EB) architectures to support multiple Virtual Channels (VC) and we derive extit{ElastiStore}, a novel lightweight elastic buffer architecture that minimizes buffering requirements, without sacrificing performance. The integration of the proposed elastic buffering scheme in the NoC router enables the design of new router architectures -- both single-cycle and two-stage pipelined -- which offer the same performance as baseline VC-based routers, albeit at a significantly lower area/power cost.
10:00	IP4-12, 581	DYNAMIC CONSTRUCTION OF CIRCUITS FOR REACTIVE TRAFFIC IN HOMOGENEOUS CMPS Speakers: Marta Ortín-Obón¹, Darío Suárez-Gracia Suárez-Gracia¹, María Villaroya-Gaudó¹, Cruz Izu² and Víctor Viñals-Yúfera¹ ¹University of Zaragoza, ES; ²University of Adelaide, AU Abstract Networks on Chip (NoCs) have a large impact on system performance, area and energy. Considering the characteristics of the memory subsystem while designing the NoC helps identify improvement opportunities and build more efficient designs. Leveraging the frequent request-reply pattern, our proposal dynamically builds the reply path in advance, is able to share circuits between messages, and even removes some implicit replies, significantly reducing NoC latency. A careful implementation of this circuit reservation mechanism achieves an average 17% reduction in router energy consumption, 8% smaller router area and a 2% system performance increase, compared with its baseline counterpart.
10:01	IP4-13, 646	IMPROVING HAMILTONIAN-BASED ROUTING METHODS FOR ON-CHIP NETWORKS: A TURN MODEL APPROACH Speakers: Poona Bahrebar and Dirk Stroobandt, Ghent University, BE Abstract The overall performance of Multi-Processor System-on-Chip (MPSoC) platforms depends highly on the efficient communication among their cores in the Network-on-Chip (NoC). Routing algorithms are responsible for the on-chip communication and traffic distribution through the network. Hence, designing efficient and high-performance routing algorithms is of significant importance. In this paper, a deadlock-free and highly adaptive path-based routing method is proposed without using virtual channels. This method strives to exploit the maximum number of minimal paths between any source and destination pair. The simulation results in terms of performance and power consumption demonstrate that the proposed method significantly outperforms the other adaptive and non-adaptive schemes. This efficiency is achieved by reducing the number of hotspots and smoothly distributing the traffic across the network.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.3 Hardware Implementations for Data Security

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 1

Chair:
Viktor Fischer, St Etienne, FR

Co-Chair:
Tim Gueneysu, RUB, DE

Hardware features are used as a trust anchor in many secure systems. This includes design obfuscation techniques, encrypted processing, and biometric systems which are discussed in this session.

Time	Label	Presentation Title Authors
08:30	9.3.1	EMBEDDED RECONFIGURABLE LOGIC FOR ASIC DESIGN OBFUSCATION AGAINST SUPPLY CHAIN ATTACKS Speakers: Bao Liu¹ and Brandon Wang² ¹University of Texas at San Antonio, US; ²Cadence Design Systems, Inc., US Abstract Hardware is the foundation and the root of trust of any security system. However, in today's global IC industry, an IP provider, an IC design house, a CAD company, or a foundry may subvert a VLSI system with back doors or logic bombs. Such a supply chain adversary's capability is rooted in his knowledge on the hardware design. Successful hardware design obfuscation would severely limit a supply chain adversary's capability if not preventing all supply chain attacks. However, not all designs are obfuscatable in traditional technologies. We propose to achieve ASIC design obfuscation based on embedded reconfigurable logic which is determined by the end user and unknown to any party in the supply chain. Combined with other security techniques, embedded reconfigurable logic can provide the root of ASIC design obfuscation, data confidentiality and tamper-proofness. As a case study, we evaluate hardware-based code injection attacks and reconfiguration-based instruction set obfuscation based on an open source SPARC processor LEON2. We prevent program monitor Trojan attacks and increase the area of a minimum code injection Trojan with a 1KB ROM by 2.38% for every 1% area increase of the LEON2 processor.
09:00	9.3.2	A MINIMALIST APPROACH TO REMOTE ATTESTATION Speakers: Aurélien Francillon¹, Quan Nguyen², Kasper Rasmussen² and Gene Tsudik² ¹EURECOM, FR; ²University of California, Irvine, US Abstract Embedded computing devices increasingly permeate many aspects of modern life: from medical to automotive, from building and factory automation to weapons, from critical infrastructures to home entertainment. Despite their specialized nature as well as limited resources and connectivity, these devices are now becoming an increasingly popular and attractive target for attacks, especially, malware infections. A number of approaches have been suggested to detect and/or mitigate such attacks. They vary greatly in terms of application generality and underlying assumptions. However, one common theme is the need for Remote Attestation, a distinct security service that allows a trusted party (verifier) to check the internal state of a remote untrusted embedded device (prover). Many prior methods assume some form of trusted hardware on the prover, which is not a good option for small and low-end embedded devices. To this end, we investigate the feasibility of Remote Attestation without trusted hardware. This paper provides a systematic treatment of Remote Attestation, starting with a precise definition of the desired service and proceeding to its systematic deconstruction into necessary and sufficient properties. Next, these are mapped into a minimal collection of hardware and software components that result in secure Remote Attestation. One distinguishing feature of this line of research is the need to prove (or, at least argue for) architectural minimality -- an aspect rarely encountered in security research. This work also provides a promising platform for attaining more advanced security services and guarantees.
09:15	9.3.3	MULTI RESOLUTION TOUCH PANEL WITH BUILT-IN FINGERPRINT SENSING SUPPORT Speakers: Pranav Koundinya, Sandhya Theril, Tao Feng, Varun Prakash, Jimming Bao and Weidong Shi, University of Houston, US Abstract In today's technology driven world, it is essential to build secure systems with low faulty behavior. Authentication is one of the primary means to gain access to secure systems. Users need to be authenticated in order to gain access to the services and sensitive information contained within the system. Due to the surge in the number of touch based smart devices, there arises a need for a compatible authentication system. Historically, fingerprints have served in its fullest capacity to establish the uniqueness of an individual's identity. It can be detected using capacitive sensing techniques. In this paper we present a novel unified device using transparent electronics for both fingerprint scan and multi-touch interaction. We discuss a high resolution transparent touch sensitive device and a read out circuit that drives the capacitive sensor array for touch interactions at low resolutions and for fingerprint sensing at higher resolutions. Using circuit simulation and custom Verilog-A model for transparent thin-film transistors, we verified that our design can sense fingerprints in 8.25 ms and detect touches in 0.6ms with an efficient power consumption of 1 mW. The results show that such a device can be realized and can serve as a very efficient means of user authentication. Furthermore, from the usability aspect, the proposed device is essential as it provides user transparent and non intrusive authentication.
09:30	9.3.4	HEROIC: HOMOMORPHICALLY ENCRYPTED ONE INSTRUCTION COMPUTER Speakers: Nektarios Georgios Tsoutsos¹ and Michail Maniatakos² ¹NYU Polytechnic School of Engineering, US; ²NYU Abu Dhabi, AE Abstract As cloud computing becomes mainstream, the need to ensure the privacy of the data entrusted to third parties keeps rising. Cloud providers resort to numerous security controls and encryption to thwart potential attackers. Still, since the actual computation inside cloud microprocessors remains unencrypted, the opportunity of leakage is theoretically possible. Therefore, in order to address the challenge of protecting the computation inside the microprocessor, we introduce a novel general purpose architecture for secure data processing, called HEROIC (Homomorphically EncRypted One Instruction Computer). This new design utilizes a single instruction architecture and provides native processing of encrypted data at the architecture level. The security of the solution is assured by a variant of Paillier's homomorphic encryption scheme, used to encrypt both instructions and data. Experimental results using our hardware-cognizant software simulator, indicate an average execution overhead between 5 and 45 times for the encrypted computation (depending on the security parameter), compared to the unencrypted variant, for a 16-bit single instruction architecture.
10:00	IP4-14, 836	EDA TOOLS TRUST EVALUATION THROUGH SECURITY PROPERTY PROOFS Speaker: Yier Jin, The University of Central Florida, US Abstract The security concerns of EDA tools have long been ignored because IC designers and integrators only focus on their functionality and performance. This lack of trusted EDA tools hampers hardware security researchers' efforts to design trusted integrated circuits. To address this concern, a novel EDA tools trust evaluation framework has been proposed to ensure the trustworthiness of EDA tools through its functional operation, rather than scrutinizing the software code. As a result, the newly proposed framework lowers the evaluation cost and is a better fit for hardware security researchers. To support the EDA tools evaluation framework, a new gate-level information assurance scheme is developed for security property checking on any gate-level netlist. Helped by the gate-level scheme, we expand the territory of proof-carrying based IP protection from RT-level designs to gate-level netlist, so that most of the commercially trading third-party IP cores are under the protection of proof-carrying based security properties. Using a sample AES encryption core, we successfully prove the trustworthiness of Synopsys Design Compiler in generating a synthesized netlist.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.4 Timing challenges in validation

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 2

Chair:
Elena Ioana Vatajelu, Politecnico di Torino, IT

Co-Chair:
Mark Zwolinski, University of Southampton, UK

Accelerated timing simulation is essential for today's chip designs, whether it is performed at the gate-level or at the system-level. This session provides solutions to address the challenges of timing analysis and timing validation performance across multiple levels of design's abstractions.

Time	Label	Presentation Title Authors
08:30	9.4.1	FAST STA PREDICTION-BASED GATE-LEVEL TIMING SIMULATION Speakers: Tariq Bashir Ahmad and Maciej Ciesielski, UMASS Amherst, US Abstract Traditional dynamic simulation with standard delay format (SDF) back-annotation cannot be reliably performed on large designs. The large size of SDF files makes the event-driven timing simulation extremely slow as it has to process an excessive number of events. In order to accelerate gate-level timing simulation we propose an automated fast prediction-based gate-level timing simulation that combines static timing analysis (STA) at the block level with dynamic timing simulation at the I/O interfaces. We demonstrate that the proposed timing simulation can be done earlier in the design cycle in parallel with synthesis.
09:00	9.4.2	A CROSS-LEVEL VERIFICATION METHODOLOGY FOR DIGITAL IPS AUGMENTED WITH EMBEDDED TIMING MONITORS Speakers: Valerio Guarnieri¹, Massimo Petricca², Alessandro Sassone², Sara Vinco¹, Nicola Bombieri¹, Franco Fummi³, Enrico Macii² and Massimo Poncino² ¹University of Verona, IT; ²Politecnico di Torino, IT; ³Universita' di Verona, IT Abstract Smart systems implement the leading technology advances in the context of embedded devices. Current design methodologies are not suitable to deal with tightly interacting subsystems of different technological domains, namely analog, digital, discrete and power devices, MEMS and power sources. The effects of interaction between components and with the environment must be modeled and simulated at system level to achieve high performance. Focusing on the digital domain, additional design constraints have to be considered as a result of the integration of multi-domain subsystems in a single device. The main digital design challenges, combined with those emerging from the heterogeneous nature of the whole system, directly impact on performance and on propagation delay of the digital component. This paper proposes a design approach to enhance the RTL model of a given digital component for the integration in smart systems, and a methodology to verify the added features at system-level. The design approach consists of augmenting the RTL model through the automatic insertion of delay sensors, which can detect and correct timing failures. The augmented model is abstracted to SystemC TLM and, then, mutants (i.e., code mutations for emulating timing failures) are automatically injected into the model. Experimental results demonstrate the applicability of the proposed design and verification methodology and the effectiveness of the simulation performance.
09:30	9.4.3	EMPOWERING STUDY OF DELAY BOUND TIGHTNESS WITH SIMULATED ANNEALING Speakers: Xueqian Zhao and Zhonghai Lu, KTH Royal Institute of Technology, SE Abstract Studying the delay bound tightness typically takes a practical approach by comparing simulated results against analytic results. However, this is often a manual process whereas many simulation parameters have to be configured before the simulations run. This is a tedious and time-consuming process. We propose a technique to automate this process by using a simulated annealing approach. We formulate the problem as an online optimization problem, and embed a simulated annealing algorithm in the simulation environment to guide the search of configuration parameters which give good tightness results. This is a fully automated procedure and thus provide a promising path to automatic design space exploration in similar contexts. Experiment results of an all-to-one communication network with large searching space and complicated constraints illustrate the effectiveness of our method.
10:00	IP4-15, 665	(Best Paper Award Candidate) ANALYSIS AND EVALUATION OF PER-FLOW DELAY BOUND FOR MULTIPLEXING MODELS Speakers: Yanchen Long¹, Zhonghai Lu² and Xiaolang Yan³ ¹Zhejiang University and KTH Royal Institute of Technology, SE; ²KTH Royal Institute of Technology, SE; ³Zhejiang University, CN Abstract Multiplexing models are common in resource sharing communication media such as buses, crossbars and networks. While sending packets over a multiplexing node, the packet delay bound can be computed using network calculus models. The tightness of such delay bound remains an open problem. This paper studies the multiplexing models for weighted round robin scheduling with different traffic arrival curves, and analyzes per-flow packet delay bounds with different service properties. We empirically evaluate the tightness of the delay bounds. Our results show the quality of different analysis models, and how influential each parameter is to tightness.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.5 Hot Topic: Connecting Different Worlds - Technology Abstraction for Reliability-Aware Design and Test

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 3

Organisers:
Ulf Schlichtmann, Technische Universität München, Ge
Andreas Herkersdorf, Technische Universität München, Ge

Chair:
Nikil Dutt, University of California, Irvine, US

Co-Chair:
Mehdi Tahoori, Karlsruhe Institute of Technology, DE

The rapid shrinking of device geometries in the nanometer regime requires new technology-aware design methodologies. These must be able to evaluate the resilience of the circuit throughout all System on Chip (SoC) abstraction levels. To successfully guide design decisions at the system level, reliability models, which abstract technology information, are required to identify those parts of the system where additional protection in the form of hardware or software countermeasures is most effective. Interfaces such as the presented Resilience Articulation Point (RAP) or the Reliability Interchange Information Format (RIIF) are required to enable EDA-assisted analysis and propagation of reliability information. The models are discussed from different perspectives, such as design and test.

Time	Label	Presentation Title Authors
08:30	9.5.1	INTRODUCTION TO RAP (RESILIENCE ARTICULATION POINT) Speaker: Andreas Herkersdorf, TU München, DE
08:45	9.5.2	SYSTEM LEVEL DESIGN USING RAP (RESILIENCE ARTICULATION POINT) Speaker: Ulf Schlichtmann, Technische Universität München, DE Abstract We will demonstrate how technology characteristics can be included in system-level reliability analysis using the RAP (Resilience Articulation Point) model. The specific example of a two-wheeled robot will be used.
09:00	9.5.3	CROSS-LAYER RELIABILITY IN THE DESIGN OF AN ERROR RESILIENT COMMUNICATION SYSTEM Speaker: Norbert Wehn, University of Kaiserslautern, DE
09:15	9.5.4	RIIF - TOWARD A STANDARD APPROACH FOR CREATING RELIABILITY MODELS FOR COMPLEX SILICON DEVICES Speaker: Adrian Evans, IROC Technologies, FR Abstract Complex silicon devices are increasingly controlling critical systems where safety and reliability are key concerns. Silicon technology is subject to numerous failure modes which can be broadly classified into soft- error effects (due to natural radiation) and life-time effects (e.g. electro-migration, NBTI, HCI). It is necessary to consider all of these failure modes and how they propagate through the system and produce user-visible effects. There are no consistent tools or methodologies to address this problem. Current ad-hoc approaches are not able to cope with the diversity of technology failure modes, increased design sizes and the complex relationships between consumers and suppliers of electronic components. RIIF (Reliability Information Interchange Format), is an initiative to develop a standard modelling language for specifying the failure mechanisms in silicon devices and systems built using these devices. In this session we give a brief overview of RIIF and present an example that highlights some of the challenges in reliability modelling.
09:30	9.5.5	TEST PERSPECTIVES Speaker: Jacob Abraham, UT Austin, US
09:45	9.5.6	INDUSTRIAL PERSPECTIVE Speaker: Sani Nassif, IBM, US
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.6 Schedulability analysis

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 4

Chair:
Giuseppe Lipari, ENS - Cachan, FR

Co-Chair:
Benny Akesson, CTU Prague, CZ

This session deals with scheduling and schedulability analysis of real-time systems. In particular, it presents different models of scheduling, including fixed and dynamic priority, and real-time calculus.

Time	Label	Presentation Title Authors
08:30	9.6.1	RATE-ADAPTIVE TASKS: MODEL, ANALYSIS, AND DESIGN ISSUES Speakers: Giorgio Buttazzo¹, Enrico Bini² and Darren Buttle³ ¹Scuola Superiore Sant'Anna, IT; ²Lund University, SE; ³ETAS-PGA/PRM-E, DE Abstract In automotive systems, some of the engine control tasks are triggered by specific crankshaft rotation angles and are designed to adapt their functionality based on the angular velocity of the engine. This paper proposes a new task model for specifying such a type of real-time activities and presents an approach for analyzing the system feasibility under deadline scheduling for different scenarios. In particular, a feasibility test is derived for tasks under steady-state conditions (constant speed), as well as in dynamic conditions (constant acceleration). A design method is also discussed to determine the most suitable switching speeds for adapting the functionality of tasks without exceeding a desired utilization. Finally, a number of research directions are highlighted to extend the current results to more complex and realistic scenarios.
09:00	9.6.2	ACCEPTANCE AND RANDOM GENERATION OF EVENT SEQUENCES UNDER REAL TIME CALCULUS CONSTRAINTS Speakers: Kajori Banerjee and Pallab Dasgupta, Indian Institute of Technology Kharagpur, IN Abstract Simulation platforms for complex networked real time systems require random input pattern generators for simulating input distributions. They also require monitors for checking whether the output of the system satisfies the desired throughput. In this paper we study the acceptance and generation problems in a setting where the constraints defining the input distributions as well as the constraints defining the expected output distributions are specified in real time calculus (RTC). We prove that event patterns satisfying a given set of RTC constraints can be described by a omega-regular language. We propose a method for constructing an automaton that can be used for online generation of random admissible event patterns. This is significant, considering the known problems of deadlock in less informed generators for streams satisfying RTC constraints
09:30	9.6.3	GENERAL AND EFFICIENT RESPONSE TIME ANALYSIS FOR EDF SCHEDULING Speakers: Nan Guan and Wang Yi, Uppsala University, SE Abstract Response Time Analysis (RTA) is one of the key problems in real-time system design. This paper proposes new RTA methods for EDF scheduling, with general system models where workload and resource availability are represented by request/demand bound functions and supply bound functions. The main idea is to derive response time upper bounds by lower-bounding the slack times. We first present a simple over-approximate RTA method, which lower bounds the slack time by measuring the "horizontal distance" between the demand bound function and the supply bound function. Then we present an exact RTA method based on the above idea but eliminating the pessimism in the first analysis. This new exact RTA method, not only allows to precisely analyze more general system models than existing EDF RTA techniques, but also significantly improves analysis efficiency. Experiments are conducted to show efficiency improvement of our new RTA technique, and tradeoffs between the analysis precision and efficiency of the two methods in this paper are discussed.
09:45	9.6.4	THE SCHEDULABILITY REGION OF TWO-LEVEL MIXED-CRITICALITY SYSTEMS BASED ON EDF-VD Speakers: Dirk Mueller and Alejandro Masrur, TU Chemnitz, DE Abstract The algorithm Earliest Deadline First with Virtual Deadlines (EDF-VD) was recently proposed to schedule mixed-criticality task sets consisting of high-criticality (HI) and low-criticality (LO) tasks. EDF-VD distinguishes between HI and LO mode. In HI mode, the HI tasks may require executing for longer than in LO mode. As a result, in LO mode, EDF-VD assigns virtual deadlines to HI tasks (i.e., it uniformly downscales deadlines of HI tasks) to account for an increase of workload in HI mode. Different schedulability conditions have been proposed in the literature; however, the schedulability region to fully characterize EDF-VD has not been investigated so far. In this paper, we review EDF-VD's schedulability criteria and determine its schedulability region to better understand and design mixed-criticality systems. Based on this result, we show that EDV-VD has a schedulability region being around 85% larger than that of the Worst-Case Reservations (WCR) approach.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.7 Timing Analysis and Cell Design

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 5

Chair:
Jose Monteiro, INESC-ID / Tecnico, ULisboa, PT

Co-Chair:
Elena Dubrova, Royal Institute of Technology, SE

The papers in this session present static timing techniques and tools for the analysis and synthesis of logic circuits. The papers take into account new aspects of timing analysis like variability, leakage and sign-off.

Time	Label	Presentation Title Authors
08:30	9.7.1	(Best Paper Award Candidate) FACILITATING TIMING DEBUG BY LOGIC PATH CORRESPONDENCE Speakers: Oshri Adler, Eli Arbel, Ilia Averbouch, Ilan Beer and Inna Grijnevitch, IBM, IL Abstract Synthesis tools for high-performance VLSI designs employ aggressive logic optimization techniques in order to meet physical requirements such as area and cycle time. During these optimizations, the original structure of the design, which is usually written in a hardware description language (HDL), is lost. It is difficult, and often impossible, to relate signals after synthesis to the original signals in the HDL code. Some signals only lose their names while for others there are no equivalent counterparts in the design after synthesis. Debugging timing problems is based on timing reports which are usually represented in terms of the post-synthesis design. Hence, it is difficult to relate critical paths in the timing reports to the relevant paths in the HDL code when a logic fix is needed. In this paper, we propose a different approach for dealing with the correspondence problem: instead of trying to relate signals we relate paths. Given a critical path in a post-synthesis representation, our method is able to find all corresponding paths in the pre-synthesis (HDL) representation. As a result, locating the parts in the HDL which are relevant to the given timing problem becomes trivial. A novel Sat-based algorithm for dealing with the path-correspondence problem is described. Experimental results on various industrial high-end processor designs show the effectiveness of our algorithm in substantially reducing the amount of paths in the HDL which one will have to consider when debugging a given critical path.
09:00	9.7.2	STATISTICAL STATIC TIMING ANALYSIS USING A SKEW-NORMAL CANONICAL DELAY MODEL Speakers: Vijaykumar M and V Vasudevan, Department of Electrical Engineering Indian Institute of Technology Madras, IN Abstract In its simplest form, a parameterized block based statistical static timing analysis (SSTA) is performed by assuming that both gate delays and the arrival times at various nodes are Gaussian random variables. These assumptions are not true in many cases. Quadratic models are used for more accurate analysis, but at the cost of increased computational complexity. In this paper, we propose a model based on skew-normal random variables. It can take into account the skewness in the gate delay distribution as well as the nonlinearity of the MAX operation. We derive analytical expressions for the moments of the MAX operator based on the conditional expectations. The computational complexity of using this model is marginally higher than the linear model based on Clark's approximations. The results obtained using this model match well with Monte-Carlo simulations.
09:30	9.7.3	LEAKAGE-POWER-AWARE CLOCK PERIOD MINIMIZATION Speakers: Hua-Hsin Yeh¹, Shih-Hsu Huang¹ and Yow-Tyng Nieh² ¹Chung Yuan Christian University, TW; ²Industrial Technology Research Institute, TW Abstract In the design of nonzero clock skew circuits, an increase of the path delay may improve circuit speed and reduce leakage power. However, the impact of increasing path delay on the trade-off between circuit speed and leakage power has not been well studied. In this paper, we propose a two-step approach for leakage-power-aware clock period minimization. Compared with previous works, our approach has the following two significant contributions. First, our approach is the first leakage-power-aware clock skew scheduling that can guarantee working with the lower bound of the clock period. Second, our approach is also the first work that demonstrates the problem of minimizing the number of extra buffers is a polynomial-time problem. Benchmark data show that our approach achieves the best results in terms of the clock period and the leakage power.
09:45	9.7.4	A DEEP LEARNING METHODOLOGY TO PROLIFERATE GOLDEN SIGNOFF TIMING Speakers: Seung-Soo Han¹, Andrew B. Kahng², Siddhartha Nath² and Ashok S. Vydyanathan² ¹Myongji University, Yongin, KR; ²University of California, San Diego, US Abstract Signoff timing analysis remains a critical element in the IC design flow. Multiple signoff corners, libraries, design methodologies, and implementation flows make timing closure very complex at advanced technology nodes. Reported timing slacks directly affect chip area and power by forcing additional buffering or sizing (negative slacks), or limiting area and power recovery (positive slacks). Design teams often wish to ensure that one tool's timing reports are neither optimistic nor pessimistic with respect to another tool's reports. The resulting "correlation" problem is highly complex because tools contain millions of lines of black-box and legacy code, licenses prevent any reverse-engineering of algorithms, and the nature of the problem is seemingly "unbounded" across possible designs, timing paths, and electrical parameters. In this work, we apply a "big-data" mindset to approach the timer correlation problem. We develop a machine learning-based tool, Golden Timer eXtension (GTX), to correct divergence in flip-flop setup time, cell arc delay, wire delay, stage delay, and path slack at timing endpoints between timers. Our models are developed with datasets of >300K data points for cell, wire, and stage delays and >30K data points for path slack and flip-flop setup time. We propose a methodology to apply GTX to two arbitrary timers, and we evaluate scalability of GTX across multiple designs and foundry technologies / libraries, both with and without signal integrity analysis. Our experimental results show reduction in divergence between timing tools from 139.3ps to 21.1ps (i.e., 6.6×) in endpoint slack, from 25.6ps to 2.4ps (i.e., 10× reduction) in flip-flop setup time, from 454.4ps to 51.9ps (i.e., 8.7× reduction) in cell delay, from 194.8ps to 17.4ps (i.e., 11.2× reduction) in wire delay, and from 117ps to 23.8ps (4.9× reduction) in stage delay. The average (mean) divergence in timing reports after applying GTX is almost zero. We further demonstrate the incremental application of our methods so that models can be adapted to any outlier discrepancies when new designs are taped out in the same technology / library. Last, we demonstrate that GTX can also correlate timing reports between signoff and design implementation tools.
10:00	IP4-17, 759	AGING-AWARE STANDARD CELL LIBRARY DESIGN Speakers: Saman Kiamehr¹, Farshad Firouzi¹, Mojtaba Ebrahimi² and Mehdi Tahoori² ¹Karlsruhe Institute of Technology (KIT), DE; ²Karlsruhe Institute of Technology, DE Abstract Transistor aging, mostly due to Bias Temperature Instability (BTI), is one of the major unreliability sources at nano-scale technology nodes. BTI causes the circuit delay to increase and eventually leads to a decrease in the circuit lifetime. Typically, standard cells in the library are optimized according to the design time delay, however, due to the asymmetric effect of BTI, the rise and fall delays might become significantly imbalanced over the lifetime. In this paper, the BTI effect is mitigated by balancing the rise and fall delays of the standard cells at the excepted lifetime. We find an optimal trade-off between the increase in the size of the library and the lifetime improvement (timing margin reduction) by non-uniform extension of the library cells for various ranges of the input signal probabilities. The simulation results reveal that our technique can prolong the circuit lifetime by around 150% with a negligible area overhead.
10:01	IP4-18, 279	PASS-XNOR LOGIC: A NEW LOGIC STYLE FOR P-N JUNCTION BASED GRAPHENE CIRCUITS Speakers: Valerio Tenace, Andrea Calimera, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT Abstract In this work we introduce a new logic style for p-n junctions based digital graphene circuits: the pass-XNOR logic style. The latter enables the realization of compact, energy efficient circuits that better exploit the characteristics of graphene. We first show how a single p-n junction can be conceived as a pass-XNOR gate, i.e., a transmission gate with embedded logic functionality, the XNOR Boolean operator. Secondly, we propose a smart integration strategy in which series/parallel connections of pass-XNOR gates allow to implement AND/OR logical conjunctions, and, therefore, all possible truth tables. Experimental results conducted on a set of representative logic functions show the superior of pass-XNOR logic circuits w.r.t. standard CMOS circuits and graphene circuits that use p-n junctions in a complementary-like structure.
10:02	IP4-19, 365	MIXED ALLOCATION OF ADJUSTABLE DELAY BUFFERS COMBINED WITH BUFFER SIZING IN CLOCK TREE SYNTHESIS OF MULTIPLE POWER MODE DESIGNS Speakers: Kitae Park, Geunho Kim and Taewhan Kim, Seoul National University, KR Abstract Recently, many works have shown that adjustable delay buffer (ADB) whose delay is adjustable dynamically can effectively solve the clock skew variation problem in the designs with multiple power modes. However, all the previous works of ADB allocation inherently entail two critical limitations, which are the adjusted delays by ADB are always increments and the low cost buffer sizing has never been or not been primarily taken into account. To demonstrate how much overcoming the two limitations is effective in resolving the clock skew constraint, we characterize the two types of ADBs called CADB (capacitor based ADB) and IADB (inverter based ADB) and show that the adjusted delays by IADB can be decremented, and show that the clock skew violation in some clock trees of multiple power modes can be resolved by applying buffer sizing together with using only a small number of IADBs and CADBs.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

9.8 Embedded Tutorial: Memcomputing: the Cape of Good Hope

Date: Thursday 27 March 2014
Time: 08:30 - 10:00
Location / Room: Exhibition Theatre

Organisers:
Yiyu Shi, Missouri University of Science & Technology, US
Hung-Ming Chen, National Chiao Tung University, Taiwan, Ta

Chair:
Tsung-Yi Ho, CSIE, NCKU, TW

Co-Chair:
Hung-Ming Chen, National Chiao Tung University, TW

Energy efficiency has emerged as a major barrier to performance scalability for modern processors. On the other hand, significant breakthroughs have been achieved in memory technologies recently. As such, the fascinating idea of memcomputing (i.e., use memory for computation purposes) has drawn wide attention from both academia and industry as an effective remedy. Compared with conventional logic computing, memory array provides large set of parallel resources with high bandwidth, which can be configured to perform computing in spatial/temporal manner leading to dramatic reduction in processor-memory traffic. Moreover, memory computing brings the computing engine close to the data, thus drastically minimizing the von Neumann bottleneck. Finally, it exploits the advances in memory technologies and integration approaches (e.g. 3D integration) to achieve better technology scalability. This special session offers a broad-spectrum retreat (devices, processes and systems) on this hot topic to the general CAD community, hoping to inspire more contributions from the design automation perspective.

Time	Label	Presentation Title Authors
08:30	9.8.1	MEMCOMPUTING: A BRAIN-INSPIRED COMPUTING PARADIGM Speaker: Yuriy Pershin, University of South Carolina, US
09:00	9.8.2	MSIM: A GENERAL CYCLE ACCURATE SIMULATION PLATFORM FOR MEMCOMPUTING STUDIES Speakers: Chun Zhang¹, Peng Deng², Hui Geng¹, Jianming Liu¹, Qi Zhu², Jinjun Xiong³ and Yiyu Shi¹ ¹Missouri University of Science and Technology, US; ²University of California, Riverside, US; ³IBM T.J. Watson Research Center, US Abstract The lack of accurate yet open to public simulation infrastructure has puzzled researchers in the memcomputing area for sometime. In this paper, we propose for the first time a full tool chain called MSim that supports the cycle-accurate microarchitecture level simulation for memcomputing studies. With MSim, the performance gains of utilizing memcomputing for arbitrary applications on user configurable computer system architectures can be evaluated in high accuracy. In addition, MSim provides flexible interfaces with pervasive object-oriented design, which makes it well-suited as a good base platform for researchers to explore new memcomputing technologies.
09:30	9.8.3	ENERGY-EFFICIENT HARDWARE ACCELERATION THROUGH COMPUTING IN THE MEMORY Speakers: Somnath Paul¹, Robert Karam², Swarup Bhunia² and Ruchir Puri³ ¹Intel Corporation, US; ²Case Western Reserve University, US; ³IBM Watson Research Center, US Abstract Energy-efficiency has emerged as a major barrier to performance scalability for modern processors. We note that significant part of processor's energy requirement is contributed by processor-memory communication. To address the energy issue in processors, we propose a novel hardware accelerator framework that transforms high-density memory array into a configurable computing resource to accelerate variety of tasks - both compute- and data-intensive. It exploits the block-based architecture of nanoscale memory to create a spatially connected array of lightweight processors, each of which uses a memory block as its local memory. The proposed framework provides some unique advantages for hardware acceleration compared to conventional accelerators: 1) memory array provides large set of parallel resources with high bandwidth, which can be configured to perform computing in spatio/temporal manner leading to dramatic reduction in processor-memory traffic; 2) it brings the computing engine close to the data, thus drastically minimizing the von Neumann bottleneck; 3) finally, it exploits the advances in memory technologies and integration approaches e.g. 3D integration to achieve better technology scalability compared to alternative reconfigurable accelerator platforms. Simulation results for several data-intensive applications show that the proposed computing approach provides significant improvement in energy-efficiency compared to software while achieving significantly lower hardware overhead.
10:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

IP4 Interactive Presentations

Date: Thursday 27 March 2014
Time: 10:00 - 10:30
Location / Room: Conference Level, foyer

Label	Presentation Title Authors
IP4-1	A MULTIPLE FAULT INJECTION METHODOLOGY BASED ON CONE PARTITIONING TOWARDS RTL MODELING OF LASER ATTACKS Speakers: Athanasios Papadimitriou¹, David Hely¹, Vincent Beroulle¹, Paolo Maistri² and Regis Leveugle³ ¹LCIS Laboratory - Grenoble INP, FR; ²TIMA Laboratory / CNRS, FR; ³TIMA Laboratory / Grenoble INP, FR Abstract Laser attacks, especially on circuits manufactured with recent deep submicron semiconductor technologies, pose a threat to secure integrated circuits due to the multiplicity of errors induced by a single attack. An efficient way to neutralize such effects is the design of appropriate countermeasures, according to the circuit implementation and characteristics. Therefore tools which allow the early evaluation of security implementations are necessary. Our efforts involve the development of an RTL fault injection approach more representative of laser attacks than random multi-bit fault injections and the utilization and evolution of state of the art emulation techniques to reduce the duration of the fault injection campaigns. This will ultimately lead to the design and validation of new countermeasures against laser attacks, on ASICs implementing cryptographic algorithms.
IP4-2	ENERGY EFFICIENT DATA FLOW TRANSFORMATION FOR GIVENS ROTATION BASED QR DECOMPOSITION Speakers: Namita Sharma¹, Preeti Ranjan Panda¹, Min Li², Prashant Agrawal² and Francky Catthoor² ¹Indian Institute of Technology Delhi, IN; ²IMEC, BE Abstract QR Decomposition (QRD) is a typical matrix decomposition algorithm that shares many common features with other algorithms such as LU and Cholesky decomposition. The principle can be realized in a large number of valid processing sequences that differ significantly in the number of memory accesses and computations, and hence, the overall implementation energy. With modern low power embedded processors evolving towards register files with wide memory interfaces and vector functional units (FUs), the data flow in matrix decomposition algorithms needs to be carefully devised to achieve energy efficient implementation. In this paper, we present an efficient data flow transformation strategy for the Givens Rotation based QRD that optimizes data memory accesses. We also explore different possible implementations for QRD of multiple matrices using the SIMD feature of the processor. With the proposed data flow transformation, a reduction of up to 36% is achieved in the overall energy over conventional QRD sequences.
IP4-3	MODE-CONTROLLED DATAFLOW BASED MODELING & ANALYSIS OF A 4G-LTE RECEIVER Speakers: Hrishikesh Salunkhe¹, Orlando Moreira² and Kees van Berkel³ ¹PhD Candidate, NL; ²Principal DSP Systems Engineer, NL; ³Prof. Dr., NL Abstract Today's smartphones and tablets contain multiple cellular modems to support 2G/3G/4G standards, including Long Term Evolution (LTE). They run on complex multi-processor hardware platforms and have to meet hard real-time constraints. Dataflow modeling can be used to design an LTE receiver. Static dataflow allows a rich set of analysis techniques, but is too restrictive to model the dynamic behavior in many realistic applications, including LTE receivers. Dynamic dataflow allows modeling of many realistic applications, but does not support rigorous temporal analysis. Mode-Controlled Dataflow (MCDF) is a restricted form of dynamic dataflow, and allows the same analysis techniques as static dataflow, in principle. We prove that MCDF is sufficiently expressive to handle the dynamic behavior of a realistic LTE receiver, by systematically and stepwise developing a complete MCDF model for an LTE receiver.
IP4-4	MODEL-BASED ACTOR MULTIPLEXING WITH APPLICATION TO COMPLEX COMMUNICATION PROTOCOLS Speakers: Christian Zebelein¹, Christian Haubelt¹, Joachim Falk², Tobias Schwarzer² and Jürgen Teich² ¹University of Rostock, DE; ²University of Erlangen-Nuremberg, DE Abstract We propose a dynamic scheduling approach for the concurrent execution of logical actor instances on a single synthesized actor instance. Based on a formal dataflow model of computation, the proposed approach can be applied to a wide range of applications in a model-based design flow. As case-study, we evaluate a bus-cycle-accurate SystemC RTL model based on an InfiniBand network adapter in a PCI Express system.
IP4-5	A NOVEL MODEL FOR SYSTEM-LEVEL DECISION MAKING WITH COMBINED ASP AND SMT SOLVING Speakers: Alexander Biewer¹, Jens Gladigau¹ and Christian Haubelt² ¹Robert Bosch GmbH, DE; ²University of Rostock, DE Abstract In this paper, we present a novel model enabling system-level decision making for time-triggered many-core architectures in automotive systems. The proposed application model includes shared data entities that need to be bound to memories during decision making. As a key enabler to our approach, we explicitly separate computation and shared memory communication over a network-on-chip (NoC). To deal with contention on a NoC, we model the necessary basis to implement a time-triggered schedule that guarantees freedom of interference. We compute fundamental design decisions, namely (a) spatial binding, (b) multi-hop routing, and (c) time-triggered scheduling, by a novel coupling of answer set programming (ASP) with satisfiability modulo theories (SMT) solvers. First results of an automotive case study demonstrate the applicability of our method for complex real-world applications.
IP4-6	DESPERATE: SPEEDING-UP DESIGN SPACE EXPLORATION BY USING PREDICTIVE SIMULATION SCHEDULING Speakers: Giovanni Mariani, Gianluca Palermo, Vittorio Zaccaria and Cristina Silvano, Politecnico di Milano, IT Abstract Design Space Exploration (DSE) is the problem to find the best architecture configuration in a platform based design problem. To accurately evaluate a configuration, computational expensive simulations are required. A common approach to reduce DSE execution time is to use analytic performance prediction models to approximate some of the required simulations, thus to prune the design space by removing bad configuration candidates. In this paper we will demonstrate that state of the art analytic techniques to speedup the DSE process are not capable to fully exploit the potentialities of a parallel simulation environment. We will demonstrate that, when different simulations can be run in parallel, predicting simulation time to better schedule the simulations on the parallel simulation environment is a more profitable approach with a speedup of more than 2x when compared to state of the art approaches.
IP4-7	COMIK: A PREDICTABLE AND CYCLE-ACCURATELY COMPOSABLE REAL-TIME MICROKERNEL Speakers: Andrew Nelson¹, Ashkan Beyranvand Nejad¹, Anca Molnos², Martijn Koedam³ and Kees Goossens³ ¹TU Delft, NL; ²CEA Leti, FR; ³TU Eindhoven, NL Abstract The functionality of embedded systems is ever increasing. This has lead to mixed time-criticality systems, where applications with a variety of real-time requirements co-exist on the same platform and share resources. Due to inter-application interference, verifying the real-time requirements of such systems is generally non trivial. In this paper, we present the CoMik microkernel that provides temporally predictable and composable processor virtualisation. CoMik's virtual processors are cycle-accurately composable, i.e. their timing cannot affect the timing of co-existing virtual processors by even a single cycle. Real-time applications executing on dedicated virtual processors can therefore be verified and executed in isolation, simplifying the verification of mixed time-criticality systems. We demonstrate these properties through experimentation on an FPGA prototyped hardware platform.
IP4-8	UTILIZATION-AWARE LOAD BALANCING FOR THE ENERGY EFFICIENT OPERATION ON THE BIG.LITTLE PROCESSOR Speakers: Myungsun Kim¹, Kibeom Kim², James Geraci¹ and Seongsoo Hong³ ¹Samsung Electronics, KR; ²SAMSUNG Electronics, KR; ³Seoul National University, KR Abstract ARM's big.LITTLE architecture introduces the opportunity to optimize power consumption by selecting the core type most suitable for a level of processing demand. To take advantage of this new axis of optimization, we introduce the processor utilization factor into the Linux kernel's load balancing algorithm after carefully analyzing the power management mechanism of the big.LITTLE processor's port of Linux and deriving its state diagram representation. Our mechanism improves the Linux kernel's ability to assign tasks to cores in an energy efficient manner without having to make it directly aware of the available core types. Our experiments with a real test bed show that our algorithm improves energy consumption over the standard Linux scheduler up to 11.35% with almost no corresponding reduction in performance.
IP4-9	HEVCDTM: APPLICATION-DRIVEN DYNAMIC THERMAL MANAGEMENT FOR HIGH EFFICIENCY VIDEO CODING Speakers: Daniel Palomino¹, Muhammad Shafique², Hussam Amrouch², Altamiro Susin³ and Jörg Henkel² ¹Karlsruhe Institute of Technology (KIT), BR; ²Karlsruhe Institute of Technology (KIT), DE; ³Federal University of Rio Grande do Sul, BR Abstract This paper presents an application-driven algorithm for Dynamic Thermal Management (DTM) for the High Efficiency Video Coding (HEVC). For efficient design of such a DTM policy, we perform an offline thermal analysis of an HEVC encoder and demonstrate the impact of different video sequences and different coding configurations on the processor temperature. Our thermal analysis is leveraged to develop an efficient application-driven DTM policy that performs temperature-aware coding along with an application-driven control of DTM knobs (e.g., frequency scaling) in order to meet the temperature constraints while still providing high video quality (i.e. PSNR loss < 0.01dB). For accurate thermal analysis and evaluation, we deploy an infrared camera-based thermal measurement setup that, on the contrary to state-of-the-art setups, does not require adding any extra layer on top of the measured chip, thus allowing the camera to accurately capture the infrared emissions from the die.
IP4-10	IMPROVING EFFICIENCY OF EXTENSIBLE PROCESSORS BY USING APPROXIMATE CUSTOM INSTRUCTIONS Speakers: Mehdi Kamal¹, Amin Ghasem Azar¹, Ali Afzali-Kusha¹ and Massoud Pedram² ¹University of Tehran, IR; ²University of Southern California, US Abstract In this paper, we propose to move the conventional extensible processor design flow to the approximate computing domain to gain more speedup. In this domain, the instruction set architecture (ISA) design flow selects both exact and approximate custom instructions (CIs). The proposed approach could be used for the applications where imprecise results may be tolerated. In the CI identification phase of the flow, the CIs which do not satisfy the maximum propagation delay but can provide approximate results also may be included in the CI candidate set. Next, in the selection phase, we propose a merit function which selects CIs with higher cycle savings and small error rates. The efficacy of the proposed approximate design flow is investigated using the case studies of the discrete cosine transform (DCT) and inverse DCT (iDCT) of the MPEG2 application. Also, the impact of the process variation on the impreciseness of the results is investigated.
IP4-11	PROBABILISTIC STANDARD CELL MODELING CONSIDERING NON-GAUSSIAN PARAMETERS AND CORRELATIONS Speakers: André Lange¹, Christoph Sohrmann¹, Roland Jancke¹, Joachim Haase¹, Ingolf Lorenz² and Ulf Schlichtmann³ ¹Fraunhofer Institute for Integrated Circuits (IIS), Design Automation Division (EAS), DE; ²GLOBALFOUNDRIES Inc., DE; ³Technische Universität München, DE Abstract Variability continues to pose challenges to integrated circuit design. With statistical static timing analysis and high-yield estimation methods, solutions to particular problems exist, but they do not allow a common view on performance variability including potentially correlated and non-Gaussian parameter distributions. In this paper, we present a probabilistic approach for variability modeling as an alternative: model parameters are treated as multi-dimensional random variables. Such a fully multivariate statistical description formally accounts for correlations and non-Gaussian random components. Statistical characterization and model application are introduced for standard cells and gate-level digital circuits. Example analyses of circuitry in a 28 nm industrial technology illustrate the capabilities of our modeling approach.
IP4-12	DYNAMIC CONSTRUCTION OF CIRCUITS FOR REACTIVE TRAFFIC IN HOMOGENEOUS CMPS Speakers: Marta Ortín-Obón¹, Darío Suárez-Gracia Suárez-Gracia¹, María Villaroya-Gaudó¹, Cruz Izu² and Víctor Viñals-Yúfera¹ ¹University of Zaragoza, ES; ²University of Adelaide, AU Abstract Networks on Chip (NoCs) have a large impact on system performance, area and energy. Considering the characteristics of the memory subsystem while designing the NoC helps identify improvement opportunities and build more efficient designs. Leveraging the frequent request-reply pattern, our proposal dynamically builds the reply path in advance, is able to share circuits between messages, and even removes some implicit replies, significantly reducing NoC latency. A careful implementation of this circuit reservation mechanism achieves an average 17% reduction in router energy consumption, 8% smaller router area and a 2% system performance increase, compared with its baseline counterpart.
IP4-13	IMPROVING HAMILTONIAN-BASED ROUTING METHODS FOR ON-CHIP NETWORKS: A TURN MODEL APPROACH Speakers: Poona Bahrebar and Dirk Stroobandt, Ghent University, BE Abstract The overall performance of Multi-Processor System-on-Chip (MPSoC) platforms depends highly on the efficient communication among their cores in the Network-on-Chip (NoC). Routing algorithms are responsible for the on-chip communication and traffic distribution through the network. Hence, designing efficient and high-performance routing algorithms is of significant importance. In this paper, a deadlock-free and highly adaptive path-based routing method is proposed without using virtual channels. This method strives to exploit the maximum number of minimal paths between any source and destination pair. The simulation results in terms of performance and power consumption demonstrate that the proposed method significantly outperforms the other adaptive and non-adaptive schemes. This efficiency is achieved by reducing the number of hotspots and smoothly distributing the traffic across the network.
IP4-14	EDA TOOLS TRUST EVALUATION THROUGH SECURITY PROPERTY PROOFS Speaker: Yier Jin, The University of Central Florida, US Abstract The security concerns of EDA tools have long been ignored because IC designers and integrators only focus on their functionality and performance. This lack of trusted EDA tools hampers hardware security researchers' efforts to design trusted integrated circuits. To address this concern, a novel EDA tools trust evaluation framework has been proposed to ensure the trustworthiness of EDA tools through its functional operation, rather than scrutinizing the software code. As a result, the newly proposed framework lowers the evaluation cost and is a better fit for hardware security researchers. To support the EDA tools evaluation framework, a new gate-level information assurance scheme is developed for security property checking on any gate-level netlist. Helped by the gate-level scheme, we expand the territory of proof-carrying based IP protection from RT-level designs to gate-level netlist, so that most of the commercially trading third-party IP cores are under the protection of proof-carrying based security properties. Using a sample AES encryption core, we successfully prove the trustworthiness of Synopsys Design Compiler in generating a synthesized netlist.
IP4-15	(Best Paper Award Candidate) ANALYSIS AND EVALUATION OF PER-FLOW DELAY BOUND FOR MULTIPLEXING MODELS Speakers: Yanchen Long¹, Zhonghai Lu² and Xiaolang Yan³ ¹Zhejiang University and KTH Royal Institute of Technology, SE; ²KTH Royal Institute of Technology, SE; ³Zhejiang University, CN Abstract Multiplexing models are common in resource sharing communication media such as buses, crossbars and networks. While sending packets over a multiplexing node, the packet delay bound can be computed using network calculus models. The tightness of such delay bound remains an open problem. This paper studies the multiplexing models for weighted round robin scheduling with different traffic arrival curves, and analyzes per-flow packet delay bounds with different service properties. We empirically evaluate the tightness of the delay bounds. Our results show the quality of different analysis models, and how influential each parameter is to tightness.
IP4-17	AGING-AWARE STANDARD CELL LIBRARY DESIGN Speakers: Saman Kiamehr¹, Farshad Firouzi¹, Mojtaba Ebrahimi² and Mehdi Tahoori² ¹Karlsruhe Institute of Technology (KIT), DE; ²Karlsruhe Institute of Technology, DE Abstract Transistor aging, mostly due to Bias Temperature Instability (BTI), is one of the major unreliability sources at nano-scale technology nodes. BTI causes the circuit delay to increase and eventually leads to a decrease in the circuit lifetime. Typically, standard cells in the library are optimized according to the design time delay, however, due to the asymmetric effect of BTI, the rise and fall delays might become significantly imbalanced over the lifetime. In this paper, the BTI effect is mitigated by balancing the rise and fall delays of the standard cells at the excepted lifetime. We find an optimal trade-off between the increase in the size of the library and the lifetime improvement (timing margin reduction) by non-uniform extension of the library cells for various ranges of the input signal probabilities. The simulation results reveal that our technique can prolong the circuit lifetime by around 150% with a negligible area overhead.
IP4-18	PASS-XNOR LOGIC: A NEW LOGIC STYLE FOR P-N JUNCTION BASED GRAPHENE CIRCUITS Speakers: Valerio Tenace, Andrea Calimera, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT Abstract In this work we introduce a new logic style for p-n junctions based digital graphene circuits: the pass-XNOR logic style. The latter enables the realization of compact, energy efficient circuits that better exploit the characteristics of graphene. We first show how a single p-n junction can be conceived as a pass-XNOR gate, i.e., a transmission gate with embedded logic functionality, the XNOR Boolean operator. Secondly, we propose a smart integration strategy in which series/parallel connections of pass-XNOR gates allow to implement AND/OR logical conjunctions, and, therefore, all possible truth tables. Experimental results conducted on a set of representative logic functions show the superior of pass-XNOR logic circuits w.r.t. standard CMOS circuits and graphene circuits that use p-n junctions in a complementary-like structure.
IP4-19	MIXED ALLOCATION OF ADJUSTABLE DELAY BUFFERS COMBINED WITH BUFFER SIZING IN CLOCK TREE SYNTHESIS OF MULTIPLE POWER MODE DESIGNS Speakers: Kitae Park, Geunho Kim and Taewhan Kim, Seoul National University, KR Abstract Recently, many works have shown that adjustable delay buffer (ADB) whose delay is adjustable dynamically can effectively solve the clock skew variation problem in the designs with multiple power modes. However, all the previous works of ADB allocation inherently entail two critical limitations, which are the adjusted delays by ADB are always increments and the low cost buffer sizing has never been or not been primarily taken into account. To demonstrate how much overcoming the two limitations is effective in resolving the clock skew constraint, we characterize the two types of ADBs called CADB (capacitor based ADB) and IADB (inverter based ADB) and show that the adjusted delays by IADB can be decremented, and show that the clock skew violation in some clock trees of multiple power modes can be resolved by applying buffer sizing together with using only a small number of IADBs and CADBs.

UB09 Session 9

Date: Thursday 27 March 2014
Time: 10:00 - 12:00
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB09.01	SOC VERIFICATION: AUTOMATED FUNCTIONAL VERIFICATION OF SYSTEMS-ON-CHIP Authors: Zdenek Prikryl, Marcela Simkova and Karel Masarik, Faculty of Information Technology, Brno University of Technology, CZ Abstract An increase of the complexity of systems-on-chip (SoC) induces an increase of the complexity of their verification as well. The reason is that we must verify not only the functions of separate logic blocks, but we need to check their interconnections, timing and functional collaboration as well. Therefore, there is still a great demand for verification tools, which are time-effective, fast and as automated as possible. Exactly these issues we target in our solution. You are welcome to see the live demonstration at our booth! More information ...
UB09.02	AN AUTOMATED DESIGN FLOW FOR FAST PROTOTYPING OF SIMULINK MODELS ONTO MPSOC Authors: Francesco Robino and Johnny Öberg, Royal Institute of Technology, SE Abstract Simulink is a modelling environment suitable to model embedded systems at system-level. However there is no standard to rapidly prototype Simulink models onto modern multiprocessor system-on-chip (MPSoC). In this demonstration we show how our NoC System Generator tool can be used as part of an automated platform-based design flow to synthesize a Simulink model to a network-on-chip based MPSoC implementation on FPGA. The performance of the generated prototype scales with the number of processors. More information ...
UB09.03	LARA: THE LARA COMPILER SUITE Authors: Joao Bispo, Pedro Pinto, Ricardo Nobre, Tiago Carvalho and Joao Cardoso, Universidade do Porto, PT Abstract LARA is an aspect-oriented programming (AOP) language which allows the description of sophisticated code instrumentation schemes, advanced mapping strategies including conditional decisions, based on hardware/software resources, and of sophisticated sequences of compiler transformations. Furthermore, LARA provides mechanisms for controlling all elements of a toolchain in a consistent and systematic way, using a unified programming interface. We present three compiler tools developed around the LARA technology, MATISSE, MANET and ReflectC. MATISSE is a compiler which 1) allows analyses and transformations on MATLAB code and 2) generates C code from the MATLAB code. MATISSE can be fully controlled through LARA aspects, which can define the type and shape of MATLAB variables, specify code insertion/removal actions, and define specialization directives and other additional information. MATISSE can output transformed MATLAB code and specialized C code. The knowledge provided by the LARA aspects allows MATISSE to generate C tailored to specific targets (e.g., use statically declared arrays to be compliant with the high-level synthesis tools such as Catapult C). MANET is a source-to-source compiler for ANSI C based on Cetus, and is controlled using LARA aspects. MANET manages to leverage the expressiveness and modularity of LARA to query and manipulate the Cetus AST, providing an easy compilation flow with main goal of code instrumentation and code transformations. LARA aspects allow for a simple selection of program elements in the code which can be analyzed or transformed, by either consulting their attributes or applying actions. Thus, MANET can be used to provide information reports based on compiler analyses, to implement sophisticated code instrumentation strategies, or to perform code optimizations and transformations. ReflectC is a C compiler based on CoSy's compiler framework. CoSy's configurability and retargetability make ReflectC particularly effective for exploration of compiler transformations and optimizations on possible architecture variations, and it is being used for hardware/software co-design and design space exploration (DSE). We will present demos of the tools and the use of LARA aspects and strategies to guide our suite of compilation tools providing: 1) C code generation from MATLAB code, according to information provided by LARA aspects; 2) Instrumentation of C code to be used for collecting specific compile and runtime information (e.g., execution time, range of values for specific variables, custom profiling); 3) User-controlled compiler optimizations targeting several architectures and DSE of sequences of compiler optimizations bearing in mind performance improvements. In addition to presenting examples for each of the tools of the LARA compilation suite, we show an execution of the complete toolchain, controlled by LARA aspects. More information ...
UB09.04	SECURE CLOUD-BASED WORKFLOW-AS-A-SERVICE (WFAAS) ENVIRONMENT WITH ROLE-BASED-ACCESS-CONTROL (RBAC) FOR SOC DESIGN Authors: Sai Manoj P D¹, Sai Manoj P. D.¹, Hao Yu¹ and Joseph Lee² ¹Nanyang Technological University, SG; ²Silicon Cloud International, US Abstract The SoC design process requires multiple EDA tools, custom IP's, and technology design kit from multiple providers. The design environment needs to be secure and collaborative. These requirements can be realized by using an integrated cloud based Workflow-as-a-Service (WFaaS) design environment. We demonstrate a cloud-based design environment for a SoC design with multiple CPU cores and analog IO's. This design environment uses an innovative Role-Based-Access-Control user security model where designers interact through a web portal dashboard to perform the design workflows. More information ...
UB09.05	RTL+: DESIGN ENVIRONMENT: WALK BEFORE YOU RUN. Authors: Somayeh Sadeghi-Kohan, Behnaz Pourmohseni, Amir Reza Nekooei, Hanieh Hashemi, Hamed Najafi Haghi and Zainalabedin Navabi, University of Tehran, IR Abstract To enable development of high level designs with hardware correspondence, synthesizability must be satisfied in a top-down manner. Thus in this work, instead of using TLM-2.0 which is not established for synthesis, we will start with a level above RT level, "RTL+". RTL+ is basically using TLM-1.0 channels and includes abstract communications and handshakings that are mainly hidden from the designer. We develop a package of SystemC channels with hardware correspondence (synthesizable HDL) for the communication between various cores (with simple interfaces) and standard buses. More information ...
UB09.06	ENERGY-MODULATED COMPUTING Authors: Maxim Rykunov, Reza Ramezani, Abdullah Baz, Xuefu Zhang, Delong Shang, Andrey Mokhov, Danil Sokolov, Fei Xia and Alex Yakovlev, Newcastle University, GB Abstract This demo will illustrate the principle of energy-modulated computing according to which the flow of energy entering a computing system determines its computational flow. This principle will be fundamental for building future autonomous systems, such as those powered by energy harvesting sources and aimed for survival in power-deficient conditions. The demo includes a set of experimental circuits (with three VLSI chips and PCBs) to work in variable power supply conditions and software tools for digital and analogue co-design (Workcraft, Petrify, MPSAT). More information ...
UB09.07	COMPSOC: VIRTUAL EXECUTION PLATFORMS FOR MIXED TIME-CRITICALITY APPLICATIONS Author: Kees Goossens, TU Eindhoven, NL Abstract System-on-Chip (SOC) design gets increasingly complex, as a growing number of applications are inte- grated in such systems. These applications have mixed time-criticality, i.e., some have firm-, some soft-, and others non-real-time requirements. Executing such a mix of applications on a SOC poses several challenges. First, to reduce cost, platform resources, e.g., processors, interconnect, memories, are shared between applications. However, sharing causes interference between applications, making their behaviors inter- dependent. This results in two problems for SOC design and verification: 1) accurate system-level simulation and several approaches to formal verification are infeasible, because of the explosion in the number of possible combinations of applications, inputs, and resource states and 2) verification becomes a circular process that must be repeated if an application is added, removed, or modified, making integration and verification dominant parts of SOC development, in terms of time and money. The CompSOC platform addresses these problems by executing each application on an independent virtual execution platform (VEP). The VEPs are composable, i.e., cannot affect each other's behaviors. In the temporal domain an applications actual execution never varies by even a single clock cycle. Similarly, the energy and power behaviors of applications are also composable. As a result, applications can be designed, developed, verified, and executed in isolation. The VEPs are also predictable, meaning that all interference is bounded. This makes them virtualized also in terms of performance bounds, which enables firm real-time applications to be verified using formal performance analysis frameworks. The CompSOC platform uses the CoMiK microkernel to implement virtual processors on each processor time through temporal partitioning. Each application can use its own operating system (e.g. Compose, μcOS-III) and model of computation (e.g. CSDF, KPN, TT) in its VEP, to suit its level of time criticality. As more applications are integrated on a single SOC, the need arises for more dynamic behaviour. The system should be able to start, modify and stop applications at run time without affecting running appli- cations. For this purpose the CompSOC platform has been extended with a predictable and composable resource management framework. It manages application bundles that contain 1) an application in the form of executables (ELFs on multiple processors), and also 2) the specifications of the (one or more) particular VEPs that the application executes in, consisting of virtual processors, NOC connections, virtualised mem- ories, etc. At run time, the resource management framework can dynamically load and start application bundles by creating a VEP and then loading, booting, and executing an application within it. VEPs can also be modified, stopped, and deleted at run time. Our University Booth will present virtual-execution-platform and application-bundle concepts using an interactive demonstrator. It will show that the CompSOC has been extended with dynamic functionality, without sacrificing its key strengths: composability and predictability. We will demonstrate this through the use of the resource management framework and application bundles, showing that we can create, modify and delete virtual execution platforms running a mixed time-criticality application dynamically at run-time. More information ...
UB09.09	LEVERAGING DYNAMIC RECONFIGURATION TO INCREASE FAULT-TOLERANCE IN FPGA-BASED SATELLITE SYSTEMS Authors: Sebastian Korf¹, Dario Cozzi¹, Dirk Jungewelter¹, Jens Hagemeyer¹, Mario Porrmann¹ and Jorgen Ilstad² ¹CITEC (Bielefeld University), DE; ²ESTEC (European Space Agency), DE Abstract This demonstrator shows how todays SoCs for satellite payload processing can be extended with high-speed interfaces and computing power utilizing commercial dynamically reconfigurable FPGAs. The use of these FPGAs in space environment will lead to faults due to radiation. Therefore, special methods have been developed to increase the system reliability. We will demonstrate an environment for automatic fault detection and correction in relevant applications like image and video processing. More information ...
UB09.10	UNISON: ASSEMBLY CODE GENERATION USING CONSTRAINT PROGRAMMING Authors: Roberto Castañeda Lozano¹, Gabriel Hjort Blindell², Mats Carlsson¹ and Christian Schulte² ¹Swedish Institute of Computer Science, SE; ²KTH Royal Institute of Technology, SE Abstract We demonstrate Unison - a simple, flexible and potentially optimal code generator that solves interdependent code generation tasks together using constraint programming as a modern combinatorial optimization method. We show how Unison takes into account the task interdependencies and their combinatorial nature to improve the speed of the code generated by LLVM (a state-of-the-art compiler) for Hexagon (a digital signal processor ubiquitous in modern mobile platforms). More information ...
12:00	End of session
12:30	Lunch Break in Exhibition Area Sandwich lunch

10.1 SPECIAL DAY Hot Topic: Memories today and tomorrow

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Saal 1

Organisers:
Thomas Mikolajick, NamLab gGmbH, DE
Ian O'Connor, Lyon Institute of Nanotechnology, FR

Chair:
Tahoori Mehdi, Karlsruhe Institute of Technology, DE

Co-Chair:
Thomas Mikolajick, NamLab gGmbH, DE

Memory devices and technologies have undergone huge transformations in recent years and many industrially viable replacements to conventional technologies are on the brink of entering the market. The first paper in this session gives an overview of alternative memory technologies and how each can contribute or disrupt accepted memory hierarchies. The quest for a universal memory device is still underway, and the other papers in this session focus on various approaches for future memory devices. The second paper examines phase change memories, while magnetic memories are discussed in the third paper, both in terms of standard memory applications but also in terms of how they can improve logic performance. Resistive memories are the topic of the fourth paper, where new applications are considered - in FPGAs, NoCs and crossbars. The fifth paper in this session looks at low-cost memory with a printable manufacturing approach, leading to other applications and market segments.

Time	Label	Presentation Title Authors
11:00	10.1.1	SEMICONDUCTOR MEMORY PERSPECTIVE Speaker: Roberto Bez, Micron, IT Abstract Memories are getting increasing importance since they are becoming fundamental in the definition of the electronic system. Presently the industry standard technologies are still DRAM and Flash that have been able to guarantee the cost sustainability thanks to the continuous scaling. The NAND/DRAM miniaturization is becoming increasingly difficult and moreover new applications are requiring higher memory density and better performances. Therefore there are good opportunities and important challenges for the alternative memory technologies to enter into the market and replace/displace the standard ones.
11:20	10.1.2	EXPLORING THE LIMITS OF PHASE CHANGE MEMORIES Speaker: Matthias Wuttig, RWTH Aachen University of Technology, DE Abstract Phase change materials are among the most promising compounds in information technology. They can be very rapidly switched between the amorphous and the crystalline state, indicative for peculiar crystallization behaviour. Phase change materials are already employed in rewriteable optical data storage, where the pronounced difference of optical properties between the amorphous and crystalline state is used. This unconventional class of materials is also the basis of a storage concept to replace flash memory. This talk will discuss the unique material properties which characterize phase change materials. In particular, it will be shown that the crystalline state of phase change materials is characterized by the occurrence of resonant bonding, a particular flavour of covalent bonding. This insight is employed to predict systematic property trends and to develop non-volatile memories with DRAM-like switching speeds potentially paving the road towards a universal memory. Phase change materials do not only provide exciting opportunities for applications including 'greener' storage devices, but also form a unique quantum state of matter as will be demonstrated by transport measurements. In this talk, potential limits of phase change memories in terms of switching speed, scalability and power consumption will be discussed.
11:35	10.1.3	MAGNETIC MEMORIES: FROM DRAM REPLACEMENT TO ULTRA LOW POWER LOGIC CHIPS Speaker: Jean-Pierre Nozières, Spintec, FR Abstract The recent advent of spin transfer torque (STT) has shed a new light on MRAM with the promises of much improved performances and greater scalability to very advanced technology nodes. As a result, MRAM is now viewed as a credible solution for stand-alone and embedded applications where the combination of non-volatility, speed and endurance is key. Whereas the technology is nearing maturity for DRAM replacement, with the exception of process scaling to sub-20nm which remains a challenge, circuit designers are now actively looking at SoCs where MRAM could bring in better performance and lower power consumption in data intensive applications as well as instant-on capability in mobile applications. In this paper we present a review of the MRAM technology and a methodology for ASIC design using a custom full digital hybrid CMOS/Magnetic Process Design Kit. We finish by a few examples showing that magnetic memories can be efficiently integrated in logic designs, for both safety and low power purposes.
11:55	10.1.4	RESISTIVE MEMORIES: WHICH APPLICATIONS? Speakers: Fabien Clermidy¹, Natalija Jovanovic¹, Santhosh Onkaraiah¹, Houcine Oucheikh¹, Ogun Turkyilmaz¹, Olivier Thomas¹, Elisa Vianello¹, Jean-Michel Portal² and Marc Bocquet² ¹CEA-LETI, FR; ²Université d'Aix-Marseille, FR Abstract Recent announcement of 16Gbits Resistive memory from Sony shows the trend to quickly adopt resistive memories as an alternative to DRAM. However, using ReRAM for embedded computing is still a futuristic goal. This paper approaches two applications based on ReRAM-devices for gaining area, performance or power consumption. The first application is FPGA, one of the first architecture that can benefit the most from ReRAM integration to reduce footprint and save energy. The second application relates to ultra-low-power systems and the way to obtain an instantaneous "freeze" mode in devices for Internet of Things.
12:10	10.1.5	THINFILM PRINTED FERRO-ELECTRIC MEMORIES AND INTEGRATED PRODUCTS Speakers: Christer Karlsson and Peter Fischer, Thin Film Electronics AB, SE Abstract Printed electronics has recently moved from a focus on the production of individual components towards the design and initial commercialization of integrated systems. This paper describes the current status and further trends of ferroelectric nonvolatile memories as developed and produced by Thin Film Electronics.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

10.2 Wireless NoCs

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 6

Chair:
Giorgos Dimitrakopoulos, Democritus University of Thrace, GR

Co-Chair:
Valeria Bertacco, University of Michigan, US

This session comprises three papers devoted to studying different aspects of wireless NoC design and optimization. The first paper focuses on energy efficiency, by effectively tuning the transmission power of on-chip antennas. The second paper compares the performance and power of different routing algorithms for wireless NoCs, while the third paper explores the adoption of wireless NoCs in 3D chip designs.

Time	Label	Presentation Title Authors
11:00	10.2.1	AN ADAPTIVE TRANSMITTING POWER TECHNIQUE FOR ENERGY EFFICIENT MM-WAVE WIRELESS NOCS Speakers: Andrea Mineo¹, Maurizio Palesi², Giuseppe Ascia¹ and Vincenzo Catania¹ ¹University of Catania, IT; ²Kore University, IT Abstract Several emerging techniques have been recently proposed for alleviating the communication latency and the energy consumption issues in multi/many-core architectures. One of such emerging communication techniques, namely, WiNoC replaces the traditional wired links with the use of wireless medium. Unfortunately, the energy consumed by the RF transceiver (i.e., the main building block of a WiNoC), and in particular by its transmitter, accounts for a significant fraction of the overall communication energy. In this paper we propose a runtime tunable transmitting power technique for improving the energy efficiency of the transceiver in wireless NoC architectures. The basic idea is tuning the transmitting power based on the location of the recipient of the current communication. The integration of the proposed technique into two known WiNoC architectures, namely, iWise64 and McWiNoC resulted in an energy reduction of 43% and 60%, respectively.
11:30	10.2.2	PERFORMANCE EVALUATION OF WIRELESS NOCS IN PRESENCE OF IRREGULAR NETWORK ROUTING STRATEGIES Speakers: Paul Wettin, Jacob Murray, Ryan Kim, Xinmin Yu, Partha Pande and Deukhyoun Heo, Washington State University, US Abstract The millimeter (mm)-wave small-world wireless NoC (mSWNoC) is an enabling interconnect architecture to design high performance and low power multicore chips. As the mSWNoC has an overall irregular topology, it is extremely important to design suitable deadlock-free routing mechanisms for it. In this paper we quantify the latency, energy dissipation, and thermal profiles of mSWNoC architectures by incorporating irregular network routing strategies. We demonstrate that the latency, energy dissipation, and thermal profile are affected by the adopted routing methodologies. In presence of the benchmarks considered, the variation in latency and energy dissipation is small. However, the network hotspot temperature can vary considerably depending on the exact routing strategy and the characteristics of the benchmark.
12:00	10.2.3	LOW-LATENCY WIRELESS 3D NOCS VIA RANDOMIZED SHORTCUT CHIPS Speakers: Hiroki Matsutani¹, Michihiro Koibuchi², Ikki Fujiwara², Takahiro Kagami¹, Yasuhiro Take¹, Tadahiro Kuroda¹, Paul Bogdan³, Radu Marculescu⁴ and Hideharu Amano¹ ¹Keio University, JP; ²National Institute of Informatics, JP; ³University of Southern California, US; ⁴Carnegie Mellon University, US Abstract In this paper, we demonstrate that by inducing a certain fraction of randomness into wireless 3D NoCs (where CMOS wireless links are used for vertical inter-chip communication) we can reduce the communication latency when considering the physical constraints of 3D design space. Towards this end, we consider two cases, namely 1) replacing existing horizontal 2D links in a wireless 3D NoC with randomized shortcut NoC links and 2) enabling full connectivity via adding a randomized NoC layer to a wireless 3D system with no or partial horizontal connectivity. Consequently, the packet routing is optimized by exploiting both the existing and the newly added random NoC. Thus, by adding randomly wired shortcut NoCs to a wireless 3D system, one can strike a good balance between the modular design and the minimum randomness needed for achieving low-latency. Experimental results show that by adding a random NoC chip to wireless 3D CMPs without built-in horizontal NoCs we can reduce the communication latency by as much as 26.2% when compared to that of adding a 2D mesh NoC. Also, the application execution time and average flit transfer energy can also be improved accordingly.
12:30	IP5-1, 578	HYBRID WIRE-SURFACE WAVE ARCHITECTURE FOR ONE-TO-MANY COMMUNICATION IN NETWORK-ON-CHIP Speakers: Ammar Karkar¹, Nizar Dahir¹, Ra'ed Al-Dujaily², Kenneth Tong³, Terrence Mak⁴ and Alex Yakovlev¹ ¹School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne, GB; ²General Systems Company, Baghdad - Iraq, IQ; ³Depart- ment of Electrical and Electronic Engineering, University College London, GB; ⁴Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong, CN Abstract Network-on-chip (NoC) is a communication paradigm that has emerged to tackle different on-chip challenges and has satisfied different demands in terms of high performance and economical interconnect implementation. However, merely metal based NoC pursuit offers limited scalability with the relentless technology scaling, especially in one-to-many (1-to-M) communication. To meet the scalability demand, this paper proposes a new hybrid architecture empowered by both metal interconnects and Zenneck surface wave interconnects (SWI). This architecture, in conjunction with newly proposed routing and global arbitration schemes, avoids overloading the NoC and alleviates traffic hotspots compared to the trend of handling 1-to-M traffic as unicast. This work addresses the system level challenges for intra chip multicasting. Evaluation results, based on a cycle-accurate simulation and hardware description, demonstrate the effectiveness of the proposed architecture in terms of power reduction ratio of 2 to 12X and average delay reduction of 25X or more, compared to a regular NoC. These results are achieved with negligible hardware overheads.
12:31	IP5-2, 101	FAILURE ANALYSIS OF A NETWORK-ON-CHIP FOR REAL-TIME MIXED-CRITICAL SYSTEMS Speakers: Eberle A Rambo¹, Alexander Tschiene¹, Jonas Diemer¹, Leonie Ahrendts¹ and Rolf Ernst² ¹Technische Universität Braunschweig, DE; ²TU Braunschweig, DE Abstract Multi- and many-core architectures using Networks-on-Chip (NoC) are being explored for use in real-time safety-critical applications for their performance and efficiency. Such systems must provide isolation between tasks that may present distinct criticality levels. The NoC is critical to maintain the isolation property as it is a heavily used shared resource. To meet safety-standard requirements, such architectures require a systematic evaluation of the effects of all possible failures such as in the form of a Failure Mode and Effects Analysis (FMEA). We present the results of a detailed system-level analysis of a typical real-time mixed-critical network-on-chip architecture. This comprises an FMEA and error effects classification regarding duration and isolation violation.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

10.3 Green Computing Systems

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 1

Chair:
Ayse Coskun, Boston University, US

Co-Chair:
Martino Ruggiero, University of Bologna, IT

This session discusses techniques to improve energy efficiency in large-scale computing systems, many-core systems, servers, and the cloud. The papers in this session particularly emphasize the practical experiences in academia and in industry.

Time	Label	Presentation Title Authors
11:00	10.3.1	GLOBAL FAN SPEED CONTROL CONSIDERING NON-IDEAL TEMPERATURE MEASUREMENTS IN ENTERPRISE SERVERS Speakers: Jungsoo Kim¹, Mohamed M. Sabry², David Atienza¹, Kalyan Vaidyanathan³ and Kenny Gross³ ¹EPFL, CH; ²ESL-EPFL, CH; ³Physical Sciences Research Center, Oracle, US Abstract Time lag and quantization in temperature sensors in enterprise servers lead to stability concerns on existing variable fan speed control schemes. Stability challenges become further aggravated when multiple local controllers are running together with the fan control scheme. In this paper, we present a global control scheme which tackles the concerns on the stability of enterprise servers while reducing the performance degradation caused by the variable fan speed control scheme. We first present a stable fan speed control scheme based on the Proportional-Integral-Derivative (PID) controller by adaptively adjusting the PID parameters according to the operating fan speed and eliminating the fan speed oscillation caused by temperature quantization. Then, we present a global control scheme which coordinates control actions among multiple local controllers. In addition, it guarantees the server stability while minimizing the overall performance degradation. We validated the proposed control scheme using a presently shipping commercial enterprise server. Our experimental results show that the proposed fan control scheme is stable under the non-ideal temperature measurement system (10 sec in time lag and 1C in quantization figures). Furthermore, the global control scheme enables to run multiple local controllers in a stable manner while reducing the performance degradation up to 19.2% compared to conventional coordination schemes with 19.1% savings in server power consumption.
11:30	10.3.2	UNVEILING EURORA - THERMAL AND POWER CHARACTERIZATION OF THE MOST ENERGY-EFFICIENT SUPERCOMPUTER IN THE WORLD Speakers: Andrea Bartolini¹, Matteo Cacciari¹, Carlo Cavazzoni², Giampietro Tecchiolli³ and Luca Benini⁴ ¹University of Bologna, IT; ²CINECA, IT; ³EUROTECH, IT; ⁴Università di Bologna, IT Abstract Eurora (EURopean many integrated cORe Architecture) is today the most energy efficient supercomputer in the world. Ranked 1st in the Green500 in July 2013, is a prototype built from Eurotech and Cineca toward next-generation Tier0 systems in the PRACE 2IP EU project. Eurora's outstanding energy-efficiency is achieved by adopting a direct liquid cooling solution and a heterogeneous architecture with best-in-class general purpose HW components (Intel Xeon E5, Intel Xeon Phi and NVIDIA Kepler K20). In this paper we present a novel, low-overhead monitoring infrastructure capable to track in detail and in real-time the thermal and power characteristics of Eurora's components with fine-grained resolution. Our experiments give insights on Eurora's thermal/power trade-offs and highlight opportunities for run-time power/thermal management and optimization.
12:00	10.3.3	CONTENTION AWARE FREQUENCY SCALING ON CMPS WITH GUARANTEED QUALITY OF SERVICE Speakers: Hao Shen and Qinru Qiu, Syracuse University, US Abstract Workload consolidation is usually performed in datacenters to improve server utilization for higher energy efficiency. One of the key issues related to workload consolidation is contention for shared resources such as last level cache, main memory, memory controller, etc. Dynamic voltage and frequency scaling (DVFS) of CPU is another effective technique that has widely been used to trade the performance for power reduction. We have found that the degree of resource contention of a system affects its performance sensitivity to CPU frequency. In this paper, we apply machine learning techniques to construct a model that quantifies runtime performance degradation caused by resource contention and frequency scaling. The inputs of our model are readings from Performance Monitoring Units (PMU) screened using standard feature selection technique. The model is tested on an SMT-enabled chip multi-processor and it reaches up to 90% accuracy. Experimental results show that, guided by the performance model, runtime power management techniques such as DVFS can achieve more accurate power and performance tradeoff without violating the quality of service (QoS) agreement. The QoS violation of the proposed system is significantly lower than systems that have no performance degradation information.
12:15	10.3.4	CONCURRENT PLACEMENT, CAPACITY PROVISIONING, AND REQUEST FLOW CONTROL FOR A DISTRIBUTED CLOUD INFRASTRUCTURE Speakers: Shuang Chen, Yanzhi Wang and Massoud Pedram, University of Southern California, US Abstract Cloud computing and storage have attracted a lot of attention due to the ever increasing demand for reliable and cost-effective access to vast resources and services available on the Internet. Cloud services are typically hosted in a Cloud computing and storage have attracted a lot of attention due to the ever increasing demand for reliable and cost-effective access to vast resources and services available on the Internet. Cloud services are typically hosted in a set of geographically distributed data centers, which we will call the cloud infrastructure. To minimize the total cost of ownership of this cloud infrastructure (which accounts for both the upfront capital cost and the operational cost of the infrastructure resources), the infrastructure owners/operators must do a careful planning of data center locations in the targeted service area (for example the US territories), data center capacity provisioning (i.e., the total CPU cycles per second that can be provided in each data center). In addition, they must have flow control policies that will distribute the incoming user requests to the available resources in the cloud infrastructure. This paper presents an approach for solving the unified problem of data center placement and provisioning, and request flow control in one shot. The solution technique is based on mathematical programming. Experimental results, using Google cluster data and placement/provisioning of up to eight data center sites demonstrate the cost savings of the proposed problem formulation and solution approach.
12:31	IP5-3, 664	COOLIP: SIMPLE YET EFFECTIVE JOB ALLOCATION FOR DISTRIBUTED THERMALLY-THROTTLED PROCESSORS Speakers: Pratyush Kumar, Hoeseok Yang, Iuliana Bacivarov and Lothar Thiele, ETH Zurich, CH Abstract Thermal constraints limit the time for which a processor can run at high frequency. Such thermal-throttling complicates the computation of response times of jobs. For multiple processors, a key decision is where to allocate the next job. For distributed thermally-throttled procesosrs, we present COOLIP with a simple allocation policy: a job is allocated to the earliest available processor, and if there are several available simultaneously, to the coolest one. For Poisson distribution of inter-arrival times and Gaussian distribution of execution demand of jobs, COOLIP matches the 95-percentile response time of Earliest Finish-Time (EFT) policy which minimizes response time with full knowledge of execution demand of unfinished jobs and thermal models of processors. We argue that COOLIP performs well because it directs the processors into states such that a defined sufficient condition of optimality holds.
12:33	IP5-4, 942	ENERGY OPTIMIZATION IN 3D MPSOCS WITH WIDE-I/O DRAM USING TEMPERATURE VARIATION AWARE BANK-WISE REFRESH Speakers: Mohammadsadegh Sadri¹, Matthias Jung², Christian Weis², Norbert Wehn² and Luca Benini¹ ¹Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, IT; ²Microelectronic Systems Design Research Group, University of Kaiserslautern, DE Abstract Heterogeneous 3D integrated systems with Wide-I/O DRAMs are a promising solution to squeeze more functionality and storage bits into an ever decreasing volume. Unfortunately, with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. We improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. In order to provide proof of our concepts we develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. On this platform we run the Android OS with real-world benchmarks to quantify the advantages of our ideas. We show improvements of 16% in DRAM refresh power due to temperature variation aware bank-wise refresh. Furthermore, two solutions are investigated to speedup system simulations: (1) Adaptive tuning of sampling intervals based on the estimated chip thermal profile, which results in speedups of 2X. (2) Hardware acceleration of thermal simulations using the Maxeler engine, which shows possible speedups of 12X.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

10.4 System-level evaluation

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 2

Chair:
Pablo Sanchez, University of Cantabria, ES

Co-Chair:
Florian Letombe, Synopsys, FR

The session presents system-level verification and simulation techniques, as well as specific solutions for particular system components. The first paper analyzes how to detect concurrency errors from multi-threaded software on a virtual platform. The second one proposes a hybrid simulation platform for cache configuration analysis. The last paper explores SSD verification challenges. The session is completed by three IPs that introduce novel approaches for parallel simulation and efficient NoC/Smart systems validation.

Time	Label	Presentation Title Authors
11:00	10.4.1	AUTOMATIC DETECTION OF CONCURRENCY BUGS THROUGH EVENT ORDERING CONSTRAINTS Speakers: Luis Gabriel Murillo, Simon Wawroschek, Jeronimo Castrillon, Rainer Leupers and Gerd Ascheid, RWTH Aachen University, DE Abstract Writing correct parallel software for modern multi-processor systems-on-chip (MPSoCs) is a complicated task. Programmers can rarely anticipate all possible external and internal interactions in complex concurrent systems. Concurrency bugs originating from races and improper synchronization are difficult to understand and reproduce. Furthermore, traditional debug and verification practices for embedded systems lack support to address this issue efficiently. For instance, programmers still need to step through several executions until finding a buggy state or analyze complex traces, which results in productivity losses. This paper proposes a new debug approach for MPSoCs that combines dynamic analysis and the benefits of virtual platforms. All in all, it (i) enables automatic exploration of SW behavior, (ii) identifies problematic concurrent interactions, (iii) provokes possibly erroneous executions and, ultimately, (iv) detects concurrency bugs. The approach is demonstrated on an industrial-strength virtual platform with a full Linux operating system and real-world parallel benchmarks.
11:30	10.4.2	HARDWARE-BASED FAST EXPLORATION OF CACHE HIERARCHIES IN APPLICATION SPECIFIC MPSOCS Speakers: Isuru Nawinne, Josef Schneider, Haris Javaid and Sri Parameswaran, The University of New South Wales, AU Abstract Multi-level caches are widely used to improve the memory access speed of multiprocessor systems. Deciding on a suitable set of cache memories for an application speciﬁc embedded system's memory hierarchy is a tedious problem, particularly in the case of MPSoCs. To accurately determine the number of hits and misses for all the conﬁgurations in the design space of an MPSoC, researchers extract the trace ﬁrst using Instruction set simulators and then simulate using a software simulator. Such simulations take several hours to months. We propose a novel method based on specialized hardware which can quickly simulate the design space of cache conﬁgurations for a shared memory multiprocessor system on an FPGA, by analyzing the memory traces and calculating the cache hits and misses simultaneously. We demonstrate that our simulator can explore the cache design space of a quad-core system with private L1 caches and a shared L2 cache, over a range of standard benchmarks, taking as less as 0.106 seconds per million memory accesses, which is up to 456 times faster than the fastest known software based simulator. Since we emulate the program and analyze memory traces simultaneously, we eliminate the need to extract multiple memory access traces prior to simulation, which saves a signiﬁcant amount of time during the design stage.
12:00	10.4.3	SSDEXPLORER: A VIRTUAL PLATFORM FOR FINE-GRAINED DESIGN SPACE EXPLORATION OF SOLID STATE DRIVES Speakers: Lorenzo Zuolo¹, Cristian Zambelli¹, Rino Micheloni², Salvatore Galfano³, Marco Indaco³, Stefano Di Carlo³, Paolo Prinetto³, Pirero Olivo¹ and Davide Bertozzi¹ ¹Università degli Studi di Ferrara, IT; ²PMC-Sierra, IT; ³Politecnico di Torino, IT Abstract Solid State Drives (SSDs) are gaining particular momentum in various frameworks such as multimedia, large data centers and cloud environments. Unfortunately, efficient CAD tools for SSD design space exploration able to assess the optimization of the device microarchitecture w.r.t. the target performance are still missing. This paper tries to close this gap by proposing SSDExplorer, a tool for fine-grained and fast design space exploration of SSD devices. SSDExplorer provides unprecedented insights into the architecture behavior and subcomponent interaction efficiency, while avoiding the need for the actual implementation of an FTL or of key hardware components. This is achieved by the introduction of suitable abstractions of the different components. This is confirmed by the thorough validation of SSDExplorer against a commercial SSD device.
12:30	IP5-5, 221	EFFICIENT SIMULATION AND MODELLING OF NON-RECTANGULAR NOC TOPOLOGIES Speakers: Ji Qi and Mark Zwolinski, University of Southampton, GB Abstract With increasing chip complexity, Networks-on-Chips (NoCs) are becoming a central platform for future on-chip communications. Many regular NoC architectures have been proposed to eliminate the communication bottlenecks of traditional bus-based networks. Non-rectangular and irregular architectures have also been proposed to increase performance. However, the complexity of designing custom non-rectangular networks leads to a rapid increase in design and verification times. To alleviate the conflict between performance and efficiency, this paper proposes a novel method that efficiently constructs virtual non-rectangular topologies on a mesh network by using time-regulated models to emulate irregular patterns. Data routings on virtual hexagonal and two irregular geometries validate the proposed method. An MPEG-4 decoder is used to exemplify the proposed method for media applications. Results analysis shows the virtual topologies emulated by the proposed method can provide precise timing and energy performance.
12:32	IP5-6, 520	MOVING FROM CO-SIMULATION TO SIMULATION FOR EFFECTIVE SMART SYSTEMS DESIGN Speakers: Franco Fummi¹, Michele Lora², Francesco Stefanni³, Dimitrios Trachanis⁴, Jan Vanhese⁴ and Sara Vinco² ¹University of Verona, EDALab s.r.l., IT; ²University of Verona, IT; ³EDALab s.r.l., IT; ⁴Agilent Technologies, BE Abstract Design of smart systems needs to cover a wide variety of domains, ranging from analogue to digital, with power devices, micro-sensors and actuators, up to MEMS. This high level of heterogeneity makes design a very challenging task, as each domain is supported by specific languages, modeling formalisms and simulation frameworks. A major issue is furthermore posed by simulation, that heavily impacts the design and verification loop and that is very hard to be built in such an heterogeneous context. On the other hand, achieving efficient simulation would indeed make smart system design feasible with respect to budget constraints. This work provides a formalization of the typical abstraction levels and design domains of a smart system. This taxonomy allows to identify a precise role in the design flow for co-simulation and simulation scenarios. Moreover, a methodology is proposed to move from the co-simulated heterogeneity to a simulatable homogeneous representation in C++ of the entire smart system. The impact of heterogeneous or homogeneous models of computation is also examined. Experimental results prove the effectiveness of the proposed C++ generation for reaching high-speed simulation.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

10.5 Analysis of Components and Systems

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 3

Chair:
Frank Oppenheimer, OFFIS, DE

Co-Chair:
Todor Stefanov, Leiden University, NL

The first paper proposes a new static analysis approach based on segment graphs that identifies a tight set of potential access conflicts in segments that may-happen-in-parallel in system-level models. In the second paper, a technique for latency analysis for shared resource systems is introduced. The third paper proposes a method that improves the tradeoff between simulation speed and accuracy of performance models of architectures. Finally, the fourth paper deals with cross-correlating specification and RTL to discover versioning issues, poor documentation, and mismatches between specification and RTL.

Time	Label	Presentation Title Authors
11:00	10.5.1	(Best Paper Award Candidate) MAY-HAPPEN-IN-PARALLEL ANALYSIS BASED ON SEGMENT GRAPHS FOR SAFE ESL MODELS Speakers: Weiwei Chen¹, Xu Han² and Rainer Doemer³ ¹University of California, Irvine, US; ²Qualcomm Inc., US; ³EECS, UC Irvine, US Abstract A well-defined system-level model contains explicit parallelism and should be free from parallel access conflicts to shared variables. However, safe parallelism is difficult to achieve since risky shared variables are often hidden deep in the design and are not exposed through simulation. In this paper, we propose a new static analysis approach based on segment graphs that identifies a tight set of potential access conflicts in segments that may-happen-in-parallel (MHP). Our experimental results show that the analysis is complete, accurate and fast to reveal dangerous shared variables in several embedded application models. Compared to earlier work, our approach significantly reduces the number of false conflict reports and thus saves the designer time.
11:30	10.5.2	TIMING ANALYSIS OF FIRST-COME FIRST-SERVED SCHEDULED INTERVAL-TIMED DIRECTED ACYCLIC GRAPHS Speakers: Raymond Frijns¹, Shreya Adyanthaya¹, Sander Stuijk¹, Jeroen Voeten¹, Marc Geilen¹, Ramon Schiffelers² and Henk Corporaal¹ ¹Eindhoven University of Technology, NL; ²ASML, NL Abstract Analyzing worst-case application timing for systems with shared resources is difficult, especially when non-monotonic arbitration policies like First-Come-First-Served (FCFS) scheduling are used in combination with varying task execution times. Analysis methods that conservatively analyze these systems are often based on state-space exploration, which is not scalable due to its inherent susceptibility to combinatorial explosion. We propose a scalable timing analysis method on periodically restarted Directed Acyclic Task Graphs, that can provide conservative bounds on task timing properties when shared resources with FCFS scheduling are used. By expressing task enabling and completion times in intervals, denoting best-case and worst-case timing properties, contention on the shared resources can be estimated using conservative approximations. With an industrial case study we show that our approach can easily analyze models with thousands of tasks in less than 10 seconds, and the worst-case bounds obtained show an average improvement of 46% compared to bounds obtained by static worst-case analysis.
12:00	10.5.3	A DYNAMIC COMPUTATION METHOD FOR FAST AND ACCURATE PERFORMANCE EVALUATION OF MULTI-CORE ARCHITECTURES Speakers: Sebastien Le Nours¹, Adam Postula² and Neil Bergmann² ¹University of Nantes, FR; ²University of Queensland, AU Abstract Early estimation of performance has become necessary to facilitate design of complex multi-core architectures. Performance evaluation based on extensive simulations is time consuming and needs to be improved to allow exploration of different architectures in acceptable time. In this paper, we propose a method that improves the tradeoff between simulation speed and accuracy in performance models of architectures. This method computes during model execution some of the synchronization instants involved in architecture evolution. It allows grouping and abstracting architecture processes and this way significantly reduces the number of simulation events. Experiments show significant benefits from the computation method on the simulation time. Especially, a simulation speed-up by a factor of 4 is achieved in the considered case study, with no loss of accuracy about estimation of processing resource usage. The proposed method has potential to support automatic generation of efficient architecture models.
12:15	10.5.4	CROSS-CORRELATION OF SPECIFICATION AND RTL FOR SOFT IP ANALYSIS Speakers: Bhanu Singh¹, Arunprasath Shankar¹, Francis Wolff¹, Christos Papachristou¹, Daniel Weyer² and Steve Clay² ¹Case Western Reserve University, US; ²Rockwell Automation, US Abstract Semiconductor companies often use third-party IPs in order to improve their design productivity. In practice, there are risks involved in using a third-party IP as bugs may creep in due to versioning issues, poor documentation, and mismatches between specification and RTL. As a result of this, third-party IP specification and RTL must be carefully evaluated. Our methodology addresses this issue, which cross-correlates specification and RTL to discover these discrepancies. The key innovative ideas in our approach are to use prior and trusted experience about designs, which include their specs and RTL code. Also, we have captured this trusted experience into two knowledge bases (KB), Spec-KB and RTL-KB. Finally, knowledge base rules are used to cross-correlate the RTL blocks to the specs. We have tested our approach by analyzing several third-party IPs. We have defined metrics for specification coverage and RTL identification coverage to quantify our results.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

10.6 Multi-processor and distributed systems

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 4

Chair:
Orlando Moreira, Ericsson, NL

Co-Chair:
Giuseppe Lipari, ENS - Cachan, FR

This session features new results in scheduling, allocation and management of real-time application in multi-core and distributed systems. The first paper presents a control algorithm for managing real-time tasks so to meet thermal constraints in a multi-core chip. The second paper proposes an algorithm for mixed-criticality task allocation in a multiprocessor platform. The third paper proposes a method for generating a schedule for a multi-mode application in a distributed system.

Time	Label	Presentation Title Authors
11:00	10.6.1	THERMAL-AWARE FREQUENCY SCALING FOR ADAPTIVE WORKLOADS ON HETEROGENEOUS MPSOCS Speakers: Heng Yu, Rizwan Syed and Yajun Ha, National University of Singapore, SG Abstract For applications featuring adaptive workloads, the quality of their task execution can be dynamically adjusted given the runtime constraints. When mapping them to heterogeneous MPSoCs, it is expected not only to achieve the highest possible execution quality, but also meet the critical thermal challenges from the continuously increasing chip density. Prior thermal management techniques, such as Dynamic Voltage/Frequency Scaling (DVFS) and thread migration, do not take into account the trade-off possibility between execution quality and temperature control. In this paper, we explore the capability of adaptive workloads for effective temperature control, while maximally ensuring the execution Quality-of-Service (QoS). We present a thermal-aware dynamic frequency scaling (DFS) algorithm on heterogeneous MPSoCs, where judicious frequency selection achieves QoS maximization under the temperature threshold, which is converted to the thermal-timing deadline as an additional execution constraint. Results show that our frequency scaling algorithm achieves as large as 31.5% execution cycle/QoS improvement under thermal constraints.
11:30	10.6.2	PARTITIONED MIXED-CRITICALITY SCHEDULING ON MULTIPROCESSOR PLATFORMS Speakers: Chuancai Gu¹, Nan Guan², Qingxu Deng¹ and Wang Yi² ¹Northeastern University, CN; ²Uppsala University, SE Abstract Scheduling mixed-criticality systems that integrate multiple functionalities with different criticality levels into a shared platform appears to be a challenging problem, even on single-processor platforms. Multi-core processors are more and more widely used in embedded systems, which provide great computing capacities for such mixed-criticality systems. In this paper, we propose a partitioned scheduling algorithm MPVD to extend the state-of-the-art single-processor mixed-criticality scheduling algorithm EY to multiprocessor platforms. The key idea of MPVD is to evenly allocate tasks with different criticality levels to different processors, in order to better explore the asymmetry between different criticality levels and improve the system schedulability. Then we propose two enhancements to further improve the schedulability of MPVD. Experiments with randomly generated task sets show significant performance improvement of our proposed approach over existing algorithms.
12:00	10.6.3	GENERATION OF COMMUNICATION SCHEDULES FOR MULTI-MODE DISTRIBUTED REAL-TIME APPLICATIONS Speakers: Akramul Azim, Gonzalo Carvajal, Rodolfo Pellizzoni and Sebastian Fischmeister, University of Waterloo, CA Abstract A key problem in designing multi-mode real-time systems is the generation of schedules to reduce the complexities of transforming the model semantics to code. Moreover, multi-mode applications that require communications are prone to suffer from delays incurred during mode changes. We therefore aim to generate communications schedules that have low average mode-change delay for multi-mode real-time distributed applications. In this paper, we use optimization constraints associated to timing requirements to generate state-based schedules that are specific to multi-mode communication systems, and also demonstrate the workflow for generating schedules from specifications through a real-time video monitoring case-study. In terms of average mode-change delays, our experiments demonstrate that schedules generated using our method are more efficient than either a randomized algorithm or the well-known EDF scheduling algorithm.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

10.7 Advances in Synthesis

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Konferenz 5

Chair:
John Hayes, University of Michigan, US

Co-Chair:
Kim Taemin, Intel Labs, US

Papers in this session address synthesis algorithms and tools at different levels, targeting power, area and delay minimization.

Time	Label	Presentation Title Authors
11:00	10.7.1	PROVABLY MINIMAL ENERGY USING COORDINATED DVS AND POWER GATING Speakers: Nathaniel Conos, Saro Meguerdichian, Foad Dabiri and Miodrag Potkonjak, UCLA, US Abstract Both energy and execution speed can be greatly impacted by clock and power gating, nonlinear voltage scaling, and leak- age power. We address the problem of coordinated power gating and dynamic voltage scaling (DVS) to minimize the overall energy consumption of an application under user- specified timing constraints. We prove that a solution pro- vided by our convex programming formulation that uses at most two versions of hardware, where each version uses its own constant voltages, is optimal. Comprehensive evalua- tion of the new approach demonstrates energy improvements over traditional DVS and DVS and power gating techniques by factors of 1.44X-2.97X and 1.44X-2.82X, respectively.
11:30	10.7.2	A TREE ARBITER CELL FOR HIGH SPEED RESOURCE SHARING IN ASYNCHRONOUS ENVIRONMENTS Speakers: Syed Rameez Naqvi and Andreas Steininger, Vienna University of Technology, AT Abstract We present a novel tree arbiter cell that allows a pipelined processing of asynchronous requests. In this way it can achieve significantly lower delay in the critical case of frequent requests coming from different clients. We elaborate the necessary extension to facilitate a cascaded use of this cell in a tree-like fashion, and we show by theoretical analysis that in this configuration our cell provides better fairness than the standard approach. We implement our approach and quantitatively compare its performance properties with related work in a gate-level simulation. In our sample asynchronous Networks-on-Chip application our new cell proves to increase the throughput of three different designs available in literature by approximately 61.28\%, 69.24\%, and 186.85\% respectively.
12:00	10.7.3	AN EFFICIENT MANIPULATION PACKAGE FOR BICONDITIONAL BINARY DECISION DIAGRAMS Speakers: Luca Amaru, Pierre-Emmanuel Gaillardon and Giovanni De Micheli, EPFL, CH Abstract Biconditional Binary Decision Diagrams (BBDDs) are a novel class of binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. Reduced and ordered BBDDs are remarkably compact and unique for a given Boolean function. In order to exploit BBDDs in Electronic Design Automation (EDA) applica- tions, efficient manipulation algorithms must be developed and integrated in a software package. In this paper, we present the theory for efficient BBDD manipulation and its practical software implementation. The key features of the proposed approach are (i) strong canonical form pre-conditioning of stored BBDD nodes, (ii) recursive formulation of Boolean operations in terms of biconditional expansions, (iii) performance-oriented memory management and (iv) dedicated BBDD re-ordering techniques. Experimental results show that the developed BBDD package achieves an average node count reduction of 19.48% and a speed-up factor of 1.63x with respect to a state-of-art decision diagram manipulation package. Employed in the synthesis of datapath circuits, the BBDD manipulation package is capable to advantageously restructure arithmetic operations producing 11.02% smaller and 32.29% faster circuits as compared to a commercial synthesis flow.
12:15	10.7.4	SYNTHESIS ALGORITHM OF PARALLEL INDEX GENERATION UNITS Speaker: Yusuke Matsunaga, Kyushu University, JP Abstract The index generation function is a multi-valued logic function which checks if the given input vector is a registered or not, and returns its index value if the vector is registered. If the latency of the operation is critical, dedicated hardware is used for implementing the index generation functions. This paper proposes a method implementing the index generation functions using parallel index generation units. A novel and efficient algorithm called `conflict free partitioning' is proposed to synthesis paralell index generation units. Experimental results show the proposed method outperforms other existing methods.
12:30	IP5-7, 104	AUTOMATING DATA REUSE IN HIGH-LEVEL SYNTHESIS Speakers: Wim Meeus¹ and Dirk Stroobandt² ¹Imec and Ghent University, BE; ²Ghent University, BE Abstract Current High-Level Synthesis (HLS) tools perform excellently for the synthesis of computation kernels, but they often don't optimize memory bandwidth. As memory access is a bottleneck in many algorithms, the performance of the generated circuit will benefit substantially from memory access optimization. In this paper we present an automated method and a toolchain to detect reuse of array data in loop nests and to build hardware that exploits this data reuse. This saves memory bandwidth and improves circuit performance. We make use of the polyhedral representation of the source program, which makes our method computationally easy. Our software complements the existing HLS flows. Starting from a loop nest written in C, our tool generates a reuse buffer and a loop controller, and preprocesses the loop body for synthesis with an existing HLS tool. Our automated tool produces designs from unoptimized source code that are as efficient as those generated by a commercial HLS tool from manually-optimized source code.
12:31	IP5-8, 12	A UNIVERSAL SYMMETRY DETECTION ALGORITHM Speaker: Peter Maurer, Dept. of Computer Sci., Baylor University, US Abstract Research on symmetry detection focuses on identifying and detecting new types of symmetry. We present an algorithm that is capable of detecting any type of permutation-based symmetry, including many types for which there are no existing algorithms. General symmetry detection is library-based, but symmetries that can be parameterized, (i.e. total, partial, rotational, and dihedral symmetry), can be detected without using libraries. In many cases it is faster than existing techniques. Furthermore, it is simpler than most existing techniques, and can easily be incorporated into existing software.
12:32	IP5-9, 525	OPTIMIZATION OF DESIGN COMPLEXITY IN TIME-MULTIPLEXED CONSTANT MULTIPLICATIONS Speakers: Levent Aksoy¹, Paulo Flores² and Jose Monteiro³ ¹INESC-ID, PT; ²INESC-ID/IST ULisbon, PT; ³INESC-ID / IST, ULisbon, PT Abstract The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplexed constant multiplication (TMCM). This paper addresses the problem of optimizing the complexity of a TMCM design and introduces an algorithm that finds the least complex TMCM design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMCM algorithms.
12:33	IP5-10, 807	HARDWARE PRIMITIVES FOR THE SYNTHESIS OF MULTITHREADED ELASTIC SYSTEMS Speakers: Giorgos Dimitrakopoulos¹, Seitanidis Ioannis², Anastasios Psarras¹, Konstantinos Tsiouris¹, Pavlos Matthaiakis³ and Jordi Cortadella⁴ ¹Democritus University of Thrace, GR; ²Democritus University of Thrac, GR; ³Mentor Graphics, FR; ⁴Universitat Politecnica de Catalunya, ES Abstract Abstract—Elastic systems operate in a dataflow-like mode using a distributed scalable control and tolerating variable latency computations. At the same time, multithreading increases the utilization of processing units and hides the latency of each operation by time-multiplexing operations of different threads in the datapath. This paper proposes a model to unify multithreading and elasticity. A new multithreaded elastic control protocol is introduced supported by low-cost elastic buffers that minimize the storage requirements without sacrificing performance. To enable the synthesis of multithreaded elastic architectures, new hardware primitives are proposed and utilized in two circuit examples to prove the applicability of the proposed approach.
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

10.8 EDA+3D+MEMS Innovation Agenda 2020 Fueling the Innovation Chain of Electronics

Date: Thursday 27 March 2014
Time: 11:00 - 12:30
Location / Room: Exhibition Theatre

Organiser:
Jürgen Haase, edacentrum, DE

Moderator:
Ahmed Jerraya, CEA-LETI, FR

Panelists:
Gabriel Kittler, X-FAB, DE
Brandon Wang, Cadence, US
Brent Gregory, Synopsys, US
Horst Symanzik, Bosch-Sensortec GmbH, DE
Gerd Teepe, GLOBALFOUNDRIES, DE

Today the most powerful innovations in the major industries and the most promising approaches to tackle burning societal challenges are substantially influenced by and depending from the innovations provided by the microelectronics industry. Breakthroughs in manufacturing technologies enable the realization of novel types of devices and of systems, which enable applications with fascinating functionality and enormous performance. However, this innovation chain is not operational without appropriate innovations in design technology: We need an innovation Agenda 2020 for design methodology and EDA tools fueling the innovation chain of electronics. 2014 the technologies for MEMS and for 3D chips have reached a maturity level that enables them to reshape our lives until 2020. This panel will discuss how to utilize these technologies: Which applications will become possible with the upcoming innovations in 3D and MEMS technologies, what kind of EDA innovations will be required in order to be able to implement these applications effectively and efficiently, yielding powerful yet reliable components and systems. The set-up of the panel includes the manufacturers GLOBALFOUNDRIES and X-FAB, Bosch as leading supplier of technology and one of the MEMS pioneers as well as leading EDA vendors Cadence and Synopsys.

Time	Label	Presentation Title Authors
11:00	10.8.1	INTRODUCTION Moderator: Ahmed Jerraya, CEA-LETI, FR
12:30		End of session Lunch Break in Exhibition Area Sandwich lunch

UB10 Session 10

Date: Thursday 27 March 2014
Time: 12:00 - 14:30
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB10.01	SOC VERIFICATION: AUTOMATED FUNCTIONAL VERIFICATION OF SYSTEMS-ON-CHIP Authors: Zdenek Prikryl, Marcela Simkova and Karel Masarik, Faculty of Information Technology, Brno University of Technology, CZ Abstract An increase of the complexity of systems-on-chip (SoC) induces an increase of the complexity of their verification as well. The reason is that we must verify not only the functions of separate logic blocks, but we need to check their interconnections, timing and functional collaboration as well. Therefore, there is still a great demand for verification tools, which are time-effective, fast and as automated as possible. Exactly these issues we target in our solution. You are welcome to see the live demonstration at our booth! More information ...
UB10.02	BRIDGING MATLAB/SIMULINK AND ESL DESIGN VIA AUTOMATIC CODE GENERATION Authors: Liyuan Zhang, Michael Glaß and Jürgen Teich, University of Erlangen-Nuremberg, DE Abstract Matlab/Simulink is today's de-facto standard for model-based design in domains such as control engineering and signal processing. Commercial tools are available to generate embedded C or HDL code directly from a Simulink model. However, Simulink models are purely functional models and, hence, designers cannot seamlessly consider the architecture that a Simulink model is later implemented on. In particular, it is not possible to explore the different architectural alternatives and investigate the arising interactions and side-effects directly within Simulink. To benefit from Matlab/Simulink's algorithm exploration capabilities and overcome the outlined drawbacks, we introduce a model transformation framework that converts a Simulink model to an executable specification, written in an actor-oriented modeling language. This specification then serves as the input of an established Electronic System Level (ESL) design ﬂow, enabling Design Space Exploration (DSE) and automatic code generation for both hardware and software. In this demonstration, we will show how to automatically transform Simulink models to an established ESL design ﬂow by means of a code generator. Based on the generated code, we will present a co-simulation approach that combines complex environmental models from Matlab/Simulink with the auto-generated model of a controller. We will use an Anti-lock Braking System (ABS) as an example where we investigate the impact of different controller implementations in the automotive E/E architecture. In detail, the following scientific achievements are included in the proposed demonstration: To bridge Simulink and ESL design ﬂows, we developed an ESL Code-Generator to perform model transformation. The idea is that for any given Simulink models such as a controller in a control system, the designer can simply invoke our Code-Generator to create the ESL model automatically. In our design ﬂow, we use SystemC as a programming language with an extension of actors with a specific Model of Computation (MoC). We guarantee the preservation of the semantics of the generated model by (a) applying a specific 1-to-1 mapping from Simulink basic blocks to an actor library and (b) considering different transformations to capture single-rate and multi-rate Simulink models. After the model transformation is ﬁnished, this auto-generated SystemC model serves as the input of a well-established ESL design ﬂow that enables DSE. Besides the Code-Generator we demonstrate also a validation technique that considers the functional correctness by comparing the original Simulink model with the generated SystemC model. The main idea behind this technique is (1) to co-simulate the auto-generated model along with the the original model and (2) to reuse the environment model and the test bench that are originally created in Simulink also for the auto-generated model. Furthermore, the performance of the model can also be measured during co-simulation. In this demonstration, an ABS model will be transformed from Simulink to SystemC by invoking ESL Code-Generator. Then, by applying our validation technique, the correctness and the accuracy of the auto-generated model can be examined. Lastly, to evaluate the performance of the model, application-depended quality of control will be measured, such as the braking distance on an icy road. More information ...
UB10.04	GEMINI: A NEW SYNTHESIS AND OPTIMIZATION TOOL FOR GRAPHENE-BASED DIGITAL DEVICES Authors: Valerio Tenace, Andrea Calimera, Massimo Poncino and Enrico Macii, Politecnico di Torino, IT Abstract Gemini is a synthesis and optimization software for graphene-based digital devices. Given a combinational circuit description through its boolean representation, Gemini produces a SPICE netlist mapped with graphene PN-Junction gates. The software is composed of a parser library to handle input circuit descriptions, a characterization library of graphene gates used in the synthesis process, a Biconditional Binary Decision Diagram library used to manipulate logic networks in Pass-XNOR logic in order to better exploit the intrinsic characteristics of the adopted graphene gates, and a number of optimization algorithms designed to produce better results in terms of area and thus power consumption. As a stand-alone software or as a library easy to integrate into state-of-the-art tools, Gemini represents a first step of an enabling technology for future synthesis and optimization processes for graphene-based devices. More information ...
UB10.05	RESCV: RESOURCE-AWARE COMPUTER VISION APPLICATION ON HETEROGENEOUS MULTI-TILE ARCHITECTURE Authors: Ericles Sousa¹, Johny Paul², Vahid Lari¹, Frank Hannig¹, Jürgen Teich¹ and Walter Stechele² ¹University of Erlangen-Nuremberg, DE; ²Technische Universität München, DE Abstract We demonstrate the benefits of invasive computing by showing the efficiency and utilization improvements in a resource-aware manner by algorithmic selection of different invasive resources, such as TCPA (tightly-coupled processor array), and RISC processors. More specific we present a dynamic load balancing of a computer vision application between multiple RISC cores and a TCPA, based on invasive mechanisms supported by our operating system and the agent system. More information ...
UB10.06	SKETCH-BASED ESL VIRTUAL PROTOTYPING: SKETCH-BASED DESIGN AND SIMULATION-BASED EVALUATION FOR ESL VIRTUAL PROTOTYPING Authors: Rafael Rosales¹, Michael Glaß¹, Jürgen Teich¹, Bo Wang², Yang Xu² and Ralph Hasholzner² ¹University of Erlangen-Nuremberg, DE; ²Intel Mobile Communications, DE Abstract Virtual prototyping and Electronic System Level (ESL) modeling have become valuable approaches to cope with the ever-increasing complexity of embedded systems. Their effectiveness, however, is highly dependent on their quick development time and accuracy both conflicting goals. In this demonstration, we present (a) an ESL methodology [1] [2] for the simulation-based evaluation of power and performance of embedded systems by the use of virtual prototypes. Our methodology permits us to develop ESL models for design space exploration of dynamic power and performance management strategies and hardware/software co-design choices. (b) We present a novel sketch-based tool termed Mahler [3] for the very early design phase of ESL modeling. Mahler provides a playground to quickly model functionality and evaluate performance on different architecture implementations. In Mahler, ESL models are created by literally sketching with a pen or touch interface, e.g. a tablet stylus, or a touchless interface, such as a Leap Motion controller. The application and architecture models are transformed to an executable virtual prototype through sketch recognition. This approach provides a very intuitive and fast way to explore actor-oriented functional modeling and hardware/software partitioning. The output of Mahler is a simulation-ready SystemC-based source-code stub that can be refined for subsequent design iterations. We will show a model of a Voice over LTE (VoLTE) use case, consisting of a heterogeneous cellular SoC platform, together with a wireless channel fading model and a base station network model. State-based [1] and polynomial-equation-based [4] power models are built and co-simulated for the SoC digital module and the RF transceiver module, respectively to abstract their different power consumption characterization accurately. The entire end-to-end modeling enables efficient SoC performance and power simulation with proper network configuration in seconds, which is highly desired in cellular system early design exploration phase and co-optimization with network vendors. More information ...
UB10.09	LEVERAGING DYNAMIC RECONFIGURATION TO INCREASE FAULT-TOLERANCE IN FPGA-BASED SATELLITE SYSTEMS Authors: Sebastian Korf¹, Dario Cozzi¹, Dirk Jungewelter¹, Jens Hagemeyer¹, Mario Porrmann¹ and Jorgen Ilstad² ¹CITEC (Bielefeld University), DE; ²ESTEC (European Space Agency), DE Abstract This demonstrator shows how todays SoCs for satellite payload processing can be extended with high-speed interfaces and computing power utilizing commercial dynamically reconfigurable FPGAs. The use of these FPGAs in space environment will lead to faults due to radiation. Therefore, special methods have been developed to increase the system reliability. We will demonstrate an environment for automatic fault detection and correction in relevant applications like image and video processing. More information ...
14:30	End of session
15:30	Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.0 Special Day Keynote

Date: Thursday 27 March 2014
Time: 13:30 - 14:00
Location / Room: Saal 1

Organic semiconductors with conjugated electron system are currently intensively investigated for optoelectronic applications. This interest is spurred by novel devices such as organic light-emitting diodes (OLED), organic solar cells, and flexible electronics. I this talk, I will discuss some of the recent progress in realizing devices, in particular highly efficient white OLED for lighting and flexible organic solar cells.

Time	Label	Presentation Title Authors
13:30	11.0.1	ORGANIC ELECTRONICS - FROM LAB TO MARKETS Speaker: Martin Knupfer, IFW Dresden, DE
14:00		End of session
15:30		Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.1 SPECIAL DAY Embedded Tutorial: Alternatives to CMOS

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Saal 1

Organisers:
Ian O'Connor, Lyon Institute of Nanotechnology, FR
Thomas Mikolajick, NamLab gGmbH, DE

Chair:
Aida Todri, LIRMM, FR

Co-Chair:
Thomas Mikolajick, NamLab gGmbH, DE

Alternative approaches to CMOS-based computing structures abound, for logic, memory, interconnect and interfaces. This embedded tutorial aims to give in-depth analyses of three promising technologies. The first paper covers spintronics and its use in logic and memory to achieve low-power computing architectures. Silicon photonics, with anticipated benefits for interconnect structures, is examined in the second paper. The third paper looks at the status of organic electronics and the properties of thin-film transistors for large displays and sensor arrays on flexible supports.

Time	Label	Presentation Title Authors
14:00	11.1.1	SPINTRONICS FOR LOW-POWER COMPUTING Speakers: Yue Zhang¹, Weisheng Zhao¹, Jacques-Olivier Klein¹, Wang Kang¹, Damien Querlioz¹, Youguang Zhang², Dafiné Ravelosona¹ and Claude Chappert¹ ¹IEF - Univ. Paris Sud, FR; ²Univ. Beihang, CN Abstract Microelectronics has been following Moore's law for almost 40 years. However this trend tends to run out of steam in recent technology nodes. The continuous improvements in the size of the transistors and in the operating frequencies result in serious power consumption, heat dissipation and reliability issues. Spintronics (Nobel Prize of Physics, 2007 awarded to Prof. Fert from Univ. Paris-Sud and Peter Grünberg from Forschungszentrum Jülich) nanodevices can reduce significantly the power, improve the reliability or allow new functionalities. The 2010 ITRS report on emerging research devices identified Magnetic Tunnel Junction (MTJ) nanopillar (the preeminent spintronics nanodevice) as one of the most promising technologies to be part of the future microelectronics circuits. It provides data non-volatility, hardness to radiations, fast data access and low-power operations. Magnetic memories become the most promising candidate for both low power logic computing and the data storage. This tutorial paper presents multi-discipline questions (Device, Circuit, Architecture, System and CAD) related to this topic to share the most recent results and discuss the future challenges.
14:30	11.1.2	CHAMELEON: CHANNEL EFFICIENT OPTICAL NETWORK-ON-CHIP Speakers: Sébastien Le Beux¹, Hui Li¹, Ian O'Connor¹, Kazem Cheshmi², Xuchen Liu¹, Jelena Trajkovic² and Gabriela Nicolescu³ ¹Lyon Institute of Nanotechnology, FR; ²Concordia University, CA; ³Ecole Polytechnique de Montréal, CA Abstract The next generation of MPSoC points to the integration of thousands of IP cores, requiring high performance interconnect for high throughput communications. Optical on-chip interconnect enables significantly increased bandwidth and decreased latency in MPSoC. However, the interface between electrical and photonic devices implies strong layout constraints that may impact the system performance and scalability. In this paper, we propose a novel optical interconnect named CHAMELEON. The interface simplifies the layout and allows the bandwidth between IP cores to be adapted according to the communication requirements. Compared to related networks, CHAMELEON demonstrates improved scalability and flexibility at the cost of minor increase in power consumption.
15:00	11.1.3	LOW-VOLTAGE ORGANIC TRANSISTORS FOR FLEXIBLE ELECTRONICS Speakers: Ute Zschieschang¹, Reinhold Rödel¹, Ulrike Kraft¹, Kazuo Takimiya², Tarek Zaki³, Florian Letzkus⁴, Jörg Butschke⁴, Harald Richter⁴, Joachim Burghartz⁴, Wei Xiong⁵, Boris Murmann⁵ and Hagen Klauk¹ ¹Max Planck Institute for Solid State Research, DE; ²Riken Advanced Science Institute, JP; ³University of Stuttgart, DE; ⁴IMS CHIPS, DE; ⁵Stanford University, US Abstract A process for the fabrication of bottom-gate, top-contact (inverted staggered) organic thin-film transistors (TFTs) with channel lengths as short as 1 μm on flexible plastic substrates has been developed. The TFTs employ vacuum-deposited small-molecule semiconductors and a low-temperature-processed gate dielectric that is sufficiently thin to allow the TFTs to operate with voltages of about 3 V. The p-channel TFTs have an effective field-effect mobility of about 1 cm2/Vs, an on/off ratio of 107, and a signal propagation delay (measured in 11-stage ring oscillators) of 300 ns per stage. For the n-channel TFTs, an effective field-effect mobility of about 0.06 cm2/Vs, an on/off ratio of 106, and a signal propagation delay of 17 μs per stage have been obtained.
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.2 Transitioning NoC Design Techniques to Future Challenges

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Konferenz 6

Chair:
Masoud Daneshtalab, University of Turku, FI

Co-Chair:
Hiroki Matsutani, Keio University, JP

The first paper of the session presents an approach to tolerating faults in NoCs through runtime reconfiguration, which is of increasing importance. The second paper focuses on management of thermal behaviour in NoCs to improve the reliability of optical communication given the tight tolerances of silicon photonics. Finally, the third paper also provides an outlook on optical NoCs by contrasting their properties against those of aggressive electrical baselines, to provide directions for future research in the field.

Time	Label	Presentation Title Authors
14:00	11.2.1	(Best Paper Award Candidate) BRISK AND LIMITED-IMPACT NOC ROUTING RECONFIGURATION Speakers: Doowon Lee, Ritesh Parikh and Valeria Bertacco, University of Michigan, US Abstract The expected low reliability of the silicon substrate at upcoming technology nodes presents a key challenge for digital system designers. Networks-on-chip (NoCs) are especially concerning because they are often the only communication infrastructure for the chips in which they are deployed. Recently, routing reconﬁguration solutions have been proposed to address this problem. However, they come at a high silicon cost, and often require suspending the normal network activity while executing a centralized, resource-hungry reconﬁguration algorithm. This paper proposes a novel, fast and minimalistic routing reconﬁguration algorithm, called BLINC. BLINC utilizes precomputed routing metadata to quickly evaluate localized detours upon each fault manifestation. We showcase the efﬁcacy of our algorithm by deploying it in a novel NoC fault detection and reconﬁguration solution, where BLINC enables uninterrupted NoC operation during aggressive online testing. If a fault seems likely to occur, we circumvent it in advance with the aid of our BLINC reconﬁguration algorithm. Experimental results show an 80% reduction in the average number of routers affected by a reconﬁguration event, compared to state-of-the-art techniques. BLINC enables negligible performance degradation in our detection and reconﬁguration solution, while solutions based on current techniques suffer a 17-fold latency increase.
14:30	11.2.2	THERMAL MANAGEMENT OF MANYCORE SYSTEMS WITH SILICON-PHOTONIC NETWORKS Speakers: Tiansheng Zhang, José L. Abellán, Ajay Joshi and Ayse K. Coskun, Boston University, US Abstract Silicon-photonic network-on-chips (NoCs) provide high bandwidth density; therefore, they are promising candidates to replace electrical NoCs in manycore systems. The silicon-photonic NoCs, however, are sensitive to the temperature gradients that typically occur on the chip, and hence, require proactive thermal management. This paper first provides a design space exploration of silicon-photonic networks in manycore systems and quantifies the performance impact of the temperature gradients for various network bandwidths. The paper then introduces a novel job allocation technique that minimizes the temperature gradients among the ring modulators/filters to improve the application performance. Experimental results for a single-chip 256-core system demonstrate that our policy is able to maintain the maximum network bandwidth. Compared to existing workload allocation policies, the proposed policy improves system performance by up to 26.1% when running a single application and 18.3% for multi-program scenarios.
15:00	11.2.3	ASSESSING THE ENERGY BREAK-EVEN POINT BETWEEN AN OPTICAL NOC ARCHITECTURE AND AN AGGRESSIVE ELECTRONIC BASELINE Speakers: Luca Ramini¹, Paolo Grani², Herve Tatenguem Fankem¹, Alberto Ghiribaldi¹, Sandro Bartolini² and Davide Bertozzi¹ ¹Engineering Department of the University of Ferrara, IT; ²University of Siena, IT Abstract Many crossbenchmarking results reported in the open literature raise optimistic expectations on the use of optical networks-on-chip (ONoCs) for high-performance and low-power on-chip communication. However, most of those previous works ultimately fail to make a compelling case for chip-level nanophotonic NoCs, especially for the lack of aggressive electronic baselines (ENoC), and the poor accuracy in physical- and architecture-layer analysis of the ONoC. This paper aims at providing the guidelines and minimum requirements so that nanophotonic emerging technology may become of practical relevance. The key differentiating factor of this work consists of contrasting ONoC solutions with an aggressive ENoC architecture with realistic complexity, performance, and power figures, synthesized on an industrial 40nm low-power technology. At the same time, key physical design issues and network interface architecture requirements for the ONoC under test are carefully assessed, thus paving the way for a well-grounded definition of the requirements for the emerging ONoC technology to achieve the energy break-even point with respect to pure electronic interconnect solutions in future multi- and many-core systems.
15:30	IP5-11, 618	DCM: AN IP FOR THE AUTONOMOUS CONTROL OF OPTICAL AND ELECTRICAL RECONFIGURABLE NOCS. Speakers: Wolfgang Büter¹, Christof Osewold¹, Daniel Gregorek¹ and Alberto Garcia-Ortiz² ¹University of Bremen, DE; ²ITEM (U.Bremen), DE Abstract The increasing requirements for bandwidth and quality-of-service motivate the use of parallel interconnect architectures with several degrees of reconfiguration. This paper presents an IP, called Distributed Channel Management (DCM), to extend existing packet-switched NoCs with a reconfigurable point-to-point network seamlessly, i.e., without the need for any modification on the routers. The configuration of the reconfigurable network takes place dynamically and autonomously, so that the topology can be changed at run time. Furthermore, the architecture is scalable due to the autonomous decentralized administration of the links. The Paper reports a thorough experimental analysis of the overhead of the approach at the gate level that considers different network parameters such as flit size and timing constraints.
15:31	IP5-12, 726	MINIMALLY BUFFERED SINGLE-CYCLE DEFLECTION ROUTER Speakers: Gnaneswara Rao Jonna¹, John Jose¹, Rachana Radhakrishnan² and Madhu Mutyam¹ ¹Indian Institute of Technology, Madras., IN; ²Rajagiri School of Engineering & Technology, Kochhi., IN Abstract With the drift from computation centric designs to communication centric designs in the Chip Multi Processor (CMP) era, the interconnect fabric is gaining more importance. An efficient NoC in terms of power, area and average flit latency has a huge impact on the overall performance of a CMP. In the current work, we propose MinBSD - a minimally buffered, single cycle, deflection router. It incorporates different operations (Injection, Ejection, Preemption, Re-injection) in a single module to handle the traffic effectively and ensures smooth flow of flits through router pipeline. It performs overlapped execution of independent operations. These factors not only make MinBSD to operate in a single cycle but also to reduce the critical path latency resulting in a faster interconnect network. Experimental results show that MinBSD reduces the average flit latency on real work loads, reduces die area and power consumption when compared to the existing state-of-the-art minimally buffered deflection routers.
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.3 Industry relevant research and practice for system design

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Konferenz 1

Chair:
Emil Matus, Technische Universität Dresden, DE

Co-Chair:
Norbert Wehn, TU Kaiserslautern, DE

This session addresses various aspects of system modeling, synthesis, validation and verification with the strong focus on industrial relevance.

Time	Label	Presentation Title Authors
14:00	11.3.1	THE METAMODELING APPROACH TO SYSTEM LEVEL SYNTHESIS Speakers: Wolfgang Ecker¹, Michael Velten¹, Leily Zafari¹ and Ajay Goyal² ¹Infineon Technologies, DE; ²Infineon Technologies, IN Abstract This paper presents an industry proven Metamodeling based approach to System-Level-Synthesis which is seen as generic design automation strategy above today's implementation levels RTL (for digital) and Schematic Entry (for analog). The approach follows a new synthesis paradigm: The designer develops a simple domain and/or design specific language and a smart tool synthesizing implementation level models according to its needs. The overhead of making both a tool and a model pays off since the tool building is automated by code generation and reuse, both based on Metamodeling techniques. Also the focus on owns demand keeps development costs low. Finally, utilization of specification data keeps also modeling effort low and increases design consistency and thus decreases debug time. Using these concepts, single design steps have been speed up to a factor of 20x and implementations of chips (specification-to-tapeout) have been speed up to a factor of 3x.
14:15	11.3.2	LOGIC SYNTHESIS OF LOW-POWER ICS WITH ULTRA-WIDE VOLTAGE AND FREQUENCY SCALING Speakers: Yu Pu, Juan Echeverri, Maurice Meijer and Jose Pineda de Gyvez, NXP Research, NL Abstract For low-power digital ICs with ultra-wide voltage and frequency scaling (e.g., from the nominal supply voltage to the sub/near-threshold regime), achieving design closure can be a big challenge, especially when speed limits are pushed at very different voltages. This paper shares a practical logic synthesis recipe that helps to fulfill tight timing constraints. Our method includes: i) synthesizing circuits at a high voltage; ii) over-constraining maximal transition time; iii) pruning standard cell library based on cell delay degradation factor across voltages. This approach shows effectiveness on an industrial 90nm low-power micro-controller.
14:30	11.3.3	FORMAL VERIFICATION OF TAINT-PROPAGATION SECURITY PROPERTIES IN A COMMERCIAL SOC DESIGN Speakers: Pramod Subramanyan¹ and Divya Arora² ¹Princeton University, US; ²Intel Corporation, US Abstract SoCs embedded in mobile phones, tablets and other smart devices come equipped with numerous features that impose specific security requirements on their hardware and firmware. Many security requirements can be formulated as taint-propagation properties that verify information flow between a set of signals in the design. In this work, we take a tablet SoC design, formulate its critical security requirements as taint-propagation properties, and prove them using a formal verification flow. We describe the properties targeted, techniques to help the verifier scale, and security bugs uncovered in the process.
14:45	11.3.4	EARLY DESIGN STAGE THERMAL EVALUATION AND MITIGATION: THE LOCOMOTIV ARCHITECTURAL CASE Speakers: Tanguy Sassolas¹, Chiara Sandionigi¹, Alexandre Guerre², Alexandre Aminot¹, Pascal Vivet³, Hela Boussetta⁴, Luca Ferro⁴ and Nicolas Peltier⁴ ¹CEA, LIST, FR; ²CEA LIST, FR; ³CEA-LETI, FR; ⁴DOCEA Power, FR Abstract To offer more computing power to modern SoCs, transistors keep scaling in new technology nodes. Consequently, the power density is increasing, leading to higher thermal risks. Thermal issues need to be addressed as early as possible in the design flow, when the optimization opportunities are the highest. For early design stages, architects rely on virtual prototypes to model their designs' behavior with an adapted trade-off between accuracy and simulation speed. Unfortunately, accurate virtual prototypes fail to encompass thermal effects timescale. In this paper, we demonstrate that less accurate high-level architectural models, in conjunction with efficient power and thermal simulation tools, provide an adapted environment to analyze thermal issues and design software thermal mitigation solutions in the case of the Locomotiv MPSoC architecture.
15:00	11.3.5	MULTI-DISCIPLINARY INTEGRATED DESIGN AUTOMATION TOOL FOR AUTOMOTIVE CYBER-PHYSICAL SYSTEMS Speakers: Arquimedes Canedo¹, Mohammad Abdullah Al Faruque² and Jan Richter¹ ¹Siemens Corporation, US; ²University of California Irvine, US Abstract This paper presents our multi-year experience in the development of a Functional Modeling Compiler (FMC), a new model-based design tool for the development of multi-disciplinary automotive cyber-physical systems. We show how system-level simulation models suitable for design-space exploration of complex architectures can be synthesized from functional specifications to test and validate the interactions between ECUs, control algorithms, and the multi-physics.
15:15	11.3.6	PREDICTIVE PARALLEL EVENT-DRIVEN HDL SIMULATION WITH A NEW POWERFUL PREDICTION STRATEGY Speakers: Seiyang Yang¹, Jaehoon Han¹, Doowhan Kwak¹, Namdo Kim², Daeseo Cha², Junhyuck Park² and Jay Kim² ¹Pusan National University, KR; ²Samsung Electronics Co., KR Abstract Traditional parallel event-driven HDL simulation methods suffer heavy synchronization & communication overhead for timely transferring the signal data among local simulators, which could easily nullify most of the expected simulation speed-up from parallelization. A new predictive parallel event-driven HDL simulation as a new promising approach had been proposed for enhancing simulation performance. In this paper, we have further enhanced this noble parallel simulation method for a series of not only timing, but also function oriented design changes with a new powerful prediction strategy. Experimentation with real SOC designs from industry has been performed for actual design changes, and shown the effectiveness of the enhanced approach.
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.4 Enabling validation on fast platforms

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Konferenz 2

Chair:
Ronny Morad, IBM, IL

Co-Chair:
Franco Fummi, Universita' di Verona, IT

Fast platforms, whether acceleration, post-silicon or virtual prototypes, are key technologies to enabling validation of complex systems. However, they present enormous challenges to become effective. This session presents four papers and two IPs that propose solutions to overcome some of them, thus enabling much higher performance and coverage.

Time	Label	Presentation Title Authors
14:00	11.4.1	(Best Paper Award Candidate) COVERAGE EVALUATION OF POST-SILICON VALIDATION TESTS WITH VIRTUAL PROTOTYPES Speakers: Kai Cong, Li Lei, Zhenkun Yang and Fei Xie, Portland State University, US Abstract High-quality tests for post-silicon validation should be ready before a silicon device becomes available in order to save time spent on preparing, debugging and fixing tests after the device is available. Test coverage is an important metric for evaluating the quality and readiness of post-silicon tests. We propose an online-capture offline-replay approach to coverage evaluation of post-silicon validation tests with virtual prototypes for estimating silicon device test coverage. We first capture necessary data from a concrete execution of the virtual prototype within a virtual platform under a given test, and then compute the test coverage by efficiently replaying this execution offline on the virtual prototype itself. Our approach provides early feedback on quality of post-silicon validation tests before silicon is ready. To ensure fidelity of early coverage evaluation, our approach have been further extended to support coverage evaluation and conformance checking in the post-silicon stage. We have applied our approach to evaluate a suite of common tests on virtual prototypes of five network adapters. Our approach was able to reliably estimate that this suite achieves high functional coverage on all five silicon devices.
14:30	11.4.2	ARCHIVED: ARCHITECTURAL CHECKING VIA EVENT DIGESTS FOR HIGH PERFORMANCE VALIDATION Speakers: Chang-Hong Hsu¹, Debapriya Chatterjee², Ronny Morad³, Raviv Gal³ and Valeria Bertacco¹ ¹University of Michigan, Ann Arbor, US; ²IBM Research - Austin, US; ³IBM Research - Haifa, IL Abstract Simulation-based techniques play a key role in validating the functional correctness of microprocessor designs. A common approach for validating microprocessors (called instruction-by-instruction, or IBI checking) consists of running a RTL and an architectural simulation in lock-step, while comparing processor architectural state at each instruction retirement. This solution, however, cannot be deployed on long regression tests, because of the limited performance of RTL simulators. Acceleration platforms have the performance power to overcome this issue, but are not amenable to the deployment of an IBI checking methodology. Indeed, validation on these platforms requires logging activity on-platform and then checking it against a golden model off-platform. Unfortunately, an IBI checking approach following this paradigm entails a large slowdown for the acceleration platform, because of the sizable amount of data that must be transferred off-platform for comparison against the golden model. In this work we propose a sequence-by-sequence (SBS) checking approach that is efficient and practical for acceleration platforms. Our solution validates the test execution over sequences of instructions (instead of individual ones), thus greatly reducing the amount of data transferred for off-platform checking. We found that SBS checking delivers the same bug-detection accuracy as traditional IBI checking, while reducing the amount of traced data by more than 90%.
15:00	11.4.3	EFFECTIVE POST-SILICON FAILURE LOCALIZATION USING DYNAMIC PROGRAM SLICING Speakers: Ophir Friedler, Wisam Kadry, Arkadiy Morgenshtein, Amir Nahir and Vitali Sokhin, IBM Research - Haifa, IL Abstract In post-silicon functional validation, one of the complex and time-consuming processes is the localization of an instruction that exposes a bug detected at system level. The task is especially hard due to the silicon's limited observability and the long time between the failure's occurrence and its detection. We propose a novel method that automates the architectural localization of post-silicon test-case failures. The proposed tool analyzes a failing test-case, while leveraging the information derived from executing the test on an Instruction Set software Simulatior (ISS), to identify a set of instructions that could lead to the faulty final state. The proposed failure localization process comprises the creation of a resource dependency graph based on the execution of the test-case on the ISS, determining a program slice of instructions that influence the faulty resources, and the reduction of the set of suspicious instructions by leveraging the knowledge of the correct resources. We evaluate our proposed solution through extensive experiments. Experimental results show that, in over 97% of all cases, our method was able to narrow down the list of suspicious instructions to under 2 instructions, on average, out of over 200. In over 59% of all cases, our method correctly reduced a test-case to a single faulty instruction.
15:15	11.4.4	DESIGN-FOR-DEBUG ROUTING FOR FIB PROBING Speakers: Chia-Yi Lee, Tai-Hung Li and Tai-Chen Chen, National Central University, TW Abstract To observe internal signals, physical probing is an important step in post-silicon debug. Focused ion beam (FIB) is one of most popular probing technologies. However, an unsuitable layout significantly decreases the percentage of nets which can be observed through FIB probing for advanced process technologies. This paper presents the first design-for-debug routing to increase the FIB observable rate. The proposed algorithm, which adopts three FIB states and costs to enhance the maze routing, keeps at least one FIB candidate for each net while routing. Experimental results demonstrate that the proposed method can significantly increase the FIB observable rate under 100% routability.
15:30	IP5-13, 244	FUNCTIONAL TEST GENERATION GUIDED BY STEADY-STATE PROBABILITIES OF ABSTRACT DESIGN Speakers: Jian Wang¹, Huawei Li², Tao Lv², Tiancheng Wang² and Xiaowei Li² ¹Institute of Computing Technology, Chinese Academy of Sc iences, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract This paper presents a novel method for functional test generation aiming at exploring control state space of the design. The steady-state probabilities (SP's) of the abstract design's control FSM are used to guide test generation. The SP's of the states can reflect how hard the states can be reached, and the hard-to-reach states are assigned with high priority to be exercised. Experimental results show that our method has better performance in test generation in comparison with constrained random simulation, and demonstrate that SP's provide good guidance on traversing hard-to-reach states of the design under validation.
15:31	IP5-14, 991	AUTOMATED SYSTEM TESTING USING DYNAMIC AND RESOURCE RESTRICTED CLIENTS Speakers: Mirko Caspar, Mirko Lippmann and Wolfram Hardt, Technische Universität Chemnitz, DE Abstract Testing on system level using a static and homogeneous architecture of clients is common practice. This paper introduces a new approach to use a heterogeneous and dynamic set of resource restricted test clients for automated testing. Due to changing resources and availability of the clients, the test case distribution needs to be recalculated dynamically during the test execution. All necessary conditions and parameters are represented by a formal model. It is shown that the algorithmic problem of DYNAMIC TESTPARTITIONING can be solved in polynomial time by a heuristic recursive algorithm. A testbench architecture is introduced and by simulation it is shown that the testbench can execute the test requirements within a small variation using a number of several hundred clients. The system can react dynamically on changing resources and availability of the test clients within several seconds. The approach is generic and can be adapted to a huge number of systems.
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.5 Memory Resource Allocation and Scheduling in MPSoC

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Konferenz 3

Chair:
Andreas Herkersdorf, Technische Universität Munchen, DE

Co-Chair:
Donatella Sciuto, Politecnico di Milano, IT

Low-latency data access and efficient interprocess communication are critical to MPSoC performance and power efficiency. This session introduces innovative approaches for data placement, memory bandwidth allocation and scheduling techniques in MPSoC architectures with heterogeneous 2D/3D memory hierarchies.

Time	Label	Presentation Title Authors
14:00	11.5.1	(Best Paper Award Candidate) SCENARIO-AWARE DATA PLACEMENT AND MEMORY AREA ALLOCATION FOR MULTI-PROCESSOR SYSTEM-ON-CHIPS WITH RECONFIGURABLE 3D-STACKED SRAMS Speakers: Meng-Ling Tsai, Yi-Jung Chen, Yi-Ting Chen and Ru-Hua Chang, Department of Computer Science and Information Engineering, National Chi Nan University, TW Abstract Integrating Multi-Processor System-on-Chips (MPSoCs) with 3D-stacked reconfigurable SRAM tiles has been proposed for embedded systems with high memory demands. At runtime, the SRAM tiles are configured into several memory areas, which can be reconfigured according to the dynamic behavior of the system. Targeting this architecture, in this paper, we propose a data placement and memory area allocation algorithm. The goal of the proposed algorithm is to optimize the performance of the memory system by minimizing the on-chip memory access latency, the number of off-chip memory accesses, and the number of reconfigurations. Since the behavior of an embedded system can be described by a set of scenarios, where each scenario specifies a set of applications that would execute concurrently, the proposed algorithm synthesizes data placements and the memory area allocation for each scenario. Not only the data access patterns within the scenario but also among all scenarios are considered for data placement. We evaluate the proposed algorithm on a set of synthetic and real-world applications. The experimental results show that, compared to the existing data placement method designed for MPSoCs with distributed memory modules, the proposed algorithm achieves up to 11.72% of data access latency reduction.
14:30	11.5.2	OPTIMIZED BUFFER ALLOCATION IN MULTICORE PLATFORMS Speakers: Maximilian Odendahl¹, Andres Goens¹, Rainer Leupers¹, Gerd Ascheid¹, Benjamin Ries¹, Berthold Vöcking¹ and Tomas Henriksson² ¹RWTH Aachen University, DE; ²Huawei Technologies, SE Abstract With the availability of advanced MPSoC and emerging Dynamic RAM (DRAM) interface technologies, an optimal allocation of logical data buffers to physical memory cannot be handled manually anymore due to the huge design space. An allocation does not only need to decide between an on- or off-chip memory, but also needs to take an increasing number of available memory channels, different bandwidth capacities and several routing possibilities into account. We formalize this problem and introduce a Mixed Integer Linear Programming (MILP) model based on two different optimization criteria. We implement the MILP model into a retargetable tool and present a case study with representative data of the Long-Term-Evolution (LTE) standard to show the real-life applicability of our approach.
15:00	11.5.3	MEMORY-CONSTRAINED STATIC RATE-OPTIMAL SCHEDULING OF SYNCHRONOUS DATAFLOW GRAPHS VIA RETIMING Speakers: Xue-Yang Zhu¹, Marc Geilen², Twan Basten³ and Sander Stuijk² ¹State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, CN; ²Department of Electrical Engineering, Eindhoven University of Technology, NL; ³Department of Electrical Engineering, Eindhoven University of Technology. Embedded Systems Institute, NL Abstract Synchronous dataflow graphs (SDFGs) are widely used to model digital signal processing and streaming media applications. In this paper, we use retiming to optimize SDFGs to achieve a high throughput with low storage requirement. Using a memory constraint as an additional enabling condition, we define a memory constrained self-timed execution of an SDFG. Exploring the state-space generated by the execution, we can check whether a retiming exists that leads to a rate-optimal schedule under the memory constraint. Combining this with a binary search strategy, we present a heuristic method to find a proper retiming and a static scheduling which schedules the retimed SDFG with optimal rate (i.e., maximal throughput) and with as little storage space as possible. Our experiments are carried out on hundreds of synthetic SDFGs and several models of real applications. Differential synthetic graph results and real application results show that, in 79% of the tested models, our method leads to a retimed SDFG whose rate-optimal schedule requires less storage space than the proven minimal storage requirement of the original graph, and in 20% of the cases, the returned storage requirements equal the minimal ones. The average improvement is about 7.3%. The results also show that our method is computationally efficient.
15:15	11.5.4	A CONSTRAINT-BASED DESIGN SPACE EXPLORATION FRAMEWORK FOR REAL-TIME APPLICATIONS ON MPSOCS Speakers: Kathrin Rosvall and Ingo Sander, KTH Royal Institute of Technology, SE Abstract Design space exploration (DSE) is a critical step in the design process of real-time multiprocessor systems. Combining a formal base in form of SDF graphs with predictable platforms providing guaranteed QoS, the paper proposes a flexible and extendable DSE framework that can provide performance guarantees for multiple applications implemented on a shared platform. The DSE framework is formulated in a declarative style as interprocess communication-aware constraint programming (CP) model. Apart from mapping and scheduling of application graphs, the model supports design constraints on several cost and performance metrics, as e.g. memory consumption and achievable throughput. Using constraints with different compliance level, the framework introduces support for mixed criticality in the CP model. The potential of the approach is demonstrated by means of experiments using a Sobel filter, a SUSAN filter, a RASTA-PLP application and a JPEG encoder.
15:31	IP5-15, 472	RELIABILITY-AWARE MAPPING OPTIMIZATION OF MULTI-CORE SYSTEMS WITH MIXED-CRITICALITY Speakers: Shin-Haeng Kang¹, Hoeseok Yang², Sungchan Kim³, Iuliana Bacivarov², Soonhoi Ha¹ and Lothar Thiele⁴ ¹Seoul National University, KR; ²ETH Zurich, CH; ³Chonbuk National University, KR; ⁴Swiss Federal Institute of Technology Zurich, CH Abstract This paper presents a novel mapping optimization technique for mixed critical multi-core systems with different reliability requirements. For this scope, we derived a quantitative reliability metric and presented a scheduling analysis that certifies given mixed-criticality constraints. Our framework is capable of investigating re-execution, passive replication, and modular redundancy with optimized voter placement, while typical hardening approaches consider only one or two of these techniques. The proposed technique complies with existing safety standards and is power-efficient, as demonstrated by our experiments.
15:32	IP5-16, 498	(Best Paper Award Candidate) FROM SIMULINK TO NOC-BASED MPSOC ON FPGA Speakers: Francesco Robino and Johnny Öberg, KTH Royal Institute of Technology, SE Abstract Network-on-chip (NoC) based multi-processor systems are promising candidates for future embedded system platforms. However, because of their complexity, new high level modeling techniques are needed to design, simulate and synthesize embedded systems targeting NoC-based MPSoC. Simulink is a popular modeling environment suitable to model at system level. However, there is no clear standard to synthesize Simulink models into SW and HW towards a NoC-based MPSoC implementation. In addition, many of the proposed solutions require large overhead in terms of SW components and memory requirements, resulting in complex and customized multi-processor platforms. In this paper we present a novel design flow to synthesize Simulink models onto a NoC-based MPSoC running on low-cost FPGAs. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.6 System-Level Thermal Estimation and Management

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Konferenz 4

Chair:
Petru Eles, Linköping University, SE

Co-Chair:
Oliver Bringmann, University of Tübingen, DE

This session deals with thermal management issues in multicore and battery-operated systems. In particular the papers cover three orthogonal aspects, specifically, thermal estimation and tracking from sparse sensor readings, proactive dynamic thermal management via task migration, and thermal management of hybrid energy storage devices based on idle period insertion.

Time	Label	Presentation Title Authors
14:00	11.6.1	MINIMAL SPARSE OBSERVABILITY OF COMPLEX NETWORKS: APPLICATION TO MPSOC SENSOR PLACEMENT AND RUN-TIME THERMAL ESTIMATION & TRACKING Speakers: Santanu Sarma and Nikil Dutt, University of California Irvine, US Abstract This paper addresses the fundamental and practically useful question of identifying a minimum set of sensors and their locations through which a large complex dynamical network system and its time-dependent states can be observed. The paper defines the minimal sparse observability problem (MSOP) and provides analytical tools with necessary and sufficient conditions to make an arbitrary complex dynamic network system completely observable. The mathematical tools are then used to develop effective algorithms to find the sparsest measurement vector that provides the ability to estimate the internal states of a complex dynamic network system from experimentally accessible outputs. The developed algorithms are further used in the design of a sparse Kalman filter (SKF) to estimate the time-dependent internal states of a linear time-invariant (LTI) dynamical network system. The approach is applied to illustrate the minimum sensor in-situ run-time thermal estimation and robust hotspot tracking for dynamic thermal management (DTM) of high performance processors and MPSoCs.
14:30	11.6.2	MDTM: MULTI-OBJECTIVE DYNAMIC THERMAL MANAGEMENT FOR ON-CHIP SYSTEMS Speakers: Heba Khdr, Thomas Ebi, Muhammad Shafique, Hussam Amrouch and Jörg Henkel, Karlsruhe Institute of Technology (KIT), DE Abstract Thermal hot spots and unbalanced temperatures between cores on chip can cause either degradation in performance or may have a severe impact on reliability, or both. In this paper, we propose mDTM, a proactive dynamic thermal management technique for on-chip systems. It employs multi-objective management for migrating tasks in order to both prevent the system from hitting an undesirable thermal threshold and to balance the temperatures between the cores. Our evaluation on the Intel SCC platform shows that mDTM can successfully avoid a given thermal threshold and reduce spatial thermal variation by 22%. Compared to state-of-the-art, our mDTM achieves up to 58% performance gain. Additionally, we deploy an FPGA and IR camera based setup to analyze the effectiveness of our technique.
15:00	11.6.3	THERMAL MANAGEMENT OF BATTERIES USING A HYBRID SUPERCAPACITOR ARCHITECTURE Speakers: Donghwa Shin¹, Massimo Poncino² and Enrico Macii³ ¹Department of Computer Engineering, Yeungnam University, KR; ²Politecnico di Torino, IT; ³Dipartimento di Automatica e Informatica, Politecnico di Torino, IT Abstract Thermal analysis and management of batteries have been an important research issue for battery-operated systems such as electric vehicles and mobile devices. Nowadays, battery packs are designed considering heat dissipation, and external cooling devices such as a cooling fan are also widely used to enforce the reliability and extend the lifetime of a battery. This type of approaches that target the enhancement of the cooling efficiency via the reduction of the thermal resistance cannot achieve an immediate temperature drop to avoid a thermal emergency situation. Approaches based on removing the heat from the heat sources via idle period insertion (similar to what is done for silicon devices) would allow faster thermal response; however it is not obvious how to implement these schemes in the context of batteries. In this paper, we propose the use of a simple parallel battery-supercapacitor hybrid architecture with a dual-mode discharging strategy that can provide immediate temperature management, in which the supercapacitor is used as an energy buffer during the idle periods of the battery. Simulation results shows that the proposed method can keep the battery temperature within the safe range without external cooling devices while exploiting the advantage of the battery-supercapacitor parallel connection.
15:31	IP5-17, 873	(Best Paper Award Candidate) THERMAL ANALYSIS AND MODEL IDENTIFICATION TECHNIQUES FOR A LOGIC + WIDEIO STACKED DRAM TEST CHIP Speakers: Francesco Beneventi¹, Andrea Bartolini¹, Pascal Vivet², Denis Dutoit² and Luca Benini¹ ¹DEI - University of Bologna, IT; ²CEA-Leti, Grenoble, FR Abstract High temperature is one of the limiting factors and major concerns in 3D-chip integration. In this paper we use a 3D test chip (WIDEIO DRAM on top of a logic die) equipped with temperature sensors and heaters to explore thermal effects. We correlated real temperature measurements with the power dissipated by the heaters using model learning techniques. The resulting compact thermal model is able to predict temperatures at chip locations far from the temperature sensors and to infer the power dissipation at any location of the chip. Results are verified by mean of an off-sample validation technique and show a high accuracy of the compact thermal model when compared with silicon measurements.
15:32	IP5-18, 349	ADAPTIVE POWER ALLOCATION FOR MANY-CORE SYSTEMS INSPIRED FROM MULTIAGENT AUCTION MODEL Speakers: Xiaohang Wang¹, Baoxin Zhao¹, Terrence Mak², Mei Yang³, Yingtao Jiang³, Masoud Daneshtalab⁴ and Maurizio Palesi⁵ ¹Guangzhou Institute of Advanced Technology, CN; ²The Chinese University of Hong Kong, CN; ³University of Nevada, Las Vegas, US; ⁴University of Turku, FI; ⁵University of Enna, Kore, IT Abstract Scaling of future many-core chips is hindered by the challenge imposed by ever-escalating power consumption. At its worst, an increasing fraction of the chips will have to be shut down, as power supply is inadequate to simultaneously switch all the transistors. This so-called dark silicon problem brings up a critical issue regarding how to achieve the maximum performance with a given limited power budget. This issue is further complicated by two facts. First, high variation in power budget calls for wide range power control capability, whereas most current frequency/voltage scaling techniques cannot effectively adjust power over such a wide range. Second, as the applications' behavior becomes more complicated, there is a pressing need for scalability and global coordination, rendering heuristic-based centralized or fully distributed control schemes inefficient. To address the aforementioned problems, in this paper, a power allocation method employing multiagent auction models is proposed, referred as Hierarchal MultiAgent based Power allocation (HiMAP). Tiles act the role of consumers to bid for power budget and the whole process is modeled by a combinatorial auction, whereas HiMAP finds the Walrasian equilibria. Experimental results have confirmed that HiMAP can reduce the execution time by as much as 45% compared to three competing methods. The runtime overhead and cost of HiMAP are also small, which makes it suitable for adaptive power allocation in many-core systems.
15:33	IP5-19, 815	UNIFIED, ULTRA COMPACT, QUADRATIC POWER PROXIES FOR MULTI-CORE PROCESSORS Speakers: Muhammad Yasin¹, Ibrahim (Abe) Elfadel² and Anas Shahrour² ¹New York University - Abu Dhabi, AE; ²Masdar Institute of Science and Technology, AE Abstract Per-core power proxies for multi-core processors are known to use several dozens of hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.7 Power and Emerging Technologies in Reconfigurable Computing

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Konferenz 5

Chair:
Diana Goehringer, Ruhr-University Bochum (RUB), DE

Co-Chair:
Fabrizio Ferrandi, Politecnico di Milano, IT

The first two papers in this session propose new architectures that take advantage of emerging nonvolatile memory technologies. The third paper proposes a battery cell aware task partitioning and mapping to maximize battery runtime.

Time	Label	Presentation Title Authors
14:00	11.7.1	EXPLOITING STT-NV TECHNOLOGY FOR RECONFIGURABLE, HIGH PERFORMANCE, LOW POWER, AND LOW TEMPERATURE FUNCTIONAL UNIT DESIGN Speakers: Adarsh Reddy¹, Hamid Mahmoodi² and Houman Homayoun¹ ¹George Mason University, US; ²San Francisco State University, US Abstract Unavailability of functional units and their unequal activity makes performance bottlenecks and thermal hot spot units in general-purpose processors. We propose to use reconfigurable functional units to overcome these challenges. A selected set of complex functional units that might be under-utilized, such as a multiplier and divider, are realized in a time multiplexed fashion using a shared programmable Look Up Table (LUT) based fabric. This allows for run-time reconfiguration and migration of their activity. LUT based implementation also allows under-utilized functional units to be dynamically reconfigured to the functional units that have a performance bottleneck and hence improving performance. The programmable LUTs are realized using Spin Transfer Torque (STT) Magnetic technology (also called STT-NV) due to its zero leakage and CMOS compatibility. The results show significant performance improvement of 16% on average across standard benchmarks, when replacing CMOS multiplier and divider with reconfigurable STT-NV LUT counterpart. In addition, reconfiguration reduces the maximum temperature of functional units by up to 27oC and almost eliminates the thermal variation across them. This comes with small power overhead and no area impact.
14:30	11.7.2	A POWER-EFFICIENT RECONFIGURABLE ARCHITECTURE USING PCM CONFIGURATION TECHNOLOGY Speakers: Ali Ahari¹, Hossein Asadi¹, Behnam Khaleghi¹ and Mehdi Tahoori² ¹Sharif University of Technology, IR; ²Karlsruhe Institute of Technology, DE Abstract Promising advantages offered by resistive Non-Volatile Memories (NVMs) have brought great attention to replace existing volatile memory technologies. While NVMs were primarily studied to be used in the memory hierarchy, they can also provide benefits in Field-Programmable Gate Arrays (FPGAs). One major limitation of employing NVMs in FPGAs is significant power and area overheads imposed by the Peripheral Circuitry (PC) of NVM configuration bits. In this paper, we investigate the applicability of different NVM technologies for configuration bits of FPGAs and propose a power-efficient reconfigurable architecture based on Phase Change Memory (PCM). The proposed PCM-based architecture has been evaluated using different technology nodes and it is compared to the SRAM-based FPGA architecture. Power and Power Delay Product (PDP) estimations of the proposed architecture show up to 37.7% and 35.7% improvements over SRAM-based FPGAs, respectively, with less than 3.2% performance overhead.
15:00	11.7.3	EXTENDING LIFETIME OF BATTERY-POWERED COARSE-GRAINED RECONﬁGURABLE COMPUTING PLATFORMS Speakers: Shouyi Yin¹, Peng Ouyang¹, Leibo Liu² and Shaojun Wei¹ ¹Tsinghua University, CN; ²Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, CN Abstract The coarse-grained reconﬁgurable architecture (CGRA) is a promising platform for mobile computing. In this pa- per, how to prolong the lifetime of battery-powered reconﬁgurable computing platform is addressed. Considering the nonlinear characteristics of battery, a multi-objective optimization model is built for extending the lifetime of battery. Based on this model, a joint task-battery scheduling algorithm is proposed. The experimental results show that the proposed method achieves 26.22% improvement on battery runtime averagely comparing to the state-of-the-art methods.
15:30	IP5-20, 659	3D FPGA USING HIGH-DENSITY INTERCONNECT MONOLITHIC INTEGRATION Speakers: Ogun Turkyilmaz¹, Gerald Cibrario², Olivier Rozeau², Perrine Batude² and Fabien Clermidy³ ¹CEA-LETI, Minatec Campus, FR; ²CEA, FR; ³CEA-LETI, FR Abstract New 3D technology, called "Monolithic Integration", offers very dense 3D interconnect capabilities. In this paper, we propose a 3D FPGA architecture with logic-on-memory approach based on this technology. The routing and computation blocks are splitted into two layers where the logic is placed on the top and memory on the bottom. Using extracted values from layout in 14nm FDSOI technology, typical benchmark circuits are evaluated in the VPR5 toolflow. The results show an area reduction of 55% compared to the 2D FPGA. More importantly, due to the lowered routing congestion, the EDP of the 3D FPGA is improved by 47%.
15:32	IP5-21, 526	JOINT COMMUNICATION SCHEDULING AND INTERCONNECT SYNTHESIS FOR FPGA-BASED MANY-CORE SYSTEMS Speakers: Alessandro Cilardo, Edoardo Fusella, Luca Gallo and Antonino Mazzeo, University of Naples Federico II, IT Abstract This work proposes an automated methodology for optimizing FPGA-based many-core interconnect architectures. Based on the application communication requirements, the methodology concurrently defines the structure of the interconnect and the communication task scheduling, taking into account possible dependencies between tasks under given area constraints. The resulting architecture improves the level of communication parallelism that can be exploited while keeping area costs low. The paper thoroughly describes the proposed approach and discusses a few case-studies showing the impact of the proposed technique.
15:33	IP5-22, 688	A NOVEL EMBEDDED SYSTEM FOR VISION TRACKING Speakers: Antonis Nikitakis¹, Theofilos Paganos¹ and Ioannis Papaefstathiou² ¹Technical University of Crete, Department of Electronic and Computer Engineering Kounoupidiana, Chania, Crete, GR73100, Greece, GR; ²Synelixis Solutions Ltd, Farmakidou 10,Chalkida, GR34100, Greece, GR Abstract One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as its memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory sub-system which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

11.8 Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability

Date: Thursday 27 March 2014
Time: 14:00 - 15:30
Location / Room: Exhibition Theatre

Organiser:
Matteo Sonza Reorda, Politecnico di Torino, It

Chair:
Dimitris Gizopoulos, University of Athens, GR

Co-Chair:
Rob Aitken, ARM, US

The embedded tutorial aims at providing with an updated view on what GPGPUs can provide not only in terms of performance and power, but also in terms of reliability, and how the latter can be evaluated and possibly improved.

Time	Label	Presentation Title Authors
14:00	11.8.1	RELIABILITY REQUIREMENTS FOR GPUS IN HPC Speakers: Nathan A. DeBardeleben¹, Leonardo Bautista Gomez² and Franck Cappello² ¹Los Alamos National Laboratory, US; ²Argonne National Laboratory, US
14:30	11.8.2	GPU RELIABILITY ASSESSMENT AND ENHANCEMENT Speakers: Paolo Rech¹, Luigi Carro¹ and Steve Keckler² ¹UFRGS, BR; ²NVIDIA, US
15:00	11.8.3	EVALUATING THE ROBUSTNESS OF GPU APPLICATIONS THROUGH FAULT INJECTION Speakers: Karthik Pattabiraman¹, Bo Fang¹ and Sudhanva Gurumurthi² ¹UBC, CA; ²AMD, US
15:30		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

UB11 Session 11

Date: Thursday 27 March 2014
Time: 14:30 - 16:30
Location / Room: University Booth, Booth 3, Exhibition Area

Label	Presentation Title Authors
UB11.01	CYCLOSE: DESIGNING CLOUD-BASED SELF-HEALING CYBER-PHYSICAL SYSTEMS Authors: Giulio Gambardella¹, Silviu Folea², Mihai Hulea², Liviu Miclea², George Mois², Teodora Sanislav², Marco Indaco¹, Paolo Prinetto¹, Daniele Rolfo¹ and Pascal Trotta¹ ¹Politecnico di Torino, IT; ²Universitatea Tehnica din Cluj-Napoca Departamentul de Automatica, RO Abstract Cyber-Physical Systems (CPSs) are a new generation of systems capable to represent more than networking and information technology, information and knowledge being integrated into physical objects. These type of systems are physical and engineered systems whose actions are monitored, controlled, and integrated by a computing and communication kernel. The Cyclose project aims at developing: (1) an infrastructure for designing self-healing Cyber-Physical Systems (CPSs) using cloud computing technology; (2) an experimental model for CPSs using wireless sensor networks (WSNs) for data acquisition, reliable hardware components based on reconfigurable devices - Field Programmable Gate Arrays (FPGAs) and cloud computing technology to store, manage and analyse data in a large context. More information ...
UB11.02	BRIDGING MATLAB/SIMULINK AND ESL DESIGN VIA AUTOMATIC CODE GENERATION Authors: Liyuan Zhang, Michael Glaß and Jürgen Teich, University of Erlangen-Nuremberg, DE Abstract Matlab/Simulink is today's de-facto standard for model-based design in domains such as control engineering and signal processing. Commercial tools are available to generate embedded C or HDL code directly from a Simulink model. However, Simulink models are purely functional models and, hence, designers cannot seamlessly consider the architecture that a Simulink model is later implemented on. In particular, it is not possible to explore the different architectural alternatives and investigate the arising interactions and side-effects directly within Simulink. To benefit from Matlab/Simulink's algorithm exploration capabilities and overcome the outlined drawbacks, we introduce a model transformation framework that converts a Simulink model to an executable specification, written in an actor-oriented modeling language. This specification then serves as the input of an established Electronic System Level (ESL) design ﬂow, enabling Design Space Exploration (DSE) and automatic code generation for both hardware and software. In this demonstration, we will show how to automatically transform Simulink models to an established ESL design ﬂow by means of a code generator. Based on the generated code, we will present a co-simulation approach that combines complex environmental models from Matlab/Simulink with the auto-generated model of a controller. We will use an Anti-lock Braking System (ABS) as an example where we investigate the impact of different controller implementations in the automotive E/E architecture. In detail, the following scientific achievements are included in the proposed demonstration: To bridge Simulink and ESL design ﬂows, we developed an ESL Code-Generator to perform model transformation. The idea is that for any given Simulink models such as a controller in a control system, the designer can simply invoke our Code-Generator to create the ESL model automatically. In our design ﬂow, we use SystemC as a programming language with an extension of actors with a specific Model of Computation (MoC). We guarantee the preservation of the semantics of the generated model by (a) applying a specific 1-to-1 mapping from Simulink basic blocks to an actor library and (b) considering different transformations to capture single-rate and multi-rate Simulink models. After the model transformation is ﬁnished, this auto-generated SystemC model serves as the input of a well-established ESL design ﬂow that enables DSE. Besides the Code-Generator we demonstrate also a validation technique that considers the functional correctness by comparing the original Simulink model with the generated SystemC model. The main idea behind this technique is (1) to co-simulate the auto-generated model along with the the original model and (2) to reuse the environment model and the test bench that are originally created in Simulink also for the auto-generated model. Furthermore, the performance of the model can also be measured during co-simulation. In this demonstration, an ABS model will be transformed from Simulink to SystemC by invoking ESL Code-Generator. Then, by applying our validation technique, the correctness and the accuracy of the auto-generated model can be examined. Lastly, to evaluate the performance of the model, application-depended quality of control will be measured, such as the braking distance on an icy road. More information ...
UB11.03	BICONDITIONAL BINARY DECISION DIAGRAM MANIPULATION PACKAGE Authors: Luca Amaru¹, Alexios Balatsoukas-Stimming², Pierre-Emmanuel Gaillardon³, Andreas Burg² and Giovanni De Micheli³ ¹EPFL, CH; ²EPFL-TCL, CH; ³EPFL-LSI, CH Abstract In this software demonstration, we present a logic manipulation package based on Biconditional Binary Decision Diagrams (BBDDs). BBDDs are a novel class of canonical binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. We show how Verilog files from real life designs can be rapidly read and processed by the BBDD manipulation package, for verification, testing or synthesis purposes. In particular, we demonstrate the benefit deriving from BBDD re-writing of arithmetic circuits in the synthesis of a product code iterative decoder. More information ...
UB11.04	HEROES^2: A SYSTEMC FRAMEWORK FOR MODELING, SIMULATION AND TESTING OF HETEROGENEOUS SOFTWARE-INTENSIVE SYSTEMS Authors: Markus Becker¹, Wolfgang Mueller¹, Ulrich Kiffmeier² and Joachim Stroop² ¹University of Paderborn/C-LAB, DE; ²dSPACE GmbH, DE Abstract HeroeS^2 is a SystemC framework for modeling/simulation of heterogeneous SW-intensive systems. It has 8 abstraction levels for corefinement of application/environment models from continous/discrete models to networked embedded SW stacks. Support of various SW/comm. abstractions is achieved by combining AMS MoCs, TLM, HdS models (MW, RTOS, HAL) and QEMU user mode/system emulator. Interfacing w/ a commerical AUTOSAR toolchain is provided, i.e., code generators, integration and experimentation tools. More information ...
UB11.05	RESCV: RESOURCE-AWARE COMPUTER VISION APPLICATION ON HETEROGENEOUS MULTI-TILE ARCHITECTURE Authors: Ericles Sousa¹, Johny Paul², Vahid Lari¹, Frank Hannig¹, Jürgen Teich¹ and Walter Stechele² ¹University of Erlangen-Nuremberg, DE; ²Technische Universität München, DE Abstract We demonstrate the benefits of invasive computing by showing the efficiency and utilization improvements in a resource-aware manner by algorithmic selection of different invasive resources, such as TCPA (tightly-coupled processor array), and RISC processors. More specific we present a dynamic load balancing of a computer vision application between multiple RISC cores and a TCPA, based on invasive mechanisms supported by our operating system and the agent system. More information ...
UB11.06	SKETCH-BASED ESL VIRTUAL PROTOTYPING: SKETCH-BASED DESIGN AND SIMULATION-BASED EVALUATION FOR ESL VIRTUAL PROTOTYPING Authors: Rafael Rosales¹, Michael Glaß¹, Jürgen Teich¹, Bo Wang², Yang Xu² and Ralph Hasholzner² ¹University of Erlangen-Nuremberg, DE; ²Intel Mobile Communications, DE Abstract Virtual prototyping and Electronic System Level (ESL) modeling have become valuable approaches to cope with the ever-increasing complexity of embedded systems. Their effectiveness, however, is highly dependent on their quick development time and accuracy both conflicting goals. In this demonstration, we present (a) an ESL methodology [1] [2] for the simulation-based evaluation of power and performance of embedded systems by the use of virtual prototypes. Our methodology permits us to develop ESL models for design space exploration of dynamic power and performance management strategies and hardware/software co-design choices. (b) We present a novel sketch-based tool termed Mahler [3] for the very early design phase of ESL modeling. Mahler provides a playground to quickly model functionality and evaluate performance on different architecture implementations. In Mahler, ESL models are created by literally sketching with a pen or touch interface, e.g. a tablet stylus, or a touchless interface, such as a Leap Motion controller. The application and architecture models are transformed to an executable virtual prototype through sketch recognition. This approach provides a very intuitive and fast way to explore actor-oriented functional modeling and hardware/software partitioning. The output of Mahler is a simulation-ready SystemC-based source-code stub that can be refined for subsequent design iterations. We will show a model of a Voice over LTE (VoLTE) use case, consisting of a heterogeneous cellular SoC platform, together with a wireless channel fading model and a base station network model. State-based [1] and polynomial-equation-based [4] power models are built and co-simulated for the SoC digital module and the RF transceiver module, respectively to abstract their different power consumption characterization accurately. The entire end-to-end modeling enables efficient SoC performance and power simulation with proper network configuration in seconds, which is highly desired in cellular system early design exploration phase and co-optimization with network vendors. More information ...
UB11.07	VERIFIC-MM Authors: Christoph Kuznik and Wolfgang Müller, University of Paderborn, DE Abstract Verific-MM is an approach to systematize and accelerate the coverage plan engineering as well as the verification environment's (functional) metric code generation -- usually a time-consuming and error-prone task -- in particular by (i) improving automation via assisted model-based approaches, utilizing recent industry standards such as UCIS and (ii) a supporting methodology suitable for various target (functional coverage) languages (IEEE-1800 SystemVerilog, IEEE-1647 e, IEEE-1666 SystemC). More information ...
16:30	End of session

IP5 Interactive Presentations

Date: Thursday 27 March 2014
Time: 15:30 - 16:00
Location / Room: Conference Level, foyer

Label	Presentation Title Authors
IP5-1	HYBRID WIRE-SURFACE WAVE ARCHITECTURE FOR ONE-TO-MANY COMMUNICATION IN NETWORK-ON-CHIP Speakers: Ammar Karkar¹, Nizar Dahir¹, Ra'ed Al-Dujaily², Kenneth Tong³, Terrence Mak⁴ and Alex Yakovlev¹ ¹School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne, GB; ²General Systems Company, Baghdad - Iraq, IQ; ³Depart- ment of Electrical and Electronic Engineering, University College London, GB; ⁴Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong, CN Abstract Network-on-chip (NoC) is a communication paradigm that has emerged to tackle different on-chip challenges and has satisfied different demands in terms of high performance and economical interconnect implementation. However, merely metal based NoC pursuit offers limited scalability with the relentless technology scaling, especially in one-to-many (1-to-M) communication. To meet the scalability demand, this paper proposes a new hybrid architecture empowered by both metal interconnects and Zenneck surface wave interconnects (SWI). This architecture, in conjunction with newly proposed routing and global arbitration schemes, avoids overloading the NoC and alleviates traffic hotspots compared to the trend of handling 1-to-M traffic as unicast. This work addresses the system level challenges for intra chip multicasting. Evaluation results, based on a cycle-accurate simulation and hardware description, demonstrate the effectiveness of the proposed architecture in terms of power reduction ratio of 2 to 12X and average delay reduction of 25X or more, compared to a regular NoC. These results are achieved with negligible hardware overheads.
IP5-2	FAILURE ANALYSIS OF A NETWORK-ON-CHIP FOR REAL-TIME MIXED-CRITICAL SYSTEMS Speakers: Eberle A Rambo¹, Alexander Tschiene¹, Jonas Diemer¹, Leonie Ahrendts¹ and Rolf Ernst² ¹Technische Universität Braunschweig, DE; ²TU Braunschweig, DE Abstract Multi- and many-core architectures using Networks-on-Chip (NoC) are being explored for use in real-time safety-critical applications for their performance and efficiency. Such systems must provide isolation between tasks that may present distinct criticality levels. The NoC is critical to maintain the isolation property as it is a heavily used shared resource. To meet safety-standard requirements, such architectures require a systematic evaluation of the effects of all possible failures such as in the form of a Failure Mode and Effects Analysis (FMEA). We present the results of a detailed system-level analysis of a typical real-time mixed-critical network-on-chip architecture. This comprises an FMEA and error effects classification regarding duration and isolation violation.
IP5-3	COOLIP: SIMPLE YET EFFECTIVE JOB ALLOCATION FOR DISTRIBUTED THERMALLY-THROTTLED PROCESSORS Speakers: Pratyush Kumar, Hoeseok Yang, Iuliana Bacivarov and Lothar Thiele, ETH Zurich, CH Abstract Thermal constraints limit the time for which a processor can run at high frequency. Such thermal-throttling complicates the computation of response times of jobs. For multiple processors, a key decision is where to allocate the next job. For distributed thermally-throttled procesosrs, we present COOLIP with a simple allocation policy: a job is allocated to the earliest available processor, and if there are several available simultaneously, to the coolest one. For Poisson distribution of inter-arrival times and Gaussian distribution of execution demand of jobs, COOLIP matches the 95-percentile response time of Earliest Finish-Time (EFT) policy which minimizes response time with full knowledge of execution demand of unfinished jobs and thermal models of processors. We argue that COOLIP performs well because it directs the processors into states such that a defined sufficient condition of optimality holds.
IP5-4	ENERGY OPTIMIZATION IN 3D MPSOCS WITH WIDE-I/O DRAM USING TEMPERATURE VARIATION AWARE BANK-WISE REFRESH Speakers: Mohammadsadegh Sadri¹, Matthias Jung², Christian Weis², Norbert Wehn² and Luca Benini¹ ¹Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, IT; ²Microelectronic Systems Design Research Group, University of Kaiserslautern, DE Abstract Heterogeneous 3D integrated systems with Wide-I/O DRAMs are a promising solution to squeeze more functionality and storage bits into an ever decreasing volume. Unfortunately, with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. We improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. In order to provide proof of our concepts we develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. On this platform we run the Android OS with real-world benchmarks to quantify the advantages of our ideas. We show improvements of 16% in DRAM refresh power due to temperature variation aware bank-wise refresh. Furthermore, two solutions are investigated to speedup system simulations: (1) Adaptive tuning of sampling intervals based on the estimated chip thermal profile, which results in speedups of 2X. (2) Hardware acceleration of thermal simulations using the Maxeler engine, which shows possible speedups of 12X.
IP5-5	EFFICIENT SIMULATION AND MODELLING OF NON-RECTANGULAR NOC TOPOLOGIES Speakers: Ji Qi and Mark Zwolinski, University of Southampton, GB Abstract With increasing chip complexity, Networks-on-Chips (NoCs) are becoming a central platform for future on-chip communications. Many regular NoC architectures have been proposed to eliminate the communication bottlenecks of traditional bus-based networks. Non-rectangular and irregular architectures have also been proposed to increase performance. However, the complexity of designing custom non-rectangular networks leads to a rapid increase in design and verification times. To alleviate the conflict between performance and efficiency, this paper proposes a novel method that efficiently constructs virtual non-rectangular topologies on a mesh network by using time-regulated models to emulate irregular patterns. Data routings on virtual hexagonal and two irregular geometries validate the proposed method. An MPEG-4 decoder is used to exemplify the proposed method for media applications. Results analysis shows the virtual topologies emulated by the proposed method can provide precise timing and energy performance.
IP5-6	MOVING FROM CO-SIMULATION TO SIMULATION FOR EFFECTIVE SMART SYSTEMS DESIGN Speakers: Franco Fummi¹, Michele Lora², Francesco Stefanni³, Dimitrios Trachanis⁴, Jan Vanhese⁴ and Sara Vinco² ¹University of Verona, EDALab s.r.l., IT; ²University of Verona, IT; ³EDALab s.r.l., IT; ⁴Agilent Technologies, BE Abstract Design of smart systems needs to cover a wide variety of domains, ranging from analogue to digital, with power devices, micro-sensors and actuators, up to MEMS. This high level of heterogeneity makes design a very challenging task, as each domain is supported by specific languages, modeling formalisms and simulation frameworks. A major issue is furthermore posed by simulation, that heavily impacts the design and verification loop and that is very hard to be built in such an heterogeneous context. On the other hand, achieving efficient simulation would indeed make smart system design feasible with respect to budget constraints. This work provides a formalization of the typical abstraction levels and design domains of a smart system. This taxonomy allows to identify a precise role in the design flow for co-simulation and simulation scenarios. Moreover, a methodology is proposed to move from the co-simulated heterogeneity to a simulatable homogeneous representation in C++ of the entire smart system. The impact of heterogeneous or homogeneous models of computation is also examined. Experimental results prove the effectiveness of the proposed C++ generation for reaching high-speed simulation.
IP5-7	AUTOMATING DATA REUSE IN HIGH-LEVEL SYNTHESIS Speakers: Wim Meeus¹ and Dirk Stroobandt² ¹Imec and Ghent University, BE; ²Ghent University, BE Abstract Current High-Level Synthesis (HLS) tools perform excellently for the synthesis of computation kernels, but they often don't optimize memory bandwidth. As memory access is a bottleneck in many algorithms, the performance of the generated circuit will benefit substantially from memory access optimization. In this paper we present an automated method and a toolchain to detect reuse of array data in loop nests and to build hardware that exploits this data reuse. This saves memory bandwidth and improves circuit performance. We make use of the polyhedral representation of the source program, which makes our method computationally easy. Our software complements the existing HLS flows. Starting from a loop nest written in C, our tool generates a reuse buffer and a loop controller, and preprocesses the loop body for synthesis with an existing HLS tool. Our automated tool produces designs from unoptimized source code that are as efficient as those generated by a commercial HLS tool from manually-optimized source code.
IP5-8	A UNIVERSAL SYMMETRY DETECTION ALGORITHM Speaker: Peter Maurer, Dept. of Computer Sci., Baylor University, US Abstract Research on symmetry detection focuses on identifying and detecting new types of symmetry. We present an algorithm that is capable of detecting any type of permutation-based symmetry, including many types for which there are no existing algorithms. General symmetry detection is library-based, but symmetries that can be parameterized, (i.e. total, partial, rotational, and dihedral symmetry), can be detected without using libraries. In many cases it is faster than existing techniques. Furthermore, it is simpler than most existing techniques, and can easily be incorporated into existing software.
IP5-9	OPTIMIZATION OF DESIGN COMPLEXITY IN TIME-MULTIPLEXED CONSTANT MULTIPLICATIONS Speakers: Levent Aksoy¹, Paulo Flores² and Jose Monteiro³ ¹INESC-ID, PT; ²INESC-ID/IST ULisbon, PT; ³INESC-ID / IST, ULisbon, PT Abstract The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplexed constant multiplication (TMCM). This paper addresses the problem of optimizing the complexity of a TMCM design and introduces an algorithm that finds the least complex TMCM design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMCM algorithms.
IP5-10	HARDWARE PRIMITIVES FOR THE SYNTHESIS OF MULTITHREADED ELASTIC SYSTEMS Speakers: Giorgos Dimitrakopoulos¹, Seitanidis Ioannis², Anastasios Psarras¹, Konstantinos Tsiouris¹, Pavlos Matthaiakis³ and Jordi Cortadella⁴ ¹Democritus University of Thrace, GR; ²Democritus University of Thrac, GR; ³Mentor Graphics, FR; ⁴Universitat Politecnica de Catalunya, ES Abstract Abstract—Elastic systems operate in a dataflow-like mode using a distributed scalable control and tolerating variable latency computations. At the same time, multithreading increases the utilization of processing units and hides the latency of each operation by time-multiplexing operations of different threads in the datapath. This paper proposes a model to unify multithreading and elasticity. A new multithreaded elastic control protocol is introduced supported by low-cost elastic buffers that minimize the storage requirements without sacrificing performance. To enable the synthesis of multithreaded elastic architectures, new hardware primitives are proposed and utilized in two circuit examples to prove the applicability of the proposed approach.
IP5-11	DCM: AN IP FOR THE AUTONOMOUS CONTROL OF OPTICAL AND ELECTRICAL RECONFIGURABLE NOCS. Speakers: Wolfgang Büter¹, Christof Osewold¹, Daniel Gregorek¹ and Alberto Garcia-Ortiz² ¹University of Bremen, DE; ²ITEM (U.Bremen), DE Abstract The increasing requirements for bandwidth and quality-of-service motivate the use of parallel interconnect architectures with several degrees of reconfiguration. This paper presents an IP, called Distributed Channel Management (DCM), to extend existing packet-switched NoCs with a reconfigurable point-to-point network seamlessly, i.e., without the need for any modification on the routers. The configuration of the reconfigurable network takes place dynamically and autonomously, so that the topology can be changed at run time. Furthermore, the architecture is scalable due to the autonomous decentralized administration of the links. The Paper reports a thorough experimental analysis of the overhead of the approach at the gate level that considers different network parameters such as flit size and timing constraints.
IP5-12	MINIMALLY BUFFERED SINGLE-CYCLE DEFLECTION ROUTER Speakers: Gnaneswara Rao Jonna¹, John Jose¹, Rachana Radhakrishnan² and Madhu Mutyam¹ ¹Indian Institute of Technology, Madras., IN; ²Rajagiri School of Engineering & Technology, Kochhi., IN Abstract With the drift from computation centric designs to communication centric designs in the Chip Multi Processor (CMP) era, the interconnect fabric is gaining more importance. An efficient NoC in terms of power, area and average flit latency has a huge impact on the overall performance of a CMP. In the current work, we propose MinBSD - a minimally buffered, single cycle, deflection router. It incorporates different operations (Injection, Ejection, Preemption, Re-injection) in a single module to handle the traffic effectively and ensures smooth flow of flits through router pipeline. It performs overlapped execution of independent operations. These factors not only make MinBSD to operate in a single cycle but also to reduce the critical path latency resulting in a faster interconnect network. Experimental results show that MinBSD reduces the average flit latency on real work loads, reduces die area and power consumption when compared to the existing state-of-the-art minimally buffered deflection routers.
IP5-13	FUNCTIONAL TEST GENERATION GUIDED BY STEADY-STATE PROBABILITIES OF ABSTRACT DESIGN Speakers: Jian Wang¹, Huawei Li², Tao Lv², Tiancheng Wang² and Xiaowei Li² ¹Institute of Computing Technology, Chinese Academy of Sc iences, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract This paper presents a novel method for functional test generation aiming at exploring control state space of the design. The steady-state probabilities (SP's) of the abstract design's control FSM are used to guide test generation. The SP's of the states can reflect how hard the states can be reached, and the hard-to-reach states are assigned with high priority to be exercised. Experimental results show that our method has better performance in test generation in comparison with constrained random simulation, and demonstrate that SP's provide good guidance on traversing hard-to-reach states of the design under validation.
IP5-14	AUTOMATED SYSTEM TESTING USING DYNAMIC AND RESOURCE RESTRICTED CLIENTS Speakers: Mirko Caspar, Mirko Lippmann and Wolfram Hardt, Technische Universität Chemnitz, DE Abstract Testing on system level using a static and homogeneous architecture of clients is common practice. This paper introduces a new approach to use a heterogeneous and dynamic set of resource restricted test clients for automated testing. Due to changing resources and availability of the clients, the test case distribution needs to be recalculated dynamically during the test execution. All necessary conditions and parameters are represented by a formal model. It is shown that the algorithmic problem of DYNAMIC TESTPARTITIONING can be solved in polynomial time by a heuristic recursive algorithm. A testbench architecture is introduced and by simulation it is shown that the testbench can execute the test requirements within a small variation using a number of several hundred clients. The system can react dynamically on changing resources and availability of the test clients within several seconds. The approach is generic and can be adapted to a huge number of systems.
IP5-15	RELIABILITY-AWARE MAPPING OPTIMIZATION OF MULTI-CORE SYSTEMS WITH MIXED-CRITICALITY Speakers: Shin-Haeng Kang¹, Hoeseok Yang², Sungchan Kim³, Iuliana Bacivarov², Soonhoi Ha¹ and Lothar Thiele⁴ ¹Seoul National University, KR; ²ETH Zurich, CH; ³Chonbuk National University, KR; ⁴Swiss Federal Institute of Technology Zurich, CH Abstract This paper presents a novel mapping optimization technique for mixed critical multi-core systems with different reliability requirements. For this scope, we derived a quantitative reliability metric and presented a scheduling analysis that certifies given mixed-criticality constraints. Our framework is capable of investigating re-execution, passive replication, and modular redundancy with optimized voter placement, while typical hardening approaches consider only one or two of these techniques. The proposed technique complies with existing safety standards and is power-efficient, as demonstrated by our experiments.
IP5-16	(Best Paper Award Candidate) FROM SIMULINK TO NOC-BASED MPSOC ON FPGA Speakers: Francesco Robino and Johnny Öberg, KTH Royal Institute of Technology, SE Abstract Network-on-chip (NoC) based multi-processor systems are promising candidates for future embedded system platforms. However, because of their complexity, new high level modeling techniques are needed to design, simulate and synthesize embedded systems targeting NoC-based MPSoC. Simulink is a popular modeling environment suitable to model at system level. However, there is no clear standard to synthesize Simulink models into SW and HW towards a NoC-based MPSoC implementation. In addition, many of the proposed solutions require large overhead in terms of SW components and memory requirements, resulting in complex and customized multi-processor platforms. In this paper we present a novel design flow to synthesize Simulink models onto a NoC-based MPSoC running on low-cost FPGAs. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.
IP5-17	(Best Paper Award Candidate) THERMAL ANALYSIS AND MODEL IDENTIFICATION TECHNIQUES FOR A LOGIC + WIDEIO STACKED DRAM TEST CHIP Speakers: Francesco Beneventi¹, Andrea Bartolini¹, Pascal Vivet², Denis Dutoit² and Luca Benini¹ ¹DEI - University of Bologna, IT; ²CEA-Leti, Grenoble, FR Abstract High temperature is one of the limiting factors and major concerns in 3D-chip integration. In this paper we use a 3D test chip (WIDEIO DRAM on top of a logic die) equipped with temperature sensors and heaters to explore thermal effects. We correlated real temperature measurements with the power dissipated by the heaters using model learning techniques. The resulting compact thermal model is able to predict temperatures at chip locations far from the temperature sensors and to infer the power dissipation at any location of the chip. Results are verified by mean of an off-sample validation technique and show a high accuracy of the compact thermal model when compared with silicon measurements.
IP5-18	ADAPTIVE POWER ALLOCATION FOR MANY-CORE SYSTEMS INSPIRED FROM MULTIAGENT AUCTION MODEL Speakers: Xiaohang Wang¹, Baoxin Zhao¹, Terrence Mak², Mei Yang³, Yingtao Jiang³, Masoud Daneshtalab⁴ and Maurizio Palesi⁵ ¹Guangzhou Institute of Advanced Technology, CN; ²The Chinese University of Hong Kong, CN; ³University of Nevada, Las Vegas, US; ⁴University of Turku, FI; ⁵University of Enna, Kore, IT Abstract Scaling of future many-core chips is hindered by the challenge imposed by ever-escalating power consumption. At its worst, an increasing fraction of the chips will have to be shut down, as power supply is inadequate to simultaneously switch all the transistors. This so-called dark silicon problem brings up a critical issue regarding how to achieve the maximum performance with a given limited power budget. This issue is further complicated by two facts. First, high variation in power budget calls for wide range power control capability, whereas most current frequency/voltage scaling techniques cannot effectively adjust power over such a wide range. Second, as the applications' behavior becomes more complicated, there is a pressing need for scalability and global coordination, rendering heuristic-based centralized or fully distributed control schemes inefficient. To address the aforementioned problems, in this paper, a power allocation method employing multiagent auction models is proposed, referred as Hierarchal MultiAgent based Power allocation (HiMAP). Tiles act the role of consumers to bid for power budget and the whole process is modeled by a combinatorial auction, whereas HiMAP finds the Walrasian equilibria. Experimental results have confirmed that HiMAP can reduce the execution time by as much as 45% compared to three competing methods. The runtime overhead and cost of HiMAP are also small, which makes it suitable for adaptive power allocation in many-core systems.
IP5-19	UNIFIED, ULTRA COMPACT, QUADRATIC POWER PROXIES FOR MULTI-CORE PROCESSORS Speakers: Muhammad Yasin¹, Ibrahim (Abe) Elfadel² and Anas Shahrour² ¹New York University - Abu Dhabi, AE; ²Masdar Institute of Science and Technology, AE Abstract Per-core power proxies for multi-core processors are known to use several dozens of hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.
IP5-20	3D FPGA USING HIGH-DENSITY INTERCONNECT MONOLITHIC INTEGRATION Speakers: Ogun Turkyilmaz¹, Gerald Cibrario², Olivier Rozeau², Perrine Batude² and Fabien Clermidy³ ¹CEA-LETI, Minatec Campus, FR; ²CEA, FR; ³CEA-LETI, FR Abstract New 3D technology, called "Monolithic Integration", offers very dense 3D interconnect capabilities. In this paper, we propose a 3D FPGA architecture with logic-on-memory approach based on this technology. The routing and computation blocks are splitted into two layers where the logic is placed on the top and memory on the bottom. Using extracted values from layout in 14nm FDSOI technology, typical benchmark circuits are evaluated in the VPR5 toolflow. The results show an area reduction of 55% compared to the 2D FPGA. More importantly, due to the lowered routing congestion, the EDP of the 3D FPGA is improved by 47%.
IP5-21	JOINT COMMUNICATION SCHEDULING AND INTERCONNECT SYNTHESIS FOR FPGA-BASED MANY-CORE SYSTEMS Speakers: Alessandro Cilardo, Edoardo Fusella, Luca Gallo and Antonino Mazzeo, University of Naples Federico II, IT Abstract This work proposes an automated methodology for optimizing FPGA-based many-core interconnect architectures. Based on the application communication requirements, the methodology concurrently defines the structure of the interconnect and the communication task scheduling, taking into account possible dependencies between tasks under given area constraints. The resulting architecture improves the level of communication parallelism that can be exploited while keeping area costs low. The paper thoroughly describes the proposed approach and discusses a few case-studies showing the impact of the proposed technique.
IP5-22	A NOVEL EMBEDDED SYSTEM FOR VISION TRACKING Speakers: Antonis Nikitakis¹, Theofilos Paganos¹ and Ioannis Papaefstathiou² ¹Technical University of Crete, Department of Electronic and Computer Engineering Kounoupidiana, Chania, Crete, GR73100, Greece, GR; ²Synelixis Solutions Ltd, Farmakidou 10,Chalkida, GR34100, Greece, GR Abstract One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as its memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory sub-system which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.

12.1 SPECIAL DAY Hot Topic: The future of interfacing to the natural world

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Saal 1

Organisers:
Ian O'Connor, Lyon Institute of Nanotechnology, FR
Thomas Mikolajick, NamLab gGmbH, DE

Chair:
Michael Huebner, Ruhr Universitaet Bochum, DE

Co-Chair:
Ian O'Connor, Lyon Institute of Nanotechnology, FR

Challenges for acquiring and processing data from the real world includes the development of interfaces capable of extracting relevant information from massive sensor networks or from living organisms, sifting through the wealth of data to arrive systematically at a meaningful conclusion, and building hardware platforms suited to carry out these operations in an energy-efficient way. The first paper in this session looks at the necessarily complex processing of chemical information with hardware components that are capable of responding to various chemical conditions. Interfaces to living organisms are examined in the second paper, which discusses challenges and approaches for efficient detection of disease. In the third paper, novel hardware devices and architectures are explored for use in energy-efficient video analysis applications such as movement detection and face recognition. The fourth paper discusses handling of complex data with large-scale GPU-based recurrent networks, exploiting specific features of the data to improve energy efficiency.

Time	Label	Presentation Title Authors
16:00	12.1.1	INTEGRATED CIRCUITS PROCESSING CHEMICAL INFORMATION: PROSPECTS AND CHALLENGES Speakers: Andreas Richter, Axel Voigt, René Schüffny, Stephan Henker and Marcus Völp, Technische Universität Dresden, DE Abstract The unbelievable properties of our information processing capabilities regarding the processing of big data, resilience, and energy efficiency are inspiration sources for the optimization and the rethinking of the principles of electronic information processing. Here, we present an approach of integrated circuits intended to solve chemical problems by active processing of chemical information.
16:25	12.1.2	INTERFACING TO LIVING CELLS Speaker: Rudy Lauwereins, IMEC, BE Abstract Recent advances in More than Moore technology enable close inspection of and even direct interfacing to living cells. This paper illustrates this through three use cases. In the first use case, the type or quality of billions of cells is quickly inspected in a fluidic medium. Secondly, the effect of potential drugs is monitored in neural cell cultures. In the third use case, neural brain activity is recorded in vivo using implantable electrodes to understand how the brain functions.
16:45	12.1.3	VIDEO ANALYTICS USING BEYOND CMOS DEVICES Speakers: Vijaykrishnan Narayanan¹, Gert Cauwenberghs², Donald Chiarulli³, Suman Datta⁴, Steve Levitan³ and Philip Wong⁵ ¹Penn State University, US; ²University of California at San Deigo, US; ³University of Pittsburgh, US; ⁴The Pennsylvania State University, US; ⁵Stanford University, US Abstract The human vision system understands and interprets complex scenes for a variety of visual tasks in real-time while consuming less than 20 Watts of power. The holistic design of artificial vision systems that will approach and eventually exceed the capabilities of human vision systems is a grand challenge. The design of such a system needs advances in multiple disciplines. This paper focuses on advances needed in the computational fabric and provides an overview of a new-genre of architectures inspired by advances in both the understanding of the visual cortex and the emergence of devices with new mechanisms for state computations.
17:10	12.1.4	ENERGY EFFICIENT NEURAL NETWORKS FOR BIG DATA ANALYTICS Speakers: Wang Yu, Boxun Li, Rong Luo, Yiran Chen, Ningyi Xu and Huazhong Yang, Tsinghua University, CN Abstract The world is experiencing a data revolution to discover knowledge in big data. Sequential data, such as the text, speech and video, are the primary sources of big data. The recurrent network is a powerful model to process sequential data because of the ability of capturing the long-term latent dependencies and features of the data. However, the difficulty of training a recurrent network, especially the huge requirement of computing power, makes the recurrent network fail to become a mainstream tool in mining big data. In this paper, we propose an efficient GPU implementation of large-scale recurrent network training. The proposed GPU implementation is based on a fast approximation technique of activation functions and a fine-grained two-stage pipeline architecture. We also propose a parallel realization of the stochastic gradient descent (SGD), one of the most popular but sequential algorithms for network training. The experiment results demonstrate that the proposed GPU implementation is able to realize at least 6x speedup on a signal GTX580 GPU compared with the CPU implementation on an Intel Xeon E5-2690 (16 cores) with MKL library. Meanwhile, the trained large-scale recurrent network can achieve the state-of-the-art performance on the Microsoft Research Sentence Completion Challenge, a challenge set for advancing language modeling.
17:30		End of session

12.2 Hot topic: How Secure are PUFs Really? On the Reach and Limits of Recent PUF Attacks

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Konferenz 6

Organiser:
Ulrich Rührmair, TU München, DE

Chair:
Ulf Schlichtmann, TU München, DE

PUFs are an emerging and promising security primitive. However, some strong attacks on their core security features have been reported recently, for example on their unclonability. We discuss the reach, but also the limits of these attacks, aiming at a well-balanced treatment, and also evaluate the future perspectives of the field.

Time	Label	Presentation Title Authors
16:00	12.2.1	PUFS AT A GLANCE Speakers: Ulrich Rührmair¹ and Daniel E. Holcomb² ¹TU München, DE; ²University of Michigan, US Abstract Physical Unclonable Functions (PUFs) are a new, hardware-based security primitive, which has been introduced just about a decade ago. In this paper, we provide a brief and easily accessible overview of the area. We describe the typical security features, implementations, attacks, protocols uses, and applications of PUFs. Special focus is placed on the two most prominent PUF types, so-called "Weak PUFs" and "Strong PUFs", and their mutual differences.
16:15	12.2.2	PUF MODELING ATTACKS: AN INTRODUCTION AND OVERVIEW Speakers: Ulrich Rührmair¹ and Jan Sölter² ¹TU München, DE; ²Freie Universität Berlin, DE Abstract Machine learning (ML) based modeling attacks are the currently most relevant and effective attack form for so-called Strong Physical Unclonable Functions (Strong PUFs). We provide an overview of this method in this paper: We discuss the basic conditions under which it is applicable; the ML algorithms that have been used in this context; the latest and most advanced results on simulated and silicon data; the right interpretation of existing results; and possible future research directions.
16:30	12.2.3	HYBRID SIDE-CHANNEL / MACHINE-LEARNING ATTACKS ON PUFS: A NEW THREAT? Speakers: Xiaolin Xu and Wayne Burleson, Umass, Amherst, US Abstract Machine Learning (ML) is a well-studied strategy in modeling Physical Unclonable Functions (PUFs) but reaches its limits while applied on instances of high complexity. To address this issue, side-channel attack is combined to help reduce the computational workload of ML modeling attacks and make it more applicable. In this work, we present the currently known hybrid side-channel attacks on PUFs. A taxonomy is proposed based on the characteristics of different side-channel attacks. The practical reach of some published side-channel attacks is discussed. Both challenges and opportunities for PUF attackers are introduced. Countermeasures against some certain side- channel attacks are also analyzed. To better understand the side-channel attacks on PUFs, three different methodologies of implementing side-channel attacks are compared. At the end of this paper, we bring forward some open problems for this research area.
16:45	12.2.4	PHYSICAL VULNERABILITIES OF PHYSICALLY UNCLONABLE FUNCTIONS Speakers: Clemens Helfmeier, Dmitry Nedospasov, Shahin Tajik, Christian Boit and Jean-Pierre Seifert, Technische Universität Berlin, DE Abstract In recent years one of the most popular areas of research in hardware security has been Physically Unclonable Functions (PUF). PUFs provide primitives for implementing tamper detection, encryption and device fingerprinting. One particularly common application is replacing Non-volatile Memory (NVM) as key storage in embedded devices like smart cards and secure microcontrollers. Though a wide array of PUF have been demonstrated in the academic literature, vendors have only begun to roll out PUFs in their end-user products. Moreover, the improvement to overall system security provided by PUFs is still the subject of much debate. This work reviews the state of the art of PUFs in general, and as a replacement for key storage in particular. We review also techniques and methodologies which make the physical response characterization and physical/digital cloning of PUFs possible.
17:00	12.2.5	PROTOCOL ATTACKS ON ADVANCED PUF PROTOCOLS AND COUNTERMEASURES Speakers: Marten van Dijk¹ and Ulrich Rührmair² ¹University of Connecticut, US; ²TU München, DE Abstract In recent years, PUF-based schemes have not only been suggested for the basic security tasks of tamper sensitive key storage or system identification, but also for more complex cryptographic protocols like oblivious transfer (OT), bit commitment (BC), or key exchange (KE). These more complex protocols are secure against adversaries in the stand-alone, good PUF model. In this survey, a shortened version of [17], we explain the stronger bad PUF model and PUF re-use model. We argue why these stronger attack models are realistic, and that existing protocols, if used in practice, will need to face these. One consequence is that the design of advanced cryptographic PUF protocols needs to be strongly reconsidered. It suggests that Strong PUFs require additional hardware properties in order to be broadly usable in such protocols: Firstly, they should ideally be erasable, meaning that single PUF-responses can be erased without affecting other responses. If the area efficient implementation of this feature turns out to be difficult, new forms of Controlled PUFs [3] (such as Logically Erasable and Logically Reconfigurable PUFs [6]) may suffice in certain applications. Secondly, PUFs should be certifiable, meaning that one can verify that the PUF has been produced faithfully and has not been manipulated in any way afterwards. The combined implementation of these features represents a pressing and challenging problem for the PUF hardware community.
17:15	12.2.6	QUO VADIS, PUF? TRENDS AND CHALLENGES OF EMERGING PHYSICAL-DISORDER BASED SECURITY Speakers: Masoud Rostami¹, Farinaz Koushanfar², James Wendt³ and Miodrag Potkonjak³ ¹Rice University, US; ²Rice Unviersity, US; ³UCLA, US Abstract Physical unclonable Function (PUF) has emerged as a popular and widely studied security primitive based on the randomness of the underlying physical medium.To date, most of the research emphasis have been placed on finding new ways to measure randomness, hardware realization and analysis of a few initially proposed structures, and conventional secret-key based protocols. In this work, we suggest our subjective analysis of the emerging and future trends in this area that aim to change the scope, widen the application domain, and make lasting impact. We emphasize on development of new PUF-based primitives and paradigms, robust protocols, public-key protocols, digital PUF, new technologies, implementation, metrics and tests for evaluation/validation, as well as relevant attacks and countermeasures.
17:30		End of session

12.3 Multimedia Systems

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Konferenz 1

Chair:
Theocharides Theocharis, University of Cyprus, CY

Co-Chair:
Cristiana Bolchini, Politecnico di Milano, IT

The session presents designs for energy efficient and flexible implementations of advanced video coders or image acquisition/processing systems

Time	Label	Presentation Title Authors
16:00	12.3.1	FLEXIBLE AND SCALABLE IMPLEMENTATION OF H.264/AVC ENCODER FOR MULTIPLE RESOLUTIONS USING ASIPS Speakers: Hong Chinh Doan, Haris Javaid and Sri Parameswaran, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, AU Abstract Real-time encoding of video streams is computationally intensive and rarely carried out at high resolutions. In this paper, for the first time, we propose a platform for H.264 encoder which is both flexible (allows software upgrades) and scalable (supports multiple resolutions), and supports high video quality (by using both intraprediction and interprediction) and high throughput (by exploiting slice-level and pixel-level parallelisms). Our platform uses multiple Application Specific Instruction set Processors (ASIPs) with local and shared memories, and hardware accelerators (in the form of custom instructions). Our platform can be configured to use a particular number of ASIPs (slices per video frame) for a specific video resolution at design-time. The MPSoC architecture is automatically generated by our platform and the H.264 software does not need any modification, which enables quick design space exploration. We implemented the proposed platform in a commercial design environment, and illustrated its utility by creating systems with up to 170 ASIPs supporting resolutions up to HD1080. We further show how power gating can be used in our platform to save energy consumption.
16:30	12.3.2	A FLEXIBLE ASIP ARCHITECTURE FOR CONNECTED COMPONENTS LABELING IN EMBEDDED VISION APPLICATIONS Speakers: Juan Fernando Eusse¹, Rainer Leupers¹, Gerd Ascheid¹, Patrick Sudowe¹, Bastian Leibe¹ and Tamon Sadasue² ¹RWTH Aachen University, DE; ²RICOH Company LTD., JP Abstract Real-time identification of connected regions of pixels in large (e.g. FullHD) frames is a mandatory and expensive step in many computer vision applications that are becoming increasingly popular in embedded mobile devices such as smartphones, tablets and head mounted devices. Standard off-the-shelf embedded processors are not yet able to cope with the performance/flexibility trade-offs required by such applications. Therefore, in this work we present an Application Specific Instruction Set Processor (ASIP) tailored to concurrently execute thresholding, connected components labeling and basic feature extraction of image frames. The proposed architecture is capable to cope with frame complexities ranging from QCIF to FullHD frames with 1 to 4 bytes-per-pixel formats, while achieving an average frame rate of 30 frames-per-second (fps). Synthesis was performed for a standard 65nm CMOS library, obtaining an operating frequency of 350MHz and 2.1mm2 area. Moreover, evaluations were conducted both on typical and synthetic data sets, in order to thoroughly assess the achievable performance. Finally, an entire planar-marker based augmented reality application was developed and simulated for the ASIP.
17:00	12.3.3	IMAGE PROGRESSIVE ACQUISITION FOR HARDWARE SYSTEMS Speakers: Jianxiong Liu, Christos Bouganis and Peter Y.K. Cheung, Imperial College London, GB Abstract As the resolution of digital images increases, accessing raw image data from memory has become a major consideration during the design of image/video processing systems. This is due to the fact that the bandwidth requirement and energy consumption of such image accessing process has increased. Inspired by the successful application of progressive image sampling techniques in many image processing tasks, this work proposes to apply similar concept within hardware systems to efficiently trade image quality for reduced memory bandwidth requirement and lower energy consumption. Based on this idea, a hardware system is proposed that is placed between the memory subsystem and the processing core of the design. The proposed system alters the conventional memory access pattern to progressively and adaptively access pixels from a target memory external to the system. The sampled pixels are used to reconstruct an approximation to the ground truth, which is stored in an internal image buffer for further processing. The system is prototyped on FPGA and its performance evaluation shows that a saving of up to 85% of memory accessing time and 33%/45% of image acquisition time/energy is achieved on the benchmark image "lena" while maintaining a PSNR of about 30 dB.
17:15	12.3.4	HIGH-QUALITY REAL-TIME HARDWARE STEREO MATCHING BASED ON GUIDED IMAGE FILTERING Speakers: Christos Ttofis and Theocharis Theocharides, University of Cyprus, CY Abstract Stereo matching is a vital task in several emerging embedded vision applications requiring high-quality depth computation and real-time frame-rate. Although several stereo matching dedicated-hardware systems have been proposed in recent years, only few of them focus on balancing accuracy and speed. This paper proposes a hardware-based stereo matching architecture that aims to provide high accuracy and concurrently high performance in embedded vision applications. The proposed architecture integrates a compact and efficient design of the recently proposed guided image filter; an edge-preserving filter that reduces the hardware complexity of the implemented stereo algorithm, while at the same time maintains high-quality results. A prototype of the architecture has been implemented on a Kintex-7 FPGA board, achieving 60 fps for 720p resolution images. Moreover, the proposed design delivers leading accuracy when compared to state-of-the-art hardware implementations.
17:30		End of session

12.4 Physical Aspects

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Konferenz 2

Chair:
Carl Sechen, University of Texas at Dallas, US

Co-Chair:
Jens Lienig, Technische Universität Dresden, DE

This session focuses on contemporary issues in physical design. The first paper concerns detailed placement for sub-20nm technologies. Then pattern matching for more efficient hotspot detection is introduced. Finally, a flow for minimizing the number of wiring layers on multichip interposers is presented.

Time	Label	Presentation Title Authors
16:00	12.4.1	OPTIMIZATION OF STANDARD CELL BASED DETAILED PLACEMENT FOR 16 NM FINFET PROCESS Speakers: Yuelin Du and Martin D. F. Wong, University of Illinois at Urbana-Champaign, US Abstract FinFET transistors have great advantages over traditional planner MOSFET transistors in high performance and low power applications. Major foundries are adopting the FinFET technology for CMOS semiconductor device fabrication in the 16 nm technology node and beyond. Edge device degradation is among the major challenges for the FinFET process. To avoid such degradation, dummy gates are needed on device edges, and the dummy gates have to be tied to power rails in order not to introduce unconnected parasitic transistors. This requires that each dummy gate must abut at least one source node after standard cell placement. If the drain nodes at two adjacent cell boundaries abut each other, additional source nodes must be inserted in between for dummy gate power tying, which costs more placement area. Usually there is some flexibility during detailed placement to horizontally flip the cells or switch the positions of adjacent cells, which has little impact on the global placement objectives, such as timing conditions and net congestion. This paper proposes a detailed placement optimization strategy for the standard cell based designs. By flipping a subset of cells in a standard cell row and switching pairs of adjacent cells, the number of drain to drain abutments between adjacent cell boundaries can be optimally minimized, which saves additional source node insertion and reduces the length of the standard cell row. In addition, the proposed graph model can be easily modified to consider more complicated design rules. The experimental results show that the optimization of 100k cells is completed within 0.1 second, verifying the efficiency of the proposed algorithm.
16:30	12.4.2	SIGNATURE INDEXING OF DESIGN LAYOUTS FOR HOTSPOT DETECTION Speakers: Cristian Andrades¹, Andrea Rodriguez¹ and Charles Chiang² ¹Universidad de Concepcion, CL; ²Synopsys Inc., US Abstract This work presents a new signature for 2D spatial configurations that is useful for the optimization of a hotspot detection process. The signature is a string of numbers representing changes along the horizontal and vertical slices of a configuration, which serves as the key of an inverted index that groups layout' windows with the same signature. The method extracts signatures from a compact specification of similar exact patterns with a fixed size. Then, these signatures are used as search keys of the inverted index to retrieve candidate windows that can match the patterns. Experimental results show that this simple type of signature has 100% recall and, in average, over 85% of precision in terms of the area effectively covered by the pattern and the retrieved area of the layout. In addition, the signature shows a good discriminate quality, since around 99% of the extracted signatures match each of them with a single pattern.
17:00	12.4.3	METAL LAYER PLANNING FOR SILICON INTERPOSERS WITH CONSIDERATION OF ROUTABILITY AND MANUFACTURING COST Speakers: Wen-Hao Liu, Tzu-Kai Chien and Ting-Chi Wang, National Tsing Hua University, TW Abstract A 2.5D IC provides a silicon interposer to integrate multiple dies into a package, which not only offers better performance than 2D ICs but also has lower manufacturing complexity than true 3D ICs. In an interposer, routing wires connect signals between dies or route signals from dies to the package substrate. The number of metal layers in an interposer is one of the critical factors to affect the routability and manufacturing cost of the 2.5D IC. Thus, how to achieve 100% routing completion rate in an interposer using a minimum number of metal layers plays a key role for the success of a 2.5D IC. This paper presents a global-routing-based metal layer planner called VGR to identify a minimal number of metal layers for an interposer with consideration of routability and manufacturing cost. Also, VGR can identify a good stacking order of the horizontal and vertical layers in an interposer such that the routing solution in the interposer costs fewer vias. To our best knowledge, this paper is the first study to solve the metal layer planning problem for silicon interposers.
17:30		End of session

12.5 System-level Design Space Exploration

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Konferenz 3

Chair:
Frederic Petrot, TIMA, FR

Co-Chair:
Luciano Lavagno, Politecnico di Torino, IT

The sessions discusses novel aspects and objectives in the exploration of embedded architectures. Papers cover topics including integration of diagnosis, approximate circuit design, custom instruction optimization, and scheduling issues.

Time	Label	Presentation Title Authors
16:00	12.5.1	NON-INTRUSIVE INTEGRATION OF ADVANCED DIAGNOSIS FEATURES IN AUTOMOTIVE E/E-ARCHITECTURES Speakers: Ulrich Abelein¹, Alejandro Cook², Piet Engelke³, Michael Glaß⁴, Felix Reimann⁴, Laura Rodríguez Gómez², Thomas Russ⁴, Jürgen Teich⁴, Dominik Ull² and Hans-Joachim Wunderlich² ¹AUDI AG, Ingolstadt, DE; ²University of Stuttgart, DE; ³Infineon Technologies AG, DE; ⁴University of Erlangen-Nuremberg, DE Abstract With ever more complex automotive systems, the current approach of using functional tests to locate faulty components results in very long analysis procedures and poor diagnostic accuracy. Built-In Self-Test (BIST) offers a promising alternative to collect structural diagnostic information during E/E-architecture test. However, as the automotive industry is quite cost-driven, structural diagnosis shall not deteriorate traditional design objectives. With this goal in mind, the work at hand proposes a design space exploration to integrate structural diagnostic capabilities into an E/E-architecture design. The proposed integration is performed non-intrusively, i.e., the addition and execution of tests (a) does not affect any functional applications and (b) does not require any costly changes in the communication schedules.
16:30	12.5.2	ABACUS: A TECHNIQUE FOR AUTOMATED BEHAVIORAL SYNTHESIS OF APPROXIMATE COMPUTING CIRCUITS Speakers: Kumud Nepal, Yueting Li, R. Iris Bahar and Sherief Reda, Brown University, Providence, Rhode Island, US Abstract Many classes of applications, especially in the domains of signal and image processing, computer graphics, computer vision, and machine learning, are inherently tolerant to inaccuracies in their underlying computations. This tolerance can be exploited to design approximate circuits that perform within acceptable accuracies but have much lower power consumption and smaller area footprints (and often better run times) than their exact counterparts. In this paper, we propose a new class of automated synthesis methods for generating approximate circuits directly from behavioral-level descriptions. In contrast to previous methods that operate at the Boolean level or use custom modifications, our automated behavioral synthesis method enables a wider range of possible approximations and can operate on arbitrary designs. Our method first creates an abstract synthesis tree (AST) from the input behavioral description, and then applies variant operators to the AST using an iterative stochastic greedy approach to identify the optimal inexact designs in an efficient way. Our method is able to identify the optimal designs that represent the Pareto frontier trade-off between accuracy and power consumption. Our methodology is developed into a tool we call ABACUS, which we integrate with a standard ASIC experimental flow based on industrial tools. We validate our methods on three realistic Verilog-based benchmarks from three different domains --- signal processing, computer vision and machine learning. Our tool automatically discovers optimal designs, providing area and power savings of up to 50% while maintaining good accuracy.
17:00	12.5.3	AUTOMATIC GENERATION OF CUSTOM SIMD INSTRUCTIONS FOR SUPERWORD LEVEL PARALLELISM Speakers: Taemin Kim and Yatin Hoskote, Intel/Intel Labs, US Abstract Application specific instruction-set processors (ASIPs) have drawn significant attention from System-on-a-Chip (SoC) community due to its capability of fine grain flexibility and customizability. In order to maximize the benefit of ASIP, automatic instruction set extension (ISE) is required. In the past decade, there have been plethora researches on automatic ISE for custom scalar instruction. However, due to increasing usage of SIMD instructions to exploit data level parallelism (DLP) that exists both across loop iterations and within a basic block called Superword Level Parallelism (SLP), automatic generation of custom SIMD instructions is inevitable direction of automatic ISE. In this paper, we propose an algorithm that automatically generates custom SIMD instructions from a set of custom scalar instructions to exploit SLP. We have demonstrated 52.4% and 30.8% performance improvement on average over base instruction set and additional custom scalar instructions, respectively.
17:15	12.5.4	SYSTEM-LEVEL SCHEDULING OF REAL-TIME STREAMING APPLICATIONS USING A SEMI-PARTITIONED APPROACH Speakers: Emanuele Cannella, Mohamed Bamakhrama and Todor Stefanov, Leiden University, NL Abstract Modern multiprocessor streaming systems have hard real-time constraints that must be always met to ensure correct functionality. At the same time, these streaming systems must be designed to use the minimum required amount of resources (such as processors and memory). In order to meet such constraints, using scheduling algorithms from the classical real-time scheduling theory represents an attractive solution approach. These algorithms enable: (1) providing timing guarantees to the applications running on the system, and (2) deriving analytically the minimum number of processors required to schedule the applications. So far, designers in the embedded systems community have focused on global and partitioned scheduling algorithms. However, recently, a new hybrid class of scheduling algorithms has been proposed. In this work, we investigate the applicability of a sub-class of these hybrid algorithms, called semi-partitioned algorithms, to applications modeled as Cyclo-Static Dataflow (CSDF) graphs. The contribution of this paper is two fold. First, we devise an approach that enables semi-partitioned scheduling algorithms, even soft real-time ones, to be applied to CSDF graphs while providing hard real-time guarantees at the input/output interfaces with the external environment. Second, we focus on an existing soft real-time semi-partitioned approach, for which we propose an allocation heuristic, called FFD-SP. The proposed heuristic reduces the minimum number of processors required to schedule the applications compared to a pure partitioned scheduling algorithm, while trying to minimize the buffer size and latency increases incurred by the soft real-time approach.
17:30		End of session

12.6 Error Resilience and Power Management

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Konferenz 4

Chair:
William Fornaciari, Politecnico di Milano - DEIB, IT

Co-Chair:
Kim Gruettner, OFFIS, DE

This session addresses the trade-off between accuracy and power consumption and the management of multi core/multi systems. The power management is addressed at several abstraction levels from circuit and performance counters up to the system level (operating system).

Time	Label	Presentation Title Authors
16:00	12.6.1	ASLAN: SYNTHESIS OF APPROXIMATE SEQUENTIAL CIRCUITS Speakers: Ashish Ranjan, Arnab Raha, Swagath Venkataramani, Kaushik Roy and Anand Raghunathan, PURDUE UNIVERSITY, US Abstract Applications from several important domains exhibit intrinsic resilience to approximations or inexactness in their underlying computations. Approximate circuits are commonly used to realize highly efficient hardware implementations of such applications. A wide range of manual and automatic techniques for the design of approximate circuits have been proposed. However, all of them target combinational circuits, leaving a gap between these techniques and the natural granularity at which quality is specified. In practice, the designer is concerned with quality or accuracy at the output of a sequential circuit after several cycles of computation, and not at the output of an embedded combinational block. We propose ASLAN (Automatic methodology for Sequential Logic ApproximatioN), the first effort towards the synthesis of approximate sequential circuits. Given an RTL or gate-level description of a sequential circuit and a quality constraint at its output, ASLAN automatically synthesizes an approximate version that guarantees the specified quality bound. The key challenges in approximating sequential circuits are (i) to model how errors due to approximations are generated, propagate through multiple cycles of operation, and eventually impact quality of the final output, and (ii) to select the most beneficial approximations, i.e., those that result in higher energy savings for smaller impact on output quality. We address the first challenge by constructing a virtual Sequential Quality Constraint Circuit (SQCC) and utilizing formal verification to ensure that a given approximation satisfies the quality constraint during synthesis. To address the second challenge, we identify combinational blocks in the sequential circuit that are amenable to approximation, generate local quality-energy trade-off curves for them, and use a gradient-descent approach to iteratively approximate the sequential circuit. We used ASLAN to automatically synthesize approximate versions for several sequential benchmarks (DCT, FIR, IIR, etc.). Our experiments demonstrate energy reductions of 1.20X-2.44X for tight error constraints, and 1.32X-4.42X for relaxed error constraints. We also present case studies of using the approximate circuits generated by ASLAN in two well known applications — MPEG Encoding and K-Means Clustering. We obtain energy savings of 1.32X with 0.5% average degradation in PSNR for the MPEG Encoder and 1.26X with 0.8% quality loss in case of KMeans Clustering.
16:30	12.6.2	VRCON: DYNAMIC RECONFIGURATION OF VOLTAGE REGULATORS IN A MULTICORE PLATFORM Speakers: Woojoo Lee, Yanzhi Wang and Massoud Pedram, University of southern california, US Abstract The emerging trend toward utilizing chip multi-core processors (CMPs) that support dynamic voltage and frequency scaling (DVFS) is driven by user requirements for high performance and low power. To overcome limitations of the conventional chip-wide DVFS and achieve the maximum possible energy saving, per-core DVFS is being enabled in the recent CMP offerings. While power consumed by the CMP is reduced by per-core DVFS, power dissipated by many voltage regulators (VRs) needed to support per-core DVFS becomes critical. This paper focuses on the dynamic control of the VRs in a CMP platform. Starting with a proposed platform with a configurable VR-to-core power distribution network, two optimization methods are presented to maximize the system-wide energy savings: (i) reactive VR consolidation to reconfigure the network for maximizing the power conversion efficiency of the VRs performed under the pre-determined DVFS levels for the cores, and (ii) proactive VR consolidation to determine new DVFS levels for maximizing the total energy savings without any performance degradation. Results from detailed experiments demonstrate up to 35% VR energy loss reduction and 14% total energy saving.
17:00	12.6.3	COARSE-GRAINED BUBBLE RAZOR TO EXPLOIT THE POTENTIAL OF TWO-PHASE TRANSPARENT LATCH DESIGNS Speakers: Hayoung Kim, Jae-joon Kim, Sungjoo Yoo, Sunggu Lee and Dongyoung Kim, POSTECH, KR Abstract Timing margin to cover process variation is one of the most critical factors that limit the amount of supply voltage reduction thereby power consumption. To remove too conservative timing margin, Bubble Razor was introduced to dynamically detect and correct errors in two-phase transparent latch designs [13]. However, it does not fully exploit the potential of two-phase transparent latch design, e.g. time borrowing. Thus, especially at low supply voltage where the effect of process variation becomes significant, the existing Bubble Razor can suffer from significant overhead in performance and power consumption due to too frequent occurrence of bubble generations. We present a design methodology for coarse-grained Bubble which exploits the time-borrowing characteristic of two-phase transparent latch design. By selectively inserting error checkpoints, i.e., shadow latches and error management logic, in the circuit, time borrowing can be applied between error checkpoints thereby avoiding bubbles which could occur in the existing Bubble Razor design with a checkpoint at every latch on the critical path. We present a methodology to choose the grain size (the number of stages between error checkpoints) based on 3-sigma delay distribution. We also verify the benefits of coarse-grained Bubble Razor with a real microprocessor, Core-A design [15] using 20nm Predictive Technology Model (PTM) [16]. The proposed methodology offers 62% improvement in performance (MIPS) and 49% less energy consumption (per instruction) at 0.6V operation (zero frequency margin) over the original Bubble Razor scheme. In addition, it gives 25% area reduction in core design.
17:15	12.6.4	FEPMA: FINE-GRAINED EVENT-DRIVEN POWER METER FOR ANDROID SMARTPHONES BASED ON DEVICE DRIVER LAYER EVENT MONITORING Speakers: Kitae Kim¹, Donghwa Shin², Qing Xie³, Yanzhi Wang³, Massoud Pedram³ and Naehyuck Chang¹ ¹Seoul National University, KR; ²Politecnico di Torino, IT; ³University of Southern California, US Abstract This paper introduces a novel sensor-less, event-driven power analysis framework called FEPMA for providing highly accurate and nearly instantaneous estimates of power dissipation in an Android smartphone. The key idea is to collect and correctly record various events of interest within a smartphone as applications are running on the application processor within it. This is in turn done by instrumenting the Android operating system to provide information about power/performance state changes of various smartphone components at the lowest layer of the kernel to avoid time stamping delays and component state observability issues. This technique then enables one to perform fine-grained (in time and space) power metering in the smartphone. Experimental results show significant accuracy improvement compared to previous approaches and good fidelity with respect to actual current measurements. The estimation error of the proposed method is lower by a factor of two than the state-of-the-art method.
17:30		End of session

12.7 Built-in Self-Test Solutions for Mixed-Signal and RF ICs

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Konferenz 5

Chair:
Jacob A. Abraham, University of Texas at Austin, US

Co-Chair:
Marian Verhelst, KU Leuven, BE

Presentations in this session offer solutions to equip mixed-signal and RF circuits with built-in self-test capabilities. These solutions include the use of an on-chip neural network that maps test signatures directly to a pass/fail decision, loopback test where the transmitter is used to test the receiver, and a reconfiguration principle for pipelined data converters.

Time	Label	Presentation Title Authors
16:00	12.7.1	(Best Paper Award Candidate) AN ANALOG NON-VOLATILE NEURAL NETWORK PLATFORM FOR PROTOTYPING RF BIST SOLUTIONS Speakers: Dzmitry Maliuk¹ and Yiorgos Makris² ¹Yale University, US; ²University of Texas at Dallas, US Abstract We introduce an analog non-volatile neural network chip which serves as an experimentation platform for prototyping custom classifiers for on-chip integration towards fully stand-alone built-in self-test (BIST) solutions for RF circuits. Our chip consists of a reconfigurable array of synapses and neurons operating below threshold and featuring sub-μW power consumption. The synapse circuits employ dynamic weight storage for fast bidirectional weight updates during training. The learned weights are then copied onto analog floating gate (FG) memory for permanent storage. The chip architecture supports two learning models: a multilayer perceptron and an ontogenic neural network. A benchmark XOR task is first employed to evaluate the overall learning capability of our chip. The BIST-related effectiveness is then evaluated on two case studies: the detection of parametric and catastrophic faults in an LNA and an RF front-end circuits, respectively.
16:30	12.7.2	BUILT-IN SELF-TEST AND CHARACTERIZATION OF POLAR TRANSMITTER PARAMETERS IN THE LOOP-BACK MODE Speakers: Jae Woong Jeong¹, Sule Ozev¹, Shreyas Sen², Vishwanath Natarajan² and Mustapha Slamani³ ¹Arizona State University, US; ²Intel Corporation, US; ³IBM Corp., US Abstract This paper presents a Built-in-self-test (BIST) solution for polar transmitters with low cost. Polar transmitters are desirable for portable devices due to higher power efficiency they provide compared to traditional Cartesian transmitters. However, they generally require iterative test/measurement/calibration cycles. The delay skew between the envelope and phase signals and the finite envelope bandwidth can create inter modulation distortion (IMD) that leads to the violation of the spectral mask and error vector magnitude (EVM) requirements. These parameters are typically not directly measured but calibrated through spectral performance analysis using expensive RF equipment, leading to lengthy and costly measurement/calibration cycles. Characterization and calibration of these parameters inside the device would reduce the test time and cost considerably. In this paper, we propose a technique to measure the delay skew and the finite envelope bandwidth, two parameters that can be digitally calibrated, based on the measurement of the output of the receiver in the loop-back mode. Simulation and hardware measurement results show that the proposed technique can characterize the targeted impairments in the polar transmitter accurately.
17:00	12.7.3	A FLEXIBLE BIST STRATEGY FOR SDR TRANSMITTERS Speakers: Emanuel Dogaru¹, Filipe Vinci dos Santos² and William Rebernak¹ ¹Thales Communications & Security, FR; ²Thales Chair on Advanced Analog Design, SUPELEC, FR Abstract Software-defined radio (SDR) development aims for increased speed and flexibility. The impact of these system-level requirements on the physical layer (PHY) access hardware is leading to more complex architectures, which together with higher levels of integration pose a challenging problem for product testing. For radio units that must be field-upgradeable without specialized equipment, Built-in Self-Test (BIST) schemes are arguably the only way to ensure continued compliance to specifications. In this paper we introduce a loopback RF BIST technique that uses Periodically Nonuniform Sampling (PNS2) of the transmitter (TX) output to evaluate compliance to spectral mask specifications. No significant hardware costs are incurred due to the re-use of available RX resources (I/Q ADCs, DSP, GPP, etc.). Simulation results of an homodyne TX demonstrate that Adjacent Channel Power Ratio (ACPR) can be accurately estimated. Future work will consist in validating our loopback RF BIST architecture on an in-house SDR testbed.
17:15	12.7.4	SIGMA-DELTA TESTABILITY FOR PIPELINE A/D CONVERTERS Speakers: Antonio Jose Gines Arteaga and Gildas Leger, Instituto de Microelectronica de Sevilla, IMSE-CNM, (CSIC - Universidad de Sevilla), ES Abstract Pipeline Analog to Digital Converters (ADCs) are widely used in applications that require medium to high resolution at high acquisition speed. Despite of their quite simple working principles, they usually form rather complex mixed-signal blocks, particularly if digital correction and calibration are considered. As a result, pipeline converters are difficult to test and diagnose. In this paper, we propose to reconfigure the internal Multiplying DACs (MDACs) that perform residue amplifications as integrators, each one with an analog and a digital input. In this way, we can reuse consecutive pipeline stages to form Sigma Delta modulators, with very reduced area overhead. We thus get an on-chip DC (low-frequency) probe with a digital 1-bit output that does not require any extra pin. In addition, digital test techniques developed for Sigma Delta modulators may be used to enhance the diagnosing capabilities. An industrial 1.8V 15-bit 100Msps pipeline ADC that had previously been fully validated in a 0.18um CMOS process is used as a case of study for the introduction of the DfT modifications.
17:30		End of session

12.8 Panel: Future SoC verification methodology: UVM evolution or revolution?

Date: Thursday 27 March 2014
Time: 16:00 - 17:30
Location / Room: Exhibition Theatre

Organiser:
Alex Goryachev, IBM Research - Haifa, IL

Chair:
Rolf Drechsler, University of Bremen/DFKI, DE

It is a recent trend that SoCs are becoming more similar to servers. Many SoCs today are no longer tied to a single application and look more like general purpose PCs and high-end servers. Smartphones are the most notable example of this, but we are also seeing this with TV chips, in-car controllers, network routers, and more. This trend is occurring in parallel to the constantly growing complexity of SoCs, which support diverse IO interfaces and devices, and have complex architectures including multiple heterogeneous cores, multi-level caches, and multiple IO bridges. Today, common practice for verification is based on Universal Verification Methodology (UVM), which, at the system level, is built on reusing and combining unit-level environments, followed by running real software on an SoC. This methodology leaves a large gap. In high-end systems, this gap is covered by system-level verification that focuses on HW-only system integration. This level has its own methodology, dedicated environment, set of tools, and teams. It looks at the system as a whole and is not based on reusing lower level environments. Formal methods are a field of intensive research, but they have not been adopted by the industry for SoC-level verification. In this panel leading experts from industry (both users and vendors) and academy will discuss the future of SoC verification methodology. Is the gap in today's SoC verification methodology significant? Is it growing? Or perhaps it does not exist? What is the right way to close the gap, if one exists? Is it sufficient to extend UVM capabilities (e.g., SystemC, TLM) or are dedicated tools and methodology needed? Are formal methods ready to play a significant role in SoC-level verification? In general, we would like to determine the importance of system-level verification and its unique needs—whether generators, checking, coverage, or teams.

Panelists:

Lyes Benalycherif, STMicroelectronics, FR
Franco Fummi, University of Verona, IT
Alan J. Hu, University of British Columbia, Vancouver, CA
Ronny Morad, IBM Research - Haifa, IL
Frank Schirrmeister, Cadence Design Systems, US

17:30	End of session

< Return to last page

Submissions